Classification of Chances of Defaulting to Pay Lab Report Example | Topics and Well Written Essays

Classification of Chances of Defaulting to Pay Having analysed data well in the above sections, this data is only useful to the client for the relevance of understanding whether or not the clients will default paying for next month. To classify this data, Knime tool was used. As a group it was decided that the data be categorised into two groups. Whether a client would default to pay next month or not i.e binary classification is the most appropriate in this case. Defaulting to pay was assigned a numerical code 1 and not defaulting to pay assigned the numerical code 0. This is because there are only two possibilities in this situation and only one of the two applies to a particular client. In order to have the best classification technique, the group decided to carry out all the techniques and pick out on the best technique. Each member was assigned/ chose to undertake one of the three techniques. The three techniques include: the decision tree technique, boosting technique and the random forest technique. Each member gave the details of his applied technique as shown below. This classification used the KNIME software as will be illustrated later. The major reason of using the three different techniques was determine which one is more applying to this situation. Random forest: Myname This is an ensemble classifier that consists of several decision tree classifiers hence the name forest. As an ensemble of many trees, it is considered to be computationally efficient and it operates very quickly over large datasets (). As an ensemble technique, the random forest classifier constructs many decision trees. These decision trees are then used to classify a new instance by the majority vote. The classifier combines bagging with random selection of attributes. Figuratively speaking, this process can be thought of a class of weak learners coming together to form a stronger group. Each class receives bootstrapped construct of the same information put differently. In our case, the bootstraps are constructed by randomly drawing with replacements from the training data set. A bootstrap has the same number of instances as the training set. To bring out the aspect of same information put differently, each bootstrap is fed as input to a base decision tree learner which uses a subset of attributes randomly selected from the original set of attributes hence the name random. Below is a detailed procedure that was followed in the random construction as a guide by (). The random forest model was then constructed. The random forest classification model The random forest classification model was built using KNIME as shown below Figure 2: A construction of the KNIME nodes for a random forest classification model. Below is the steps that were involved in running the model 1. Pre-processing processes 2. Classifier construction process 3. Test and evaluation Evaluation of the random forest method CSV Reader Node: The node uploads dataset for further processing Column Filter Node: The node deletes irrelevant and redundant attributes. Feature selection occurs naturally as part of random forest data mining but the column filter node improves efficiency. Missing Value Node: The node applies the list-wise deletion method. Only data with data on all the variables was analysed. It was however noted that this would reduce the statistical power of this data. However this was done as routine KNIME process otherwise all the data was available. Partitioning Node: The node divides dataset into 2 one 70% and 30% for the other. The training set was used by the RandomForest node to construct random forest classifier while the test set was used by Weka Predictor Node to evaluate the random forest classifier. The test set set was used to estimate the accuracy of the model. To ensure that the test set is independent of the training set “Draw randomly” was selected in the setting. This avoided the problem of over-fitting Based on Breiman & Culter’s advice, the test set error was estimated internally by out of bag error. Classifier Construction Process Random Forest Node: This node was used to create the random forest classifier. As seen in the figure below, some very important features need to be set for the classifier. Maxdepth It is a mechanism to avoid the possible individual decision tree learner from building over-complex tree which would otherwise creat an over-fitting problem. Numberfeatures Using the Breiman method of log M/ log2, this feature helps in determining the number of randomly selected variables. Smaller subset would produce less correlation between each individual classifier, but also indicates a lower predictive power. numTrees According to (), there’s a limit of trees that make up a forest. In our case, doubling the number of trees from 50 to 100 does not improve the performance indicated by the out of bag error. It is also important to note that increase in trees increases the computational cost exponentially. Test and Evaluation process Weka Predictor Node: This node uses the test set of the original data to test the classifier after the random forest classifier is built using the precedent nodes. Scorer Node: The node is used to construct the confusion matrix of the classification results. The results are the basis for many measures. Roc Curve: The node is used to construct the relative operating curve. Evaluating the Classifier The main components that are used to evaluate the the classifier are the confusion matrix and the ROC curve. Confusion Matrix (Refer to the figure below this section) First, the value of attribute “default payment next month” into 0 and 1. 1 implying the client will default payment next month and 0 that client will not default payment next month. From the confusion as shown in the figure below, the following data was extracted. True Zero (TZ): --- is the number of correct predictions that an instance is zero True One (TO): --- is the number of correct predictions that an instance is one False Zero (FZ):--- number of incorrect predictions that an instance is zero False One (FO):---is the number of incorrect predictions that an instance is one. Figure 6: The confusion matrix of the scorer node. Based on the extracted data, the following measures were calculated Accuracy= (TZ+TO)/(TO+ TZ+FZ+FO) = Error Rate =It indicates that ---% of the classification of the test set is correct. These measures may seem useful but it is not rational to make decisions based on them. There are different numbers of items in each class. This means that this measure is biased hence better measures are adopted. Precision: It measures how accurate the predictions are. Prec= = TZ/(TZ+FZ) Sensitivity/Recall/True Zero Rate (TZR) This measures the proportion of actual zero instances that are correctly classified by the classified by the classifier. In a test with high sensitivity, a negative result is relatively reliable compared to the positive result. A test with high sensitivity has a low type II error. TZR=1-FOR=TZ/(TZ+FO)= Specificity/True One Rate (TOR): It measures the proportion of actual 1 instances that are correctly classified by the classifier. In a test with high specificity, a 0 result is highly reliable. In this case, the specificity of --- TOR=1-FZR=TO/(TO+FZ)= ROC: Refer to the graph below. The ROC grapgh was plotted with the false 0 (FZR) on the X axis and the TZR oon the Y-axis. The point at which the curve touches the Y axis last refers to the perfect classifier which classifies all the zero and one cases correctly. i.e not defaulting payment next month and defaulting payment respectively. In this case, it is at point ---. The point – represents classifier that predicts all Zero instances but fails to predict the One instances. --- predicts all cases to be one. The two represent the extreme cases. Figure :The rOC curve. The red curve in grapgh indicates possible combinations of possible FZR and TZR of the random forest classifier. The classifier parameters are adjusted to increase TZR at the cost of increased FZR or decrease FZR at the cost of decreased TZR. This clearly visualizes the tradeoff between the abilioty of the classifier to correctly identify Zero cases and the number of 1 cases that are incorrectly predicted by the model. Taking our case into account, our main aim is to correctly identify those who will not pay so we set the FZR to a higher level. As a result the positive result which indicate a person who is likely to default payment next month is highly reliable. The Area Under the curve (AUC) is ---- and can be used to do a comparative analysis of the model with others. In most cases, the higher the number, the more reliable the classifier/model. Advantages and Disadvantages of the Random forest classifier By virtue of using a combination of trees, the method reduces the over-fitting problem. This produces more accurate results. The two levels of randomness: the bagging and randomly selected attributes, make the classifier robust to noise in the training set. Compared to decision tree, there is reduced biasness that come with pruning witnessed in the decision tree classifier. There is improved efficiency and the classifier can be constructed quickly. This makes it a better choice for handling the Big Data which is a common phenomenon of the era. There is also limited data loss through missing data since it allows individual estimation of individual variables. The classification process is therefore simplified as it does not require handling of the missing data and outliers. Neither does it require normalization. More data can be used to train the model. This increases the statistical power. This is achieved because the out of bag test error serves as an internal unbiased estimate of accuracy hence no need to allocate cross validation set to tune the classifier. However, it is empirical that the random forests seem to over-fit for some datasets with some noisy classification task. The technique is also seen to be biased towards attributes with more levels especially for data with nominal variables with several levels. The limit to reduce generalization error by creating more trees is a disadvantage by the model since there is a limit to error control. Read More

Classification of Chances of Defaulting to Pay - Lab Report Example

Extract of sample "Classification of Chances of Defaulting to Pay"

CHECK THESE SAMPLES OF Classification of Chances of Defaulting to Pay

Thieves Classification

Future of Policies and Programs for Older Adults

Division-Classification

Classifications Of Drug Actions

Classification of Groups in Glee (Pilot)

Reclassification of marijuana

Reclassification of marijuana

Implementing an Information Classification Scheme