StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Classification of Chances of Defaulting to Pay - Lab Report Example

Cite this document
Summary
This lab report "Classification of Chances of Defaulting to Pay" discusses whether or not the clients will default paying for next month. To classify this data, the Knime tool was used. As a group, it was decided that the data be categorized into two groups…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER94.1% of users find it useful

Extract of sample "Classification of Chances of Defaulting to Pay"

Classification of Chances of Defaulting to Pay Having analysed data well in the above sections, this data is only useful to the client for the relevance of understanding whether or not the clients will default paying for next month. To classify this data, Knime tool was used. As a group it was decided that the data be categorised into two groups. Whether a client would default to pay next month or not i.e binary classification is the most appropriate in this case. Defaulting to pay was assigned a numerical code 1 and not defaulting to pay assigned the numerical code 0. This is because there are only two possibilities in this situation and only one of the two applies to a particular client. In order to have the best classification technique, the group decided to carry out all the techniques and pick out on the best technique. Each member was assigned/ chose to undertake one of the three techniques. The three techniques include: the decision tree technique, boosting technique and the random forest technique. Each member gave the details of his applied technique as shown below. This classification used the KNIME software as will be illustrated later. The major reason of using the three different techniques was determine which one is more applying to this situation. Random forest: Myname This is an ensemble classifier that consists of several decision tree classifiers hence the name forest. As an ensemble of many trees, it is considered to be computationally efficient and it operates very quickly over large datasets (). As an ensemble technique, the random forest classifier constructs many decision trees. These decision trees are then used to classify a new instance by the majority vote. The classifier combines bagging with random selection of attributes. Figuratively speaking, this process can be thought of a class of weak learners coming together to form a stronger group. Each class receives bootstrapped construct of the same information put differently. In our case, the bootstraps are constructed by randomly drawing with replacements from the training data set. A bootstrap has the same number of instances as the training set. To bring out the aspect of same information put differently, each bootstrap is fed as input to a base decision tree learner which uses a subset of attributes randomly selected from the original set of attributes hence the name random. Below is a detailed procedure that was followed in the random construction as a guide by (). The random forest model was then constructed. The random forest classification model The random forest classification model was built using KNIME as shown below Figure 2: A construction of the KNIME nodes for a random forest classification model. Below is the steps that were involved in running the model 1. Pre-processing processes 2. Classifier construction process 3. Test and evaluation Evaluation of the random forest method CSV Reader Node: The node uploads dataset for further processing Column Filter Node: The node deletes irrelevant and redundant attributes. Feature selection occurs naturally as part of random forest data mining but the column filter node improves efficiency. Missing Value Node: The node applies the list-wise deletion method. Only data with data on all the variables was analysed. It was however noted that this would reduce the statistical power of this data. However this was done as routine KNIME process otherwise all the data was available. Partitioning Node: The node divides dataset into 2 one 70% and 30% for the other. The training set was used by the RandomForest node to construct random forest classifier while the test set was used by Weka Predictor Node to evaluate the random forest classifier. The test set set was used to estimate the accuracy of the model. To ensure that the test set is independent of the training set “Draw randomly” was selected in the setting. This avoided the problem of over-fitting Based on Breiman & Culter’s advice, the test set error was estimated internally by out of bag error. Classifier Construction Process Random Forest Node: This node was used to create the random forest classifier. As seen in the figure below, some very important features need to be set for the classifier. Maxdepth It is a mechanism to avoid the possible individual decision tree learner from building over-complex tree which would otherwise creat an over-fitting problem. Numberfeatures Using the Breiman method of log M/ log2, this feature helps in determining the number of randomly selected variables. Smaller subset would produce less correlation between each individual classifier, but also indicates a lower predictive power. numTrees According to (), there’s a limit of trees that make up a forest. In our case, doubling the number of trees from 50 to 100 does not improve the performance indicated by the out of bag error. It is also important to note that increase in trees increases the computational cost exponentially. Test and Evaluation process Weka Predictor Node: This node uses the test set of the original data to test the classifier after the random forest classifier is built using the precedent nodes. Scorer Node: The node is used to construct the confusion matrix of the classification results. The results are the basis for many measures. Roc Curve: The node is used to construct the relative operating curve. Evaluating the Classifier The main components that are used to evaluate the the classifier are the confusion matrix and the ROC curve. Confusion Matrix (Refer to the figure below this section) First, the value of attribute “default payment next month” into 0 and 1. 1 implying the client will default payment next month and 0 that client will not default payment next month. From the confusion as shown in the figure below, the following data was extracted. True Zero (TZ): --- is the number of correct predictions that an instance is zero True One (TO): --- is the number of correct predictions that an instance is one False Zero (FZ):--- number of incorrect predictions that an instance is zero False One (FO):---is the number of incorrect predictions that an instance is one. Figure 6: The confusion matrix of the scorer node. Based on the extracted data, the following measures were calculated Accuracy= (TZ+TO)/(TO+ TZ+FZ+FO) = Error Rate =It indicates that ---% of the classification of the test set is correct. These measures may seem useful but it is not rational to make decisions based on them. There are different numbers of items in each class. This means that this measure is biased hence better measures are adopted. Precision: It measures how accurate the predictions are. Prec= = TZ/(TZ+FZ) Sensitivity/Recall/True Zero Rate (TZR) This measures the proportion of actual zero instances that are correctly classified by the classified by the classifier. In a test with high sensitivity, a negative result is relatively reliable compared to the positive result. A test with high sensitivity has a low type II error. TZR=1-FOR=TZ/(TZ+FO)= Specificity/True One Rate (TOR): It measures the proportion of actual 1 instances that are correctly classified by the classifier. In a test with high specificity, a 0 result is highly reliable. In this case, the specificity of --- TOR=1-FZR=TO/(TO+FZ)= ROC: Refer to the graph below. The ROC grapgh was plotted with the false 0 (FZR) on the X axis and the TZR oon the Y-axis. The point at which the curve touches the Y axis last refers to the perfect classifier which classifies all the zero and one cases correctly. i.e not defaulting payment next month and defaulting payment respectively. In this case, it is at point ---. The point – represents classifier that predicts all Zero instances but fails to predict the One instances. --- predicts all cases to be one. The two represent the extreme cases. Figure :The rOC curve. The red curve in grapgh indicates possible combinations of possible FZR and TZR of the random forest classifier. The classifier parameters are adjusted to increase TZR at the cost of increased FZR or decrease FZR at the cost of decreased TZR. This clearly visualizes the tradeoff between the abilioty of the classifier to correctly identify Zero cases and the number of 1 cases that are incorrectly predicted by the model. Taking our case into account, our main aim is to correctly identify those who will not pay so we set the FZR to a higher level. As a result the positive result which indicate a person who is likely to default payment next month is highly reliable. The Area Under the curve (AUC) is ---- and can be used to do a comparative analysis of the model with others. In most cases, the higher the number, the more reliable the classifier/model. Advantages and Disadvantages of the Random forest classifier By virtue of using a combination of trees, the method reduces the over-fitting problem. This produces more accurate results. The two levels of randomness: the bagging and randomly selected attributes, make the classifier robust to noise in the training set. Compared to decision tree, there is reduced biasness that come with pruning witnessed in the decision tree classifier. There is improved efficiency and the classifier can be constructed quickly. This makes it a better choice for handling the Big Data which is a common phenomenon of the era. There is also limited data loss through missing data since it allows individual estimation of individual variables. The classification process is therefore simplified as it does not require handling of the missing data and outliers. Neither does it require normalization. More data can be used to train the model. This increases the statistical power. This is achieved because the out of bag test error serves as an internal unbiased estimate of accuracy hence no need to allocate cross validation set to tune the classifier. However, it is empirical that the random forests seem to over-fit for some datasets with some noisy classification task. The technique is also seen to be biased towards attributes with more levels especially for data with nominal variables with several levels. The limit to reduce generalization error by creating more trees is a disadvantage by the model since there is a limit to error control. Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(Classification of Chances of Defaulting to Pay Lab Report, n.d.)
Classification of Chances of Defaulting to Pay Lab Report. https://studentshare.org/logic-programming/2067114-report
(Classification of Chances of Defaulting to Pay Lab Report)
Classification of Chances of Defaulting to Pay Lab Report. https://studentshare.org/logic-programming/2067114-report.
“Classification of Chances of Defaulting to Pay Lab Report”. https://studentshare.org/logic-programming/2067114-report.
  • Cited: 0 times

CHECK THESE SAMPLES OF Classification of Chances of Defaulting to Pay

Thieves Classification

The second classification of thieves is the group consisting of those with the behavioral addiction of kleptomania.... The size of this group allows for it to be used as an example for further classification into sub-groups.... The author classifies the thieves.... The first group is who find thrill in the act of theft to make a mockery of existing social norms, the second is the group consisting of those with the behavioral addiction of kleptomania, and the third those with the intent of personal gain in the action of theft....
1 Pages (250 words) Essay

Future of Policies and Programs for Older Adults

This entails an era that America was healing from vast depression.... The world wars posed elevated challenges to Americans concerning property loss and bereavement of populaces.... The civil aggression culminated to increased cases… The circumstances arising from this era compelled the establishment of a pension plan solely meant for soldiers....
5 Pages (1250 words) Research Paper

Division-Classification

By classification and division writer can communicate his or her ideas to the audience in a better and coherent way.... By classification and division writer can communicate his or her ideas to the audience in a better and coherent way.... classification can also prove helpful in writing as it allows a writer to classify different sets of arguments or ideas in writing.... This is why division and classification has great importance in writing....
1 Pages (250 words) Assignment

Classifications Of Drug Actions

The main aim of the paper "Classifications Of Drug Actions" is to compare the four major classifications of drug actions on the human body relying upon a drug from each classification.... A medication is a substance that has known organic consequences for animals and humans.... hellip; Foods are by and large prohibited from this definition, regardless of their physiological impacts on animal species....
4 Pages (1000 words) Research Paper

Classification of Groups in Glee (Pilot)

nbsp;… According to the paper, classification of Groups in Glee (Pilot), the group of parents consists of adults whose role in shaping the lives of the young students apart from being their parents.... classification of Characters This part starts by classifying all the characters into three groups3.... A classification of the Characters in Glee (Pilot)One of the most watched film in the country is Glee, which involves a school musical involving various talented dancers and individuals....
2 Pages (500 words) Assignment

Reclassification of marijuana

The article Drug Schedules posted on the website of United States Drug Enforcement Administration provides the definition of what drug schedules are and gives classification of drugs by schedules.... Using the classification of drugs provided in the article and Schedule I in particular and knowing medical benefits of marijuana, it is possible to prove it should be reclassified because by its influence on a human organism it differs significantly from other drugs in the same schedule....
2 Pages (500 words) Annotated Bibliography

Reclassification of marijuana

In such a context it is of a great interest to examine what… To begin with it must be said that official classification defines marijuana as a soft drug, only slightly mentioning its positive effects on organism.... To begin with it must be said that official classification defines marijuana as a soft drug, only slightly mentioning its positive effects on organism....
2 Pages (500 words) Essay

Implementing an Information Classification Scheme

… The paper “Implementing an Information classification Scheme” is a spectacular example of an essay on information technology.... The paper “Implementing an Information classification Scheme” is a spectacular example of an essay on information technology.... This report discusses the importance of the common-sense information classification scheme.... In that respect, the common sense data classification scheme and the identified solutions will improve the information security of 3D Media....
11 Pages (2750 words) Essay
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us