Representativeness Data Problem in Credit Scoring - Research Paper Example

Add to wishlist

Summary

The paper "Representativeness Data Problem in Credit Scoring" explores what happens if both the development and validation datasets are created badly in such a way that they are not representative of the population. It demonstrates the consequences of impacts on the performance of the scorecards…

Download full paper File format: .doc, available for editing

GRAB THE BEST PAPER93.1% of users find it useful

Representativeness Data Problem in Credit Scoring

Read Text Preview

Subject: Statistics
Type: Research Paper
Level: Undergraduate
Pages: 19 (4750 words)
Downloads: 2
Author: eharris

Extract of sample "Representativeness Data Problem in Credit Scoring"

Representativeness data problem in credit scoring Introduction During statistical modeling, it is common for the sample to be split into two samples - thedevelopment sample (DEV) and the validationsample (VAL). The development sample is used to develop the model; learning and estimating parameters of the model, while the validationsample is used to evaluate the model and final model selection. It a later phase of a model development third type of a sample – the testing sample(s) - can be used for assessing the predictive performance of the model(Borovicka et al., 2012). If the same data set would be used for the development, validation and calibration, the estimation of the predictive ability of a model would be overly optimistic. In an ideal situation two or more independent data sets are collected.But in a situation when onlyone dataset is available and there is no opportunity to collect new data, it is necessary to split a data file. According to Snee (1977), data splitting is the most effective method of model validation when it is impossible to collect new data to examine the model.It is very important to create both DEV and VAL samples in such a way that represents the total population as they can cause a lot of problems due to bias.This requirement is absolutely natural, since the model reflects the specifics of the dataset used for its development. In order to be sure that the sample is representative, it is important to consider carefully how the sample was collected. If a sample is chosen for the sake of convenience alone, it becomes difficult to interpret the final model with confidence (Geoff and Everitt, 2001). Bias refers to the tendency for selected samples to contrast with the corresponding population in some methodical manner. Bias can arise in the manner that the sample was chosen or from the way that information is acquired once a given sample has been selected (Peck et al., 2012). When sampling, the most common types of bias are selection bias, response or measurement bias and non responsive bias. In most of the applications simple random sampling is used. Nevertheless, there are several sophisticated statistical sampling methods more suitable for various types of datasets. The purpose of this paper is to show what would happen if both the development and validation datasets are created badly in such a way that they are not representative of the population. To demonstrate the consequences of impacts into the performance of the score cards two different but the most common data splitting methods were employed. The rest of this paper is organized as follows. The next section presents the brief overview of various sampling methods. Section 3 explains methodology that is used in performed tests. Section 4 describes the data used for impact illustration. Section 5 contains a case study and discusses analyses results. The final section presents conclusions. 2 Data splitting In many fields, representative large independent samples can be used for training and validating (and testing) of models and can be obtained simply by partitioning the one large dataset (holdout method). But in other industries only datasets limited in size are available as measurements are expensive and/or work intensive. To solve the dilemma of partitioning a small pool of data into independent data subsets, resampling procedures can be used (repeated holdout method). It is believed that to have more data, it should improve the model performance. But some recently published studies show that this is not necessary true (Meng and Xie, 2013; Faraway, 2014). As a pioneer of data splitting, Stone (1974) may be considered. Since those times many data splitting methods were proposed. Quality and complexity of them differand not any method is superior in general. The using one mostly depends on the purposes on an analysis. The sampling methods can be divided according to their principles, goals and overall complexity (Reitermanová, 2010). Data splitting algorithms and also theircomparison can be found in many literatures (e.g. Molinaro et al., 2005; Snee, 1977). Data splitting is easy to implement and thus is an attractive alternative to complex methods of adjusting for the effect of model selection on inference (Faraway, 1998). Simple random sampling is the most common holdout method. It is efficient and easily feasible. Samplesare selected randomly with a uniform distribution, i.e. each observation has an equal probability of selection. This quite simple method leads to low bias of the model performance. However in cases when dataset is not uniformly distributed or number of selected cases is much lower than the original database have, simple random sampling can lead to subset that do not coverthe data properly (e.g. one or more classes might be missing) and therefore the estimate of the model error will have a high variance (Lohr, 1999). Stratified sampling is a probability sampling and stands on the idea to explore the internal structure and distributionof a dataset and to divide it into (relatively) homogeneous non-overlapping groups called strata (or clusters). The observations are then selected from each stratum proportionally to the appropriate probability. It ensures that each class is represented with the same frequency into subsets. The important question is how to select observations from each cluster. There are two most commonprinciples. To select one sample from each cluster (Bowden et al., 2002) or samples from each cluster are selected with a uniform probability (May et al., 2010). The second approach is often denoted as a stratified random sampling. As a represent of naturally ordered dataset, systematic sampling should be mentioned. From the ordered dataset (e.g. time series), a random starting sample is chosen and then each kth observation is taken (Elsayir, 2014). Systematic sampling is a very efficient method and it is easy to implement.However, in many cases it is very difficult to find an appropriate ordering. For disordered data, the results of systematic sampling are comparable to the simple random sampling and therefore it suffers with the same problems. Also itssensitivity to periodicities in the dataset is one from disadvantages of the method. Cross-validation belongs among the most popular resampling methods. For a k-fold cross-validation, the original datasetis partitioned into k equalparts (folds). The first fold is used for evaluation purposes; the rest (k-1)folds are used for model learning. In next step, the second fold is used for evaluation and the rest is used for learning. This procedure is repeated k-times and the results are averaged (Picard and Cook, 1984).There exist no clear rules how many folds should be used for the cross-validation. In praxis, set k=10 is often sufficient. A special variant of cross-validation is called as leave-one-out cross-validation (full cross-validation, jackknife). It assumes that k=n, where n is a size of the original dataset. This setting gives nearly unbiased estimates (lower root mean square errors of predictions) of a model performance but usually with large variability.This deficiency is known as asymptotically inconsistency (Shao, 1993). The main ideal of bootstrap method, firstly introduced in 1996 (Tibshirani and Efron, 1996), is to get B bootstrap samples by uniform sampling with replacement from the original dataset with size n. On each bootstrap sample parameters of a model are estimated while the estimation of prediction performance is performed on the original dataset.Bootstrapping is not affected by asymptotic inconsistency and might be the best way of estimating the error for very small data sets whereby the complete procedure can be repeated arbitrarily often. For more information, see e.g. Kohavi (1995) or Andrews (2000). 3 Methodology In this section we present three quick analyses that can be used for checking the representativeness between the two created sub-samples when the scoring model is built. Proposed tests are further in a case study illustrated on credit scoring models development but can they be used in other areas as well. It is possible to check for different variables of the database used for the computation of final score whether the repartition of the modalities is significantly different between the development and validation samples. This is called as demand stability. Risk stability examines whether the event rate between corresponding classes for a variable is between both samples appropriate. A gap table can be also constructed. Rows represent categories in case of explanatory variables or score deciles in case of scorecard output examination. For each class of examined variable, columns of a table contain the following information: Points of each category (for explanatory variables only), Average total score, Number of observations and column percentage, Number of observed and predicted events and its difference, Observed and predicted event rate and its difference, p-value of one-tailed test (H0: event_ratepredicted Read More

CHECK THESE SAMPLES OF Representativeness Data Problem in Credit Scoring

Credit scoring model

credit scoring model Name: Institution: Date: Executive summary As times change, use of credit scoring model has become a centrally important issue.... hellip; Consequently, credit scoring models have been subjected to extensive research and studies more so in statistic.... As a way of solving classification issues and also decreases Type I errors, typical of many credit scoring models, this piece attempts to describe or rather come up with an appropriate credit scoring model via two stages....

20 Pages (5000 words) Coursework

Workplace Flexibility and Employee Performance in Kraft Foods

Contents Abstract iv Contents v Table of Figures vii List of Tables vii Photo credit vii Chapter 1: Introduction 1 1.... data Collection 22 3.... Ethical Issues in Collection of data 24 3.... data Analysis Technique 24 3.... Workplace Flexibility Exploring impact of workplace flexibility on employee performance in Kraft Foods Company, UK ___________________ Student Number:_____________ (February 2011) Abstract This study generated current knowledge on impact of workplace flexibility practices on employee performance....

33 Pages (8250 words) Essay

How the Fico or credit score impacts on consumers

A lot of consumers don't appreciate the importance of personal credit scores and how it impacts on their daily lives.... Some don't know that credit scores are available and accessible online.... A credit score can have a say on what a consumer is or is not able to do.... hellip; For instance, it can determine what one pays for leasing or auto financing, credit card rates, car insurance and mortgages.... When applying for credit, the lenders want to know your credit risk level....

7 Pages (1750 words) Research Paper

Credit Scoring and Its Impact

This is because most credit scoring systems substitute the highest balance for the missing credit limit.... The credit bureau and a model developer need to validate a model that will be utilized for credit scoring.... Although the scoring logarithm remains the same, the performance related to each score may possibly vary according to user.... A credit score refers to a statistically based means of evaluating the expected repayment performance of a loan that can substantially decrease the time cost of reviewing applications, and human input....

18 Pages (4500 words) Essay

Unisys Credit Problems

This is in danger since the… This has been caused due to the economic meltdown which has caused instability in the value of bonds, letters of credit and foreign exchange derivatives.... The amount in bonds that is in the market Unisys credit Problems due Unisys credit Problems The scenario in which Unisys finds itself is one where it might be forced to fall short of providing its services to its main client, the U.... This has been caused due to the economic meltdown which has caused instability in the value of bonds, letters of credit and foreign exchange derivatives....

2 Pages (500 words) Essay

Credit Scoring

The paper “credit scoring” looks at a management technique used by banks to reduce credit risk.... nbsp;credit scoring can assist commercial and retail banks in identifying opportunities to purchase a large mix of assets.... credit risk is the risk that a borrower could default in fulfilling his financial obligation to service his debt on time.... hellip; The author states that the non-payment record will then become a part of the borrower's credit history....

7 Pages (1750 words) Assignment

Qualification Requirements of Security Personnel in the UK

The dissertation “Qualification Requirements of Security Personnel in the UK” estimates the current provided and required qualification standards appropriately aligned to meet the demands of the roles in which the UK frontline security staff are employed.... hellip; The researched security companies do not have substantive training programs to regularly upgrade the skills of the security personnel....

32 Pages (8000 words) Dissertation

Internet Use in Schools and Security Issues

… 8 Validation299 Ethical consideration30Chapter four304.... Results and Findings301 Introduction302 Stage One: Secondary data322.... The importance of computers and internet use in the UK322.... Evolution of cybercrime332.... Classification of cyber crime342.... 8 Validation299 Ethical consideration30Chapter four304....

59 Pages (14750 words) Article