Quantitative Approaches to Decision Making Assignment

Business Intelligence Assignment Section A a) Business Intelligence is the term given to the practice of businesses in gathering and extracting useful data from particular sources that then analyse and present this valuable information to put it to their strategic advantage by being more effective in their decision making. To help them in this complicated process, businesses make intelligent use of computer applications and technologies. This judicious decision support system is thus a phenomenon of the present Information Age. Its use distinguishes such businesses from those of the past and those that do not utilise modern technological business methods. 1(b) Business Intelligence is imperative if businesses are to be successful in a competitive market. Quantitative data analysis as an aid to decision-making and for optimising other business processes is the driving force today for competitive advantage. One area of fascination is the current trend of whats called Operational Business Intelligence. This variation of Business Intelligence operates on automatically gathering, processing and analysing data and in some cases even implementing the findings immediately. "As the amount of data businesses create increases daily, so does the need to use that information faster, for better decisions . . . Best-in-class organizations use todays data to make todays decisions and give their frontline employees the tools to do that . . . Operational business intelligence will lead the way . . . But looking at what happened two months ago does nothing to help you . . . So the drive [is] to extend actionable business intelligence to a broader audience—frontline employees and even customers and partners . . . putting the spotlight on operational business tools . . . Operational BI can automate operational data collection and integration . . . Operational BI is the trend for 2008 for the simple and crucial reason that it brings relevant information to employees as it is needed, allowing them to respond to problems or opportunities." (Daniel, 2008) There are plenty of common examples but a recent business that has initiated this strategy is a web search engine called Cuil. This is a new challenger to Googles analytics. It attempts to not just search but find relevant results to web searches by displaying the information as a web page rather than a list of links. Whereas, Google bases its information on the data obtained from web surfers search history and habits by monitoring their search activity in addition to trawling and indexing websites, Cuil actually analyses the content of web pages to find relevant material. Two issues related to such automatic analytics for Operational Business Intelligence come to mind. One is that the analytical process should remain targeted to the particular businesss needs and specifically of the decision makers. It is also important for businesses to maintain a strategic focus for such analytical activity. 2(a) Data mining by supervised learning involves the prior specification of a dependent or target variable for the algorithm to follow. The data mining process is also aided by providing it with an initial set of values of the target variable in order that the algorithm used may then subsequently learn by comparing these values with those of the predictor variables. Data mining by this supervised learning method is more common than mining by unsupervised learning. Examples of its use are for creating classification, estimation and prediction models. The algorithm used in supervised learning is also taught to avoid suggesting any obvious spurious correlations and likely statistical models that are obtained are further trained on test sets of data. The adjusted model may then undergo further processes of validation to minimise the error rate. Examples of application: Determining if homeowners rather than tenants have a higher salary. (a classification task) 2(b) Data mining by unsupervised learning does not start with the identification of any target variable. So, there is no guidance for the algorithm to begin with as with supervised learning. Instead, the algorithm used for the data mining process itself tries to identify a structure by searching for patterns amongst the variables that are present. In this way it learns for itself from the outset. A typical example of unsupervised data mining is clustering. The statistical model is formed by the algorithm employing checks on the cluster quality, and identifying any outliers. Example of application: analysing data clusters for exploratory data mining. 3. The Cross Industry Standard Process for Data Mining (CRISP-DM) is as its name suggests, a standard procedure adopted for mining data. It is also application neutral. Being so widely used and well documented testifies to its role as an effective data mining process. The six stages of the CRISP-DM process are identified in the Journal of Data Warehousing as follows: “CRISP-DM organises the data mining process into six phases: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. These phases help organizations understand the data mining process and provide a road map to follow while planning and carrying out a data mining project.” (Shearer, 2000) Here is an outline of the process and why it is considered to be effective: Phases 1-2 (Business understanding and Data understanding) The data mining task is initiated by ensuring that there is an in-depth understanding of the situation and requirements on the part of the analyst. A proper understanding of the business and data leads to a more precise translation of the business requirements into a data mining problem. A suitable selection of tools can also be made if the needs are understood. So, understanding helps the task to be identified and therefore tackled more easily. Phases 3 (Data preparation) A careful preparation or selection of data sources and types and ‘cleaning’ i.e. adjustment of the data ensures that the data is appropriate and targeted for devising a suitable model representing the situation and capable of providing the information required. Shearer identifies 5 steps in data preparation: “The five steps in data preparation are the selection of data, the cleansing of data, the construction of data, the integration of data, and the formatting of data.” (Shearer, 2000) Selecting the right data sources and types ensures its suitability for the model; the cleansing of data is necessary for relevance and quality; data construction and integration; the re-formatting of data, if necessary, may help make the data better suit the parameters of the model being constructed. Phase 4 (Modelling) Constructing the model in a systematic manner provides for one that is perfected to suit the business requirements. In this phase, the modelling technique chosen is an important decision, as this will be used to generate the test design and the resulting models. These are then assessed for their effectiveness. Phases 5 (Evaluation) The CRISP-DM process specifies the need to evaluate the model before it is finally deployed and put to use. Evaluation is essential because it provides an opportunity for a thorough testing to be certain that it meets the business requirements for which it was constructed in the first place. Any deficiencies in the model will become apparent and can therefore be corrected so as to refine it. By applying the model to a real situation, the analyst can make sure that no factors have been overlooked and a review of the entire process will give a measure of the quality assurance. Phase 6 (Deployment) The deployment phase too can be as detailed as required, instead of simply producing a final report. For the data mining to prove really effective, an ongoing monitoring and maintenance is the best policy together with a periodic review of the entire project. 4(a) A correlation coefficient R calculated by taking the square root of the ratio of the explained variation to the total variation will yield a value in the range of -1 to +1 and indicates the degree of relationship between the variables relative to the type of equation assumed. If a linear equation is assumed and R has a value near ±1, this would suggest a linear correlation between the variables, whether the relationship is really causative or spurious. A value of R closer to 0 would in this case suggest no linear correlation between the variables. However, this does not mean that there is no correlation at all, as there may very well be a high nonlinear correlation between the variables. A positive value of R indicates a positive linear correlation, suggesting a proportional relationship, and a negative value of R indicates a negative linear correlation, suggesting an inverse proportional relationship. For values of R at any point on this range, the interpretation is actually a measure of the goodness of fit between the equation assumed and the statistical data being analysed. 4(b-c) Neither of these statements can be taken at face value and are therefore false. A qualitative observation of the scattering of sample data on a scatter diagram may appear to show a linear correlation between the two variables but the strength of the relationship is better determined by calculating the correlation coefficient in a quantitative manner. Even then, it may be possible to find a better goodness of fit between the target and predictor variables if some other nonlinear relationship is assumed. The size of the correlation coefficient only indicates the possible degree of relationship between the variables relative to the type of equation assumed and based on the sample data. A high degree of correlation does not imply that the target variable either causes, or is caused by, a corresponding change in the predictor variable. It may be that a high correlation is indicated but the relationship itself is spurious, requires further validation by analysing more sample data or that there are other intermediate factors at play. A causative relationship cannot automatically be taken for granted. Section B 2. DATA CLUSTERING (a) The K-Means algorithm is designed to cluster a data set into partitions based on shared attributes and attempts to find the ‘centres’ of these natural clusters in the data. It is a computational procedure of the general problem of clustering or classifying objects into groups. Starting with the data sets, also called objects, the algorithm will cluster these based on attributes into the different partitions, the number of which may be predefined as an input parameter. The condition of this algorithm is that the number of partitions is less than the number of objects and the algorithm works on the assumption that the object attributes form a vector space. The centres are the arithmetic means of all the objects in each cluster. The objective of the algorithm is to minimise the total intra-cluster variance, which is the squared error function. In the example provided, the data set is comprised of information gathered on shoppers and the products they bought. As the types of shoppers are a variety of people and their purchases, the data set is partitioned based on some shared attributes. Two of these attributes form binary partitions viz. homeownership and gender, represented by the variables ‘homeown’ and ‘sex’ respectively. The variable ‘pmethod’ determines a tripartite partition into those who paid by cheque, cash or card. Other attributes used to distinguish between the shoppers are the type of products bought, amount of money spent on that particular shopping event (value), their income and age. By attempting to find the centre of each cluster, the algorithm attempts to typify the shoppers in each cluster to determine their background and shopping habits. The selection of the attributes used for this determination is an important consideration of the data analyst to produce relevant information. It does seem that the attributes selected in this case are appropriate to the analysis. (b) The relevance and quality of the information provided by the K-Means algorithm is dependent on a number of factors. Seeing the RHS panel of figure 1, Clementine has applied an iterative refinement heuristic, probably an implementation of Lloyd’s algorithm. Terminating at a zero error value with a low number of iterations (20) seems to demonstrate that the algorithm worked successfully. However, the algorithm started by using the sample data provided, taking the default set of 5 clusters to partition the shoppers into (as seen in figure 2) and then calculated the 5 means by minimising the variance. The size of the sample data, presumably 1000 shoppers, the parameter 5 for the number of clusters and the choice of variance as a measure of cluster scatter can impact on the results obtained. In fact, the minimised variance is the total intra-cluster variance, not the global variance for the data. It is possible that this did not yield the optimal partitioning and the resulting information is not so accurate. It is also possible that different results may be obtained if the algorithm is run again. It would therefore be advisable to re-run the algorithm, then again with different numbers of clusters and if necessary try different algorithms or use more sample data to be satisfied that the information is relevant and of good quality. Only when this satisfaction is obtained, can the information be usefully applied for Business Intelligence. The LHS panel of figure 2 shows that the 5 clusters detected by the K-Means algorithm have between 148 and 287 records; an average of 200 records. This appears to be a fairly even distribution. The third cluster shown in detail is describing a typical shopper as a 34 year old, who is most likely to be a female and probably does not own a home, earns over £24,000 a year and spends on average £34.50 a time, usually on confectionery and wine. This kind of information, if it is relevant to the company’s projection and its quality is reliable, stemming from a solid data analysis, can prove of immense benefit to the company’s decision makers. Obtaining, handling and applying this information successfully, is the sign of good business intelligence. For example, taking the above information for granted, the company, recognising one type of a typical shopper, can target females of about this age through marketing of confectionery and wine, knowing that they could cash in on this. For Business Intelligence, cluster analysis, especially of data obtained from market research, the company can partition their own customers or clients, into market segments. This would help them to understand the relationships between the different groups much better and consequently target their products better or even track the development of new products. And, by doing the same for the wider population of potential customers and clients, the information could help them expand their business. (c) The best approach in deciding how many initial clusters to specify is to start with any reasonably small value and to not end the analysis with this result, but to repeat the algorithm, as many times as necessary, with larger numbers of clusters and compare all the results obtained. Re-running the algorithm in this way is not an inconvenience for the data analyst because this algorithm normally processes data quickly with few iterations. Usually, small values are advisable, at least to begin with, because the larger the number of clusters, the greater will be the likelihood of ‘over-fitting’ the data. This produces results that are not so useful, despite the smaller error function values. Selection of the result with a particular number of initial clusters is best decided by using some suitable criterion in comparing the results. If the company has a strategy of specifically dividing their product ranges based on age groups, it may be appropriate to take the number of age groups as the input parameter for the initial number of clusters. This may help the company to obtain useful information in order that they could understand the consumer habits and needs of each of their age groups and target their business strategies for each age group more effectively. This would mean defining the age variable to contain a limited value and associating each cluster with one of these values, or specifying the data gathered on each group as the data set for the algorithm separately. Alternatively, if we are uncertain of which value to use for the initial clusters, or the results vary drastically for different initial clusters, we could use another algorithm such as the QT clustering algorithm. This does not require specifying the number of initial clusters. It has the further advantage of coming up with the same result, however many times the algorithm is run. 3. DECISION TREES (a) In this decision tree, the domain (root node) is the heterogeneous class of people who have applied for insurance, presumably for driving, with this company. At each of the two levels, there is a binary partition based on a true/false or yes/no condition. The partitioning rules for each of the 4 resulting homogeneous sub-classes (leaf nodes) are as follows: Classification at level 1: At the first level, the binary partition is determined by the age of the applicant, whether or not he/she is 25 or over. Classification at level 2: At the second level, the binary partition is determined by whether the applicant has a clean licence. 1. IF AGE ≥ 25 AND ICENCE => “ACCEPT UNDER NORMAL TERMS” i.e.  the applicant is 25 years of age or over and has a clean licence, then accept the insurance application under the normal terms. 2. IF AGE ≥ 25 AND AN LICENCE => “ACCEPT WITH A HIGH PREMIUM” i.e. If the applicant is 25 years of age or over but does not have a clean licence, then accept the insurance application with a high premium. 3. IF AGE < 25 AND CLEAN LICENCE => “ ACCEPT WITH VERY HIGH PREMIUM” i.e. If the applicant is under 25 years of age and has a clean licence, then accept the insurance application albeit with a very high premium. 4. IF AGE ≤ 25 AND NO CLEAN LICENCE => “REJECT APPLICATION” i.e. If the applicant is under 25 years of age but does not have a clean licence, then reject the insurance application. (b) The main advantages of decision trees: 1. A decision tree is a pattern recognition approach to decision making that provides a graphical representation of decision alternatives based on classification rules. This makes it easy to visualise and interpret the information. 2. Constructing decision trees has the feature of being a methodical, mathematical procedure. It therefore lends itself to being a useful quantitative approach to decision making. “Experts in problem solving agree that the first step in solving a complex problem is to decompose it into a series of smaller subproblems. Decision trees provide a useful means of showing how the problem can be decomposed, as well as showing the sequential nature of the decision process.” (Anderson) 3. In complex situations, it may be difficult to isolate the different decision alternatives, even if these are reduced to those that are likely to be close to the optimal schemes. However, once constructed, the decision tree can be used to calculate the expected payoffs, usually based on probabilities, associated with each possible course of action, i.e. decision alternative. These can then be compared to one another and the information utilised to help make an economical choice, either in terms of monetary value or in terms of utility. 4. From a statistical perspective, decision trees have the advantage of being able to handle both continuous and discrete variables. This makes them suitable for both regression and classification problems. They also have the ability to deal with missing or incomplete data. Extreme values or co-linearity do not affect the process either. Moreover, there is no need to make any assumptions about the distribution of the variables. 5. Decision trees can help identify the data fields that are most important or even critical for classification or prediction purposes. (c) It is possible to construct several different decision trees that may all classify a particular data set correctly, a process referred to as decision tree induction. However, a good induction algorithm creates the simplest possible structure whilst maintaining the tree’s predictive accuracy. The objective therefore is to partition the data in the minimal number of steps so that each partition has only one class (represented by the leaf nodes). The data having multiple classes is what is described as impurity. It reveals the heterogeneity of the data. Satisfying the objective entails finding such attributes of the data set that would produce the least impurity, or homogenous data, when it is partitioned based on the values of that attribute. Several ‘scoring criteria’, or measures, exist to evaluate decision tree partitions, for example those named in part d below. All measures of impurity attempt to quantify the quality of the partitions in their own way. They indicate the suitability of the selection of attributes so as to have each dominated by a single class. It is analogous to the goodness of fit in correlation theory. (d) The Gini, Entropy and Chi-Square are the most commonly used criteria for data partitioning for the purpose of impurity reduction. Gini (i) If a and b are two classes of the response variable and p(a/t) and p(b/t) the probabilities of classes i and j respectively at node t, the Gini index for the node is defined as follows: G(t) = ∑(i≠j) p(i/t).p(j/t) For a node t to be pure, all cases, subject to the condition that a and b are unequal, will be of the same class i.e. the probability of one class, p(i/t) say, will be unity whereas of the other will be null, in this case p(j/t)=0. As the two probabilities are multiplied before summing for each case, the Gini index for a pure node is therefore 0. At the other end of the scale, as the number of evenly distributed classes increases creating a perfectly diverse node, the node being impure will yield an index value approaching 1. The CART algorithm, which only allows for binary partitioning, uses the Gini index to search for the best possible partition at each node that reduces the impurity of the node. In doing so, the Gini index prefers the class with the largest population. (ii) In contrast to the Gini index, Entropy looks for the partitions where as many levels as possible are divided correctly i.e. it works by categorising the most number of classes correctly. It is defined as follows: Entropy of node, I(t) = ∑-(i/t).log2(i/t) (for i=1 to c) (where i/t is the fraction of instances or examples that belong to class i at node t and c is the total number of different classes.) If the fraction i/t is unity, then all the instances belong to class i, and so the node is homogenous i.e. the partition is pure. If this is zero, it means that there are no instances from that class. A value close to zero, indicating a low number of instances, makes the log to the base 2 of that value a large negative number. On the other hand, a value close to unity, meaning a high number of instances, makes the log value approach zero. Hence, zero entropy shows that the class represents all the instances, a case of purity, and an entropy tending to zero shows that the class represents most of the instances, a case of some impurity. Finding the attribute with the lowest average entropy can alternatively be approached, by evaluating the data to find the attribute with the highest information gain value. However, both are identical in so far as their usefulness as a measure of impurity. (iii) In calculating this measure of impurity, the chi-squared test is used to estimate the significance of a partition. Chi-squared test statistic X2v = ∑ [(observed – expected)2 / expected] (where v is the number of degrees of freedom) The chi-square measure of impurity is therefore suitable for multiple partitioning cases where a complex decision tree has more than two branches at any of its nodes. This is something that the Gini and Entropy criteria cannot cope with adequately. The chi-squared test also allows for adjustment by way of the degrees of freedom. Contingency tables may be used to represent the partitions for easier analysis. The observed and expected ‘competitiveness’ for each potential partition at a node are obtained and the chi-squared test then determines its relative importance. Works Cited Anderson, David. An Introduction to Management Science: Quantitative Approaches to Decision Making, seventh edition, 1994, chapter 14, page 596 Daniel, Diann. “The Year Ahead in BI: Operational Business Intelligence, Open-Source Tools and More.” Jan 2008 accessed 8 Aug 2008. http://www.cio.com/article/167450/The_Year_Ahead_in_BI_Operational_Business_Intelligence_Open_Source_Tools_and_More_ Shearer, 2000, Journal of Data Warehousing, Vol. 5 No. 4, Fall 2000 Read More

Quantitative Approaches to Decision Making - Assignment Example

Extract of sample "Quantitative Approaches to Decision Making"

CHECK THESE SAMPLES OF Quantitative Approaches to Decision Making

LOG501 MOD 3 Cases Assn - Optimzing Inventory & transportation

Management decisions

Boundary Control and Legal Principles

Introduction to Quantitative Management

Contribution that Qualitative Approaches Make to an Understanding of Health Care

Fundamentals of Decision Support

Approaches in Decision-Making

Management science