Data Analysis Using R Assignment Example | Topics and Well Written Essays

Data Analysis Using R Customer Inserts His/Her Name Customer Inserts Grade Course Customer Inserts Tutor’s Name 23rd April, 2012 Table of Contents Data Analysis Using R 1 Table of Contents 2 Executive Summary 3 Task 1 3 Task 3 16 Task 4 19 Executive Summary Data Analysis is very important in the field of business since many organizations utilize it for different purposes. In many cases, organizations face the challenge of analyzing huge databases of information from different sources. One such industry is the banking industry whereby customer information is utilized in the credit process. This essay explores the use of different data mining skills such as CRISP and data mining tools such as Rattle (R Console) and MS Excel 2007. These tools are powerful in terms of functions and controls they have in analyzing data in different formats. For instance, data obtained on customers belonging to a bank is huge and a powerful data processing tool is needed for the purposes of analyzing this data. Making use of R, we can easily analyze patterns within data in a short period of time while at the same time get interactive results on the analyzed data. R has several capabilities to handle the analysis of data and coming up with concise results. For instance, decision trees, correlation and other analysis can be conducted easily through use of R as it has been done in this essay. The essay looked into the analysing the credit rating of bank customers through use of different set of data. Moreover, opportunities and challenges of implementing CRISP DM systems within the credit section of the banking industry are analyzed. The last task in this essay looked into creation of dashboards which are visual methods used in the analysis of data. Several factors have to be analyzed in the creation of dashboards as it is seen under task 3 through the use of Microsoft Excel 2007. Task 1 The process of CRISP data mining involves six distinct processes or procedures that are used in data analysis. After gathering of data, modelling and evaluations have to be conducted to ensure the best results are obtained for the study being undertaken. The correlation results are listed below: a) There are seven variables in this study of the different variables and these variables correlate to each other differently. From the correlation results, we found out that there are two pairs of variables that had moderate correlation. This pair of variables is amount/duration and age/resident. However, the correlation between amount and duration is moderate with a figure of 0.62 while the linear relation between age and resident presented a weak linear relation. On the other hand, the relationship or correlation between different variables ended up with negative linear relationship. Most of the correlation figures stood at less than -0.1. Therefore, in this case, we can conclude that the values and relations between these variables are scarce. The results of the correlation analysis are represented by figure 1 shown below. For instance, the light maroon coloured circles represent relations where there was negative linear correlation. While the white (uncoloured) circles represent the relations where there was weak correlation or negligible correlation. The blue coloured circles represent scenarios where the linear correlation was high and possibility of link between the variables. High correlation figures in this case might show the close relation between these variables in relation to credit ratings. For example, if the amount of loan is high then probably the loan period will be high or it will increase. The same cannot be said about the relation between age and residency. As a result, it would be prudent to conclude that most of these variables are independent and have range of values which differ greatly. The highly correlated figures in this study are those of amount and duration which have moderate correlation. From the analysis, we get to see that the correlation is moderately linear and this shows that there is possibility that an increase in amount loaned out contributes to longer duration of the loans. The same relation applies for the two variables vice-versa in the study. The correlation figures and matches are shown by figure below Figure 1: Showing Correlation levels. The dark blue show high positive correlation while the light blue show positive medium correlation, the light coloured red circle show less correlation between the variables. (NB: correlation is either positive or negative). b) Correlation in some cases might reflect redundancy in data and thus we can eliminate the variables which do not add up or make sense. For instance, in our case we look at age and residency which exhibit correlation in their relations. In relation to assessing the credit worthiness of a client age and residency might be a consideration. But making use of business intelligence methods, we can eliminate the variable resident since it does not affect the credit worthiness of a client. The variable resident is redundant in the case of analyzing data since this variable only lists the residency duration of different customers in the bank. Most of the variables that we can do away with are the ones that bring about redundancy (Paredes, 2009). Some of these variables include other which is not expressly linked with any variable. This variable cannot be split and if it is ignored it will not make any difference in the process of data analysis. The variable age shows a lot concerning the age of the loan borrowers and the propensity of different age groups in the process of loaning out money in the financial systems. c) The variables used in this dataset are numerous and this presents us with an opportunity to split the datasets into various groups. For instance, the variable credit rating can be split into two to come up with either good or bad credit ratings in both cases. This will enable us to rate and separate customers with different credit ratings in the process of undertaking data analysis. Moreover, there are other datasets which can be split into numerous groups such as duration whereby the duration that would be taken in repaying up for the loan. The duration could be split into three different categories based on the months taken to pay the loan. Short term loans could be designated 6 months and below wile medium term loans could be designated above 6 months but below 48 months and long term loans exceeding 48 months. Other interesting data that we can notice from the data is that the loan amount is very important since it gives information on whether big borrowers have the tendency to pay for their debts or the case happens in reverse. However, in some instances, the loan amount might not matter in the process of checking the customer’s credit history. This is because depending on the age, job type and employment history of the customer amount might not matter. In some case the amount of loan borrowed could be split into three groups of large, medium and low loan amounts. Another variable that could be split is duration that could be split into three to come up with long, medium and short term loan duration for different loans. d) In the process of drawing a decision tree, we will take variables that will enable us to get optimal results out of the data analysis. Most of the variables that will be utilized in the process are those that affect the data directly. In this case, we only pick variables which have a direct and overall effect on the results that will be obtained in the process of making a tree analysis. In the process of drawing our decision tree, we made use of important variables such as customer number which is a unique identifier and other variables such as duration which is used to measure the period a loan has taken to be repaid. However, in the analysis we did make use of several combinations of variables to come with a sensible decision tree. In the process of analyzing the decision tree, we came up with several results with the decision tree having so many different nodes. The results showed that the decision tree recorded the observations obtained when different variables were connected in relation to credit rating. For instance, we notice that customers who were borrowed loans with repayment periods below 34 months with a certain level of skills had higher chances of good credit rating with observations of around 442 customers. We also witness, variables such as housing and amount are used in the process of designing the decision tree. The results of the decision tree reveal that customer who rented houses had bad credit ratings compared to those who lived in owned houses. The results of the decision tree depicted several patterns and showed the number of observations obtained from these results as shown in the figure below. Figure 2: Showing the decision Tree Moreover, the error levels in the process of coming up with the decision tree is depicted by the relative error in the x value as shown by figure below. Figure 3: Image showing the relative error (X-Val) for the decision tree e) The process of logistic regression involves a lot of procedures and activities. We analyze the variable credit rating against other variables which might determine the outcome of the credit rating. Logistic regression will in most cases makes use of predictor variables in the process of trying to come up with patterns of predicting certain outcomes. In our case, we made use of a set of variables to assess credit rating. From the analysis, we consider variables such as employed, job, housing and duration to analyze the results of the logistic regression. From the logistic regression we have a model and graphs as shown by figure 4 below. Including variables such as employed, amount and duration is vital in the process of calculating the credit rating for different customers within the credit dataset. The logistic regression models come up with four graphs which provide the residual versus fitted graph which presents us with good models for the dataset. In the linear model whereby the scale location is drawn we see that the dataset has a standard deviation of around 1. Figure 4: Image showing Logistic Regression Models f) There are several variables that are used in the process of prediction of credit rating in the data model provided. The overall error of the model, when selected variables are used in the prediction of good or bad credit rating is 0.2733. The variables that were utilized in the prediction of good or bad credit rating were duration, amount, employed, housing, job and depends. These variables are important since they directly affect the credit ratings of any customer. From the analysis of the logistic regression we get to know that error matrix for the linear model using counts is bad versus bad is 5% and Good versus good amounting to 68%. Making use of these set of variables provides us with minimal and slight errors. Therefore, it is easy to conclude that these variables are important in predicting credit rating. Task 2 In the process of implementing business intelligence systems we make use of different techniques and procedures. The use of these techniques creates opportunities which we can harness to ensure business knowledge and expertise is gained from different aspects. Some of the opportunities that were created in the process of implementation and utilization of business intelligence include: Efficiency: The deployment phase of CRISP DM involves making use of a combination of techniques that ensure the knowledge learnt is used in development of business decisions. Knowledge obtained through deployment of systems through use of CRISP DM is important in that it ensures time and money is saved in making decisions. For instance, it ensures that huge volumes of data are organized and analysis is conducted easily (Silverston, 2002). As a result, it saves the time of people and employees in the process of undertaking work such as analyzing credit worthiness of bank clients. The results obtained from the process of data analysis through the use of CRISP DM is important in analyzing coming up with decisions that are sound since the process involves making use of different strategies (Olson, 2011). Synergies are created between making use of human knowledge combined with computer/technological techniques in making decisions. As a result, it would be easy to come up with fast decisions and allow the interaction of different actors in the credit process. Competitive Advantage: Making use of CRISP DM creates a competitive advantage since business decisions are done swiftly compared to making use of traditional data mining systems. CRISP DM mining techniques follow set out procedures that ensure precise data measures are used in analysis. As a result, it would be easy to map out trends in the real world that ensure prudent decisions are undertaken. For instance, in the process of assessing the credit worthiness of a client, data mined is packaged and utilized in a way that ensures prudent decisions and products are given to the customer. This is because the information obtained takes into consideration the trends of the current dynamic business environment. Data mining through the use of CRISP deployment process takes into consideration the social effect of making use of technological systems (Paredes, 2009). For instance, the processing is done though use of technology while decisions and analysis if done by human beings. The synergies created and data analysis that ensure faster and efficient forecasting of trends in the credit process leading to good decisions being made. Cost Savings: Making use of CRISP DM ensures that a lot of costs are saved in the process of undertaking data management and understanding. This is because in the deployment process, organizations make use of software that ensures prudent and faster processing of data. As a result, the cost of hiring and maintaining staff is reduced since data is processed faster. Moreover, data mining and management through use of CRISP provides insights into how data can be utilized in a bank. As a result, banks reduce the occurrence of bad debts that result from poor credit ratings (Larose, 2005). This is because CRISP allows for thorough analysis of data obtained from different sources. Deployment under CRISP makes use of software that is fast and has a quick turnaround time. Rather than making use of a lot of resources, CRISP is efficient and thus costs associated with hiring of staff to conduct data analysis is sufficiently reduced. Business Continuity: The process of data mining is challenging and sometimes the process of CRISP DM might have pitfalls that prevent the proper utilization of data used in assessing credit worthiness. Therefore making use of CRISP DM deployment procedures we come up with robust systems that ensure business activities are analyzed in a prudent manner. The process of CRISP DM is recursive and therefore, business continuity is ensured at all times since mined data is utilized over and over again (Larose, 2005). The iterative nature of CRISP DM makes the analysis of data to become difficult and thus it is easy to use this model in the future. As a result, any business would make use of this model for the purposes of continuing the business. Accuracy: Deployed CRISP DM makes use of a business intelligence system which allows for accurate data mining in the process of analysis. The deployment process involves making use of systems and human personnel in a way that ensures social interaction of systems. As a result, systems make use of complex functionality to analyze data in a fast and accurate manner. Moreover, the social interaction aspect of CRISP DM ensures that individuals such as staff are used in counter checking the results of the analyzed data. The involvement and the use of socio-technical change management techniques ensure that data analyzed is accurate and error free (Yeates, 2001). Implementation Bottlenecks: The deployment phase of CRISP DM is the final phase that follows the modelling and evaluation steps. The process of designing the final system that will be utilized within the deployment of efficient business intelligence systems is challenging since a lot of data is involved and checking all the integrity constraints within these systems/databases is quite difficult. Design challenges might lead to design of incomplete implementation of CRISP DM systems within the business intelligence field (Silverston, 2002). Moreover, this field requires good systems especially in critical areas such as the banking and finance sectors. Some of the challenges faced in pursuit of good data mining and management systems include: Time Consuming: Making use of CRISP data mining tools and techniques is important in gathering and ensuring that data mining is done in a proper manner. The task of collecting data is tedious and in most cases, it takes a lot time and thus making the whole data mining process to be time consuming. In some cases a lot of data is analyzed and this needs the input of personnel since socio technical changes are difficult to manage. Databases used in the banking operations and organizations are huge and thus processing all the data is quite tedious and heavy leading to a lot of time spent in processing these data information. Customers in the banking industry have different attributes and therefore, the process of collecting, validating and grouping data is quite tedious (Paredes, 2009). Moreover, capturing and development of systems that handle huge data such as transactional information from banks is time consuming and it takes a lot of resources. Redundancy: The process of undertaking data mining involves making use of huge volumes of data that is processed many times. As a result, we can say that CRISP DM is an iterative process that can cause redundancy. Redundancy can lead to loss of data integrity and this affects the way data is processed and analyzed. The process of deploying business intelligence systems could bring with it a lot of problems associated with handling large volumes of data especially in the credit application process within the banking industry. As a result, several systems have to be deployed and data organization has to be done in a fast and prudent manner to ensure data integrity is maintained (Yeates, 2001). Moreover, personnel using or handling business intelligence systems have to be efficient in handling data within the deployment process of CRISP DM in the banking industry. The banking industry is a sensitive industry and therefore efficiency has to be maintained by reducing redundancy. Cost: The cost of the deploying intelligent systems might be a hindrance in the implementation of CRISP DM systems. This is because a lot of data is used in the implementation process and this could take a lot of financial resources that is not available. Moreover, the cost of designing and hiring staff who can undertake the deployment of innovative CRISP DM systems in the business intelligence field is costly (Paredes, 2009). For instance, developing systems which undertake a lot of processing of information from the banking industry is a big challenge. Developing systems which could handle a lot of data in different formats is costly since it will take a lot of development time. The high cost in the development of business intelligence systems is a challenge and hindrance to the deployment of CRISP DM systems within the banking industry. Integrity: Maintaining integrity in the process of decision making is difficult especially when we are dealing with huge information. Business intelligence systems make work of data preparation and analysis easy. However, since the deployment process of CRISP DM involves making use of business intelligence systems, human/personnel is required in decision making. As a result, controlling the decision making part among employees in a bank is difficult due to the huge information that is processes by these systems (Olson, 2011). For instance, the world financial crisis which was precipitated by the banking collapse could be attributed to poor decision making. Thus, this is a challenge that is encountered in the deployment of business intelligence systems within the banking/financial system. Even though, CRISP systems could be used in deployment of banking and financial systems, the task of maintaining integrity within the users of these systems is difficult. Deploying integrity checks within business intelligent systems through use of CRISP DM might be the biggest challenge in the banking industry. Task 3 In the process of designing a dashboard, several measurements and methods are used to ensure the correct procedure of the work is undertaken at all times. The analysis looks into the design of a dashboard using excel 2007. The process of designing a dashboard is challenging and it is done through several steps. The dashboard was supposed to show the analysis of four different measures of sales and gross profit using four different parameters. The process of building and designing a dashboard can only be done when a pivot table has been built and utilized. There are different processes involved in the design and analysis of dashboards through use of Excel 2007. The four different measures are shown below: Functionality: Dashboards are scalable since they are visual in presenting information and data in different ways. For the purpose of designing the dashboard, functionality has to be considered due to the visuals aspect of dashboards. Functionality looks at the purpose that a dashboard is supposed to serve. Dashboards have to serve the function of displaying different data under different situations and time frames. For instance, in the design of dashboard 1 we had to ensure that the dashboard does represent the sales revenue and sales gross profit data in terms of weeks, months and years (Walkenbach, 2011). Therefore, these different sets of data have to be represented very well through functionality provided by Excel. Under the factor of functionality, we have to ask questions such as is Excel suited for this kind of work or not. For functionality to be achieved all the data has to be accommodated in the design and implementation of dashboards. Ease of Interpretation: In the process of designing dashboards, we have to make sure that dashboards are easy to interpret. The purpose of designing these dashboards ensures that people or decision makers can easily interpreted data that is needed at any given time or moment. As a result, making use of dashboards has to consider ease of use in terms of design and interpretation of dashboards. Usability looks into the design and easy interpretation of the dashboards. In the process of designing a dashboard do not clutter the dashboard with a lot of information that ends up ruining the look and analysis of the dashboard (Frye, 2007). As a result, the design of the dashboard should group data that are similar in nature into various groups to allow for easy interpretation of the dashboard. Scalability: This is one of the most important factors in the design and use of dashboards is scalability. The design of the dashboard has to look into the allowing all data points and values to fit or to be accommodated for the purposes of ensuring easy analysis of data. Data ranges and differences have to be considerably small so that scalability can be maintained. However, in most instances or cases, scalability is handled properly through the use of sizing and re-sizing the chart areas where dashboards have been represented. In terms of size and appearance, dashboards have to be drawn to accommodate all data figures including negative figures. As a result, a dashboard design will look into data ranges that fit the specified data levels (Frye, 2007). Scalability aids in designing dashboards which are to look at and make use of in the process of data analysis. Dynamic: Dynamism is one of the factors that a person should consider in the process of designing and building dashboards. This is because dynamism is very important in ensuring that the data and dashboards drawn represent a true picture of the data used. Dashboards represent different sets of data and graphs and therefore these data sets are different hence making dynamism a requirement in the design of dashboards (Walkenbach, 2011). Through the click of a button or some other action, a dashboard should change easily and display the requested data or graph. Dynamic dashboards are fast and efficient in the process of data analysis since they display data quite easily and fast. The above factors were considered in the process of coming up with the four dashboards that were drawn in the excel sheet for GBI International. The four dashboards and the representation of data are shown below: Dashboard One: This was the first dashboard to be created was to show the different values for Sales revenue and sales gross profit by week, month and year. The process of designing a dashboard involves making calculations and different analysis that are important in the process of coming up with the final design of the dashboard. The sales revenue and gross profit for the different periods are shown in the chart displayed in a graph under the Calculation tab of the excel sheet. We have to consider the two most important measures in the calculation, design and analysis of the dashboard. These measures are sales revenue and sales gross profit and they are determined weekly, monthly and yearly. In our particular case, we calculated these two measures into periods of 52 weeks or 12 months or 1 year in two consecutive years (2009 and 2010). By typing any number between 1 and 52, under cell H1, H23 and H47 in the Calculation tab in the GBI Sales excel file will get you the desired dashboard. Dashboard Two: The dashboard that was desired in this section was one that reflected on the sales revenue and sales gross profit per product category. The design of this dashboard required the calculation of revenue under the Product/Category tab in the GBI Sales excel file. The major product categories for GBI International are four and they include utilities, Touring, Protective and Off Road product categories. By typing any number between 1 and 12, under cell H68 in the Calculation tab in the GBI Sales excel file will get you the desired dashboard. Dashboard Three: In this section, we designed a dashboard that was to handle different transactions such as the figures for the sales revenue and sales gross profit in four divisions of the different sales organizations. By typing 1, 2, 3 or 4 under cell H93 in the Calculation tab in the GBI Sales excel file will get you the desired dashboard. Dashboard Four: This dashboard presented sales revenue and sales gross profit for two countries in the case of GBI International. These countries are the United States and Germany and the sales are quite straightforward for the interpretation of the results. By either typing 1 or 2 under cell H117 in the Calculation tab in the GBI Sales excel file. Task 4 Various tasks were undertaken in the process of doing the assignment and the assignment was divided into three basic parts. The assignment looked at tasks 1, 2 and 3 whereby different operations and data analysis techniques were conducted. The journal of activities for this assignment is outlined below: Task 1: This was the initial processing of the data that was undertaken by analyzing data through various methods. This task took a period of 5 days (from the 11th to the 15th of May) due to the complexity of the work. Under task 1, several data processes were conducted in the pursuit of analysing data through decision trees, correlation and logistic regression. Decision trees are important in the task of understanding data patterns and the relation between data patterns. Decision trees also assist us in understanding the how different variables affect other variables in the case of dependant and independent variables. While on the other hand, correlation looks into the effect of different variables on some other variable. Most of the result and graphs are displayed in the appendix for the correlation and decision trees. Task 2: This was the other major task that was undertaken in the assignment and it took a period of 4 days (from 16th to the 19th of May). The main topic and research that was undertaken within this assignment was the examination of the opportunities and challenges that exist within the banking industry in the deployment of CRISP DM systems. We had to look into the challenges and opportunities that can assist us in designing systems through the use of the skill of CRISP DM. Making use of CRISP DM methods ensures that opportunities are harnessed in the coming up with good systems. Task 3: This is the last bit of the assignment whereby various tasks were undertaken and it is in this assignment that a dashboard of GBI international sales was done. The task took 7 days from the 20th to 26th of May. Various tasks were undertaken within this task. In the process of producing the dashboard, we have to look into the different factors that are used in the process of designing dashboards. These factors are important in ensuring data is presented in a good format for analysis. The dashboards were used in capturing sales revenues and sales gross profit for GBI international in different data sets. The design of dashboards involves making use of calculations that are vital in drawing graphs that are dynamic (Walkenbach, 2011). Read More

Data Analysis Using R - Assignment Example

Extract of sample "Data Analysis Using R"

CHECK THESE SAMPLES OF Data Analysis Using R

Gas Prices: Data Analysis Using Descriptive Statistics

Job Search Data Analysis

Optimization of Pore Pressure Prediction for Effective Well Planning in the Niger Delta

A detailed written data analysis using ONS Census records 1951-2011

The Principal Barriers to Secondary Analysis of Qualitative Data

Value of IDEA and ACL in Accumulation and Analysis of Data for Forensic Auditor and Accountant

Simple Data Analysis and Comparison

Predicting Customer Behavior Using Statistical Techniques