Data Mining Techniques by Using Boston Data Research Proposal Example | Topics and Well Written Essays

Data Mining Techniques Name of the student Name of the Institution Date of Submission ABSTRACT Data mining is a software engineering subfield that is settled from different orders. It is alluded to as the procedure of calculation that is utilized in examples with gigantic arrangements of data utilizing crossing techniques, for example, database frameworks, insights, machine learning and manmade brainpower. In this paper subsequently, five of the predominant straight expectation that is the multiple direct regression (MLR), the key segment regression (PCR), the edge regression, the fractional minimum square (PLS) and the nonlinear incomplete slightest square (NLPLS) are utilized on four particularly unmistakable data set to look at the prescient capacities of each of these diverse data group. The paper explains and explore by using Boston data. Table of Contents ABSTRACT 2 Table of Contents 3 1.0 Study background 4 1.2 Statement of the problem 6 2.1 Literature Review 6 2.2 Brief history of data mining 9 2.3 Concepts in data mining 9 2.4 Data mining methods 10 3.0 Methodology and Data 11 3.1 Data set prescription and processing 12 4.1 Multiple linear regression model 20 4.2 Principal component regression (PCR) 21 4.3 Ridge regression 22 4.4 Partial least square 24 5.0Summary and conclusion 25 1.0 Study background Liao et al., (2012) categorizes human problems as either economic, intellectual or business interest in terms of six data mining activities. They include; 1. Classification of activity 2. Estimation problems 3. Affinity grouping 4. Clustering and ‘ 5. Description which is problem summarization The whole process in solving human problems can be collectively termed as a knowledge discovery process. Braha (2013) on his part classified data mining into two main categories:- i. Problem prediction which includes classification, time series and regression. ii. Knowledge disclosure incorporates deviation identification database fragment, grouping affiliation rules, rundown, content mining and representation (Fayyad et al., 1996) Each one of those techniques are known as learning revelation in database (KDD) and they gives the relationship among the watched data (Dejaeger et al., 2012). Practically speaking, KDD incorporates numerous stages beginning from the underlying recognizable proof of the business issue and intends to the application choice tenets (Larose & Daniel 2014). In this manner, this is the name of all stages utilized as a part of finding and finding learning from the data, with data mining being one of the phases simultaneously. Dejaeger et al., (2012) characterizes data mining as the procedure of determination, investigation and displaying of expansive amounts of data to help in finding the relationship that are at the main obscure with the point of getting a reasonable and valuable results for the proprietor of a given data base (Linoff & Berry 2011). This is the comparative way the prescient data mining (PDM) fills in as the human treatment of little data set. Nevertheless, PDM can be utilized as a part of taking care of substantial data set with no compels that are controlled by human expert (Lin & Cercone 2012). Three phases of learning disclosure in database (KDD) 1.2 Statement of the problem A standout amongst the most grew some portion of data mining is its prescient part, it has the most potential result and the most exact depiction (Torgo 2010). In the field of data mining, the method decision to be utilized as a part of breaking down the data set for the most part relies on upon the comprehension of the examiner to utilize it. As a rule, ordinarily have been squandered in attempting every single forecast system including yet not constraining to stacking, boosting and meta-learning in the process to locate the best arrangement that fits the requirement for the expert (Liu and Motoda 2012). In this way, with the disclosure of a more enhanced and changed expectation systems, it is extremely key for expert to comprehend the device which perform which errand better for any given sort of data set (Rajaraman & Ullman 2012). In this paper subsequently, five of the predominant straight expectation that is the numerous direct regression (MLR), the key segment regression (PCR), the edge regression, the fractional minimum square (PLS) and the nonlinear incomplete slightest square (NLPLS) are utilized on four particularly unmistakable data set to look at the prescient capacities of each of these diverse specimens (Liu and Motoda 2012 p23). Be that as it may, in the data results, we will significantly base our contention and testing will be accomplished for stand out set in the paper. 2.1 Literature Review The paper further handle a portion of the data preparing systems that will help with meager the attributes of the data sets, with the fundamental point of precisely and effectively utilizing the right expectation procedures as a part of making the forecast. In the process we will have the capacity to handle the benefits and negative marks of each prescient systems utilized as a part of this procedure (Low, et al., 2012). Data mining is a software engineering subfield that is settled from different orders. It is alluded to as the procedure of calculation that is utilized in examples with gigantic arrangements of data utilizing crossing techniques, for example, database frameworks, insights, machine learning and manmade brainpower (Han and Kamber, 2006 pxxi). Data mining is likewise alluded to as the finding procedure of fascinating, quick and novel examples. It likewise incorporates prescient, reasonable and expressive models for extensive scale data (Zaki and Meira, 2014, p4). Ultimate of data mining process is to filter data from a broad raw data set and convert into a format that can be easily understood for further manipulation or application. Further, data mining entails, online updating, visualization, post-processing of structures already structured, complexity considerations, considerations of inferences and models, data pre-processing, data-management and data base aspects (Han and Kamber, 2006 pxxi). With globalization there is robust of data such contributing factors include: satellite remote systems sensors image platforms, in addition to advancement in data collection such as scanned text, bar codes employed in commercial products processing, like supermarkets, publication tools, digital cameras wide application, governments, scientific and business computerized transactions (Grossman et al., 2013). The advent of the World Wide Web, did contribute to data and data overload and overflow (Han and Kamber, 2006 pxxi). The tremendous growth of data points to urgent need for automated tools and new techniques that can easily convert the vast data and data to usable data and knowledge, see graphs below (Han and Kamber, 2006 pxxi). This survey focuses on data mining concepts and techniques of data mining that can be adapted in the 21st century. 2.2 Brief history of data mining Data mining is another PC created innovation. It is seen as an answer for the data size, dimensionality and data multifaceted nature challenge. Data mining is connected in wide territories from money, to assembling to medicinal services among others (Shouman, Turner and Stocker 2012). Organizations can concentrate on the vital data in the bigger databases, and have the capacity to anticipate the future and practices. Presented without precedent for the late 1890's, however the field has a long history Bayes hypothesis and regression examination were executed in the sixteenth and seventeenth hundreds of years as techniques for data recognizable proof examples and patterns. Data mining roots are linked with: Statistics Artificial intelligence Machine learning In 1990's data mining was begat as one of Knowledge Discovery in Databases or KDD. KDD process joins application area learning, target dataset creation, data cleaning and pre-handling and data lessening and projection, be that as it may, preparing force was termed as the principle challenge with this system (Liao, Chu & Hsiao 2012). Calculation productivity and practicality are still a difficulties, however today with innovation headway exactness in huge data set is the principle purpose of concern 2.3 Concepts in data mining Data mining systems can be sorted as takes after: Classification: involves expectation of a specific result with respect to a specific info. The technique utilizes preparing set with set of qualities, highlight and the separate result Estimation: the aim is to determine a value regardless of unknown output attribute, or incomplete, uncertain, or unstable input data. The concept is challenged by presence of outliers (Bettini, Jajodia & Wang 2013). Prediction: involves future outcomes prediction rather than the present behavior e.g. a business may predict closing price in next week for a present product on sale (Shouman, Turner & Stocker 2012). Clustering: this is grouping similar data sets with pronounced level of commonness together. Is used for statistical data analysis in various fields. Unlike classification clustering method requires no training, is a non-supervised learning method. Deviation detection or Anomaly Detection: identifies outlier events or observations. This may indicate structural defect, bank theft, medical problems like MRI scans, computer hacking among others (Tang & Zhang 2013). Association rule mining: helps in understand the behavior for instance buyer behavior to understand correlation of purchases made 2.4 Data mining methods Decision trees: entails assigning probabilities to items or variables to determine the best possible decision Artificial neural networks (AAN): this are mathematical models resembling the brain neural network. They have layered set of interconnected nodes called neurons that can exchange the data between each other (Friedman & Schuster 2010). 3.0 Methodology and Data The figure below indicates the methodology which is to be adopted in this work. The four data set is first presented as well as the preliminary analysis done on each of the data set so that we may gain more insights on the characteristics of the data. The data relationship is done at this point and plotting is made over the input of the raw data sets (Liu & Motoda 2012). In this case, the scaling and standardization of data is done to manageable level of dispersion between the variables. The computation of each and every variables is done to verify the relationship between info variables and the yield variables (Shouman, Turner and Stocker 2012). It is trailed by particular quality deterioration of the data changing them into central segment When building the model, the predictive techniques is used and different methods of those procedures are utilized. Case in point, the numerous straight regression has three techniques for the most part connected with it in this paper, however not all are talked about thoroughly (Shouman, Turner and Stocker 2012). They incorporate, full model regression, the stepwise regression and the mode assembled selecting the best related variables to the yield variables. The validation of the model were done through test validation data set. When different data set are applied, we expect that the model will be sound and complete (Gorunescu 2011). 3.1 Data set prescription and processing The measure of dispersion were done on each set of data through plotting them in the graph. The plots of the inputs variables against the output in each and every data set to establish the relationship between the variables were made. This was mainly to check in case of any variables having nonlinear relationship (Kharya 2012). The data used in the analysis were obtained for the Boston data set at http://lib.stat.cmu.edu/datasets/boston Dispersion testing between variables are shown in the figure below In order to show the difference in the means, the box plot were done as shown in the figure below The scaled data of Boston Housing was done against the index and this was mainly to show the range of dispersion of between -2 to + 2 which was very good. It is shown in the table below; Second data set The second data set was one for coal. This data specifically has 8 variables in the 8 columns. It does not have defined variables in them. The figure below shows the dispersion between all variables. Figure 5: Coal Data In indicating the difference in the mean of the variables, it is shown in the figure below In plotting the scaled data of the coal variables, this is shown in the figure below It shows the range of diversion of between -3 to +3 which is not wide dispersion The third data is the airline data This particular data has around 19 variables in it contained in the 19 columns A plot of the Airliner information set against the file uncovering the scattering between the different variables (scope of - 500 to 4500).This is shown in the figure below The box plot of the airline set skimpy variances in mean of the variables are shown in the figure below When the data set is plotted in a scaled Airliner information set against the list it demonstrates lessening in the scope of the variables (inside - 3 and +3). After explaining the behaviour of the three data set individually, the combined characteristics is very important to be analyzed and the graph below shows the combined variables against the index revealing the level of dispersion between the variables (-5 to 25). The box plot for all the three data set were also simulated to reveal more characteristics of this data set. It shows the differences between the column means as shown below The simulated scaled plotting of all variables combined shows a reduction in the level of dispersion between -2 to +3 as shown in the figure below 4.1 Multiple linear regression model In the Boston information, three strategies were utilized as a part of the investigation which incorporate full model regression, stepwise regression and variables determinations in light of the relationship with other reaction variables utilizing the connection coefficient network of the considerable number of variables in the information set. The results of the regression are summarized in the table below MLR R-Sq Adj.R-Sq MSE RMSE MAE E.mod CN Norm.Wt N Full Model 0.7445 0.7306 21.1503 4.5989 3.2500 0.4405 7.33e+7 43.14 13 Cor.Coeff. 0.7038 0.6915 24.5201 4.9518 3.4430 0.3645 7.14e+7 5.64 11 Stepwise 0.6727 0.6968 24.5971 4.9595 3.3989 0.3809 2.122+7 9.014 6 The correlation model was used purely to choose variables that are best correlated with the output variables. The results are shown in the above table. On the stepwise regression, it only gives the results for the predictive results of training. The predicted model of the data output tested on the data output is shown in the table below Figure 4: Boston housing data predicted output 4.2 Principal component regression (PCR) This is first done on the Boston data output. The regression results are shown in the table below The data results summarizes all the important output expected from the tested results. It can be seen that R-sq, adjusted R-square and the modified coefficient of the efficiency that is MMSE and MAE are correlated in the first model all are almost looking the same. This can be observed from the table. The predicted output of the test data output for the best wto PCR models can be shown diagramitically as below; 4.3 Ridge regression In this model, there are five regression models which were built. The plot of the alpha Vs. MSE for ridge regression on the case of the Boston data can be shown in the graph below The summary of the Ridge regression results on Boston housing data can be presented in the table below The predicted output results over the original output of the data test can be represented in the graph below 4.4 Partial least square The state of the quantities of the PLS models were the same as those in the PCR with the comparing PC variables. The models were bult and regression results done. For this situation, the rundown of the outcomes utilizing the PLS regression the lodging information from Boston are shown below The scores U of the output were plotted over the Input T and the predicted test response In carrying out the analysis for Coal and Airline data, the behaviour of the model can be clearly seen and the analyst are able to make observation on the same. 5.0 Summary and conclusion Some of the data mining predictive problems are of non-linear type and for every complex forecasting problem, both linear and non-linear are best used in finding the results. Non-linear algorithms of techniques when combined with linear models, they are able to produce best results. In discovering much concerning data mining, these techniques can be combined together to help in producing the most necessary results required for marking any concrete decision in business simulation process. This analysis is imperative as it creates the way for new researchers in the data mining discipline through data mining concepts and methodologies review. Reference Bettini, C., Jajodia, S. and Wang, S., 2013. Time granularities in databases, data mining, and temporal reasoning. Springer Science & Business Media. Braha, D. ed., 2013. Data mining for design and manufacturing: methods and applications (Vol. 3). Springer Science & Business Media. Dejaeger, K., Verbeke, W., Martens, D. and Baesens, B., 2012. Data mining techniques for software effort estimation: a comparative study. Software Engineering, IEEE Transactions on, 38(2), pp.375-397. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R., 1996. Advances in knowledge discovery and data mining. Friedman, A. and Schuster, A., 2010, July. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 493-502). ACM. Gorunescu, F., 2011. Data Mining: Concepts, models and techniques (Vol. 12). Springer Science & Business Media. Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V. and Namburu, R. eds., 2013. Data mining for scientific and engineering applications (Vol. 2). Springer Science & Business Media. Han, J. and Kamber, M. 2006. Data mining: concepts and techniques, 2nd ed. Morgan Kaufmann pubblishers. Kharya, S., 2012. Using data mining techniques for diagnosis and prognosis of cancer disease. arXiv preprint arXiv:1205.1923. Larose, Daniel T.2014 Discovering knowledge in data: an introduction to data mining. John Wiley & Sons Liao, S.H., Chu, P.H. and Hsiao, P.Y., 2012. Data mining techniques and applications–A decade review from 2000 to 2011. Expert Systems with Applications, 39(12), pp.11303-11311. Liao, S.H., Chu, P.H. and Hsiao, P.Y., 2012. Data mining techniques and applications–A decade review from 2000 to 2011. Expert Systems with Applications, 39(12), pp.11303-11311. Lin, T.Y. and Cercone, N. eds., 2012. Rough sets and data mining: Analysis of imprecise data. Springer Science & Business Media. Linoff, G.S. and Berry, M.J., 2011. Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons. Liu, H. and Motoda, H., 2012. Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A. and Hellerstein, J.M., 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8), pp.716-727. Rajaraman, A. and Ullman, J.D., 2012. Mining of massive datasets (Vol. 1). Cambridge: Cambridge University Press. Shouman, M., Turner, T. and Stocker, R., 2012, March. Using data mining techniques in heart disease diagnosis and treatment. In Electronics, Communications and Computers (JEC-ECC), 2012 Japan-Egypt Conference on (pp. 173-177). IEEE. Tang, Q.Y. and Zhang, C.X., 2013. Data Processing System (DPS) software with experimental design, statistical analysis and data mining developed for use in entomological research. Insect Science, 20(2), pp.254-260. Torgo, L., 2010. Data mining with R: learning with case studies. Chapman & Hall/CRC. Zaki, M.J. and Meira, W. J.R. 2014. Data mining and analysis. Fundamental concepts and algorithms. Cambdge university press. Read More

Data Mining Techniques by Using Boston Data - Research Proposal Example

Extract of sample "Data Mining Techniques by Using Boston Data"

CHECK THESE SAMPLES OF Data Mining Techniques by Using Boston Data

Fe Oxide Experiment-Solid State Chemistry, X-ray Diffraction and Characterisation of Fe Oxides

A2Z Minicab

Managing and Improving Quality

Use data mining tools (Weka) to enhance a marketing exercise

Decision Support and Business Intelligence Systems

Facilitate Continuous Improvement

Resistor Manufacture and Optimization

Knowledge Discovery for Business Information Systems