Data Mining Techniques and DNA/bio-data analysis Essay

Data Mining Techniques and DNA/bio-data analysis Introduction Data Mining is an iterative and interactive discovery process required by organizations that generate huge volumes of data daily and require analysis on the fly. The decision support system is required to provide information to queries such as finding all cases of fraud, finding the customers that are likely to buy a particular car, etc. Traditional query languages become infeasible in such scenarios (Tan, Steinbach and Kumar, 2007). Data Mining is an essential step of the knowledge discovery by data (KDD) process (Han and Kamber, 2006). The data mining process uses the applied knowledge of a combination of different fields (computer science, statistics, pattern recognition, artificial intelligence, etc.) and extracts the hidden, formerly unknown yet potentially useful information (such as patterns, anomalies, changes, associations and statistically eminent structures) from large databases, data warehouses or other data repositories (Zaki and Wong, 2003). Data mining applications include market analysis, fraud detection or unusual pattern detection, risk analysis and management, text/web mining, stream mining and DNA/bio-data analysis, scientific data analysis. The data mining tasks can either be descriptive or predictive. Descriptive mining tasks qualify the general properties of the data and include Clustering, Summarization, Association rules and Sequence Discovery. The predictive mining tasks conduct inference based on the current data and make predictions. The descriptive mining tasks include Classification, Regression, Time Series Analysis and Prediction (Han and Kamber, 2006; Rokach, Maimon and Maimon, 2010). Figure 1 Taxonomy of Data Mining Methods and Techniques (Sayad, 2012) Data mining systems are classified according to the data mined, the knowledge extracted or the mining technique used. This report discusses the prominent data mining techniques that are used worldwide. 1. Data Mining Techniques Generally the data mining techniques are based upon non-parametric models which unlike the parametric models are data driven and do not require equations. These techniques learn and dynamically adapt to the available and continuously changing data. The greater is the input data, the better is the formulated model. The non-parametric data mining techniques include the neural network, decision trees and the genetic algorithms (Dunham, 2006). A. Statistical Techniques Statistics is the science of collection, presenting and analyzing of data (Liu, 2006). Many data mining techniques are based on some basic concepts of statistics. a) Point Estimation Point estimation is used in predicting the unknown parameters of data mining models and tasks (e.g. summarization, time-series prediction, etc.) by calculating parameter for the sample and predicting values for the missing data points. It is used to estimate statistical parameters that describe the data e.g. mean variance and standard deviation. For example, in the time-series tasks, point estimation calculates parameters for the sample data and then predicts one of more values that would appear later in the sequence. Point Estimation methods include least squares, maximum likelihood estimation, robust estimation, the method of moments, Bayes estimators, etc. (Liu, 2006). b) Summarization Summarization is the process that extracts or derives representative information from a data set. Frequency distribution, mean, variance, median, mode, etc. can all be visualized in the form of plots (e.g. box plot, scatter plots, etc.) through summarization. Figure 2 Box Plot Figure 3 Scatter Plot c) Bayes Theorem Bayes Theorem calculates the probability of an event (dependent event) currently occurring given that the probability of another event (prior event) which has already occurred in the past. Considering the dependent event as B and the prior event as A, the theorem states that Probability of (B given A) = Probability of (A and B) / Probability of A For example, consider the case of estimating how likely it is that a customer under 21 will increase his spending. In this case, the prior condition (A) would be the fact of customer being under 21 while the dependent event (B) would be the increased spending. In case, 25 out of 100 customers are those who are under 21 and have increased their spending. Then, the probability of (A and B) = 25%. In case, 75 out of 100 customers are under the age of 21, then Probability of A = 75%. Bayes Theorem’s prediction would be that 33% of the customers under the age of 21 are likely to increase their spending i.e. (25/75). d) Regression and Correlation Regression is a data mining function that predicts a number (profit, temperature, distance, value, mortgages rates, etc.) based on the past values. For instance, based upon the location, plot size, number of rooms, etc. a regression model can predict a house’s value. The model is initially based upon a data set, wherein the target values are known. For instance in the above case, the data of house (features and values) could be provided for a certain period of time. Based on the underlying relationship, target value for predictors (a new feature set) can be predicted. If the target predictor relationship can be approximated by a straight line, then the regression is linear. Figure 4 Linear Regression Correlation is the measure of how similarly the values of two variables behave i.e. it is the strength of association between the two variables. The degree and direction of correlation is estimated through the correlation coefficient, r. A value of 0 indicates that no correlation exists between the two variables while a value of 1 indicates perfect correlation. A value of -1 would indicate a perfect yet opposite correlation. Figure 5 Correlation Coefficient B. Similarity Measures Several data mining tasks (e.g. distance-based outlier detection, clustering (k-means), classification (KNN, SVM), etc.) that involve computing of distance essentially require measuring the similarity or distance between two data sets (Doreswamy, Manohar and Hemanth, 2011). Similarity is the measure of how much two data objects are alike. In the context of data mining, it is generally described as the distance with the dimensions that represent the features of data objects. A small distance measure would indicate a high degree of similarity while a large distance measure would indicate low degree of similarity. Similarity is a subjective measure and it greatly depends on the domain and the application. For instance, two people cannot be similar just because they share the similar names or cities. Therefore, the dimensions or features for which distances need to be calculated must be meaningful. The feature values for which distances are acquired must be normalized or it could happen that one feature dominates the calculations. For instance, when considering two people similar in height and estimating how far apart they currently reside from each other, if the units of measurement are in centimeters then the residence distance would dominate any height correlation. The distance and similarity measures generally used in data mining tasks are as shown in Figure (Doreswamy, Manohar and Hemanth, 2011). Figure 6 Similarity/Distance Measures C. Decision Trees Decision trees are tree shaped structures used for representing the decision sets. The decisions taken generate rules. These rules are then used in the classification of a dataset. All nodes of the tree (root as well as internal nodes) are labeled with questions. The arc adjacent to each node represents the possible answer to the question. The leaf nodes represent the predicted solution to the problem. In data mining, the leaf node would represent the class to which a tuple will belong to. An algorithm is required to create the tree and another to apply it upon the data which is usually a binary search operation (Han and Kanmer, 2006). Decision trees can easily be transformed into IF-ELSE rules by mapping of each path from the root node to each of the leaf nodes (Sayad, 2012). Figure 7 Decision Rules from Decision Trees Decision trees are the most widely used technique in data mining as the model is understandable, rules are generated easily. Two of the most notable decision tree methods include the Classification and Regression Trees (CART) and the Chi Square Automatic Interaction Detection (CHAID). Both these techniques provide sets of rules which can be applied to the new dataset and predict the output for a tuple. In contrast to the CHAID, CART generally requires less data preparation. A simple classification problem that the decision trees can solve is the classification of an individual as short, medium of tall based upon their gender and height. The decision tree formed would be as shown in Figure. Figure 8 Sample Decision Tree In another data mining example, auditors can use decision trees to assess, based on the profits of the customers whether the marketing strategy used by the organization is cost-effective. For large databases the decision trees may be huge in size, and searching through the entire tree would be an extensive operation. In those cases, some kind of pruning technique would have to be applied. D. Neural Networks Neural networks (NN) or artificial neural networks (NN) model the working of a human brain. The NN is a directed graph where the nodes are the processing elements while the arcs are the interconnection between those elements. Each of the nodes is independent in its working and directs the output of the local input towards one of the adjacent nodes. NN has the source (input nodes) forming the input layer, sink (output nodes) forming the output layer and the internal (hidden) nodes forming the hidden layer. In order to execute a data mining task, each of the input (tuple) values is entered to the corresponding input node and the prediction is determined at the output nodes. The NN can learn from prior values entered into it and it can even be modified in order to achieve a desired performance level. NN however can only work with numeric data. The NN is trained by feeding it into a set of input data and the corresponding output. The network learns according to the input. Once the network is trained according to the correct standards for sample data, weights are assigned to each of the interconnections. Any new data when entered as input is processed by the nodes and predicted an output according to the weights assigned and the activation function (threshold, sigmoid, Gaussian, symmetric sigmoid) at each node. The output of the network at each node is either -1, 0 or 1. NNs can continue to learn even after the training phase is over. In a scenario where based upon the weight and height of individuals, it is required to predict the physique of the individual. The neural network would be formed as shown in Figure. Figure 9 Basic Neural Network When the input is provided, processing starts at each adjacent node. The output for node f3 would be the sum of each input and the weight of the corresponding interconnection. The output of f3 would be the input of the adjacent nodes. There are some disadvantages when utilizing the accurate prediction and learning feature of the NNs. When the training data is of the same kind, over-fitting may occur. Such a network cannot generalize accurately to new type of data. Secondly, neural networks are data hungry. The more the data, the better is the prediction. The overall design is heuristic and similar to black-box. By training a different set of data, different values of accuracy can be achieved. NNs have successfully been used in the data mining tasks in the fields of fraud detection, medicine, telecommunications, insurance, marketing, operations research and bankruptcy prediction (Cerny, 2001). E. Genetic Algorithms Genetic algorithms is an evolutionary computation method that devices new, better and optimum solution (individual) to a problem by using the set of available solutions (individuals). The challenging task is to model the problem and find the set of solutions. In data mining an individual (solution) is considered as a tuple of values or arrays. The data is generally numeric or binary strings. However complicated data structures can also be used as long as the genetic operator (crossover) is defined. The point(s) of crossovers are determined by the crossover algorithms. Apart from crossover, a very low probability of mutation is also set wherein the characters of the children are changed randomly. Once, the new individuals (children) are formed by the crossover of the original individuals (parents), and mutated (if applied), a fitness function is used to select the fittest individuals from the parents and the children. The individual that meets the objective better gets selected. After selecting the best individuals, the whole reproduction process repeats with the selected individuals are parents till the required objective meets an acceptable threshold. Since the genetic algorithm selects only the individuals that can solve the problem, it could be possible that the optimized solution may not still be the best. In data mining tasks, genetic algorithms have been used in generation of association rules, clustering and classification. Genetic algorithm is used in classification problems for optimizing the feature selection process so as to be able to predict students’ final grades based upon their web usage features extracted from their homework data. The objective of the problem is to find a population of best weights for every feature vector which would minimize error rate of the classification (Bidgoli and Punch, 2003). The problem in use of genetic algorithm is its complexity to comprehend and explain to users, and the difficultly in determining the most suitable fitness function, crossover and mutation. F. K-Means Clustering K-Means clustering is used for partitioning of n objects into k number of clusters such that each object belonging to the cluster has the nearest mean to it. K-Means clustering produces precisely k different clusters of highest possible distinction (Sayad, 2012). The number of clusters i.e. value of k which would lead to the highest separation (distance) is not a priori knowledge and is to be estimated from the data itself. The aim of of K-Means clustering is minimization of the total intra-cluster variance or the squared error function. Figure 10 Squared Error Function Here k is the number of clusters Si, (i = 1, 2, ..., k) and µj is the centroid or mean point of all the points xn in Si. Although K-Means is rather an efficient method, yet, due to the advance assigning of the number of clusters and the fact that the results are sensitive to initial selection, the process may terminate at local optimum values (Sayad, 2012). Consider an example where the visitors of a site are intended to be grouped according to their ages. Figure 11 K-Means Clustering Example No change was observed in the last two iterations. Two groups are identified by the clustering method (group of ages 15-28, and that of ages 36-65). As can be seen from the example, the initial selection of centroids has an effect on the resultant clusters. Due to this, the algorithm is generally run multiple times, using different starting points to bring forward the most suitable clusters. 2. Conclusion Owing to the fact that the amount of information generated daily has increased immensely, data mining has become a necessity of the present day. Data mining can be used to extract the hidden, analytical, predictive information from a data set which can be very useful for a company’s strategic policy. This report gave an overview of the many data mining techniques that can be implemented to extract the hidden useful information from large data stores. 3. References Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann. Zaki, M.J. and Wong, L. 2003. Data mining techniques. WSPC/Lecture Notes Series. [Accessed 11 November 2012] Dunham, M.H. 2006. Data Mining: Introductory And Advanced Topics. Pearson Education India Cerny, P.A. 2001. Data mining and Neural Networks from a Commercial Perspective, Auckland, New Zealand Student of the Department of Mathematical Sciences, University of Technology, Sydney, Australia. Available at [Accessed 11 November 2012] Bidgoli, B.M. and Punch, W.F. (2003) Using Genetic Algorithms for Data Mining Optimization in an Educational Web-based System. Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartII. Pages 2252-2263, Springer-Verlag Berlin. Rokach, L., Maimon, O. and Maimon, O.Z. 2010. Data Mining with Decision Trees: Theroy and Applications. World Scientific. Tan, P.N., Steinbach, M. and Kumar, V. 2007. An Introduction to Data Mining. Pearson Education India Doreswamy, Manohar M.G. and Hemanth. K.S. 2011. A study on similarity measure functions on engineering material selection. [Accessed 11 November 2012] Liu, H. 2006. Lecture Notes In Data Mining. Chapter 1: Point Estimation Algorithms. World Scientific Sayad. S. 2012. An introduction to data mining. [Accessed 11 November 2012] Read More

Data Mining Techniques and DNA/bio-data analysis - Essay Example

Extract of sample "Data Mining Techniques and DNA/bio-data analysis"

CHECK THESE SAMPLES OF Data Mining Techniques and DNA/bio-data analysis

Important Data Mining Techniquesning

Techniques Available for Image Encryption

Data mining does not violate the constitution

Protein Families

Knowledge Management business and economy

Time Series Data Mining and Forecasting Using SQL Server 2008

Big Data Challenged and Opportunities

Single-Molecule Force Approaches for Experiments