Data Mining: Concepts and Techniques Report Example | Topics and Well Written Essays

? Data Mining Data mining Data mining can be defined as a of database applications which seek for hidden patterns within a collection of data that can be used to effectively and efficiently predict future behaviors. Therefore it is scientific that a true data mining software application or technique must be able to change data presentation criterion and also discover the previously unknown relationships amongst the data types. Data mining tools allow for possible prediction of the future trends and behaviors, hence enabling for formation of proactive, knowledge-driven decisions. Automated prospective analysis provided by the data mining techniques, as will be discussed below, go beyond the simple analysis of past records as availed by the retrospective tools used in decision support systems (DSS). These techniques of data mining were fundamentally as a result of the predominant long processes of research and product developments, with the first pressing need as to help in business data collection, storage and retrieval. Considering every aspects of data mining, the commonly used techniques are: Artificial neural networks Biclustering PageRank Genetic algorithms Nearest neighbor methods Rule indications. A) Data Mining Classification over large database 1. The kNN: k-nearest neighbor classification This algorithm is works by memorizing the entire training data and performing classification on conditions that the attributes of the test object matches either of the training samples accurately. The kNN seeks a collection of k objects within the training set which closely associates with test object, and based the assignment of an indication on the predominance of any particular class in its neighborhood. The key factors in this algorithms include the distance or similarity metric to compute distance that exist between objects; a set of the labeled objects; and the number of nearest neighbor (value of k). Advantages It is simple and easy to understand It is easy to implement its classification techniques. It can also perform so well in varied situations, hence its maximum usability. It is known for its suitability for multi-modal classes and applications in which an object is able to have a number of class labels. Disadvantages The choice of k is a limiting factor. If it (k) is too small, the result would be very sensitive to noise points. While if k is too large, the neighborhood is likely to comprise of a large number of points even from other classes. This test limits the numbers of tests records to be classified since it is true that such test records will not in most instances match any of the training records to the latter as recommended. The approach of combining the class labels is also considered as very complicated. 2. Page Rank This is classified as a search ranking algorithm that uses hyperlinks on the World Wide Web. Page Rank techniques produce static rankings of the Web pages in a manner that Page Rank value is accurately computed for each and every page that is off-line without depending on the search queries; but rather on the democratic nature of the World Wide Web through the use of its wide link architecture as an indicator of any individual page quality. It is worth noting that these features have helped in the success of the famous Google search engine. Advantages It is quite dependable as its outputs are always accurate and precise. It is simple and efficient to use once one has the knowledge and skills of its usability principle. Disadvantages Database search outcomes are based on literal (keywords, Meta data, and tags) items rather than on their actual meanings. Poor ranking of Web pages in different topological Web structures. I.e. in Google’s ranking algorithm. Less page ranks and too much time taken to list and gain high ranks for the new pages. Subsequent quotation of inaccurate information on different web pages may lead to indexing of such inaccurate pages, hence resulting to a mess of fiction. 3. Naive Bayes Advantages It is easy to construct, no complicated iterative parameter estimation schemes involved, hence it is applicable to huge data sets. It is simple, elegant and robust in its application. It is quite easy to interpret as is needs no much skill. It is considered to do surprisingly well, since it is reliable. Disadvantages The assumptions about independent nonlinear variables that exist in each class contained in this algorithm’s model may likely be unduly restrictive. It has relatively low predictive accuracy. 4. The Apriori Algorithm This data mining Algorithm works by finding frequent item sets from any transaction dataset and from it deriving association rules. This algorithm (Apriori) is classified as a seminal algorithm which helps in finding frequent item sets by the use of candidate generation. Advantages It implements level-wise comprehensive search algorithm using frequent item property. It is cable of using large item set property It is easily parallelized It is always easy to implement. Disadvantages Reduced performances since it repeatedly does database scanning and checking for larger sets of candidate by pattern matching so as to calculate an item frequency. It is considerably costly to handle a large number of candidate sets (candidate set generation) or longer patterns. 5. SVMs SVMs stand for Support Vector Machines. This is a classification model (applied globally) that is used to generate the non-overlapping partitions by employing all attributes in its functionalities. Its basic concept is that it is based on maximum margin linear discriminations that are identical to probabilistic approaches. This algorithm is capable of handling linearly non-separable points, under which the classes overlap to a given extent making it impossible to have a perfect separation between them; and solving problems with non-linear decision boundaries. Advantages For a linear penalty function, the slack variables are able to vanish from the dual problems, having constant C appearing as a constraint on the Lagrange multipliers. It can reveal robust and insensitive to misclassification within a training set. It efficiently handles conditional pieces of information by dividing them into individual sections for accuracy. Its final classifier structures are simple. Disadvantages It has uncalibrated Class membership probabilities It is only directly applicable for two-class tasks; hence algorithms reducing the multi-class task to a number of binary problems must be applied. Its effectiveness highly depends on kernel selected i.e. kernel’s parameters and soft margin parameter. Conclusion This section of the paper has discussed and analyzed the meanings, purposes, pros and cons of five algorithms of data mining within the data mining classification category of larger information databases. The goal was to clearly identify, scrutinize and evaluate by shading light on each. Results have showed that performances of each one of the five algorithms discussed above depend on the type of problem at hand. Additionally, the performance of each also depended on the performance matrix and general characteristics of dataset. B. Mining Cluster over large database Effective and efficient clustering solution options can be acquired by selectively storing “crucial” sections of the database and summing up other sections. Clustering can be considered as a basic tool that is used in data mining and pattern recognition by dividing a given set of data types into groups and sub-groups. A point to note is that the fundamental goal for doing clustering is to make the within cluster distances smaller while make the between cluster distances larger in the databases. Factors about clustering: It helps to discover inherent classes or subgroups within any data. However in clustering, unlike in classification, once can never distinguish the number of classes that exist or true class labels. It is applicable in situations of efficient representation of data. Within the Engineering profession/ application context, it (clustering) is associated with pattern recognitions. 1. The k-means Algorithm This is a simplified and iterative technique used to partition a given set of data into user-specified number of clusters, k. Advantages This algorithm has proved to be the most preferable as it is: Simple to use Easily understood and reasonably scalable Easily modified to help in data streaming. Limitations This algorithm suffers some limitations which hinders its full utilization in all aspects of data mining. Such limitations include: High sensitivity to initialization It is a limiting case of fitting data by a combination of k Gaussians and covariance matrices (?????), in situations of soft assignments of data pointing to mixture components are hardened to assign each data point individually to the most likely components. Sensitivity to outliers’’ presence because the “mean” is never a robust statistics. Optimal solution cost decreases with the subsequent increase of k till at a point of zero (when clusters’ number equals that of distinct data-points). This reciprocation of numbers makes it difficult to: Do a direct comparison of the attained solution with different clusters’ numbers. Attain the optimum value of k. 2. Biclustering Biclustering data mining technique allows for the simultaneous clustering of rows and columns that exist within a matrix. Under given situations, biclustering problem complexity may at times depend upon the precise problem formulation merit function applied to evaluate the quality of any given bicluster. There are three different types of biclustering algorithms with specifically varied definitions of bicluster. I.e. bicluster with constant; value (a), values on rows (b) or columns (c), and bicluster with coherent values (d, e). Advantages Its size constraints can be naturally assimilated to considerably speed-up/ quicken the entire process. It is simple and easy to understand once constructed. It is elaborative. Disadvantages The number of inclusion-maximal biclusters can sometimes be exponential hence generating the entire set to become infeasible. It requires good understanding of its concepts to be used appropriately. 3. DBSCAN This is a density based clustering algorithm which finds a number of clusters beginning from the estimated density distribution of corresponding nodes. This algorithm works on the principle of visiting each and every point of the database content, most probably a number of times. It executes a single query at each point, and if an indexing structure is applied that executes a neighborhood query, a general runtime complexity is achieved. So as to avoid instances of distance recompilations, distance matrix is often materialized. Advantages It only requires two parameters and it is also considered to be insensitive to ordering of points within the database. It can be effectively used to find arbitrarily shaped clusters with databases. It is not a requirement for one to specify the number of clusters in the data a priori. Disadvantages Its quality is largely dependable on the distance measure applied in the function region-Query, with Euclidean distance as the most commonly used distance metrics. This algorithm cannot be used in the clustering of data sets with large differences in densities. 4. Hierarchical This method of cluster analysis seeks to construct a hierarchy of clusters through the following strategies: i. Divisive: is a top-down approach where all observations commence in a single cluster, and sub-divides recursively with the subsequent flow down the hierarchy. ii. Agglomerative: this is the reverse of divisive. It is a bottom-up approach where all observations commence in their own cluster, and pairs of clusters are linked up with the subsequent move up the hierarchy. Advantages Any valid measure of distance dimensions can be applied in its hierarchical clustering techniques, through use of distance matrix. It is considered as very efficient as the databases contain a rational data relationship. It ensures high data integrity as it is based on the parent child relationship of hierarchy development criterion. High data security ensured. Disadvantages Implementation complexities. Lack of structural independence. Database management complexities Operational incongruities. 5. The EM Algorithm This algorithm (Expectation-Maximization) considers the use of normal mixture models, as used in the clustering of continuous data and estimation of the underlying density functions of data. It provides iterative computations of maximum likelihood estimations in situations of observation of incomplete (missing data, conceptual incompleteness, artificially augmented data) data. This Algorithm is an ascending algorithm which solves optimization by iterating a sequence of two steps. The first step is known as E-step (Expectation), while the second step is known as M-step (Maximization); to make up the full description of EM algorithm. Advantages Having conceptualized its formulae, it is always flexible and easy to work with in the process of data mining. It can be used in several instance, and further with other data mining techniques through integration. Disadvantages It is complicated/ difficult to numerically to maximize the log of a summation than to maximize the summation of the logs. It has a slow convergence speed. At times, it can only be utilized with constrained estimation techniques. Conclusion Data mining clustering analysis as has been done above finds data object clusters that are identical in a way or another basing on their features and possibility to be integrated to achieve a single outcome or result. As seen above, clustering analysis has enable for the achievement of high-quality clusters where the inter-cluster similarity is relatively low while that of intra-cluster is seen to be very high. Like in classification, clustering also considers the principle of segmenting data. However clustering models, unlike classification, does segmentation of data into groups that were undefined initial instances; and it is very useful for data exploration, data preprocessing and anomaly detection. References The Google Search Engine: Commercial search engine founded by the originators of Page Rank. http://www.google.com/. The Open Directory Project: Web directory for over 2.5 million URLs. http://www.dmoz.org/. Jiawei Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001 More Evil Than Dr. Evil?' http://searchenginewatch.com/sereport/99/11-google.html Krishna Bharat and Monika R. Henzinger. Improved algorithms for topic distillation in a hyper linked environment. In Proceedings of the ACM-SIGIR, 1998. Krishna Bharat and George A. Mihaila. When experts agree: Using non-a_liated experts to rank popular topics. In Proceedings of the Tenth International World Wide Web Conference, 2001. Sander, Jorg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). "Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications". Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2 (2): 169–194. doi:10.1023/A:1009745219419 Domenica Arlia, Massimo Coppola. "Experiments in Parallel Clustering with DBSCAN". Euro-Par 2001: Parallel Processing: 7th International Euro-Par Conference Manchester, UK August 28–31, 2001, Proceedings. Springer Berlin. Sergey Brin, Rajeev Motwani, Larry Page, and Terry Winograd. What can you do with a web in your pocket. In Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 1998. Sergey Brin and Larry Page. The anatomy of a large-scale hyper textual web search engine. In Proceedings of the Seventh International World Wide Web Conference, 1998. Han J. and Micheline K. (2006). Data Mining: Concept and Techniques, Morgan Kaufmann. Tan P., Michael S., Vipin K. (2001). Data Mining, Addison Wesley, 2001 McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York Hastie T. Tibshirani R. and Friedman J. (2009) "14.3.12 Hierarchical clustering" (PDF) the Elements of Statistical Learning (2nd Ed.) New York: Springer. pp. 520–528. ISBN 0-387-84857-6. Retrieved 2009-10-20. Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 16.4. Hierarchical Clustering by Phylogenetic Trees". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8. Bilmes, Jeff. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. CiteSeerX: 10.1.1.28.613 includes a simplified derivation of the EM equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models. Read More

Data Mining: Concepts and Techniques - Report Example

Extract of sample "Data Mining: Concepts and Techniques"

CHECK THESE SAMPLES OF Data Mining: Concepts and Techniques

Data Mining and Data Warehousing

Data Mining Process and Algorithms

Data Mining and Behavior of Customers

Data Mining for Auditing

Social Networks User Types, Models

Retail Information Systems

Principles of Data Mining

Data Mining Techniques by Using Boston Data