Data Mining Technique: Clustering Coursework Example | Topics and Well Written Essays

Data mining technique: Clustering Data-Mining Assignment By Chetan Patel (MBSP0202) (03064446) Table of contents: Introduction Business Application of Clustering Technique Algorithm Description Application Description Application Requirement Business Model Information Analysis Advantages & Disadvantages of Clustering Technique Conclusion References 1. Introduction Data mining is a technique used for "the extraction of hidden predictive information from large databases" (An Introduction to Data Mining, n.d.). This technique is used to find patterns of data from a data warehouse and then derive information from these patterns. The data patterns and the derived information are used in research, business processes and decision-making. Data mining can be stated as a technique that performs "retrospective data access (for) prospective and proactive information delivery" (An Introduction to Data Mining, n.d.). To execute data mining algorithms the following three technologies are required: Massive data collection Powerful multiprocessor computers Data mining algorithms Data mining technique clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups (Berkhin, n.d.). One of the option out of many may be calling in call centres making calls at different data points for information. The number of variables in a data point are minimized, i.e. fewer number of columns in the database. There is more number of data samples, i.e. more number of rows in the database. The algorithm used in this report to illustrate a business application of data mining technique clustering is hierarchical agglomerative clustering. In the coming section, the discussion is about the commercial application of clustering technique mentioned in this section. 2. Business Application of Clustering Technique 2.1 Clustering very large data Sets Large scale data mining involves clustering. Clustering can be termed as extensively used technique in data mining. One of the different models of clustering is k-means. This needs several passes in the complete data set. This makes it very expensive for bigger disk resident data sets. As a result of this, the work can be done on approximate versions of k-means for which small number of passes are used. The algorithm FEKM is one of them that requires a small number of passes on entire dataset and produces all the needed cluster centres. Sampling is used in this algorithm to create cluster centres. The total data set is adjusted in the cluster centres created initially. 2.2 Data Tree Construction Data tree construction can be considered as a well studied problem in data mining. In latest cases, the data tree construction has been done in cases of mining streaming data. Numerical interval pruning approach is used in processing numerical attributes in this process. This leads to development of new algorithm that attempts to find and mark the exact split points. The above algorithm techniques used are capable of improving efficiency and effectiveness of decision tree construction of streaming data. These points work on attribute spaces.1 An attribute space A contains data points xi = (xi1... xid) element of A, such that i=1...N and each component xil is element of A. The data points are also known as "object, instances, cases, patterns or transactions" (Berkhin, n.d.), these are represented by rows in the database. The components are also known as “feature, variable, dimension, component or field” (Berkhin, n.d.), these are the columns in the database. In order to form clusters, the attribute space A is divided into segments that are subsets of attribute space. These segments can be cubes, cells or regions. The segment is therefore the "simplest attribute space subset (that) is a direct Cartesian product of sub ranges”. As Cartesian Product is a direct product between two well defined sets making the resultant product a relation between two sets, the segment is a relation between any two sub ranges. (Berkhin, n.d.): In a clustering histogram, a unit is therefore considered as an "elementary segment whose sub-ranges consist of a single category value" and the graph describes the number of data points per unit. In Table 1, Datapoint ID AVR menu option (1-4) Customer native country Call start time Call end time Call length 1 1 UK 10:00 AM 10:05 AM 5 min 2 2 USA 10:10 AM 10:20 AM 10 min 3 1 UK 10:10 AM 10:25 AM 15 min 4 4 UK 10:02 AM 10:05 AM 3 min 5 3 USA 10:10 AM 10:11 AM 1 min 6 1 USA 10:03 AM 10:15 AM 12 min 7 2 UK 10:10 AM 10:15 AM 5 min 8 4 UK 10:12 AM 10:20 AM 8 min Table1: Call Center Call Records – The table indicates different call durations for different calls at various data points. if agglomerative clustering algorithm is applied to a segment with single category i.e. AVR menu option column and clustering criteria is Euclidean distance = 0, then the result is the following histogram. The data points are taken on X-Axis and call duration is taken on Y-Axis in figure 1. Figure 1: Histogram for segment S where xi = (AVR menu option), refer table 1 The idea for the above histogram is taken from http://en.wikipedia.org/wiki/Histogram The clustering algorithm shall assign data points to the clusters in such a manner that intersection of the clusters is NULL and the union is equal to the full dataset. An agglomerative (bottom-up) clustering algorithm builds a clustering hierarchy starting with one data point cluster also known as singleton and then merging two or more most appropriate clusters (Berkhin, n.d.). The criteria for merging the clusters are defined by limiting the distance between cluster points. The data points within a cluster are therefore dissimilar due to the distance between data points. As the data points are merged to form larger clusters, the clustering criteria has to be generalized "to distance between subsets (Berkhin, n.d.)" of data points rather than distance between single data points. This measure of proximity between clusters is called linkage metric. Therefore, the factors that influence the hierarchical algorithms are closeness and connectivity; these are determined by linkage metric. The similarity (distance) is calculated by applying the linkage metric operation on pair wise components of the data points or clusters (Berkhin, n.d.): 2.2 Application Description In this report data, mining technique is applied to a Customer Relationship Management application, i.e. to develop call centre activity plan (Murray, Lin, and Chowdhury, n.d.). 2.2.1 Application Requirement The requirement is to analyze the customer call records in order to find the call centre requirement for call agents during business peak hours. As mentioned about the data points in Table 1: the AVR options can be defined as: 1) Register/unregistered for service – human intervention is required 2) Billing enquiry – human intervention is required 3) AVR menu repeat recording 4) Customer support for miscellaneous issues related to the service 2.2 2 Business Model In order to find a solution to the stated requirement, the data mining technique is applied. The infrastructure used for data mining application is as following as observed in ‘An Introduction to Data Mining, n.d.)’. The idea for the above diagram is adopted from http://www.thearling.com/text/dmwhite/dmwhite.htm Figure 2: Business infrastructure for data mining applications Data Warehouse is an optimized data warehouse built with a relational database that contains data formed from two sub ranges as mentioned above. On-Line Analytic Processing (OLAP) server is the business model created according to information requirements. This indicates the requirement of call agents in business peak hours (OLAP and Data Mining, n.d.). Xi = (AVR menu option, Customer native country, Call start time) The three columns highlighted in the table 1 refer to the segment S of the attribute space A (table 1) that is used to forms clusters. Similarity between data points in the highlighted segments is identified by the following linkage metrics (Luke, n.d.; Berkhin, n.d.): Data Mining Server is integrated with OLAP server and the data warehouse. It runs the business model created on the OLAP server to return the following data: Datapoint ID AVR menu option (1-4) Customer native country Call start time Cluster listing Column 1 1 UK 10:00 AM 1 3 1 UK 10:10 AM 2 6 1 USA 10:03 AM 3 2 2 USA 10:10 AM 4 7 2 UK 10:10 AM 5 5 3 USA 10:10 AM 6 4 4 UK 10:02 AM 7 8 4 UK 10:12 AM 8 Table 2: Data clusters formed by hierarchical agglomerative clustering algorithm Data analysis tools such as reporting and visualization are applied to the data results of data mining server and other metadata stored on OLAP server to generate visual reports for management dashboards refer histogram & dendrogram. According to this different call agents must be assigned to support customers from different native country. This is because the cultural requirements such as language accent and time zone are different for customers from different native countries. Each call agent can be assigned a cluster regarding each native country in this process. 2.2.3 Information Analysis The 8 data points in the database are divided into 6 different clusters. Data point ID 5 is for automated response and hence does not require call agent. One call agent for each identified cluster is assigned. Histogram: Figure 3: Histogram for segment S where xi = (AVR menu options, Customer native country, call start time), refer table 1. The idea for the above histogram is taken from The idea for the above diagram is taken from http://en.wikipedia.org/wiki/Histogram Dendrogram: The distance table to draw the diagram is given below (Hierarchical Clustering, n.d.): The table illustrates the close connection and non connetion between any two different data points. If they are connected they the value will be 0 and if they are not conneted, X is used as a variable in the following table to denote the distance numeric. 1 2 3 4 5 6 7 8 1 0 X [0,0,10] x X X X X 2 X 0 X x X X X X 3 [0,0,10] X 0 x X X X X 4 X X X 0 X X X [0,0,10] 5 X X X x 0 X X X 6 X X X x X 0 X X 7 X X X x X X 0 X 8 X x X [0,0,10] X X X 0 Table 3: Euclidean distance table From the distance table it is concluded that data points 1 and 3 are close and connected and similarly data points 4 and 8 are close and connected, these are respectively merged into new clusters as shown below: The Xs mentioned in the table are variables taken in place of numerical values as actual distances and names of data points are not taken here. The 0s represent that the data points are close and connected. Figure 4: Hierarchical agglomerative clustering (Dendrogram, n.d.) The idea for above diagram is taken from http://en.wikipedia.org/wiki/Dendrogram The above dendrogram illustrates the singleton clusters and merged clusters formed because of application of call centre business requirement model data concept. This is used to maintain the call centre call records database. In the next section the discussion about the advantages and disadvantages of the above mentioned techniques takes place. 3. Advantages & Disadvantages of Clustering Technique The clustering technique is compared with k-nearest neighbour data mining technique to evaluate the advantages and disadvantages of using this technique for the business model defined in this report. The k-nearest neighbour data mining technique classifies new vector into a category that is “most common class amongst the K nearest neighbours” (k-nearest Neighbour Algorithm, n.d.). Advantages of agglomerative clustering algorithm are (Berkhin, n.d.; Agglomerative Hierarchical Clustering Overview, n.d.): Agglomerative clustering algorithm does not use majority criteria, but a pre-defined linkage metric for data model. Whereas k-nearest neighbour algorithm does not have a data model, instead algorithm calculates distance to all vectors and then assigns the class common to most of these K nearest vectors (Agglomerative Hierarchical Clustering Overview, n.d.). Example: HAC algorithm defines with operation =. All data points that meet other two criteria and this are merged into one cluster. K-NN algorithm will calculate Euclidean distance between new vector and vectors already present in the feature space. As a result, the call record for a call with “Call start time” < 10 min may be merged into the wrong cluster, resulting in two calls landing at the same time on one call agent. Data point ID AVR menu option (1-4) Customer native country Call start time 1 1 UK 10:00 AM 3 1 UK 10:10 AM 2 1 UK 10:07 AM 4 1 UK 10:18 AM 5 1 UK 10:03 AM Table 4: HAC vs. K-NN algorithm A new call record with data point (1, UK, 10:05AM) shall be a new cluster in HAC algorithm, whereas in K-NN algorithm, with K=2 for this feature space vector may have different classification that is not appropriate for this application. HAC algorithm data model linkage metric facilitates application of problem requirements at different hierarchy levels in order to find specific solution. K-NN algorithm has limited provision for influencing the classification of vectors. Example: Considering a maximum “call length” of 10 minutes, at the second level of clustering the operation may be redefined as . Thus, there shall be no overlap of calls. This re-modelling is not possible in K-NN. HAC does not require training samples as required in K-NN algorithm. Disadvantages of hierarchical agglomerative algorithm are: The hierarchical agglomerative clustering algorithm is an iterative algorithm that may be terminated as a headless tree with more than one cluster nodes at the top level. The data model for HAC algorithm must therefore define the termination criteria. The linear search K-NN algorithm shall terminate after classification of all n vectors. Example: In order to allow maximum 4 calls per call agent in an hour, the HAC algorithm must stop after 2nd level. The hierarchical agglomerative clustering algorithm is O (n3) complex (Zhao and Karypis, 2002). The complexity of linear search k-nearest neighbour algorithm is O (n) (Nearest Neighbour Search, n.d.). Different data model has to be defined for each iterative step in HAC algorithm. Since the data model directly affects the result, therefore to find a solution to given problem statement the data model must meet the problem requirements. Example: different operation is defined for level 1 & 2 clustering. If a data point is wrongly assigned to a cluster, this wrong assignment influences cluster formation at all levels and hence the results. 4. Conclusion The data mining techniques depend on call length, data base accessed, data mining server and user. In present circumstances, the user, agents doing data mining work data base are located in different places and are connected through networking. The core variables used should be in a manner to establish authenticity of the data. For example: when there is a doubt about the collected predictive data, that can be verified by calling by using the information available. The data models defined form the basis of data concepts in the clustering process. This enables comparison between HAC and K-NN algorithms and makes clustering technique more suitable. It is also concluded that the activity of OLAP server analyses the historic data and helps in deciding the formation of new data. The analysis of results achieved by application of this data model to the database in order to form new data concepts for the problem solution are most critical to the successful application of data mining clustering technique. 5.0 References 1. IMU, 2007, A Guide to Harvard Referencing. n.d. Leeds Metropolitan University, edition not available, [Accessed December 23, 2007]. Available from: http://www.lmu.ac.uk/lskills/open/sfl/content/harvard/. 2. IOS, 2007, Agglomerative Hierarchical Clustering Overview. n.d. IOS, edition not available, [Accessed December 28, 2007] Available from: http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Agglomerative_Hierarchical_Clustering_Overview.htm. 3. Thearling.com, 2007, An Introduction to Data Mining. n.d, edition information not available, [Accessed December 23, 2007]. Available from: http://www.thearling.com/text/dmwhite/dmwhite.htm. 4. Berkhin, Pavel. Survey of Clustering Data Mining Techniques. n.d. Accrue Software, Inc. [Accessed December 23, 2007]. Available from: http://www.ee.ucr.edu/~barth/EE242/clustering_survey.pdf. 5. Media Wiki, 2007, Data mining. n.d. Wikipedia, edition not available, [Accessed December 23, 2007]. Available from: http://en.wikipedia.org/wiki/Data_mining. 6. Media Wiki, 2007, Dendrogram. n.d. Wikipedia, edition information not available, Accessed December 28, 2007]. Available from: http://en.wikipedia.org/wiki/Dendrogram. 7. Media Wiki, 2007, Euclidean Distance. n.d, edition not available, [Accessed December 23, 2007]. Available from: http://en.wikipedia.org/wiki/Euclidean_distance. 8. Hierarchical Clustering – A Working Example. n.d. [Accessed December 23, 2007]. Available from:http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/Hier_Example.htm. 9. Media Wiki, 2007, Histogram. n.d. Wikipedia, edition not available, Accessed December 28, 2007]. Available from: http://en.wikipedia.org/wiki/Histogram. 10. Media Wiki, 2007, k-nearest Neighbor Algorithm. n.d. Wikipedia, edition not available, [Accessed December 28, 2007]. Available from: http://en.wikipedia.org/wiki/Nearest_neighbor_(pattern_recognition). 11. Luke, T. Brian. Agglomerative Clustering. n.d. [online] [Accessed December 28, 2007]. Available from: http://fconyx.ncifcrf.gov/~lukeb/agclust.html. 12. Murray, Craig, G., Lin, Jimmy. & Chowdhury, Abdur. n.d. Identification of User sessions Using Hierarchical Agglomerative Clustering. University of Maryland. [Accessed December 23, 2007]. Available from: http://www.glue.umd.edu/~gcraigm/papers/gcraigmHACposter.pdf. 13. Media Wiki, 2007, Nearest Neighbor Search. n.d. Wikipedia, edition not available, Accessed December 28, 2007]. Available from: http://en.wikipedia.org/wiki/Nearest_neighbor_search. 14. Oracle, 2007, OLAP and Data Mining. n.d. [online] Oracle, edition not available, Accessed December 28, 2007]. Available from: http://download.oracle.com/docs/cd/B28359_01/server.111/b28313/bi.htm. 15. Zhao, Ying. & Karypis, George. 2002 Evaluation of Hierarchical Clustering Algorithms for Document Datasets. ACM. [Accessed December 23, 2007]. Available from: http://delivery.acm.org. 16. Gagan Agrawal, Ruoming Jin, Anjan Goswami, 2007, Algorithms for Data Mining, CSE, edition not available, Retrieved on 10th January 2008 from, http://www.cse.ohio-state.edu/~agrawal/Research_new/mining.htm 17. Michael W. Berry, , Murray Browne, 2006, Lecture Notes in Data Mining, World Scientific, I edition, Retrieved on 10th January 2008 from http://books.google.com/books?id=Rmjy1qk55tgC&dq=dendrogram+singleton+clusters+data+mining 18. Vladimir Estivill-Castro and Michael Houle, 2000, Robust Distance-Based Clustering with Applications to Spatial Data Mining, kev.pulo.com, edition not available, Retrieved on 10th January 2008 from http://www.kev.pulo.com.au/databases/echgis/ 19. Nabil Belacel, Qian (Christa) Wang, Miroslava Cuperlovic-Culf. OMICS: A Journal of Integrative Biology. 2006, 10(4): 507-531. doi:10.1089/omi.2006.10.507, Retrieved on 10th January 2008 from http://www.liebertonline.com/doi/abs/10.1089/omi.2006.10.507?cookieSet=1&journalCode=omi Read More

Data Mining Technique: Clustering - Coursework Example

Extract of sample "Data Mining Technique: Clustering"

CHECK THESE SAMPLES OF Data Mining Technique: Clustering

Analyzing and contrasting data mining based network intrusion detection system

Efficiency of Clustering Algorithms for Mining Large Biological Data Bases

Data Mining Techniques and DNA/bio-data analysis

Data Mining: Concepts and Techniques

Data Mining Theory

Efficiency of Clustering Algorithms in Mining Biological Databases

What is Data Mining and how it brings benefits to the Business

The Efficiency of Clustering Algorithms for Mining Large Data Bases