StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

The Efficiency of Clustering Algorithms for Mining Large Data Bases - Coursework Example

Cite this document
Summary
The paper "The Efficiency of Clustering Algorithms for Mining Large Data Bases" highlights that using Pro-PAM algorithm based on partitioning clustering techniques in the place of alignment methods in large data sets, increases efficiency and reduces execution time significantly…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER92.2% of users find it useful
The Efficiency of Clustering Algorithms for Mining Large Data Bases
Read Text Preview

Extract of sample "The Efficiency of Clustering Algorithms for Mining Large Data Bases"

EFFICIENCY OF CLUSTERING ALGORITHMS FOR MINING LARGE DATA BASES By Presented to Introduction In the field of protein comparison and sequence clustering, there are very few instances in which partitioning techniques have been used. For instance, Guralnik & Karypis (2001) proposed an algorithm or sequencing method-on the basis of the standard k-means-in which proteins are represented using vectors. This shortcoming of this method, however, is that there are no databases or tools readily available to the public. Therefore, this method has not been practised and as such is only a theoretical proposition. However, there are other methods such as JACOP, which also utilize the partitioning algorithm and are available to the public. The problem is, these methods cannot be used for comparing and clustering data sets that have been provided by users (Fayech et al., 2009). This study focuses on evaluating the effeciency of various types of sequencing data mining algorithms with respect to protein sequence data sets, and on the basis of their shortcomings, design and develop an efficient clustering algorithm on the basis of the partitioning method. The rationale here is that, partitioning techniques or methods have not been exploited in the field of protein clustering sequence for comparing and clustering large data sets. As a matter as fact, there are millions of protein sequences available (Mount, 2002). Additionally, as earlier noted, alignment methods are extremely expensive and inefficient when it comes to clustering of large protein data sets. Algorithms Performance and Efficiency Evaluation To meet the objectives of this study, the algorithms based on partitioning techniques proposed by Fayech et al. (2009) are implemented on the protein sequence data sets, to come up with C clusters from n protein sequences data set D, such that the objective function f(X) is optimized. These algorithms proposed by Fayech et al. (2009) are the Pro-LEADER, Pro-CLARANS, Pro-Kmeans and Pro-CLARA. These algorithms’ performance and efficiency are measured and evaluated; then, their results are compared to the results of newly proposed algorithm. This complements their shortcomings and inefficiencies. The function f(X) that is used to examine and evaluate the efficiency, quality and performance of the algorithms is stated as (Fayech et al., 2009) (X) = ………….. (i) Rj in this scenario represents the centroid of the class j in which the object Oi. On the other hand, from the equation, the alignment score of Rj and Oi –protein sequences-is calculated using the equation stated below Score(X, Y) = …(ii) In this case, S(Xj, Yi) represents amino acid substitution score by Xj by Yi on the basis of g(n) and scoring matrix as a total cost for the gap length n defined as …………(iii) The Pro-Kmeans algorithm This algorithm begins by randomly partitioning of the protein sequence data set D into patterns. It utilizes the local Smith Waterman algorithm in comparing the proteins of each pattern and computes the each protein Sum Score. In every cluster Sj, the sequence Oi, is viewed as the cluster’s centroid Rj. With the help of the local alignment algorithm, of the protein data set D, each protein Oh is compared the cluster’s centroid. This is in order to assign Oh, where the similarity score of Rj is maximum, to the nearest pattern or cluster. This procedure is repeated q times so as to optimize the f(X). For this algorithm, iterations, q, and K, the number of clusters are the inputs. Consequently, the outputs are the mean and the best partition of the data set D, of each cluster. The Pro-LEADER algorithm This is algorithm is incremental and considers every first sequence of the protein data set as the leader. Just as the Pro-Kmeans algorithm, Pro-LEADER algorithm also utilizes the local alignment algorithm proposed by Smith and Waterman for computing the similarity score with all leaders the sequence in D (Fayech et al., 2009). It detects, to every sequence Oi, the nearest leader Rj. This is then compared to the score with a predefined threshold. In the event that, the similarity score of Oi and Ri is more than the predetermined threshold, Oi then becomes the new leader sequence. In the event that it is not, the cluster that the leader Rj defines is assigned to the Oi sequence. This algorithm aims at optimizing the f(X) function. In this algorithm, the input is represented by the similarity score threshold. On the other hand, the outputs are the K leaders and the best partition of the data set D, of the clusters obtained (Fayech et al., 2009). The Pro-CLARANS algorithm This algorithm begins with the node C, which is arbitrary in the graph C = [A1, A2…Ak] that represents the first set of medoids. It randomly selects C’, a neighbor of C by a difference of one sequence only (Essoussi & Fayech, 2007). These selections proceeds to the next neighbor as long as its total score is higher than that of the node at hand. It otherwise continues the random checks until a neighbor with a higher score is found, or the threshold number of maximal neighbors has been reached. In this case, the predetermined maximal number neighbors should be at least 250. This algorithm, as opposed to the other previously discussed, aims at optimizing the total score. In this case, the similarity score is also computed using the local alignment method. It is then assigns the score to the nearest cluster (Ng & Han, 2002). The Pro-CLARANS algorithm iterates the process of clustering for predetermined q times. The final clustering is selected as the medoids set with the optimal f(X). The repetitions, q, and K, the number of clusters are the inputs for this algorithm. On the other hand, the outputs are the K medoids and the best partition of the data set D, of the clusters obtained. The Pro-CLARA algorithm This algorithm depends on the sampling approach in handling very large data sets (Essoussi & Fayech, 2007). As opposed to the Pro-CLARANS algorithm, this algorithm creates small samples of 40 + 2K from D, the data set. It utilizes the local alignment method, Smith Waterman algorithm and the maximal set of medoids to obtain the comparison between all the medoids and Oh, each protein of D; the data set and to assign the Oh sequence to the nearest cluster (Bolten et al., 2001). The clustering and sampling process is repeated predetermined number times, q, so as to ensure that there is no bias in sampling. A final clustering result is then selected from the medoids set with the optimal f(X). For this algorithm, the repetitions, q, and K, the number of classes are the inputs. On the other hand, the outputs are the K medoids and the best partition of the data set D, of the clusters obtained. Efficiency and Performance measure A training data set or large protein data set is used for evaluating the performance and efficiency of the partitioning algorithms described in the preceding sections. K clusters from the training phase defined by a leader or a centroid (medoid) are obtained from the training phase whose results are treated as test data set. The sequence of each protein on the data set D is compared with all medoids using the local alignment algorithm, Smith and Waterman, and each sequence is assigned the neighboring cluster. Specificity and sensitivity of the previously discussed algorithms are calculated using the results from the test phase. In this scenario, sensitivity refers to the probability of predicting a classifier correctly, whereas specificity refers to the probability of the correctness of the prediction. They are defined by And In this case, FN represents False Negatives and is the number of unidentified true homologues pairs; TP represents True Positives, which are the true homologous pairs that were identified correctly, and FP stands for False Positives, which are the non-homologues pairs considered as homologues. Pairs in the same category or family group are considered as homologues. The Protein sequence data used The protein sequence data set used was of families whose subgroups or subfamilies- HLA protein sequence family-are known. Eight Hundred and Ninety Three (93) sequences were randomly selected from the set and labeled as DS1. This data set was later grouped into 12 clusters. Additionally, protein sequences from the Hydrolases family were also collected. This family’s sequences are grouped into eight clusters based on their functionalities. It is further subdivided into 3,737 sequences and labeled DS2. Proteins from the Globin’s family sequences are collected as well and randomly grouped into 8 categories and 292 sequences and labeled DS3. Therefore, there are 28 different clusters of sequences as experts have classified them. The data set in use has 4,922 sequences in total and is labeled DS4. Out of these, about 70%, over 3,500 sequences as set for training and the remaining 30% (1,422) set aside for the testing phase. Discussion For the Pro-CLARA, Pro-CLARANS, and the Pro-Kmeans algorithms, q, the number of repetitions is set to five (Ng & Han, 2002), whereas K, the number of clusters is fixed to 28. From the study, it is evident that the best sequencing and clustering outcomes are obtained when the number of clusters K is fixed to 28 (Wang et al., 2012). Consequently, for the Pro-Leader method, predetermined brink is set at 350. It is clear that, Pro-CLARA, Pro-CLARANS, Pro-LEADER, and the Pro-Kmeans algorithms have significantly improved specificity and sensitivity of hierarchical and graph-based methods. However, all these algorithms produced the lowest probability on DS1-4 of predicting sensitivity correctly. Only Pro-CLARA give the highest probability that the prediction provided is correct. Pro-LEADER algorithm is not efficient when it comes to handling large data sets as shown on the DS4 data set. To complement these shortcomings, this study proposes the Pro-PAM algorithm to generate the maximal set of medoids. This algorithm randomly selects K sequences as clusters randomly from the data sets D. It also utilizes the local alignment algorithm-Smith Waterman-in the computation of the TSih, total score of every pair of the sequences selected and those that are not selected. It selects the optimal total. For this algorithm, the repetitions, q, and K, the number of clusters are the inputs. On the other hand, the outputs are the K medoids and the best partition of the data set D, of the clusters obtained. The pseudo code for the proposed algorithm is Input: A sample S of the training set D; S = {Oh} h=1…m; m is the size of S 1. Select K objects randomly from S: Ri (i ∈[i…K]); 2. For every pair of non-selected object Oh in S and selected object Ri do Calculate the total score TSih; 3. Select the maximal TSih: MaxTSih, and mark the corresponding objects Ri and Oh; 4. If MaxTSih > 0 then Ri = Oh; Go back to Step 2 Else For each Oh ∈ S do Compute the similarity score of Oh with each centroid Ri (i ∈[i…K]), using Smith Waterman algorithm Assign Oh to the cluster with the nearest Ri; End Output: BestSets; BestSets refers to the best partition of S into K cluster; with each cluster defined by medoids Ri Conclusion Proteins that have similar sequences possess similar 3D constructs and the same biochemical use. Sequences are said to be homologous when they are from different organisms but have common ancestors. Clustering of protein sequences using Pro-CLARA, Pro-CLARANS, Pro-LEADER, and the Pro-Kmeans algorithms ensures new classification sequences, prediction of protein structures from sequences that are not known. It is clear that, using Pro-PAM algorithm based on partitioning clustering techniques in the place of alignment methods in large data sets, increases efficiency and reduces execution time significantly. Additionally, it improves both the clustering specificity and sensitivity, and endeavor to reduce computation time. References Berkhin, P., 2003. Survey of Clustering Data Mining Techniques. San Jose: Accrue Software, Inc. Bolten, E. et al., 2001. Clustering protein sequences-structure prediction by transitive homology. Bioinformatics, 17(10), pp.935-41. Cai, L., Juedes, D. & Liakhovitch, E., 2000. Evolutionary Computation Techniques for multiple sequence alignment. In Congress on evolutionary computation 2000., 2000. Enright, A. & Ouzounis, C., 2000. GeneRAGE: A robust algorithm for sequence clustering and domain detection. Bioinformatics, 16(5), pp.451-57. Enright, A., Van Dongen, S. & Ouzounis, C., 2002. An efficient algorithm for large-scale detection of protein familes. Nucleic Acids Res, 30(7), pp.1575-84. Essoussi, N. & Fayech, S., 2007. A comparison of four pair-wise sequence alignment methods. Bioinformation, 2, pp.166-68. Fayech, S., Essoussi, N. & Limam, M., 2009. Partitioning clustering algorithms for protein sequence data sets. BioData Mining, 2(3). Galperin, M.Y. & Koonin, E.V., 2001. Comparative Genome Analysis, In Bioinformatics- A Practical Guide to the Analysis of Genes and Proteins. 2nd ed. New York: Wiley-Interscience. Guralnik, V. & Karypis, G., 2001. A scalable algorithm for clustering sequential data. In SIGKDD Workshop on Bioinformatics., 2001. BIOKDD. Han, J. & Kamber, M., 2000. Data Mining: Concepts and Technique. San Francisco: Morgan kaufamnn. Maxwell, E.K., 2010. Graph Mining Algorithms for Memory Leak Diagnosis and Biological Database Clustering. Thesis. Blacksburg: Virginia Polytechnic Institute and State University. Mitra, S. & Acharya, T., 2003. DataMining: Multimedia, Soft Computing and Bioinformatics. New York: John Wiley. Mount, D., 2002. Bioinformatics – Sequence and Genome Analysis. New York: Cold Spring Harbor Laboratory Press. Ng, R.T. & Han, J., 2002. CLARANS: A Method for Clustering Objects for Spatisl Data Mining. Research. Vancouver. Raut, S.A. & Sathe, S.R., 2011. A Modified Fastmap K-Means Clustering Algorithm for Large Scale Gene Expression Datasets. International Journal of Bioscience, Biochemistry and Bioinformatics, 4(1), pp.292-96. Raut, S.A., Sathe, S.R. & Raut, A.P., 2010. Gene Expression Analysis-A Review for large datasets. Journal of Computer Science and Engineering, 4(1). S.A.Raut, Sathe, S.R. & Raut, A., 2010. Bioinformatics: Trends in Gene Expression Analysis. In Proceedings of 2010 International Conference On Bioinformatics and Biomedical Technology. Chengdu, 2010. Sasson, O., Linial, N. & Linial, M., 2002. The metric space of proteins-comparative study of clustering algorithms. Bioinformatics, 18(1), pp.14-21. Sperisen, P. & Pagni, M., 2005. JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture. BMC Bioinformatics, 6(216). Wang, H., Yu, Y., Wang, Q. & Wan, Y., 2012. A density-based clustering structure mining algorithm for data streams. In BigMine 12 Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. New York, 2012. ACM. Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(The Efficiency of Clustering Algorithms for Mining Large Data Bases Coursework Example | Topics and Well Written Essays - 1250 words, n.d.)
The Efficiency of Clustering Algorithms for Mining Large Data Bases Coursework Example | Topics and Well Written Essays - 1250 words. https://studentshare.org/information-technology/1819891-data-mining
(The Efficiency of Clustering Algorithms for Mining Large Data Bases Coursework Example | Topics and Well Written Essays - 1250 Words)
The Efficiency of Clustering Algorithms for Mining Large Data Bases Coursework Example | Topics and Well Written Essays - 1250 Words. https://studentshare.org/information-technology/1819891-data-mining.
“The Efficiency of Clustering Algorithms for Mining Large Data Bases Coursework Example | Topics and Well Written Essays - 1250 Words”. https://studentshare.org/information-technology/1819891-data-mining.
  • Cited: 0 times

CHECK THESE SAMPLES OF The Efficiency of Clustering Algorithms for Mining Large Data Bases

Data Mining Techniques

The issues selected for mining in the frequent flier program are first to identify customer characteristics, their most frequent flying zone, class, and period of the year in which they usually fly.... In the essay “data Mining” the author analyzes different techniques to store large amounts of data inaccessible form.... This is usually referred to as a data warehouse.... A data warehouse receives data in raw form from operational systems and stores them....
2 Pages (500 words) Essay

High Level ETL and Data Mining Requirements

The prices of ETL High Level ETL and data Mining Requirements Introduction A data Mining and ETL methodologies seek to organize the pattern discovery process in the data warehouse of an organization.... igh Level ETL ETL is the process in database usage especially data warehousing that involves the following activities:Extracting data from external sources.... anipulating data to fit the operational needs....
2 Pages (500 words) Research Paper

Artificial Intelligence: Machine Learning Algorithms

The author describes the datasets used for experiments on the class prediction problem and the arguments for using a neighborhood analysis approach for AML-ALL class prediction… Class discovery involves categorizing data into an arbitrary number of groups.... Mapping a complete set of genes is impractical both in laboratory testing (obtain the expression of the gene) as well as data analysis.... ince an individual strand of DNA cannot be isolated from a person, data for genes represent the mean values for the samples taken from a person....
8 Pages (2000 words) Assignment

Data Mining: the Personalization of the Organizations Business Processes

ata Miners have always deployed Data Mining algorithms for mining data.... The diversity of data and the nature of the domain in which mining is being performed are also key issues that affect the choice of methodology opted for mining data.... The paper "data Mining: the Personalization of the Organization's Business Processes" presents an organization's functioning.... It is dependent upon the mining of data done by data miners....
5 Pages (1250 words) Essay

Mining of Large Data Volumes With the Aim of Advancing Business Operations

Data janitorial work can be reduced by the development of software that automates the collection, cleaning and consolidation of the plentiful The paper "Mining of large data Volumes With the Aim of Advancing Business Operations " is a wonderful example of an assignment on business.... The article from the WSJ, Getting Started in Big Data, Feb 3, 2014 -Communication skills are very crucial in the mining of large data volumes with the aim of advancing business operations....
2 Pages (500 words) Assignment

Take the Data out of Dating in the Internet System

om, and OkCupid use algorithms to pair coupe bases on answers given to predetermined questions.... This essay "Take the data out of Dating in the Internet System" describes love challenging in Internet social sites.... However, even the algorithm software find prognosticating love challenging because the data mining software requires much tuning to create a perfect march.... People should be aware of the perversity that comes with algorithms....
7 Pages (1750 words) Essay

Human Motion Detection based on Background Subtraction Techniques

It is a technology that uses the difference of the current image and the background image to detect the motion region, and it is generally able to provide data included object information .... The author of this coursework "Human Motion Detection based on Background Subtraction Techniques" provides a review of the human motion detection methods focusing on background subtraction technique....
6 Pages (1500 words) Coursework

Algorithmic: the Spirit of Computing

A number of graph algorithms call for processing vertices and edges of a graph systematically.... The two principal types of algorithms that do such traversals are the depth-first search (DFS) and the breath-first search (BFS).... This assignment "Algorithmic: the Spirit of Computing" discusses Algorithmic as a branch of science but according to David Harel, it is “much more than that and is instead the core of computer sciences” (Harel, 1992)....
1 Pages (250 words) Assignment
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us