Efficiency of Clustering Algorithms for Mining Large Biological Data Bases Research Paper Example | Topics and Well Written Essays

? EFFICIENCY OF CLUSTERING ALGORITHMS FOR MINING LARGE BIOLOGICAL DATA BASES By Presented to With the enormous amount, of gene sequences resulting from genome expression and uncontrolled data classification into functional families or classes, clustering of such large data sets has become a serious headache in functional and structural genomics. The accurate classification of sequences into classes is in recent times a necessity, and as a result, computer programs have been developed to help in automatic clustering. Various clustering algorithms-methods-have addressed the gene sequence clustering. They are categorized into portioning, hierarchical and graph-based techniques. The most widely used of the three algorithms are the graph-based technique, and the hierarchical technique. However, the partitioning techniques are used in other disciplines; it is less used in gene sequence clustering and as such, there is no substantial theory of whether the partitioning methods are efficient. This study analyzes four clustering mining algorithms using four large protein sequence data sets. The analysis highlights the weakness and shortcomings of the four and proposes a new algorithm based on the shortcomings of the four algorithms. Introduction Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. According to Cai et al. (2000), alignment methods are algorithms constructed to arrange, RNA, DNA, and protein sequences to detect similarities that may be as a result of evolutionary, functional or structural sequence relationships. Mount (2002) asserts that comparing and clustering sequences is done using pair-wise alignment method, which are of two types, global and local. Consequently, local alignment algorithm proposed by Waterman and Smith (Bolten et al., 2001) is utilized in identifying amino acid patterns that have been conserved in protein sequences. The global alignment algorithm proposed by Wunsh and Needleman (Bolten et al., 2001) is used to try and align many characters of the entire sequence. It is clear from the above that; the pair-wise alignment method is expensive when it comes to comparing and clustering a large protein data set. This is because there are very many comparisons performed during computation, since every single protein in a data set is compared to all the proteins in the data set (Bolten et al., 2001). This brings into question the efficiency of the pair-wise alignment methods in comparing and clustering of large protein data sets. The pair-wise alignment method, both local and global, do not put into consideration the size of the data set, especially too large data sets that may overwhelm the computer memory. Han & Kamber (2000) argues that, unsupervised learning is aimed at identifying from a data set, a sensible partition or a natural pattern with the help of a distance function. Biology and life science fields have extensively exploited clustering techniques in sequence analysis to classify similar sequences into either protein or gen families (Galperin & Koonin, 2001). Currently, protein sequences can be classified in similar patterns using various, readily available sequencing and clustering methods. As had earlier been mentioned, these methods can be grouped as graph-based, partitioning and hierarchical methods. These methods, especially graph-based and hierarchical methods, have been used consecutively or together to complement each other as argued by Sasson et al. (2002), Sperisen & Pagni (2005), Essoussi & Fayech (2007) and Enright & Ouzounis (2000). In the field of protein comparison and sequence clustering, there are very few instances in which partitioning techniques have been used. For instance, Guralnik & Karypis (2001) proposed an algorithm or sequencing method-on the basis of the standard k-means-in which proteins are represented using vectors. This shortcoming of this method, however, is that there are no databases or tools readily available to the public. Therefore, this method has not been practised and as such is only a theoretical proposition. However, there are other methods such as JACOP, which also utilize the partitioning algorithm and are available to the public. The problem is, these methods cannot be used for comparing and clustering data sets that have been provided by users (Fayech et al., 2009). This study focuses on evaluating the effeciency of various types of sequencing data mining algorithms with respect to protein sequence data sets, and on the basis of their shortcomings, design and develop an efficient clustering algorithm on the basis of the partitioning method. The rationale here is that, partitioning techniques or methods have not been exploited in the field of protein clustering sequence for comparing and clustering large data sets. As a matter as fact, there are millions of protein sequences available (Mount, 2002). Additionally, as earlier noted, alignment methods are extremely expensive and inefficient when it comes to clustering of large protein data sets. Algorithms Performance and Efficiency Evaluation To meet the objectives of this study, the algorithms based on partitioning techniques proposed by Fayech et al. (2009) are implemented on the protein sequence data sets, to come up with C clusters from n protein sequences data set D, such that the objective function f(X) is optimized. These algorithms proposed by Fayech et al. (2009) are the Pro-LEADER, Pro-CLARANS, Pro-Kmeans and Pro-CLARA. These algorithms’ performance and efficiency are measured and evaluated; then, their results are compared to the results of newly proposed algorithm. This complements their shortcomings and inefficiencies. The function f(X) that is used to examine and evaluate the efficiency, quality and performance of the algorithms is stated as (Fayech et al., 2009) (X) = ………….. (i) Rj in this scenario represents the centroid of the class j in which the object Oi. On the other hand, from the equation, the alignment score of Rj and Oi –protein sequences-is calculated using the equation stated below Score(X, Y) = …(ii) In this case, S(Xj, Yi) represents amino acid substitution score by Xj by Yi on the basis of g(n) and scoring matrix as a total cost for the gap length n defined as …………(iii) The Pro-Kmeans algorithm This algorithm begins by randomly partitioning of the protein sequence data set D into patterns. It utilizes the local Smith Waterman algorithm in comparing the proteins of each pattern and computes the each protein Sum Score. In every cluster Sj, the sequence Oi, is viewed as the cluster’s centroid Rj. With the help of the local alignment algorithm, of the protein data set D, each protein Oh is compared the cluster’s centroid. This is in order to assign Oh, where the similarity score of Rj is maximum, to the nearest pattern or cluster. This procedure is repeated q times so as to optimize the f(X). For this algorithm, iterations, q, and K, the number of clusters are the inputs. Consequently, the outputs are the mean and the best partition of the data set D, of each cluster. The Pro-LEADER algorithm This is algorithm is incremental and considers every first sequence of the protein data set as the leader. Just as the Pro-Kmeans algorithm, Pro-LEADER algorithm also utilizes the local alignment algorithm proposed by Smith and Waterman for computing the similarity score with all leaders the sequence in D (Fayech et al., 2009). It detects, to every sequence Oi, the nearest leader Rj. This is then compared to the score with a predefined threshold. In the event that, the similarity score of Oi and Ri is more than the predetermined threshold, Oi then becomes the new leader sequence. In the event that it is not, the cluster that the leader Rj defines is assigned to the Oi sequence. This algorithm aims at optimizing the f(X) function. In this algorithm, the input is represented by the similarity score threshold. On the other hand, the outputs are the K leaders and the best partition of the data set D, of the clusters obtained (Fayech et al., 2009). The Pro-CLARANS algorithm This algorithm begins with the node C, which is arbitrary in the graph C = [A1, A2…Ak] that represents the first set of medoids. It randomly selects C’, a neighbor of C by a difference of one sequence only (Essoussi & Fayech, 2007). These selections proceeds to the next neighbor as long as its total score is higher than that of the node at hand. It otherwise continues the random checks until a neighbor with a higher score is found, or the threshold number of maximal neighbors has been reached. In this case, the predetermined maximal number neighbors should be at least 250. This algorithm, as opposed to the other previously discussed, aims at optimizing the total score. In this case, the similarity score is also computed using the local alignment method. It is then assigns the score to the nearest cluster (Ng & Han, 2002). The Pro-CLARANS algorithm iterates the process of clustering for predetermined q times. The final clustering is selected as the medoids set with the optimal f(X). The repetitions, q, and K, the number of clusters are the inputs for this algorithm. On the other hand, the outputs are the K medoids and the best partition of the data set D, of the clusters obtained. The Pro-CLARA algorithm This algorithm depends on the sampling approach in handling very large data sets (Essoussi & Fayech, 2007). As opposed to the Pro-CLARANS algorithm, this algorithm creates small samples of 40 + 2K from D, the data set. It utilizes the local alignment method, Smith Waterman algorithm and the maximal set of medoids to obtain the comparison between all the medoids and Oh, each protein of D; the data set and to assign the Oh sequence to the nearest cluster (Bolten et al., 2001). The clustering and sampling process is repeated predetermined number times, q, so as to ensure that there is no bias in sampling. A final clustering result is then selected from the medoids set with the optimal f(X). For this algorithm, the repetitions, q, and K, the number of classes are the inputs. On the other hand, the outputs are the K medoids and the best partition of the data set D, of the clusters obtained. Efficiency and Performance measure A training data set or large protein data set is used for evaluating the performance and efficiency of the partitioning algorithms described in the preceding sections. K clusters from the training phase defined by a leader or a centroid (medoid) are obtained from the training phase whose results are treated as test data set. The sequence of each protein on the data set D is compared with all medoids using the local alignment algorithm, Smith and Waterman, and each sequence is assigned the neighboring cluster. Specificity and sensitivity of the previously discussed algorithms are calculated using the results from the test phase. In this scenario, sensitivity refers to the probability of predicting a classifier correctly, whereas specificity refers to the probability of the correctness of the prediction. They are defined by And In this case, FN represents False Negatives and is the number of unidentified true homologues pairs; TP represents True Positives, which are the true homologous pairs that were identified correctly, and FP stands for False Positives, which are the non-homologues pairs considered as homologues. Pairs in the same category or family group are considered as homologues. The Protein sequence data used The protein sequence data set used was of families whose subgroups or subfamilies- HLA protein sequence family-are known. Eight Hundred and Ninety Three (93) sequences were randomly selected from the set and labeled as DS1. This data set was later grouped into 12 clusters. Additionally, protein sequences from the Hydrolases family were also collected. This family’s sequences are grouped into eight clusters based on their functionalities. It is further subdivided into 3,737 sequences and labeled DS2. Proteins from the Globin’s family sequences are collected as well and randomly grouped into 8 categories and 292 sequences and labeled DS3. Therefore, there are 28 different clusters of sequences as experts have classified them. The data set in use has 4,922 sequences in total and is labeled DS4. Out of these, about 70%, over 3,500 sequences as set for training and the remaining 30% (1,422) set aside for the testing phase. Discussion For the Pro-CLARA, Pro-CLARANS, and the Pro-Kmeans algorithms, q, the number of repetitions is set to five (Ng & Han, 2002), whereas K, the number of clusters is fixed to 28. From the study, it is evident that the best sequencing and clustering outcomes are obtained when the number of clusters K is fixed to 28 (Wang et al., 2012). Consequently, for the Pro-Leader method, predetermined brink is set at 350. It is clear that, Pro-CLARA, Pro-CLARANS, Pro-LEADER, and the Pro-Kmeans algorithms have significantly improved specificity and sensitivity of hierarchical and graph-based methods. However, all these algorithms produced the lowest probability on DS1-4 of predicting sensitivity correctly. Only Pro-CLARA give the highest probability that the prediction provided is correct. Pro-LEADER algorithm is not efficient when it comes to handling large data sets as shown on the DS4 data set. To complement these shortcomings, this study proposes the Pro-PAM algorithm to generate the maximal set of medoids. This algorithm randomly selects K sequences as clusters randomly from the data sets D. It also utilizes the local alignment algorithm-Smith Waterman-in the computation of the TSih, total score of every pair of the sequences selected and those that are not selected. It selects the optimal total. For this algorithm, the repetitions, q, and K, the number of clusters are the inputs. On the other hand, the outputs are the K medoids and the best partition of the data set D, of the clusters obtained. The pseudo code for the proposed algorithm is Input: A sample S of the training set D; S = {Oh} h=1…m; m is the size of S 1. Select K objects randomly from S: Ri (i ?[i…K]); 2. For every pair of non-selected object Oh in S and selected object Ri do Calculate the total score TSih; 3. Select the maximal TSih: MaxTSih, and mark the corresponding objects Ri and Oh; 4. If MaxTSih > 0 then Ri = Oh; Go back to Step 2 Else For each Oh ? S do Compute the similarity score of Oh with each centroid Ri (i ?[i…K]), using Smith Waterman algorithm Assign Oh to the cluster with the nearest Ri; End Output: BestSets; BestSets refers to the best partition of S into K cluster; with each cluster defined by medoids Ri Conclusion Proteins that have similar sequences possess similar 3D constructs and the same biochemical use. Sequences are said to be homologous when they are from different organisms but have common ancestors. Clustering of protein sequences using Pro-CLARA, Pro-CLARANS, Pro-LEADER, and the Pro-Kmeans algorithms ensures new classification sequences, prediction of protein structures from sequences that are not known. It is clear that, using Pro-PAM algorithm based on partitioning clustering techniques in the place of alignment methods in large data sets, increases efficiency and reduces execution time significantly. Additionally, it improves both the clustering specificity and sensitivity, and endeavor to reduce computation time. References Berkhin, P., 2003. Survey of Clustering Data Mining Techniques. San Jose: Accrue Software, Inc. Bolten, E. et al., 2001. Clustering protein sequences-structure prediction by transitive homology. Bioinformatics, 17(10), pp.935-41. Cai, L., Juedes, D. & Liakhovitch, E., 2000. Evolutionary Computation Techniques for multiple sequence alignment. In Congress on evolutionary computation 2000., 2000. Enright, A. & Ouzounis, C., 2000. GeneRAGE: A robust algorithm for sequence clustering and domain detection. Bioinformatics, 16(5), pp.451-57. Enright, A., Van Dongen, S. & Ouzounis, C., 2002. An efficient algorithm for large-scale detection of protein familes. Nucleic Acids Res, 30(7), pp.1575-84. Essoussi, N. & Fayech, S., 2007. A comparison of four pair-wise sequence alignment methods. Bioinformation, 2, pp.166-68. Fayech, S., Essoussi, N. & Limam, M., 2009. Partitioning clustering algorithms for protein sequence data sets. BioData Mining, 2(3). Galperin, M.Y. & Koonin, E.V., 2001. Comparative Genome Analysis, In Bioinformatics- A Practical Guide to the Analysis of Genes and Proteins. 2nd ed. New York: Wiley-Interscience. Guralnik, V. & Karypis, G., 2001. A scalable algorithm for clustering sequential data. In SIGKDD Workshop on Bioinformatics., 2001. BIOKDD. Han, J. & Kamber, M., 2000. Data Mining: Concepts and Technique. San Francisco: Morgan kaufamnn. Maxwell, E.K., 2010. Graph Mining Algorithms for Memory Leak Diagnosis and Biological Database Clustering. Thesis. Blacksburg: Virginia Polytechnic Institute and State University. Mitra, S. & Acharya, T., 2003. DataMining: Multimedia, Soft Computing and Bioinformatics. New York: John Wiley. Mount, D., 2002. Bioinformatics – Sequence and Genome Analysis. New York: Cold Spring Harbor Laboratory Press. Ng, R.T. & Han, J., 2002. CLARANS: A Method for Clustering Objects for Spatisl Data Mining. Research. Vancouver. Raut, S.A. & Sathe, S.R., 2011. A Modified Fastmap K-Means Clustering Algorithm for Large Scale Gene Expression Datasets. International Journal of Bioscience, Biochemistry and Bioinformatics, 4(1), pp.292-96. Raut, S.A., Sathe, S.R. & Raut, A.P., 2010. Gene Expression Analysis-A Review for large datasets. Journal of Computer Science and Engineering, 4(1). S.A.Raut, Sathe, S.R. & Raut, A., 2010. Bioinformatics: Trends in Gene Expression Analysis. In Proceedings of 2010 International Conference On Bioinformatics and Biomedical Technology. Chengdu, 2010. Sasson, O., Linial, N. & Linial, M., 2002. The metric space of proteins-comparative study of clustering algorithms. Bioinformatics, 18(1), pp.14-21. Sperisen, P. & Pagni, M., 2005. JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture. BMC Bioinformatics, 6(216). Wang, H., Yu, Y., Wang, Q. & Wan, Y., 2012. A density-based clustering structure mining algorithm for data streams. In BigMine '12 Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. New York, 2012. ACM. Read More

Efficiency of Clustering Algorithms for Mining Large Biological Data Bases - Research Paper Example

Extract of sample "Efficiency of Clustering Algorithms for Mining Large Biological Data Bases"

CHECK THESE SAMPLES OF Efficiency of Clustering Algorithms for Mining Large Biological Data Bases

Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base

Business Intelligence in the Cardiff Websites

A Comparison of Some Methods of Cluster Analysis with SPSS

Time Series Data Mining and Forecasting Using SQL Server 2008

Efficient Data Mining Classification Technique

Identifying Outliers in a Large Biological Data Base

The Efficiency of Clustering Algorithms for Mining Large Data Bases

A Framework for Customer Relationship Management and Data Mining