Identifying Outliers in a Large Biological Data Base Coursework Example | Topics and Well Written Essays

Efficiency Of Data Mining Algorithms In Identifying Outliers/Noise In A Large Biological Data Base Efficiency Of Data Mining Algorithms In Identifying Outliers/Noise In A Large Biological Data Base Introduction The protein sequences numbers in bioinformatics are approximated to be over a half a million. This calls for the need of meaningful partitions of the protein sequences so as to be in a position to detect the role they play. Alignment methods were traditionally used in the grouping and comparing protein sequences. In a later stage, local alignment algorithms were introduced to replace the earlier methods and perform more complex functions. The local algorithms were used to find amino acid patterns that are conserved in protein sequences (Clote and Backofen, 2000). The other type of algorithms was the global algorithm which was based on to align the entire protein sequence by making use of the most possible characters. This introduction was needed as it helped have large data sets clustered into clusters that were meaningful. The traditional methods aligned the protein sets of data into pair wise alignments that proved too expensive when it came to computation and large comparisons. This rendered methods involving pair-wise alignments to be inefficient when it came to clustering of sets of protein data that are large. This is because such approaches don’t take into consideration the fact that such data may turn out to be too large thus may not fit in the computer’s main memory (Smith 1999). The above situation called for the need of a technique that would naturally group or come up with a meaningful partition that makes use of a distance function. Several methods have been identified which are appropriate for clustering protein sequences into desired families. These approaches can be classified into three main categories; graph based, hierarchical and partitioning approaches (Needleman and Wunsch, 1997). Most of these approaches have been developed on techniques based on graphs or hierarchical. Partitioning techniques are relatively small in number in the clustering field of protein sequences. This has resulted to no database or tool from this area of interest to be available in the scientific community. Most of the approaches under this category cannot be regarded as tools as they are not applicable to cluster data sets that are user-provided (Cabena, 1998). Most of these methods end up forcing users to only consult the classification stored results in database when using large data sets that are known. This paper has identifies approaches that are efficient in clustering and are based on algorithms. These methods are partitioning techniques that are very simple and appropriate in the clustering of large sets of data. The main aim of these identified clustering algorithms is to come up with meaningful partitions, to better the quality of classification and to reduce the time used for computation. The identified algorithms include; Pro-LEADER, Pro-Kmeans, Pro-CLARANS, and Pro-CLARA (Fayyad, 1996). The above methods are used for in the partitioning of protein sequence data sets in cluster algorithms. Pro-Kmeans Algorithm The Pro-Kmeans Algorithm involves partitioning that is randomly performed on a set of data which are then put into clusters and then later makes use of the Smith Waterman algorithm in an effort to come up with a comparison of each protein cluster as well as computing each protein’s SumScores. The sequence in the respective cluster which happens to have the most SumScores is normally regarded as the cluster’s centroid (Tatusov, 2003). The Smith Waterman algorithm is applied at this stage to compare the respective protein in the set of data provided with the found centroids and also used in the assigning of objects to the cluster that is nearest to the maximum score and is similar to the identified object. This algorithm repeats the above process many times so as to come up with the maximum function. In this kind of algorithm, the number of clusters forms the input parameter with the output being the most suitable partition in the entire set of data used (Sasson and Linial, 2002). Pro-LEADER algorithm Pro-LEADER algorithm involves the selection of the first sequence forming the set of data making it the first leader and goes ahead to make use of the Smith Waterman algorithm to calculate the similarity score of every sequence represented in the data set with all leaders. This algorithm operates in such a way that it identifies the nearest leader to every sequence and goes further to compare the scores with a threshold that is normally pre-fixed. If the score of the nearest leader happens to be larger than the threshold used, then the sequence will be taken as the new leader. If this does not happen then the sequence has to be assigned to the cluster which is defined by the leader (Herger and Holm, 2001). This makes pro-LEADER an incremental algorithm where each of the clusters has to be represented by an identified leader. These clusters are reached at by the use of threshold values that are most suitable. The aim of this algorithm is to have the function maximized. Its input parameter happens to be the threshold similarity score considering an object as being a new leader while its output is the best partition returns of the training data set as well as the number of the leaders of the clusters that have been obtained. This type of algorithm has been appreciated as being fast and only requires one pass through a given set of data (Enright and Ouzounis, 2000). Pro-CLARA algorithm This type of algorithm depends mostly on the sampling method in an effort to handle large sets of data. Unlike most of the other approaches which find the mediods of the entire set of data, Pro-CLARA draws a sample that is small from the sequence in the set of data provided. In order to come up with a mediods set that is optimal for any data sample, Pro-CLARA makes use of the PAM algorithm when it comes to protein data set sequence. In Pro-PAM algorithm, a number of sequences are randomly selected from the set of data provided inform of clusters after which the Smith Waterman algorithm is used to make a comparison on the total score of every pair of the chosen sequence as well as the sequence that have not been selected (Shi and Malik ,1997). The mediods optimal set that has been obtained using Smith Waterman algorithm and Pro-PAM is then made use in Pro-CLARA to make comparison of every protein present in the set of data with all the available mediods and goes ahead to place the sequence into the cluster that is nearest. Pro-CLARA keeps on repeating the cluster process as well as the sampling in order to ease the sampling bias. It achieves this by identifying the number of clustering process in a pre-defined mode and subsequently comes up with final clustering findings for the mediods set with maximum functions. In this type of algorithm, the input parameters are the clusters numbers as well as its iterations and its output being the obtained clusters mediods (Bolten, 2001). Pro-CLARANS algorithm Pro-CLARANS algorithm begins from a given node that is arbitrary in a graph. This node is a representation of an initial mediods set. The algorithm then randomly picks on neighbors of the selected node that happen to differ by specifically one sequence. If the sum of all the score of the neighbors selected happens to be higher than the prevailing node, then this algorithm moves to this neighbor and continues with its comparing process as well as its neighbor selection process. If this does not happen, then the algorithm looks for another neighbor until it settles on a neighbor that is better or settles for a maximal neighbor’s number that is pre-determined. In this case, the neighbor’s maximal number has to be of a threshold value approximated to be 250 or obtained by a given number of clusters and a given sequence in the given set of data (Enright, 2002). This algorithm also makes use of the Smith Waterman to come up with a comparison of each score sequence similarity in the given set of data with every mediod as well as assigning it to the cluster that is similar. The Pro-CLARANs algorithm repeats the process of clustering for a number of times that is pre-defined and picks on the clustering findings that is final of the mediods set with the maximal function (Van Dongen, 2000). The clusters numbers happen to be the algorithm’s input parameter and its output being the number of mediods of the clusters obtained. How To Measure Performance In order to evaluate the above mentioned clustering algorithms, a large set of training data can be used. This can be easily obtained from a given cluster’s training phase and by defining each cluster by a leader or centroid cluster. The results of the training phase are used in the clustering of a different data set identified by the name, Test data set. The Smith Waterman algorithm can then be used to make comparisons between each sequence of protein on the test set of data with all the mediods that have already been obtained in the training phase. The Smith Waterman algorithm also assigns every sequence to the cluster that is nearest. The family group that will be predicted on each sequence is the nearest mediod (Essousi and Fayech, 2007). The findings from the test phase are then used to compute the specificity and the sensitivity of each of the selected algorithm and to come up with a comparison of the algorithm with the published clustering tools results. Sensitivity in this case specifies the probability of predicting a classifier correctly while specificity happens to be the probability that the prediction provided is correct (Henikoff, 1999). The results obtained from Pro-LEADER, Pro-Kmeans, Pro-CLARA and Pro-CLARANS confirm that the proposed methods of partitioning are reliable and valuable tools for functional clustering that is automated of protein sequences. By using such methods in the place of classic or alignment methods that were previously used by biologists, the process can lead to clustering sensitivity improvement as well as specificity and a reduction in the computational time. These proposed methods can be utilized by upcoming biologists especially in the clustering of large sets of data of proteins into clusters that are meaningful so as to detect their functions (Guralni and Karypis, 2001). Conclusion Similar protein sequences may have the same biochemical functions as well as three dimensional structures. When the two sequences happen to be similar but from different organisms, they may end up having an ancestor that is common and thus be termed as homogenous. Protein clustering sequence using Pro-LEADER, Pro-Kmeans, Pro-CLARA and Pro-CLARANS methods help in the classification of new sequence, come up with a set a sequences that are similar and offer predictions on unknown protein structure sequence. The classification of large protein sets of data sequences by clustering techniques in the place of alignment methods extremely cuts down on the execution time and improves on this important function in molecular biology (Kaufman and Rousseeuw, 1999). References Altschul, S (1999). ‘Basic local alignment search tool.’ J Mol Biol, 215:403-410 Bolten, E. et al. (2001). ‘Clustering protein sequences-structure prediction by transitive homology.’ Bioinformatics, 17(10): 935-41 Cabena, P, et al. (1998). Discovering Data Mining: From Concept to Implementation. New Jersey: Prentice Hall Clote P & Backofen, R. (2000). Computational Molecular Biology. New York: John Wiley & Sons Enright, A & Ouzounis, C. (2000). GeneRAGE: a robust algorithm for the sequence clustering and domain detection. Bioinformatics 16(5): 451-7 Enright, A (2002). ‘An efficient algorithm for large-scale detection of protein families.’ Nucleic Acids Res, 30(7): 1575-84 Essousi, N & Fayech, S. (2007).‘A comparison of fur pair-wise sequence alignment methods.’ Bioinformation, 2: 166-168 Fayyad, M. (1996). ‘Data mining and knowledge discovery: Making sense out of data.’ IEEE Expert, 11:20-25 Guralnik, V & Karypis, G (2001). ‘A scalable algorithm for clustering sequential data.’ SIGKDD Workshop on Bioinformatics Henikoff, S. (1999). ‘Performances evaluation of amino acid substitution matrices.’ Proteins, 17:49-61 Herger, A & Holm, L. (2001). ‘Picasso: generating a covering set of protein family profiles.’ Bioinformatics, 17: 272-9 Kaplan, N, et al. (2005). ‘ProtoNet 4.0: a hierarchical classification of one million protein sequences.’ Nucleic Acid Res, of one million protein sequences.’ Nucleic Acid Res, 33:216-8 Kaufman, L & Rousseeuw, P. (1999). Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons Needleman S & Wunsch, D. (1997). ‘A general method applicable to the search of similarities in the amino acid sequence of proteins.’ J Mol Biol, 48:443-453 Sasson, O & Linial, N. (2002). ‘The metric space of proteins-comparative study of clustering algorithms.’ Bioinformatics, 18:14-21 Shi, J & Malik J. (1997). ‘Normalized cuts and image segmentation.’ Proceedings of the IEEE conference on Computer Vision Pattern Recognition. 731-737 Smith, W. (1999). ‘Identification of common molecular subsequences.’J Mol Biol. 147:195-197 Tatusov, R, et al. (2003). ‘The COG database: an updated version includes eukaryotes.’ BMC Bioinformatics, 4: 41 Van Dongen, S. (2000). Graph clustering by flow simulation. In Phd Thesis. Amsterdam: University of Utrecht Yona, G & Linial M. (2000). ‘ProtoMap: automatic classification of protein sequences and hierarchy families.’ Nucleic Acids Res, 28(1):49-55 Read More

Identifying Outliers in a Large Biological Data Base - Coursework Example

Extract of sample "Identifying Outliers in a Large Biological Data Base"

CHECK THESE SAMPLES OF Identifying Outliers in a Large Biological Data Base

Outliers and The Rockefellers Habits

Dig Data Integration Outlines

How Outliers Affect Current Studies

Biologics PRODUCTION AND MARKET

Managerial Report

Distribution of a Variable

Outliers by Malcolm Gladwell

Statistical Data Analysis