StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Identifying Outliers in a Large Biological Data Base - Coursework Example

Cite this document
Summary
This coursework "Identifying Outliers in a Large Biological Data Base" identifies approaches that are efficient in clustering and are based on algorithms. These methods are partitioning techniques that are very simple and appropriate in the clustering of large sets of data…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER92.3% of users find it useful
Identifying Outliers in a Large Biological Data Base
Read Text Preview

Extract of sample "Identifying Outliers in a Large Biological Data Base"

Efficiency Of Data Mining Algorithms In Identifying Outliers/Noise In A Large Biological Data Base Efficiency Of Data Mining Algorithms In Identifying Outliers/Noise In A Large Biological Data Base Introduction The protein sequences numbers in bioinformatics are approximated to be over a half a million. This calls for the need of meaningful partitions of the protein sequences so as to be in a position to detect the role they play. Alignment methods were traditionally used in the grouping and comparing protein sequences. In a later stage, local alignment algorithms were introduced to replace the earlier methods and perform more complex functions. The local algorithms were used to find amino acid patterns that are conserved in protein sequences (Clote and Backofen, 2000). The other type of algorithms was the global algorithm which was based on to align the entire protein sequence by making use of the most possible characters. This introduction was needed as it helped have large data sets clustered into clusters that were meaningful. The traditional methods aligned the protein sets of data into pair wise alignments that proved too expensive when it came to computation and large comparisons. This rendered methods involving pair-wise alignments to be inefficient when it came to clustering of sets of protein data that are large. This is because such approaches don’t take into consideration the fact that such data may turn out to be too large thus may not fit in the computer’s main memory (Smith 1999). The above situation called for the need of a technique that would naturally group or come up with a meaningful partition that makes use of a distance function. Several methods have been identified which are appropriate for clustering protein sequences into desired families. These approaches can be classified into three main categories; graph based, hierarchical and partitioning approaches (Needleman and Wunsch, 1997). Most of these approaches have been developed on techniques based on graphs or hierarchical. Partitioning techniques are relatively small in number in the clustering field of protein sequences. This has resulted to no database or tool from this area of interest to be available in the scientific community. Most of the approaches under this category cannot be regarded as tools as they are not applicable to cluster data sets that are user-provided (Cabena, 1998). Most of these methods end up forcing users to only consult the classification stored results in database when using large data sets that are known. This paper has identifies approaches that are efficient in clustering and are based on algorithms. These methods are partitioning techniques that are very simple and appropriate in the clustering of large sets of data. The main aim of these identified clustering algorithms is to come up with meaningful partitions, to better the quality of classification and to reduce the time used for computation. The identified algorithms include; Pro-LEADER, Pro-Kmeans, Pro-CLARANS, and Pro-CLARA (Fayyad, 1996). The above methods are used for in the partitioning of protein sequence data sets in cluster algorithms. Pro-Kmeans Algorithm The Pro-Kmeans Algorithm involves partitioning that is randomly performed on a set of data which are then put into clusters and then later makes use of the Smith Waterman algorithm in an effort to come up with a comparison of each protein cluster as well as computing each protein’s SumScores. The sequence in the respective cluster which happens to have the most SumScores is normally regarded as the cluster’s centroid (Tatusov, 2003). The Smith Waterman algorithm is applied at this stage to compare the respective protein in the set of data provided with the found centroids and also used in the assigning of objects to the cluster that is nearest to the maximum score and is similar to the identified object. This algorithm repeats the above process many times so as to come up with the maximum function. In this kind of algorithm, the number of clusters forms the input parameter with the output being the most suitable partition in the entire set of data used (Sasson and Linial, 2002). Pro-LEADER algorithm Pro-LEADER algorithm involves the selection of the first sequence forming the set of data making it the first leader and goes ahead to make use of the Smith Waterman algorithm to calculate the similarity score of every sequence represented in the data set with all leaders. This algorithm operates in such a way that it identifies the nearest leader to every sequence and goes further to compare the scores with a threshold that is normally pre-fixed. If the score of the nearest leader happens to be larger than the threshold used, then the sequence will be taken as the new leader. If this does not happen then the sequence has to be assigned to the cluster which is defined by the leader (Herger and Holm, 2001). This makes pro-LEADER an incremental algorithm where each of the clusters has to be represented by an identified leader. These clusters are reached at by the use of threshold values that are most suitable. The aim of this algorithm is to have the function maximized. Its input parameter happens to be the threshold similarity score considering an object as being a new leader while its output is the best partition returns of the training data set as well as the number of the leaders of the clusters that have been obtained. This type of algorithm has been appreciated as being fast and only requires one pass through a given set of data (Enright and Ouzounis, 2000). Pro-CLARA algorithm This type of algorithm depends mostly on the sampling method in an effort to handle large sets of data. Unlike most of the other approaches which find the mediods of the entire set of data, Pro-CLARA draws a sample that is small from the sequence in the set of data provided. In order to come up with a mediods set that is optimal for any data sample, Pro-CLARA makes use of the PAM algorithm when it comes to protein data set sequence. In Pro-PAM algorithm, a number of sequences are randomly selected from the set of data provided inform of clusters after which the Smith Waterman algorithm is used to make a comparison on the total score of every pair of the chosen sequence as well as the sequence that have not been selected (Shi and Malik ,1997). The mediods optimal set that has been obtained using Smith Waterman algorithm and Pro-PAM is then made use in Pro-CLARA to make comparison of every protein present in the set of data with all the available mediods and goes ahead to place the sequence into the cluster that is nearest. Pro-CLARA keeps on repeating the cluster process as well as the sampling in order to ease the sampling bias. It achieves this by identifying the number of clustering process in a pre-defined mode and subsequently comes up with final clustering findings for the mediods set with maximum functions. In this type of algorithm, the input parameters are the clusters numbers as well as its iterations and its output being the obtained clusters mediods (Bolten, 2001). Pro-CLARANS algorithm Pro-CLARANS algorithm begins from a given node that is arbitrary in a graph. This node is a representation of an initial mediods set. The algorithm then randomly picks on neighbors of the selected node that happen to differ by specifically one sequence. If the sum of all the score of the neighbors selected happens to be higher than the prevailing node, then this algorithm moves to this neighbor and continues with its comparing process as well as its neighbor selection process. If this does not happen, then the algorithm looks for another neighbor until it settles on a neighbor that is better or settles for a maximal neighbor’s number that is pre-determined. In this case, the neighbor’s maximal number has to be of a threshold value approximated to be 250 or obtained by a given number of clusters and a given sequence in the given set of data (Enright, 2002). This algorithm also makes use of the Smith Waterman to come up with a comparison of each score sequence similarity in the given set of data with every mediod as well as assigning it to the cluster that is similar. The Pro-CLARANs algorithm repeats the process of clustering for a number of times that is pre-defined and picks on the clustering findings that is final of the mediods set with the maximal function (Van Dongen, 2000). The clusters numbers happen to be the algorithm’s input parameter and its output being the number of mediods of the clusters obtained. How To Measure Performance In order to evaluate the above mentioned clustering algorithms, a large set of training data can be used. This can be easily obtained from a given cluster’s training phase and by defining each cluster by a leader or centroid cluster. The results of the training phase are used in the clustering of a different data set identified by the name, Test data set. The Smith Waterman algorithm can then be used to make comparisons between each sequence of protein on the test set of data with all the mediods that have already been obtained in the training phase. The Smith Waterman algorithm also assigns every sequence to the cluster that is nearest. The family group that will be predicted on each sequence is the nearest mediod (Essousi and Fayech, 2007). The findings from the test phase are then used to compute the specificity and the sensitivity of each of the selected algorithm and to come up with a comparison of the algorithm with the published clustering tools results. Sensitivity in this case specifies the probability of predicting a classifier correctly while specificity happens to be the probability that the prediction provided is correct (Henikoff, 1999). The results obtained from Pro-LEADER, Pro-Kmeans, Pro-CLARA and Pro-CLARANS confirm that the proposed methods of partitioning are reliable and valuable tools for functional clustering that is automated of protein sequences. By using such methods in the place of classic or alignment methods that were previously used by biologists, the process can lead to clustering sensitivity improvement as well as specificity and a reduction in the computational time. These proposed methods can be utilized by upcoming biologists especially in the clustering of large sets of data of proteins into clusters that are meaningful so as to detect their functions (Guralni and Karypis, 2001). Conclusion Similar protein sequences may have the same biochemical functions as well as three dimensional structures. When the two sequences happen to be similar but from different organisms, they may end up having an ancestor that is common and thus be termed as homogenous. Protein clustering sequence using Pro-LEADER, Pro-Kmeans, Pro-CLARA and Pro-CLARANS methods help in the classification of new sequence, come up with a set a sequences that are similar and offer predictions on unknown protein structure sequence. The classification of large protein sets of data sequences by clustering techniques in the place of alignment methods extremely cuts down on the execution time and improves on this important function in molecular biology (Kaufman and Rousseeuw, 1999). References Altschul, S (1999). ‘Basic local alignment search tool.’ J Mol Biol, 215:403-410 Bolten, E. et al. (2001). ‘Clustering protein sequences-structure prediction by transitive homology.’ Bioinformatics, 17(10): 935-41 Cabena, P, et al. (1998). Discovering Data Mining: From Concept to Implementation. New Jersey: Prentice Hall Clote P & Backofen, R. (2000). Computational Molecular Biology. New York: John Wiley & Sons Enright, A & Ouzounis, C. (2000). GeneRAGE: a robust algorithm for the sequence clustering and domain detection. Bioinformatics 16(5): 451-7 Enright, A (2002). ‘An efficient algorithm for large-scale detection of protein families.’ Nucleic Acids Res, 30(7): 1575-84 Essousi, N & Fayech, S. (2007).‘A comparison of fur pair-wise sequence alignment methods.’ Bioinformation, 2: 166-168 Fayyad, M. (1996). ‘Data mining and knowledge discovery: Making sense out of data.’ IEEE Expert, 11:20-25 Guralnik, V & Karypis, G (2001). ‘A scalable algorithm for clustering sequential data.’ SIGKDD Workshop on Bioinformatics Henikoff, S. (1999). ‘Performances evaluation of amino acid substitution matrices.’ Proteins, 17:49-61 Herger, A & Holm, L. (2001). ‘Picasso: generating a covering set of protein family profiles.’ Bioinformatics, 17: 272-9 Kaplan, N, et al. (2005). ‘ProtoNet 4.0: a hierarchical classification of one million protein sequences.’ Nucleic Acid Res, of one million protein sequences.’ Nucleic Acid Res, 33:216-8 Kaufman, L & Rousseeuw, P. (1999). Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons Needleman S & Wunsch, D. (1997). ‘A general method applicable to the search of similarities in the amino acid sequence of proteins.’ J Mol Biol, 48:443-453 Sasson, O & Linial, N. (2002). ‘The metric space of proteins-comparative study of clustering algorithms.’ Bioinformatics, 18:14-21 Shi, J & Malik J. (1997). ‘Normalized cuts and image segmentation.’ Proceedings of the IEEE conference on Computer Vision Pattern Recognition. 731-737 Smith, W. (1999). ‘Identification of common molecular subsequences.’J Mol Biol. 147:195-197 Tatusov, R, et al. (2003). ‘The COG database: an updated version includes eukaryotes.’ BMC Bioinformatics, 4: 41 Van Dongen, S. (2000). Graph clustering by flow simulation. In Phd Thesis. Amsterdam: University of Utrecht Yona, G & Linial M. (2000). ‘ProtoMap: automatic classification of protein sequences and hierarchy families.’ Nucleic Acids Res, 28(1):49-55 Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(Identifying Outliers in a Large Biological Data Base Coursework Example | Topics and Well Written Essays - 1750 words, n.d.)
Identifying Outliers in a Large Biological Data Base Coursework Example | Topics and Well Written Essays - 1750 words. https://studentshare.org/information-technology/1778467-efficiency-of-data-mining-algorithms-in-identifying-outliersnoise-in-a-large-biological-data-base
(Identifying Outliers in a Large Biological Data Base Coursework Example | Topics and Well Written Essays - 1750 Words)
Identifying Outliers in a Large Biological Data Base Coursework Example | Topics and Well Written Essays - 1750 Words. https://studentshare.org/information-technology/1778467-efficiency-of-data-mining-algorithms-in-identifying-outliersnoise-in-a-large-biological-data-base.
“Identifying Outliers in a Large Biological Data Base Coursework Example | Topics and Well Written Essays - 1750 Words”. https://studentshare.org/information-technology/1778467-efficiency-of-data-mining-algorithms-in-identifying-outliersnoise-in-a-large-biological-data-base.
  • Cited: 0 times

CHECK THESE SAMPLES OF Identifying Outliers in a Large Biological Data Base

Outliers and The Rockefellers Habits

Through such wide range of products, the company has attracted a large network of customers.... Thirdly, Kent 22nd November outliers One of the most successful people within the Coca-Cola Company is MuhtarKent, the current chief executive officer.... outliers: The story of Success.... The first thing that made Kent a successful leader is that he is keen in identifying an opportunity.... The first thing that made Kent a successful leader is that he is keen in identifying an opportunity....
2 Pages (500 words) Essay

Dig Data Integration Outlines

There are is a need to have strategies and ways in big data integration can be achieved in an organization.... This paper will assess a framework which can be used to integrate big data… The framework that is adopted in the development of big data integration is discussed below: Conversion of big data into analytics which are actionable.... It gives a straightforward exploration, access and organization of various sources of data....
1 Pages (250 words) Essay

How Outliers Affect Current Studies

From the inputs shared in the post, the information and links provided assisted in my understanding of outliers, as well as the effects of outliers in the statistical outcome.... The paper "How outliers Affect Current Studies" discusses that for researchers with no significant formal statistical background, understanding the manner by which outliers affect current studies using statistical tools could mean exerting extra efforts which are tedious....
1 Pages (250 words) Book Report/Review

Biologics PRODUCTION AND MARKET

According to the statistical data, these biologics will reach an estimated 178.... Following is the graphical data which shows the forecast of biologics between the years of 2004 till 2017.... Biologics are increasingly being… In fact, the bio-tech industry has helped treat more than 200 diseases through 400 biological drugs according to Kaldre (2008)....
4 Pages (1000 words) Essay

Managerial Report

The conclusion about the parameters' descriptive statistics and the The economy, the social life and the satisfaction with life can be estimated using the statistical analysis of the corresponding data.... The data from 100 counties represents the information about the following indices: average lifespan, average number of people per household, median household income and average number of people per household.... The provided data permits to estimate each parameter, as well as the relationship between them....
7 Pages (1750 words) Essay

Distribution of a Variable

Confusion of the exact mean caused by presence of outliers in a given data set 8.... hellip; Most data values are clustered near the mean, giving the distribution a well-defined single peak.... data values are spread evenly around the mean, making the distribution symmetric.... Chapter 6 What does the distribution of a variable (or data set) refer to?... Wbat is the Mean of a data set, and how would you calculate it?...
3 Pages (750 words) Essay

Outliers by Malcolm Gladwell

Some of the conclusions made by the author of the book are not backed up by researches and studies; Gladwell tends to use mostly anecdotal data.... Some of the conclusions made by the author of the book are not backed up by researches and studies; Gladwell tends to use mostly anecdotal data.... In the paper “outliers by Malcolm Gladwell” the author analyzes Gladwell's common belief about outliers, namely that people called outliers are successful due to intelligence and talents they possess....
2 Pages (500 words) Essay

Statistical Data Analysis

The assignment "Statistical data Analysis" focuses on the critical analysis of the tasks in statistical data, i.... he average NOx value of the given data set is 5.... .... the price and dist variables.... The price variable shows a normal distribution with a mean of $22511....
7 Pages (1750 words) Assignment
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us