StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base - Essay Example

Cite this document
Summary
The paper "Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base" summarizes that the classification of large protein sets of data sequences by clustering techniques in the place of alignment methods extremely cuts down on the execution time. …
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER98.7% of users find it useful
Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base
Read Text Preview

Extract of sample "Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base"

? Efficiency Of Data Mining Algorithms In Identifying Outliers/Noise In A Large Biological Data Base Efficiency Of Data Mining Algorithms In Identifying Outliers/Noise In A Large Biological Data Base Introduction The protein sequences numbers in bioinformatics are approximated to be over a half a million. This calls for the need of meaningful partitions of the protein sequences so as to be in a position to detect the role they play. Alignment methods were traditionally used in the grouping and comparing protein sequences. In a later stage, local alignment algorithms were introduced to replace the earlier methods and perform more complex functions. The local algorithms were used to find amino acid patterns that are conserved in protein sequences (Clote and Backofen, 2000). The other type of algorithms was the global algorithm which was based on to align the entire protein sequence by making use of the most possible characters. This introduction was needed as it helped have large data sets clustered into clusters that were meaningful. The traditional methods aligned the protein sets of data into pair wise alignments that proved too expensive when it came to computation and large comparisons. This rendered methods involving pair-wise alignments to be inefficient when it came to clustering of sets of protein data that are large. This is because such approaches don’t take into consideration the fact that such data may turn out to be too large thus may not fit in the computer’s main memory (Smith 1999). The above situation called for the need of a technique that would naturally group or come up with a meaningful partition that makes use of a distance function. Several methods have been identified which are appropriate for clustering protein sequences into desired families. These approaches can be classified into three main categories; graph based, hierarchical and partitioning approaches (Needleman and Wunsch, 1997). Most of these approaches have been developed on techniques based on graphs or hierarchical. Partitioning techniques are relatively small in number in the clustering field of protein sequences. This has resulted to no database or tool from this area of interest to be available in the scientific community. Most of the approaches under this category cannot be regarded as tools as they are not applicable to cluster data sets that are user-provided (Cabena, 1998). Most of these methods end up forcing users to only consult the classification stored results in database when using large data sets that are known. This paper has identifies approaches that are efficient in clustering and are based on algorithms. These methods are partitioning techniques that are very simple and appropriate in the clustering of large sets of data. The main aim of these identified clustering algorithms is to come up with meaningful partitions, to better the quality of classification and to reduce the time used for computation. The identified algorithms include; Pro-LEADER, Pro-Kmeans, Pro-CLARANS, and Pro-CLARA (Fayyad, 1996). The above methods are used for in the partitioning of protein sequence data sets in cluster algorithms. Pro-Kmeans Algorithm The Pro-Kmeans Algorithm involves partitioning that is randomly performed on a set of data which are then put into clusters and then later makes use of the Smith Waterman algorithm in an effort to come up with a comparison of each protein cluster as well as computing each protein’s SumScores. The sequence in the respective cluster which happens to have the most SumScores is normally regarded as the cluster’s centroid (Tatusov, 2003). The Smith Waterman algorithm is applied at this stage to compare the respective protein in the set of data provided with the found centroids and also used in the assigning of objects to the cluster that is nearest to the maximum score and is similar to the identified object. This algorithm repeats the above process many times so as to come up with the maximum function. In this kind of algorithm, the number of clusters forms the input parameter with the output being the most suitable partition in the entire set of data used (Sasson and Linial, 2002). Pro-LEADER algorithm Pro-LEADER algorithm involves the selection of the first sequence forming the set of data making it the first leader and goes ahead to make use of the Smith Waterman algorithm to calculate the similarity score of every sequence represented in the data set with all leaders. This algorithm operates in such a way that it identifies the nearest leader to every sequence and goes further to compare the scores with a threshold that is normally pre-fixed. If the score of the nearest leader happens to be larger than the threshold used, then the sequence will be taken as the new leader. If this does not happen then the sequence has to be assigned to the cluster which is defined by the leader (Herger and Holm, 2001). This makes pro-LEADER an incremental algorithm where each of the clusters has to be represented by an identified leader. These clusters are reached at by the use of threshold values that are most suitable. The aim of this algorithm is to have the function maximized. Its input parameter happens to be the threshold similarity score considering an object as being a new leader while its output is the best partition returns of the training data set as well as the number of the leaders of the clusters that have been obtained. This type of algorithm has been appreciated as being fast and only requires one pass through a given set of data (Enright and Ouzounis, 2000). Pro-CLARA algorithm This type of algorithm depends mostly on the sampling method in an effort to handle large sets of data. Unlike most of the other approaches which find the mediods of the entire set of data, Pro-CLARA draws a sample that is small from the sequence in the set of data provided. In order to come up with a mediods set that is optimal for any data sample, Pro-CLARA makes use of the PAM algorithm when it comes to protein data set sequence. In Pro-PAM algorithm, a number of sequences are randomly selected from the set of data provided inform of clusters after which the Smith Waterman algorithm is used to make a comparison on the total score of every pair of the chosen sequence as well as the sequence that have not been selected (Shi and Malik ,1997). The mediods optimal set that has been obtained using Smith Waterman algorithm and Pro-PAM is then made use in Pro-CLARA to make comparison of every protein present in the set of data with all the available mediods and goes ahead to place the sequence into the cluster that is nearest. Pro-CLARA keeps on repeating the cluster process as well as the sampling in order to ease the sampling bias. It achieves this by identifying the number of clustering process in a pre-defined mode and subsequently comes up with final clustering findings for the mediods set with maximum functions. In this type of algorithm, the input parameters are the clusters numbers as well as its iterations and its output being the obtained clusters mediods (Bolten, 2001). Pro-CLARANS algorithm Pro-CLARANS algorithm begins from a given node that is arbitrary in a graph. This node is a representation of an initial mediods set. The algorithm then randomly picks on neighbors of the selected node that happen to differ by specifically one sequence. If the sum of all the score of the neighbors selected happens to be higher than the prevailing node, then this algorithm moves to this neighbor and continues with its comparing process as well as its neighbor selection process. If this does not happen, then the algorithm looks for another neighbor until it settles on a neighbor that is better or settles for a maximal neighbor’s number that is pre-determined. In this case, the neighbor’s maximal number has to be of a threshold value approximated to be 250 or obtained by a given number of clusters and a given sequence in the given set of data (Enright, 2002). This algorithm also makes use of the Smith Waterman to come up with a comparison of each score sequence similarity in the given set of data with every mediod as well as assigning it to the cluster that is similar. The Pro-CLARANs algorithm repeats the process of clustering for a number of times that is pre-defined and picks on the clustering findings that is final of the mediods set with the maximal function (Van Dongen, 2000). The clusters numbers happen to be the algorithm’s input parameter and its output being the number of mediods of the clusters obtained. How To Measure Performance In order to evaluate the above mentioned clustering algorithms, a large set of training data can be used. This can be easily obtained from a given cluster’s training phase and by defining each cluster by a leader or centroid cluster. The results of the training phase are used in the clustering of a different data set identified by the name, Test data set. The Smith Waterman algorithm can then be used to make comparisons between each sequence of protein on the test set of data with all the mediods that have already been obtained in the training phase. The Smith Waterman algorithm also assigns every sequence to the cluster that is nearest. The family group that will be predicted on each sequence is the nearest mediod (Essousi and Fayech, 2007). The findings from the test phase are then used to compute the specificity and the sensitivity of each of the selected algorithm and to come up with a comparison of the algorithm with the published clustering tools results. Sensitivity in this case specifies the probability of predicting a classifier correctly while specificity happens to be the probability that the prediction provided is correct (Henikoff, 1999). The results obtained from Pro-LEADER, Pro-Kmeans, Pro-CLARA and Pro-CLARANS confirm that the proposed methods of partitioning are reliable and valuable tools for functional clustering that is automated of protein sequences. By using such methods in the place of classic or alignment methods that were previously used by biologists, the process can lead to clustering sensitivity improvement as well as specificity and a reduction in the computational time. These proposed methods can be utilized by upcoming biologists especially in the clustering of large sets of data of proteins into clusters that are meaningful so as to detect their functions (Guralni and Karypis, 2001). Conclusion Similar protein sequences may have the same biochemical functions as well as three dimensional structures. When the two sequences happen to be similar but from different organisms, they may end up having an ancestor that is common and thus be termed as homogenous. Protein clustering sequence using Pro-LEADER, Pro-Kmeans, Pro-CLARA and Pro-CLARANS methods help in the classification of new sequence, come up with a set a sequences that are similar and offer predictions on unknown protein structure sequence. The classification of large protein sets of data sequences by clustering techniques in the place of alignment methods extremely cuts down on the execution time and improves on this important function in molecular biology (Kaufman and Rousseeuw, 1999). References Altschul, S (1999). ‘Basic local alignment search tool.’ J Mol Biol, 215:403-410 Bolten, E. et al. (2001). ‘Clustering protein sequences-structure prediction by transitive homology.’ Bioinformatics, 17(10): 935-41 Cabena, P, et al. (1998). Discovering Data Mining: From Concept to Implementation. New Jersey: Prentice Hall Clote P & Backofen, R. (2000). Computational Molecular Biology. New York: John Wiley & Sons Enright, A & Ouzounis, C. (2000). GeneRAGE: a robust algorithm for the sequence clustering and domain detection. Bioinformatics 16(5): 451-7 Enright, A (2002). ‘An efficient algorithm for large-scale detection of protein families.’ Nucleic Acids Res, 30(7): 1575-84 Essousi, N & Fayech, S. (2007).‘A comparison of fur pair-wise sequence alignment methods.’ Bioinformation, 2: 166-168 Fayyad, M. (1996). ‘Data mining and knowledge discovery: Making sense out of data.’ IEEE Expert, 11:20-25 Guralnik, V & Karypis, G (2001). ‘A scalable algorithm for clustering sequential data.’ SIGKDD Workshop on Bioinformatics Henikoff, S. (1999). ‘Performances evaluation of amino acid substitution matrices.’ Proteins, 17:49-61 Herger, A & Holm, L. (2001). ‘Picasso: generating a covering set of protein family profiles.’ Bioinformatics, 17: 272-9 Kaplan, N, et al. (2005). ‘ProtoNet 4.0: a hierarchical classification of one million protein sequences.’ Nucleic Acid Res, of one million protein sequences.’ Nucleic Acid Res, 33:216-8 Kaufman, L & Rousseeuw, P. (1999). Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons Needleman S & Wunsch, D. (1997). ‘A general method applicable to the search of similarities in the amino acid sequence of proteins.’ J Mol Biol, 48:443-453 Sasson, O & Linial, N. (2002). ‘The metric space of proteins-comparative study of clustering algorithms.’ Bioinformatics, 18:14-21 Shi, J & Malik J. (1997). ‘Normalized cuts and image segmentation.’ Proceedings of the IEEE conference on Computer Vision Pattern Recognition. 731-737 Smith, W. (1999). ‘Identification of common molecular subsequences.’J Mol Biol. 147:195-197 Tatusov, R, et al. (2003). ‘The COG database: an updated version includes eukaryotes.’ BMC Bioinformatics, 4: 41 Van Dongen, S. (2000). Graph clustering by flow simulation. In Phd Thesis. Amsterdam: University of Utrecht Yona, G & Linial M. (2000). ‘ProtoMap: automatic classification of protein sequences and hierarchy families.’ Nucleic Acids Res, 28(1):49-55 Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(“Efficiency of Data Mining Algorithms in Identifying Outliers/Noise in Research Paper”, n.d.)
Efficiency of Data Mining Algorithms in Identifying Outliers/Noise in Research Paper. Retrieved from https://studentshare.org/information-technology/1454321-efficiency-of-data-mining-algorithms-in
(Efficiency of Data Mining Algorithms in Identifying Outliers/Noise in Research Paper)
Efficiency of Data Mining Algorithms in Identifying Outliers/Noise in Research Paper. https://studentshare.org/information-technology/1454321-efficiency-of-data-mining-algorithms-in.
“Efficiency of Data Mining Algorithms in Identifying Outliers/Noise in Research Paper”, n.d. https://studentshare.org/information-technology/1454321-efficiency-of-data-mining-algorithms-in.
  • Cited: 0 times

CHECK THESE SAMPLES OF Efficiency of Data Mining Algorithms in Identifying Outliers-Noise in a Large Biological Data Base

Active shape modelling compared to hip morphometry in the prediction of hip fracture

Recent data indicate that 46 million Americans above the age of 25 are affected by OA, while the WHO data estimates declare the population affected by OA to be 10% of the world population (Symmons et al.... This project aims to examine the relationship between the shape of the hip joint and total hip replacement in osteoarthritis,a severe illness with significant psychological and social repercussions....
27 Pages (6750 words) Thesis

Efficiency of Clustering Algorithms in Mining Biological Databases

hellip; Clustering algorithms is generally a common technique of data mining where by the data sets being examined are assigned into clusters on the basis of their similarities.... In biological data mining most of the sequences that are increasingly being analyzed using clustering algorithms include genomic as well as protein sequences.... EFFICIENCY OF CLUSTERING algorithms in MINING BIOLOGICAL DATABASES Name (s) Course Efficiency of clustering algorithms in mining biological databases Introduction Clustering analysis is increasingly being used in the mining of databases such as gene and protein sequences....
5 Pages (1250 words) Research Paper

Neural Networks and Conventional Computers

It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems.... It is configured for a specific application, such as pattern recognition or data classification, through a learning process.... hellip; In general, a biological neural network is composed of a group or groups of chemically connected or functionally associated neurons.... An Artificial Neural Network is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information....
10 Pages (2500 words) Essay

Time Series Data Mining and Forecasting Using SQL Server 2008

The third chapter presents a review literature on agriculture in Ghana and historical trends of data mining.... The storage or management technology associated with the relational databases is sufficient for a number of data mining applications which are below 50 GB.... This thesis "Time Series data mining and Forecasting Using SQL Server 2008" carries out data mining using the records on the production of major crops in Ghana for the past forty years as the data source....
64 Pages (16000 words) Thesis

Neural Network Peculiarities

eural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques.... nbsp; Since the 1940s, and especially since the early 1980s, the term has also been used for technology of parallel computation in which the computing elements are 'artificial neurons' loosely modelled on simple properties of biological neurons, usually with some adaptive capacity to change the strengths of connections between the neurons....
10 Pages (2500 words) Report

Identifying Outliers in a Large Biological Data Base

This coursework "Identifying Outliers in a large biological data base" identifies approaches that are efficient in clustering and are based on algorithms.... These methods are partitioning techniques that are very simple and appropriate in the clustering of large sets of data.... The Pro-Kmeans Algorithm involves partitioning that is randomly performed on a set of data which are then put into clusters and then later makes use of the Smith-Waterman algorithm in an effort to come up with a comparison of each protein cluster as well as computing each protein's SumScores....
7 Pages (1750 words) Coursework

The Efficiency of Clustering Algorithms for Mining Large Data Bases

his study focuses on evaluating the efficiency of various types of sequencing data mining algorithms with respect to protein sequence data sets, and on the basis of their shortcomings, design and develops an efficient clustering algorithm on the basis of the partitioning method.... The paper "The efficiency of Clustering Algorithms for Mining Large Data Bases" highlights that using Pro-PAM algorithm based on partitioning clustering techniques in the place of alignment methods in large data sets, increases efficiency and reduces execution time significantly....
7 Pages (1750 words) Coursework

Public Transport Infrastructure - Brisbane Bus Rapid Transit System

With the heavy capacity metro system, it will be possible to deal with high densities within the city with the trains expected to have 4 to 8 coaches with accompanying infrastructure also being large.... The paper "Public Transport Infrastructure - Brisbane Bus Rapid Transit System" examines two projects which are similar to BCUR and highlights the areas which can be incorporated so as to facilitate an effective, attractive, amenable, and safe mass transit corridor....
43 Pages (10750 words) Assignment
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us