Artificial Intelligence: Machine Learning Algorithms Assignment

1. Describe the prediction problem that the are addressing. Of the two problems addressed by the paper, the less complex one is that ofclass prediction. The basic notion of class prediction is the categorization of data into one of several predefined sets or classes. In the particular paper, patients are classified into different forms of cancer based on the gene expression. In the class prediction problem, a random set of DNA samples for known cancer patient is available for study. For each sample, the type of cancer is already known based on the patient’s medical records. Using these known data, the class prediction problem attempts to extract trends between patient data to aid in the classification of cancers from DNA. In particular, the class prediction problem attempts to establish differences between the different classes. Knowing these differences allows the system to predict the type of cancer a patient has before any other tests are made. It should be noted that the class prediction problem is composed of two distinct phases. The first step is supervised learning wherein patterns are established based on previously known results and the algorithm is trained to match the given results. Once the system has been trained, it is then possible to use the system to evaluate any given input without knowing a priori the expected results. The goal of the class prediction problem is to achieve a 100% accuracy rate in evaluation. 2. Describe the datasets used for experiments on the class prediction problem. The datasets available for the study contain DNA samples from AML and ALL cancer patients. Each sample is composed of a set of genes which have been extracted. It should be noted that the genes in use in the datasets comprise a small portion of the total genes in an individual. Mapping a complete set of genes is impractical both in laboratory testing (obtain the expression of the gene) as well as data analysis. Since an individual strand of DNA cannot be isolated from a person, data for genes represent the mean values for the samples taken from a person. A quantitative value of how much a specific trait manifests in the genes is given as the expression level. To equally represent different genes, the expression level is normalized for the entire dataset. In the class prediction problem, each gene is sampled for the entire dataset and the normalized expression level is compared between the classes to establish patterns. 3. Describe the class discovery problem that the authors are addressing. In contrast to the class prediction problem, the class discovery problem is significantly more complicated. The basic premise of class discovery is that certain similarities may be present between samples. These similarities may be used to categorize data into an arbitrary number of classes. Each of these classes is established as a unique cancer class. With this problem, similarities in unknown cases may be used to discover new forms of cancers. Class discovery uses a completely different approach compared to the previous case of class prediction. Distinct classes cannot be provided from the start since doing so would defeat the purpose of the problem. Instead, classes are initially unknown. The algorithm is then tasked to cluster data into classes based on strong correlation between gene expressions. The actual classes of the training data may be known but these do not have any role in training the algorithm. In fact, no supervised training is present in the system. Instead, unsupervised learning in the form of clustering is used to evaluate any data set. This means that only one step is necessary in this problem. 4. What are the arguments for using a neighborhood analysis approach for AML-ALL class prediction? Indicate if you agree or disagree with their approach. Neighborhood analysis attempts to identify distinguishing features in class prediction. Instead of focusing on genes which can manifest equally on both classes, neighborhood analysis attempts to identify genes which express very differently between both classes. This means that genes with high expression levels on one class should be significantly low on the other class. This argument is actually quite sound as genes are important indicators of the behavior of cancer cells. As a result, similar behavior should also be rooted from similar genes. However, other irrelevant genes for a given cancer behavior express randomly for different individuals. This is since each individual is unique and these genes represent variations in the population. Using the neighborhood analysis approach allows the system to identify the genes which result to specific cancer cases while rejecting the natural fluctuations in genes present from person to person. At the same time, such an approach is also useful in eliminating genes which manifest very similarly for the entire population. 5. What was their overall prediction success with cross validation? Comment on the meaning of this result. In cross-validation analysis, 36 out of the 38 samples reveal significantly high prediction strengths. This generally means that for these samples, the gene predictors used generally tend towards a single class instead of being ambiguous. Since class prediction in these tests compares two classes, obtaining high prediction strength means that the gene samples closely resemble the established set of ideal gene expressions for a class. Having low prediction strength on the other hand means that some genes appear similar to one class while the rest appear similar to the remaining class. Since cross-validation tests involve training the system to almost the same data set, it generally results to higher prediction strengths as compared to independent tests. The results therefore show that while a large number of samples meet the critical value for the prediction strength, the algorithm is not perfect. Some samples cannot be categorized between the two classes. This does not mean that the algorithm itself is not effective. For the two samples which did not meet the minimum prediction strength, it basically means that the algorithm’s confidence in the prediction is not sufficient. The actual prediction of these two samples may still coincide with the actual data. It should also be noted from the results of class prediction that for the samples with a high value for prediction strength, 100% accuracy was obtained in the identification of the class. Based on these observations, it can be said that the prediction algorithm is capable of accurately predicting the class for majority of the cases and at the same time can effectively isolate cases wherein there is insufficient evidence to warrant classification into a class. 6. What are the arguments for using a self-organizing map approach for class discovery? Indicate if you agree or disagree with their approach. Class discovery involves categorizing data into an arbitrary number of groups. Due to the nature of this problem, a clustering algorithm is appropriate. With a large number of variables (such as the number of genes being considered), similarities between classes becomes difficult using traditional clustering algorithm such as k-means. Also, such methods do not always converge at optimal cluster centroids. With these restrictions, a more intelligent algorithm is necessary. Self-organizing maps are special cases of artificial neural networks. With such a design, inter-neuronal weights can be adjusted to distinguish between different features. The advantage of using such an approach is that it is capable of identifying patterns on a complex data set with minimal supervision. This approach is quite appealing except that the complexity of such a design increases with a larger number of inputs. Also, evaluation is computationally expensive. I personally agree with this approach. At the same time, however, I believe that other approaches may prove to be equally as effective. 7. How do you evaluate the overall quality of the paper? Are there some potential improvements that can be made to improve the performances on the class prediction and class discovery techniques described in the paper? The paper presents a very novel approach to the problems of class prediction and class discovery for cancers. The document itself is well structured and can easily be understood. However, a more detailed description of the actual implementation is not present in the paper. To compensate for this, a web site was provided for the detailed methodology. In terms of content, the class prediction approach is very well thought out. By establishing the differences between classes, the algorithm can effectively discriminate between two classes. However, what the paper lacks is an analysis of the effect of the number of gene classifiers for both the prediction strength and accuracy. While it is mentioned that there is little variation, the paper may have been better if an approach for determining the optimal number of classifiers was provided. Also, with regards to class prediction, it is possible to improve the given algorithm by using variable weights designed to minimize the prediction errors of the training set instead of based on parametric distributions. The class discovery approach is also quite well prepared. However, like any idea there is always room for improvement. One limitation of the algorithm is that it is reliant on a set number of distinct classes. This makes the discovery of several new classes difficult without estimating the number of classes. Also, the class discovery algorithm can result to two classes which describe the same type of cancer but vary as a result of minor fluctuations in the genetic profiles of the subjects. It is therefore possible to design an algorithm that is capable of merging two or more classes if a certain degree of correlation exists between them. 8. What will be your approach(es) if you intend to solve the class prediction/discovery problem discussed in the paper? How do you expect the pros and cons of your approach? An alternative approach in class prediction may be derived based on statistical properties of the data sets. Since the expression levels are assumed to be normally distributed, genes may be initially processed by accepting and rejecting genes as classifiers based on the difference of their means and standard deviations. Only those genes with a significant difference at an established confidence interval can be accepted as classifiers. This compensates for possible variations in the data. Similar to the approach in the paper, the proposed approach removes natural variations in the data as well as constants in the population. This has the added advantage of being more robust while sacrificing memory performance due to the use of look-up tables. Another advantage of this method is that it does not require the selection of an arbitrary number of classifiers. Instead, it automatically establishes the said number. Following the initial classifier selection, weighting of values can then be used. An initial estimate is established based on the difference of the means of the two classes. Following this, an adjustment pass may be used to further match the training set. The learning rate for the weights should be minimal to avoid overfitting the training set. Evaluation for class prediction can be based on the statistical probabilities of belonging to a class instead of matching the raw expression level with the means of either class. Again, this has the disadvantage of requiring more memory but may prove to be more robust. Class discovery may also be implemented differently. Instead of using the more complex self-organizing maps, gene expression levels can be normalized and categorized into one of three categories – high, low, centered. Centered genes are removed during clustering as these cannot describe differences between classes (as with class prediction). High and low expression levels can then clustered using an algorithm such as the Qt-clustering algorithm. This method allows the system to determine automatically a number of high quality clusters based on gene expressions. This has the advantage of not requiring a fixed number of output classes. Also, data which cannot be classified into classes are ignored in this algorithm. This means that even samples from the healthy population will not corrupt the analysis by creating new classes of healthy individuals. This allows an even more robust class discovery as random samples can be taken from any population allowing the system to identify cancers in that population. A disadvantage would be that additional computational complexity is needed as Qt-clustering is exhaustive of all possible cluster groups. Finally, to process the selected classes/cluster centroids, an analysis of the variance between groups can be used to determine if any groups can be merged into one class. This however is intensive on computations. Read More

Artificial Intelligence: Machine Learning Algorithms - Assignment Example

Extract of sample "Artificial Intelligence: Machine Learning Algorithms"

CHECK THESE SAMPLES OF Artificial Intelligence: Machine Learning Algorithms

Ethics of Artificial Intelligence & Machine Learning

Applying concepts from SAP to a real world case (business processes in SAP)

Natural Language Processing

Effect of IT industry on Literacy

Analysis of a Bankruptcy for a Firm

Machine Learning Algorithms and Tools

Review of the Popular Association-Mining Algorithms - Apriori

The Ethical around the Impact of Artificial Intelligence on the Society