|
Platinum
Overview
ANN Classification, in GeneLinker™, is the process of learning to separate samples into different classes by finding common features between samples of known classes. For example, a set of samples may be taken from biopsies of two different tumor types, and their gene expression levels measured. GeneLinker™ can use this data to learn to distinguish the two tumor types so that later, GeneLinker™ can diagnose the tumor types of new biopsies. Because making predictions on unknown samples is often used as a means of testing the ANN classifier, we use the terms training samples and test samples to distinguish between the samples of which GeneLinker™ knows the classes (training), and samples of which GeneLinker™ will predict the classes (test).
Types of Learning
ANN Classification is an example of Supervised Learning. Known class labels help indicate whether the system is performing correctly or not. This information can be used to indicate a desired response, validate the accuracy of the system, or be used to help the system learn to behave correctly. The known class labels can be thought of as supervising the learning process; the term is not meant to imply that you have some sort of interventionist role.
Clustering is an example of Unsupervised Learning where the class labels are not presented to the system that is trying to discover the natural classes in a dataset. Clustering often fails to find known classes because the distinction between the classes can be obscured by the large number of features (genes) which are uncorrelated with the classes. A step in ANN classification involves identifying genes which are intimately connected to the known classes. This is called feature selection or feature extraction. Feature selection and ANN classification together have a use even when prediction of unknown samples is not necessary: They can be used to identify key genes which are involved in whatever processes distinguish the classes.
Manual Feature Selection
Manual feature selection is useful if you already have some hypothesis about which genes are key to a process. You can test that hypothesis by:
i. constructing a gene list of those genes,
ii. running an ANN classifier using those genes as features, and
iii. displaying a plot which shows whether the data can be successfully classified.
Feature Selection Using the SLAM™ Technology
The genes that are frequently observed in associations are frequently good features for classification with artificial neural networks. In GeneLinker™, ANN classification is done using a committee of artificial neural networks (ANNs). ANNs are highly adaptable learning machines which can detect non-linear relationships between the features and the sample classes. A committee of ANNs is used because an individual ANN may not be robust. That is, it may not make good predictions on new data (test data) despite excellent performance on the training data. Such a neural network is referred to as being overtrained.
Each ANN (component neural network or learner) is by default trained on a different 90% of the training data and then validated on the remaining 10%. (These fractions can be set differently in the Create ANN Classifier dialog by varying the number of component neural networks.) This technique mitigates the risk of overtraining at the level of the individual component neural network.
The committee architecture further enhances robustness by combining the component predictions in a voting scheme. Finally, by examining a chart of the voting results, difficult-to-classify samples can often be identified for re-examination or further study.
Related Topics:
An Introduction to Classification: Feature Selection
Association Mining Using SLAM™