Bioinformatics

Software Consulting

Partners

Support

Press

Site Map

Analysis of Proteomics Data

Tom Radcliffe
Predictive Patterns Software Inc.

http://www.improvedoutcomes.com

Summary

This case study involves a blinded proteomics dataset. It is based on a study of data supplied for a proof-of-concept analysis, but uses simulated data. It details various strategies for using GeneLinker Platinum to solve a difficult classification problem. However, it includes many examples of GeneLinker Gold features used to explore the data and devise a strategy for analysis.

It also demonstrates the power of multiple classifications: even in cases where no single classifier was able to call every test sample, using multiple classifiers enabled the almost perfect classification of unknowns.

Description of Data

The data consist of two files of mass spec data:

TrainingData.csv
TestData.csv

The TrainingData file consists of samples that are known to belong to one of four groups. There is a separate variable file that contains a single column giving the group for each of these samples.

There are four groups: A, B, C and D labeled in the last column of the training file, as follows:

A: 78 samples
B: 83 samples
C: 83 samples
D: 81 samples

The data for each sample are 570 channel values that have been extracted from the original spectra.

TestData.csv contains 60 samples of unknown group. Part of this study will show how confidence in classification can be high even when the test samples are truly of unknown group.

Classification Task

The classification task is to find features (channel values) and classifiers (ANNs, IBIS or others) that discriminate between groups reliably.

Rather than try to develop a single classifier that will distinguish all-from-all, it is often simpler to "divide and conquer" by first finding a classifier that distinguishes one group from all the others and then repeating this strategy until only one group is left.

This report focuses on D vs. ABC. The best classifier system found has good (>99%) accuracy on a test data set split off from the training samples provided, and gives a self-consistent classification of the blinded test data.

Preliminaries

The data were imported into GeneLinker^TM Platinum and given an initial high-level analysis using Gold-level functionality to determine their general characteristics. Samples were labeled S1 through S325. Preliminary analysis found heterogeneity, redundancy and outlier issues.

Heterogeneity

The data values range from about -10 to +100. Histogramming shows the usual log-normal distribution characteristic of much biological and chemical data (for the purposes of this study data were in fact generated to ensure this was the case, including the dual Gaussian structure shown below, and the outliers discussed in the next section.) Negative values were removed and replaced with the median of their channel. Channels with more than 30 negative values (10%) were deleted from the dataset. This resulted in a reduction in the number of channels from 572 to 332. The data were then normalized by taking log10, and the resulting histogram is as shown in Figure 1.

Figure 1: Histogram of all samples after log normalization

As can be seen, the data appear to contain two distinct but overlapping populations. Splitting the dataset into four chunks, one for each group and repeating this preliminary

Figure 2: Histogram of just Group B samples after log normalization

Figure 3: Correlation between adjacent entries

Figure 4: Uncorrelated columns separated by 100 daltons

analysis on each group reveals that this heterogeneous structure is present on all classes individually. For example, the histogram for Group B is shown in figure 2.

This is important because it means there is gross structure in the data that is unrelated to the grouping variable, and therefore may confound any attempt at classification.

Redundancy

A second type of preliminary analysis was also performed. Pairs of columns were examined for correlation using GeneLinker's Scatter Plot functionality. A few random pairs were selected from the whole of the data.

It was immediately observed that adjacent or nearby columns were highly correlated, as shown in Figure 3. More distant columns were not significantly correlated, as shown in Figure 4.

Mass spec systems typically have strong correlations between adjacent channels due to finite resolution—a peak is naturally extended over many channels. Therefore the correlation between nearby columns in these data strongly suggests they are channel (intensity) data rather than peak (area) data. On this basis the individual entries are referred to in this report as “channels” rather than “peaks”.

This is significant for subsequent data analysis because it means the data, already fairly sparse, are also significantly redundant, and some data mining methods will pick up several adjacent columns as “significant” when in reality only one of them carries unique information—the rest will just be repeating the same information. Therefore post-processing of results may yield smaller sets of equally good discriminators when these redundant columns are removed

Outliers

A final aspect of the data that was uncovered in the preliminary analysis was the presence of outliers in Group C. PCA analysis as shown in Figure 5, revealed a “blob” of samples far from everything else. The 19 samples in this blob were culled from the data prior to further analysis, reducing the number of Group C samples to 64 from 83. There are three samples in this culled set that are perhaps questionable (the ones in the upper right corner of the red overlay). However, their presence or absence does not make a difference to the results of analysis, which were ultimately verified on the whole dataset.

Figure 5: Group C outliers culled from the dataset on the basis of PCA as shown

Figure 6: Partial plot of un-normalized merged data. Error bars show standard deviation.

Data Quality

To assess data quality, samples in each group were merged and the resulting values plotted. A detail from this plot is shown in Figure 6.

Two things are immediately apparent: the between-group variation is much smaller than the standard deviation within each group, and adjacent channels really do frequently contain redundant information. The large within-group variation compared to the between-group variation makes the posed discrimination tasks challenging.

Splitting

To use the supervised learning capability of GeneLinker™ Platinum effectively the data from Train.csv were split into a training and test set that the known groups of the test set could be used to evaluate the quality of classification. The test set comprised approximately 30% of the data, randomly selected across all four groups.

Splitting of data files and generation of variable files was done outside of GeneLinker™ using Perl scripts. Splitting and variable generation functionality will be added to GeneLinker™ in a release later this year.

Summary

The purpose of the foregoing is to demonstrate a few of the ways you can use the capabilities of GeneLinker to examine the gross features of your data. There is no more important step in data analysis: the most sophisticated analysis algorithm will produce bad results when fed bad data.

Outline of Analysis

All classification tasks were approached in a similar way. The data were log-normalized as described above, and several parallel approaches were used for feature selection:

F-test and PCA
SLAM
IBIS

Each of these approaches was used to find a small set of channels that were promising candidates to perform the classification task. One or more committees of neural networks were then trained in these candidate channels, and the quality of classification was evaluated using the fraction of the data held out from the original training dataset. Note that evaluation was used sparingly, as over-use of a validation set can effectively turn it into a secondary training set, with the data analyst serving as part of the training algorithm.

Once reasonable performance was achieved on features selected by each approach, in some cases features from several approaches were combined. Different approaches are sensitive to different kinds of structure within the data—F-Test and PCA are linear techniques that work on individual features, SLAM is highly sensitive to non-linear and combinatorial interactions between features, and IBIS is tuned to detect non-linear relationships between pairs of features, and works on continuous rather than discrete data.

The classification system of GeneLinker^TM Platinum is a committee of neural networks, which consists of a variable number of neural networks (committees of 10 were used in this work) that are trained on different sub-sets of the training data. A majority vote (7 out of 10 in this case) is required to classify a sample unambiguously—samples for which the committee does not reach majority agreement are classified as “unknown”.

Various measures of goodness of prediction are used in the literature. For succinctness I have used accuracy without unknowns—that is, the fraction of correct calls by the classifier ignoring cases where the classifier was not able to make a call. The fraction of unknowns is reported as well. Neural network training tends to produce networks that generate unequivocal calls, even when they are incorrect. Using a committee architecture allows GeneLinker™ Platinum to make better judgments of when a sample is not susceptible to unequivocal classification.

Classification

The goal of the classification task is to find a small set of channels that allow a committee of neural networks to be trained that does a reasonable job of discriminating group D samples from all others.

The high-level overview of this task is that there are many ways to get 90 - 95% classification accuracy, but going beyond that has proven to be extremely difficult. The most accurate classifier found, using a committee of ten neural networks with 97 channels as inputs, produced 99% accuracy with 2% unknown on the test data.

The two aspects of the classification problem are:

Feature selection
Classifier creation

Feature selection is performed using a variety of data-mining algorithms as described below. Once a set of features has been selected, a committee of neural networks is trained on them to perform the classification task.

For reasonably clean data, the performance of the network committee should be 100% on the training data. This was only achieved in a single case for the D vs ABC classification problem, ignoring a single “unknown” classification. Given how noisy the data are, this is not entirely surprising.

Apart from feature selection, the number of nodes in the hidden layer of the networks is an important parameter in determining their performance. Optimal network architecture is highly problem dependent. In general, using the smallest number of hidden nodes that still produces good classification will decrease network flexibility and reduce the chance of overtraining. The best network performance was achieved in most cases with just three or four hidden nodes, regardless of the number of inputs. The number of input nodes is equal to the number of features and the number of output nodes is equal to the number of classes.

Several approaches to feature-selection were taken:

F-test on log10-normalized data, followed by PCA analysis to narrow the range of important channels
KW-test on un-normalized data, like-wise followed by PCA analysis
SLAM on both log10-normalized and un-normalized data
IBIS on both log10-normalized and un-normalized data

There isn’t room in this report to describe all of these combinations in detail. A brief over-view is as follows.

The F-test finds channels that have varied significantly between groups. The data were log10-normalized prior to running the F-test because it is a parametric test that assumes normally distributed data, and as described above the data are approximately log-normal, although this comes at the cost of filtering out 230 channels.

There are 110 channels that vary with a significance of p < 0.001. Filtering on these channels and training an committee of neural networks resulted in a single mis-classification in the training data—sample 320 was classified as “unknown” rather than “D”. Of these 110 channels, 104 were also represented in the test data (a slightly different set of channels was lost from the test data due to removal of negatives prior to log-normalization).

The committee architecture is trained using a faction of the data to evaluate each network’s performance at the end of each training epoch. For a committee of N networks, each network is validated on a randomly selected 1/N samples, and network performance on these channels is used to calculate a mean-squared error (MSE) that is a measure of network performance. Figure 7 shows the MSE plot for network training on the 104 channels selected from the F-test.

Figure 7: MSE for training a committee of ten neural networks on 104 channels selected by F-test

As can be seen in the figure there is some over-training apparent in several networks. However, experience has shown that moderate over-training on the validation set is acceptable for small datasets, because the odds are good that there is a significant difference between the samples represented in the validation set for each network and the samples it is trained on. This is not a flaw in the architecture, but a reflection of the unfortunate and immutable fact that neural networks require improbably large training sets if they are to be used in a completely straightforward way—for reasonable sized training sets a degree of insight and interpretation into their quality measures is required.

To investigate the source of overtraining, GeneLinker^TM Platinum includes a “pocket algorithm” option for neural network training. This saves (“pockets”) the network parameters that produce the best performance on the validation set for each network, preventing overtraining. If the overtraining is due to unrepresentative validation sets the pocket algorithm will result in poorer classifier performance on the whole dataset.

In this case, using the pocket algorithm produces a similar level of performance, as does cutting off training at 5 iterations rather than 10.

Testing the trained committee of neural networks on the 30% of the data held out for testing gives an error rate of 5 – 10%, depending on how unknowns are counted. If an “unknown” classification is counted as an error the typical accuracy is 91%, if unknowns are excluded from the ratio then 95 – 96% accuracy is achievable.

While it is difficult to improve the quality of discrimination, it is easy to reduce the number of channels required to reach 95% discrimination accuracy. Running Principle Component Analysis on the 104 channels identified as significant by the by F-test shows that most of the between-group variation is due to the first principle component, and there is a small group of channels that accounts for most of the power in that component.

Figure 8: 3D PCA plot of 104 channels from F-test. Group D is green, Group ABC is blue. Note that most of the D/ABC difference is accounted for by PC1.

To reduce the number of channels the PCA loadings are exported and re-imported into GeneLinker as a new data table after editing the exported file to remove all but the top five components. They are then ranged-filtered to select the channels that have the largest variation between components. This selects channels that account for most of the variation in the data, and because the channels have already been filtered to select those that vary significantly between groups, this selects channels that vary most between groups. This reduces the number of channels to 17, which produces perfect performance on the training data after the elimination of 5 unknowns. However, the performance on the held-out test data is not good: only 88% correct classification after the elimination of unknowns.

The Kruskal-Wallis test is a non-parametric version of the F-test, and produced similar results on the un-normalized data to those of the F-test on the normalized data.

To achieved better discrimination the more powerful feature selectors available in GeneLinker Platinum were brought into play: IBIS and SLAM. IBIS (Integrated Bayesian Inference System) performs an exhaustive search for single genes and pairs of genes that vary significantly between classes using linear, quadratic or uniform Gaussian discriminant analysis. SLAM performs a fast heuristic search for ntuples of genes whose discretized levels of expression co-vary within a class significantly more often than would be expected by chance.

For a dataset of moderate size running IBIS on pairs of genes with QDA (quadratic discriminant analysis) produces hundreds of discriminators with better than 80% accuracy. Taking the top 97 genes from these gives a set of features that with a four-hidden-node network gives 99% accuracy on the test data with 3% unknowns.

Once a reasonable level of discrimination has been achieved with a feature set of moderate size, the next challenge is to reduce the number of features until the minimal number has been found. This was done using a binary search strategy: halving the number of features at each step. The results are shown in Table 1.

The classification accuracy is not a sensitive function of the number of features. However, the best accuracy was achieved at 50 features (channels). Larger numbers of channels perform less well due to noisier, less-relevant data being included in the training process. Smaller numbers of channels perform less well due to the absence of relevant information.

The best performance is very good: 99% correct classification on the held-out test data, with only 1% (2 samples) not able to be classified. If unknowns are (pessimistically) counted as misclassifications, this is still better than 97% accuracy.

Fts.	Hidden Nodes	Training Accuracy	Training Unknowns	Test Accuracy	Test Unknowns
97	2	195/197 (99%)	5/202 (2.5%)	96/98 (98%)	6/104 (6%)
97	3	200/200 (100%)	2/202 (1.5%)	100/102 (98%)	2/104 (2%)
97	4	198/198 (100%)	4/202 (2%)	100/101 (99%)	3/104 (3%)
97	5	200/201 (99.5%)	1/202 (0.5%)	99/100 (99%)	4/104 (3%)
50	2	198/199 (99.5%)	3/202 (1.5%)	101/102 (99%)	2/104 (2%)
50	3	198/199 (99.5%)	3/202 (1.5%)	97/98 (99%)	6/104 (6%)
50	4	201/202 (99.5%)	0/202 (0%)	101/102 (99%)	2/104 (2%)
50	5	199/200 (99.5%)	2/202 (1%)	100/101 (99%)	3/104 (3%)
25	2	196/197 (99.5%)	5/202 (2.5%)	97/101 (96%)	3/104 (3%)
25	3	196/197 (99.5%)	5/202 (2.5%)	96/97 (99%)	7/104 (7%)
25	4	198/200 (99%)	2/202 (0.5%)	97/100 (97%)	4/104 (4%)
25	5	200/201 (99.5%)	1/202 (0.5%)	97/102 (95%)	2/104 (2%)
25	6	2/192 (99%)	10/202 (5%)	93/95 (98%)	9/104 (9%)

Table 1: Training and test accuracy of various feature sets selected by the IBIS algorithm, as a function of the number of hidden neural network nodes. The training set has 202 samples, the test set 104. Only one network was trained for each configuration.

To explore improvements to the accuracy to classification, SLAM was run on the log10-normalized data for 5,000,000 iterations. This took several hours (dual 1GHz desktop machine.)

Of the top 25 SLAM channels, 11 differ from those found by IBIS. This illustrates the value of the multiple approaches available in GeneLinker—the SLAM-found channels are implicated in more complex protein-protein interactions than the binary interactions that IBIS is sensitive to. On the other hand, the signals that SLAM detects are often more subtle and weaker than the pair-wise IBIS channels.

The SLAM-found channels are not as good predictors as those found by IBIS. The best neural network committee trained on the top 25 SLAM-found channels gives only 95% correct classification with 6% unknowns. Interestingly, they fail on a different set of samples than the top 25 IBIS channel predictors do. This is a further illustration of the value of multiple approaches—when the biological context is known, the role of different feature sets can be interpreted to understand what aspects of the biology they are sensitive to, rather than just blindly selecting based on raw statistical significance, as is done here.

The combined 36 unique channels (11 from SLAM, 14 shared, 11 from IBIS) do not do significantly better than the best 25 channels from IBIS alone.

The best 50 IBIS channels are included in the appendix, along with their classifications of the blinded test data included with the dataset using several committees of neural networks trained on those channels.

Classification of Blinded Data

Classifying the blinded data with the best neural network committee produces a higher than expected proportion of unknowns: 6/60 (10%). This suggests that the population for the blinded data may be slightly different from that of the training data. A comparison of histograms adds a certain amount of credence to this suggestion, as shown in Figure 9.

Figure 9: Comparison of histograms after negative value removal and log10 normalization for training data and blinded test data. The x-axis scales are identical in the two plots. While the distributions are similar, there are clear and significant differences

As a cross-check, several other classifiers were run on the blinded data. In particular, classifiers trained on the top 50 IBIS-selected channels with 2, 3, 4 and 5 hidden nodes were run. As shown in the appendix, in all but one case (sample 14) at least one classifier was able to make a call. In two cases (samples 22 and 53) only one classifier was able to make a call. In all cases where more than one classifier was able to make a call, there was complete agreement between them.

Appendix

Table 1A: Classification of blinded samples based on top 50 IBIS channels. Note that in the 57 cases out of 60 where more than one classifier makes a call, there is perfect agreement between the classifiers. Note that only one sample cannot be classified at all.

Sample	2 Hidden	3 Hidden	4 Hidden	5 Hidden	Call
1	ABC	ABC	ABC	ABC	ABC
2	ABC	ABC	ABC	ABC	ABC
3	ABC	ABC	ABC	ABC	ABC
4	D	Unknown	D	D	D
5	Unknown	ABC	ABC	Unknown	ABC
6	ABC	ABC	ABC	ABC	ABC
7	ABC	ABC	ABC	ABC	ABC
8	ABC	ABC	ABC	ABC	ABC
9	ABC	ABC	ABC	ABC	ABC
10	ABC	ABC	ABC	ABC	ABC
11	D	D	D	D	D
12	D	D	D	D	D
13	D	D	D	D	D
14	Unknown	Unknown	Unknown	Unknown	Unknown
15	D	D	D	D	D
16	D	D	D	D	D
17	D	Unknown	Unknown	D	D
18	D	D	D	D	D
19	D	D	D	D	D
20	D	D	D	D	D
21	ABC	ABC	ABC	Unknown	ABC
22	Unknown	Unknown	Unknown	D	D
23	ABC	ABC	ABC	ABC	ABC
24	ABC	ABC	ABC	ABC	ABC
25	ABC	ABC	ABC	ABC	ABC
26	ABC	ABC	ABC	ABC	ABC
27	ABC	ABC	ABC	ABC	ABC
28	ABC	ABC	ABC	ABC	ABC
29	ABC	ABC	ABC	ABC	ABC
30	ABC	ABC	ABC	ABC	ABC
31	ABC	ABC	ABC	ABC	ABC
32	ABC	ABC	ABC	ABC	ABC
33	ABC	ABC	ABC	ABC	ABC
34	ABC	ABC	ABC	ABC	ABC
35	ABC	ABC	ABC	ABC	ABC
36	D	D	D	D	D
37	D	D	D	D	D
38	D	D	D	D	D
39	D	D	D	D	D
40	D	D	D	D	D
41	ABC	ABC	ABC	ABC	ABC
42	ABC	ABC	ABC	ABC	ABC
43	ABC	ABC	ABC	ABC	ABC
44	ABC	ABC	ABC	Unknown	ABC
45	D	D	D	D	D
46	D	D	D	D	D
47	ABC	ABC	ABC	ABC	ABC
48	ABC	ABC	ABC	ABC	ABC
49	ABC	ABC	ABC	ABC	ABC
50	ABC	ABC	ABC	ABC	ABC
51	ABC	ABC	ABC	ABC	ABC
52	ABC	ABC	ABC	ABC	ABC
53	Unknown	ABC	Unknown	Unknown	ABC
54	ABC	ABC	ABC	Unknown	ABC
55	ABC	ABC	ABC	ABC	ABC
56	D	D	D	D	D
57	D	D	D	D	D
58	D	Unknown	Unknown	D	D
59	D	Unknown	Unknown	D	D
60	D	D	D	D	D

Improved Outcomes Software
188 Greenlees Drive :: Kingston, ON Canada :: K7K 6P6
General Enquiries: 613-583-0555
Bioinformatics/GeneLinker: 613-539-1126
Image Guided Surgery: 613-583-0555
Fax: 613-271-8867
Email: info@improvedoutcomes.com

| Web Site Designed by WebWoods. |