homeabout uscontact us

 

Overview of Principal Component Analysis (PCA) Functionality

 

Overview

Component Analysis is an unsupervised or class-free approach to finding the most informative or explanatory features in data. In particular, Principal Component Analysis (PCA) substantially reduces the complexity of data in which a large number of variables (e.g. thousands) are interrelated, such as in large-scale gene expression data obtained across a variety of different samples or conditions. PCA accomplishes this by computing a new, much smaller set of uncorrelated variables which best represent the original data. PCA is a powerful, well-established technique for data reduction and visualization. 2D and 3D PCA plots often place objects with similar patterns near each other.

GeneLinker™ provides one option for PCA analysis: Orientation by Genes or Orientation by Samples. In brief, PCA oriented by genes is useful for distinguishing sample classes or sample clusters, while PCA oriented by samples is useful for distinguishing gene classes or gene sets.

 

Mathematical Details and Examples of Orientation

To understand the difference and interpretive implications between the two different orientations - PCA by Genes or PCA by Samples - it is helpful to conceptualize the data analysis from the point of view of covariance matrices. A dataset can be thought of as comprising distinct mathematical or statistical variables (e.g. columns) for which there are statistical samples (e.g. rows).

a) Genes vs. Genes (Orientation by Genes)

b) Samples vs. Samples (Orientation by Samples)

In GeneLinker™, a Principal Component (PC) is defined as a mathematical entity (i.e., vector) computed from the data which is equivalent to a characteristic vector (i.e., eigenvector) of a covariance matrix derived from the data.

This is equivalent to finding the best lower dimensional linear basis set in which to represent the original data under the constraint of minimizing residual variance. The results obtained from the GeneLinker™ implementation are equivalent to a classical PCA of the data's covariance matrix; however, for computational speed and accuracy, covariance matrices are not explicitly computed by GeneLinker™ for PCA. From a covariance point of view, for example, a dataset typically comprises n genes by m samples. One can conceptualize two different kinds of covariance matrices for this data archetype:

a) Orientation by Genes: n by n covariance matrix (genes in the role of the math/statistics variables; hence, n genes vs. n genes, aggregated over all samples) OR

b) Orientation by Samples: m by m covariance matrix (samples in the role of the math/statistics variables; hence, m samples vs. m samples, aggregated over all genes).

For example, if there are n=1000 genes and m=12 samples (12 different human subjects, for example), the covariance matrix for case (a) would have 1000000 elements (1000 x 1000), but the covariance matrix for case (b) would have only 144 elements (12 x 12).

 

Technical Notes

Whether PCA orientation by genes or by samples, the maximum number of bona fide Principal Components that can be returned is the smaller of the number of genes or the number of samples. This is an inherent mathematical constraint.

PC calculation does not require parameters, and none are set by you beyond selecting the orientation of the calculation. The PCA Components to Display setting in the Preferences (accessed from the Edit menu) only affects display and reporting. The default limit on the number of PCs displayed in the Scree and Loadings plots is 15. This setting does not affect the actual calculation of the PCs. It sets an upper limit only on the number of PC's to display in these plots; therefore it does not have to be set before the PCs are calculated.

Whether the user requests PCA of count data, log data, max-min normalized data, missing value-replaced data, etc., GeneLinker™ automatically zero-means the data 'variables' before the PCA calculation, as is required for the results to be mathematically equivalent to the PCA of the covariance matrix.

GeneLinker™ limits the number of PCs by their contribution towards representing fractions of the total variance of the date (i.e., their numerical relevance). Only PCs associated with respective eigenvalues greater than or equal to 10-8 are included in the calculation result set. But in practice PCs with respective eigenvalues (i.e., fractions of data total variance) less than about 0.1 are rarely of much interpretive use or value.

Note also that a PC's pointing direction (e.g., southeast rather than northwest) along the line co-linear with the PC is irrelevant. Therefore, reversing the algebraic signs of all the constituent values of a PC in, for example, a Loadings Line Plot, is irrelevant.

 

Related Topics:

Performing PCA for a Dataset

Creating a 3D Score Plot

Tutorial 5: Principal Component Analysis (PCA)