Tutorial 3: Introduction

This tutorial introduces you to data normalization and Jarvis-Patrick partitional clustering. The results of the clustering experiments are viewed in a matrix tree plot.

Skills You Will Learn:

How to import gene expression data from a file into the GeneLinker™ database.

How to normalize data.

How to estimate missing values.

How to perform a partitional clustering experiment.

How to view experiment results in a matrix tree plot.

Jarvis-Patrick Partitional Clustering

Also known as mutual nearest neighbors clustering, Jarvis-Patrick clustering is a very fast non-stochastic clustering method. It has seen considerable use in the cheminformatics community, but has not been widely used in gene expression analysis until now.

Jarvis-Patrick clustering depends on two user-configurable parameters: the number of nearest Neighbors to Examine, and the number of those neighbors that must be shared in order for the two items (genes, for instance) to be clustered together. The two items must also be among each other’s nearest neighbors. The appropriate values to use for these parameters depend on the data being clustered and the objective of the analysis. Starting with one or two common neighbors out of five or six nearest neighbors tends to produce a manageable number of clusters on datasets of 100-200 items. The larger the list of Neighbors to Examine, the more likely it is that common neighbors will be found to join any two items, and so increasing this number tends to lead to fewer and larger clusters. Conversely, the more common neighbors are required, the fewer joins are found, and this tends to lead to more and smaller clusters.

A typical Jarvis-Patrick clustering contains a wide variety of cluster sizes. There are usually a significant number of singleton genes in any Jarvis-Patrick clustering, along with a small number of very large clusters, and a smattering of fairly tight clusters containing between 1 and 10 genes. As well, the clusters are not constrained to be as globular as in, for example, average-linkage K-Means clustering. When combined with the number of singletons, this means that a centroid plot will often not illustrate the clusters’ characteristics very clearly. Instead, using a Matrix Tree Plot is recommended for a comparative overview of the clusters.

Assumptions

This tutorial assumes you have already completed Tutorial 1 and Tutorial 2 thus having the Spinal_cord and t_matrix datasets in the Experiments navigator. If the Spinal_cord and/or t_matrix datasets are missing, follow the Data Import procedure in Tutorials 1 and 2 to import them.

Tutorial Length

This tutorial is split into two parts: part A deals with the Spinal_cord dataset and part B deals with the t_matrix dataset. The entire tutorial should take about 20 minutes, depending on how long you spend investigating the data, and how fast your machine is.

If you must stop part way through the tutorial, simply exit the program by selecting Exit from the File menu. The data and experiments you have performed to that point are saved automatically by GeneLinker™. The next time you start GeneLinker™, you can continue on with the next step in the tutorial.