
Overview
KMeans clustering generates a specific number of disjoint, flat (nonhierarchical) clusters. It is well suited to generating globular clusters. The KMeans method is numerical, unsupervised, nondeterministic and iterative.
KMeans Algorithm Properties
There are always K clusters.
There is always at least one item in each cluster.
The clusters are nonhierarchical and they do not overlap.
Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters.
The KMeans Algorithm Process
The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points.
For each data point:
Calculate the distance from the data point to each cluster.
If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster.
Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends.
The choice of initial partition can greatly affect the final clusters that result, in terms of intercluster and intracluster distances and cohesion.
KMeans Clustering in GeneLinker™
The version of the KMeans algorithm used in GeneLinker™ differs from the conventional KMeans algorithm in that GeneLinker™ does not compute the centroid of the clusters to measure the distance from a data point to a cluster. Instead, the algorithm uses a specified linkage distance metric. The use of the Average Linkage distance metric most closely corresponds to conventional KMeans, but can produce different results in many cases.
Advantages to Using this Technique
With a large number of variables, KMeans may be computationally faster than hierarchical clustering (if K is small).
KMeans may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Disadvantages to Using this Technique
Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).
Fixed number of clusters can make it difficult to predict what K should be.
Does not work well with nonglobular clusters.
Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.
Note the Warning in Pearson Correlation and Pearson Squared Distance Metric on use of KMeans clustering.
Related Topics: