Abstract
Cluster analysis has become a very popular tool for the exploration of high dimensional data. Dozens of algorithms have been proposed, each with its own merits and shortcomings. It is not known to what extent various methods give the same results, nor is it even clear how to measure how similar is the output of two distinct algorithms. Here we propose a statistic that is designed to measure the "correlation" between two clustering methods when applied to a particular data set. In contrast to the Rank index, the most common statistic useed for this purpose, the method is very fast. We provide an algorithm that approximates the statistic and demonstrate two of its possible uses. Finally, we use this statistic to understand the clustering in a data set in the context that motivated this work: analysis of a gene expression experiment.
Original language | English (US) |
---|---|
Pages (from-to) | 19-33 |
Number of pages | 15 |
Journal | Statistica Sinica |
Volume | 15 |
Issue number | 1 |
State | Published - Jan 1 2005 |
Keywords
- Cluster analysis
- Cohen's kappa
- Metropolis algorithm
- Microarray
- Traveling salesman problem