TY - JOUR
T1 - A ground truth based comparative study on clustering of gene expression data
AU - Zhu, Yitan
AU - Wang, Zuyi
AU - Miller, David J.
AU - Clarke, Robert
AU - Xuan, Jianhua
AU - Hoffman, Eric P.
AU - Wang, Yue
PY - 2008
Y1 - 2008
N2 - Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG™ toolkit (Visual Statistical Data Analyzer - VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.
AB - Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG™ toolkit (Visual Statistical Data Analyzer - VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.
KW - Clustering evaluation
KW - Comparative study
KW - Gene expression data
KW - Sample clustering
UR - http://www.scopus.com/inward/record.url?scp=42649139881&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=42649139881&partnerID=8YFLogxK
U2 - 10.2741/2972
DO - 10.2741/2972
M3 - Article
C2 - 18508478
AN - SCOPUS:42649139881
SN - 2768-6701
VL - 13
SP - 3839
EP - 3849
JO - Frontiers in Bioscience
JF - Frontiers in Bioscience
IS - 10
ER -