TY - JOUR
T1 - Penalized model-based clustering with unconstrained covariance matrices
AU - Zhou, Hui
AU - Pan, Wei
AU - Shen, Xiaotong
PY - 2009
Y1 - 2009
N2 - Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering.However, existingmethods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model with general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariancematrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm to utilize the graphical lasso (Friedman et al. 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.
AB - Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering.However, existingmethods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model with general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariancematrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm to utilize the graphical lasso (Friedman et al. 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.
UR - http://www.scopus.com/inward/record.url?scp=77949505112&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77949505112&partnerID=8YFLogxK
U2 - 10.1214/09-EJS487
DO - 10.1214/09-EJS487
M3 - Article
SN - 1935-7524
VL - 3
SP - 1473
EP - 1496
JO - Electronic Journal of Statistics
JF - Electronic Journal of Statistics
ER -