TY - JOUR
T1 - The cluster graphical lasso for improved estimation of Gaussian graphical models
AU - Tan, Kean Ming
AU - Witten, Daniela
AU - Shojaie, Ali
N1 - Publisher Copyright:
© 2014 Elsevier B.V. All rights reserved.
PY - 2015/5
Y1 - 2015/5
N2 - The task of estimating a Gaussian graphical model in the high-dimensional setting is considered. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to a lasso penalty, is a well-studied approach for this task. A surprising connection between the graphical lasso and hierarchical clustering is introduced: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) a penalized log likelihood is maximized on the subset of variables within each connected component. Thus, the graphical lasso determines the connected components of the estimated network via single linkage clustering. The single linkage clustering is known to perform poorly in certain finite-sample settings. Therefore, the cluster graphical lasso, which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster, is proposed. Model selection consistency for this technique is established, and its improved performance relative to the graphical lasso is demonstrated in a simulation study, as well as in applications to a university webpage and a gene expression data sets.
AB - The task of estimating a Gaussian graphical model in the high-dimensional setting is considered. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to a lasso penalty, is a well-studied approach for this task. A surprising connection between the graphical lasso and hierarchical clustering is introduced: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) a penalized log likelihood is maximized on the subset of variables within each connected component. Thus, the graphical lasso determines the connected components of the estimated network via single linkage clustering. The single linkage clustering is known to perform poorly in certain finite-sample settings. Therefore, the cluster graphical lasso, which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster, is proposed. Model selection consistency for this technique is established, and its improved performance relative to the graphical lasso is demonstrated in a simulation study, as well as in applications to a university webpage and a gene expression data sets.
KW - Hierarchical clustering
KW - High-dimensional setting
KW - Network
KW - Single linkage clustering
KW - Sparsity
UR - http://www.scopus.com/inward/record.url?scp=84920720899&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84920720899&partnerID=8YFLogxK
U2 - 10.1016/j.csda.2014.11.015
DO - 10.1016/j.csda.2014.11.015
M3 - Article
AN - SCOPUS:84920720899
SN - 0167-9473
VL - 85
SP - 23
EP - 36
JO - Computational Statistics and Data Analysis
JF - Computational Statistics and Data Analysis
ER -