High-dimensional variable selection with the plaid mixture model for clustering

Thierry Chekouo, Alejandro Murua

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.

Original languageEnglish (US)
Pages (from-to)1475-1496
Number of pages22
JournalComputational Statistics
Volume33
Issue number3
DOIs
StatePublished - Sep 1 2018

Keywords

  • Classification
  • Kidney cancer genomic data
  • Model selection
  • Monte Carlo EM
  • Multiplicative mixture model

Fingerprint

Dive into the research topics of 'High-dimensional variable selection with the plaid mixture model for clustering'. Together they form a unique fingerprint.

Cite this