A data-driven approach to conditional screening of high-dimensional variables

Hyokyoung G. Hong, Lan Wang, Xuming He

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.

Original languageEnglish (US)
Pages (from-to)200-212
Number of pages13
JournalStat
Volume5
Issue number1
DOIs
StatePublished - Jan 1 2016

Fingerprint

Data-driven
Screening
High-dimensional
Hidden Variables
Leukemia
Numerical Comparisons
Microarray Data
Model Selection
Dimensionality
Biased
Sample Size
Likely
Methodology
Alternatives

Keywords

  • conditional screening
  • false negative
  • feature screening
  • high dimension
  • sparse principal component analysis
  • sure screening property

Cite this

A data-driven approach to conditional screening of high-dimensional variables. / Hong, Hyokyoung G.; Wang, Lan; He, Xuming.

In: Stat, Vol. 5, No. 1, 01.01.2016, p. 200-212.

Research output: Contribution to journalArticle

Hong, Hyokyoung G. ; Wang, Lan ; He, Xuming. / A data-driven approach to conditional screening of high-dimensional variables. In: Stat. 2016 ; Vol. 5, No. 1. pp. 200-212.
@article{754b6b33aa83411db627b363df0e0425,
title = "A data-driven approach to conditional screening of high-dimensional variables",
abstract = "Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.",
keywords = "conditional screening, false negative, feature screening, high dimension, sparse principal component analysis, sure screening property",
author = "Hong, {Hyokyoung G.} and Lan Wang and Xuming He",
year = "2016",
month = "1",
day = "1",
doi = "10.1002/sta4.115",
language = "English (US)",
volume = "5",
pages = "200--212",
journal = "Stat",
issn = "2049-1573",
publisher = "Wiley-Blackwell Publishing Ltd",
number = "1",

}

TY - JOUR

T1 - A data-driven approach to conditional screening of high-dimensional variables

AU - Hong, Hyokyoung G.

AU - Wang, Lan

AU - He, Xuming

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.

AB - Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so-called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data-driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example.

KW - conditional screening

KW - false negative

KW - feature screening

KW - high dimension

KW - sparse principal component analysis

KW - sure screening property

UR - http://www.scopus.com/inward/record.url?scp=84994876279&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994876279&partnerID=8YFLogxK

U2 - 10.1002/sta4.115

DO - 10.1002/sta4.115

M3 - Article

AN - SCOPUS:84994876279

VL - 5

SP - 200

EP - 212

JO - Stat

JF - Stat

SN - 2049-1573

IS - 1

ER -