Variable selection after screening: with or without data splitting?

Xiaoyi Zhu, Yuhong Yang

Research output: Contribution to journalArticle

4 Scopus citations

Abstract

High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

Original languageEnglish (US)
Pages (from-to)191-203
Number of pages13
JournalComputational Statistics
Volume30
Issue number1
DOIs
StatePublished - Jan 1 2014

    Fingerprint

Keywords

  • Model selection
  • Prediction
  • Sparse regression
  • Variable screening

Cite this