ThrEEBoost: Thresholded Boosting for Variable Selection and Prediction via Estimating Equations

Ben Brown, Christopher J. Miller, Julian Wolfson

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Most variable selection techniques for high-dimensional models are designed to be used in settings, where observations are independent and completely observed. At the same time, there is a rich literature on approaches to estimation of low-dimensional parameters in the presence of correlation, missingness, measurement error, selection bias, and other characteristics of real data. In this article, we present ThrEEBoost (Thresholded EEBoost), a general-purpose variable selection technique which can accommodate such problem characteristics by replacing the gradient of the loss by an estimating function. ThrEEBoost generalizes the previously proposed EEBoost algorithm (Wolfson 2011) by allowing the number of regression coefficients updated at each step to be controlled by a thresholding parameter. Different thresholding parameter values yield different variable selection paths, greatly diversifying the set of models that can be explored; the optimal degree of thresholding can be chosen by cross-validation. ThrEEBoost was evaluated using simulation studies to assess the effects of different threshold values on prediction error, sensitivity, specificity, and the number of iterations to identify minimum prediction error under both sparse and nonsparse true models with correlated continuous outcomes. We show that when the true model is sparse, ThrEEBoost achieves similar prediction error to EEBoost while requiring fewer iterations to locate the set of coefficients yielding the minimum error. When the true model is less sparse, ThrEEBoost has lower prediction error than EEBoost and also finds the point yielding the minimum error more quickly. The technique is illustrated by applying it to the problem of identifying predictors of weight change in a longitudinal nutrition study. Supplementary materials are available online.

Original languageEnglish (US)
Pages (from-to)579-588
Number of pages10
JournalJournal of Computational and Graphical Statistics
Volume26
Issue number3
DOIs
StatePublished - Jul 3 2017

Fingerprint

Estimating Equation
Boosting
Variable Selection
Prediction Error
Thresholding
Prediction
Iteration
Selection Bias
Estimating Function
Model
Nutrition
Regression Coefficient
Threshold Value
Cross-validation
Measurement Error
Specificity
Predictors
High-dimensional
Variable selection
Simulation Study

Keywords

  • Correlation
  • GEE
  • Thresholding

Cite this

ThrEEBoost : Thresholded Boosting for Variable Selection and Prediction via Estimating Equations. / Brown, Ben; Miller, Christopher J.; Wolfson, Julian.

In: Journal of Computational and Graphical Statistics, Vol. 26, No. 3, 03.07.2017, p. 579-588.

Research output: Contribution to journalArticle

@article{e1c28daaed9947bebc7626f32a8ca347,
title = "ThrEEBoost: Thresholded Boosting for Variable Selection and Prediction via Estimating Equations",
abstract = "Most variable selection techniques for high-dimensional models are designed to be used in settings, where observations are independent and completely observed. At the same time, there is a rich literature on approaches to estimation of low-dimensional parameters in the presence of correlation, missingness, measurement error, selection bias, and other characteristics of real data. In this article, we present ThrEEBoost (Thresholded EEBoost), a general-purpose variable selection technique which can accommodate such problem characteristics by replacing the gradient of the loss by an estimating function. ThrEEBoost generalizes the previously proposed EEBoost algorithm (Wolfson 2011) by allowing the number of regression coefficients updated at each step to be controlled by a thresholding parameter. Different thresholding parameter values yield different variable selection paths, greatly diversifying the set of models that can be explored; the optimal degree of thresholding can be chosen by cross-validation. ThrEEBoost was evaluated using simulation studies to assess the effects of different threshold values on prediction error, sensitivity, specificity, and the number of iterations to identify minimum prediction error under both sparse and nonsparse true models with correlated continuous outcomes. We show that when the true model is sparse, ThrEEBoost achieves similar prediction error to EEBoost while requiring fewer iterations to locate the set of coefficients yielding the minimum error. When the true model is less sparse, ThrEEBoost has lower prediction error than EEBoost and also finds the point yielding the minimum error more quickly. The technique is illustrated by applying it to the problem of identifying predictors of weight change in a longitudinal nutrition study. Supplementary materials are available online.",
keywords = "Correlation, GEE, Thresholding",
author = "Ben Brown and Miller, {Christopher J.} and Julian Wolfson",
year = "2017",
month = "7",
day = "3",
doi = "10.1080/10618600.2016.1247005",
language = "English (US)",
volume = "26",
pages = "579--588",
journal = "Journal of Computational and Graphical Statistics",
issn = "1061-8600",
publisher = "American Statistical Association",
number = "3",

}

TY - JOUR

T1 - ThrEEBoost

T2 - Thresholded Boosting for Variable Selection and Prediction via Estimating Equations

AU - Brown, Ben

AU - Miller, Christopher J.

AU - Wolfson, Julian

PY - 2017/7/3

Y1 - 2017/7/3

N2 - Most variable selection techniques for high-dimensional models are designed to be used in settings, where observations are independent and completely observed. At the same time, there is a rich literature on approaches to estimation of low-dimensional parameters in the presence of correlation, missingness, measurement error, selection bias, and other characteristics of real data. In this article, we present ThrEEBoost (Thresholded EEBoost), a general-purpose variable selection technique which can accommodate such problem characteristics by replacing the gradient of the loss by an estimating function. ThrEEBoost generalizes the previously proposed EEBoost algorithm (Wolfson 2011) by allowing the number of regression coefficients updated at each step to be controlled by a thresholding parameter. Different thresholding parameter values yield different variable selection paths, greatly diversifying the set of models that can be explored; the optimal degree of thresholding can be chosen by cross-validation. ThrEEBoost was evaluated using simulation studies to assess the effects of different threshold values on prediction error, sensitivity, specificity, and the number of iterations to identify minimum prediction error under both sparse and nonsparse true models with correlated continuous outcomes. We show that when the true model is sparse, ThrEEBoost achieves similar prediction error to EEBoost while requiring fewer iterations to locate the set of coefficients yielding the minimum error. When the true model is less sparse, ThrEEBoost has lower prediction error than EEBoost and also finds the point yielding the minimum error more quickly. The technique is illustrated by applying it to the problem of identifying predictors of weight change in a longitudinal nutrition study. Supplementary materials are available online.

AB - Most variable selection techniques for high-dimensional models are designed to be used in settings, where observations are independent and completely observed. At the same time, there is a rich literature on approaches to estimation of low-dimensional parameters in the presence of correlation, missingness, measurement error, selection bias, and other characteristics of real data. In this article, we present ThrEEBoost (Thresholded EEBoost), a general-purpose variable selection technique which can accommodate such problem characteristics by replacing the gradient of the loss by an estimating function. ThrEEBoost generalizes the previously proposed EEBoost algorithm (Wolfson 2011) by allowing the number of regression coefficients updated at each step to be controlled by a thresholding parameter. Different thresholding parameter values yield different variable selection paths, greatly diversifying the set of models that can be explored; the optimal degree of thresholding can be chosen by cross-validation. ThrEEBoost was evaluated using simulation studies to assess the effects of different threshold values on prediction error, sensitivity, specificity, and the number of iterations to identify minimum prediction error under both sparse and nonsparse true models with correlated continuous outcomes. We show that when the true model is sparse, ThrEEBoost achieves similar prediction error to EEBoost while requiring fewer iterations to locate the set of coefficients yielding the minimum error. When the true model is less sparse, ThrEEBoost has lower prediction error than EEBoost and also finds the point yielding the minimum error more quickly. The technique is illustrated by applying it to the problem of identifying predictors of weight change in a longitudinal nutrition study. Supplementary materials are available online.

KW - Correlation

KW - GEE

KW - Thresholding

UR - http://www.scopus.com/inward/record.url?scp=85017466617&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85017466617&partnerID=8YFLogxK

U2 - 10.1080/10618600.2016.1247005

DO - 10.1080/10618600.2016.1247005

M3 - Article

AN - SCOPUS:85017466617

VL - 26

SP - 579

EP - 588

JO - Journal of Computational and Graphical Statistics

JF - Journal of Computational and Graphical Statistics

SN - 1061-8600

IS - 3

ER -