Penalized regression and model selection methods for polygenic scores on summary statistics

Jack Pattee, Wei Pan

Research output: Contribution to journalArticlepeer-review

22 Scopus citations

Abstract

Polygenic scores quantify the genetic risk associated with a given phenotype and are widely used to predict the risk of complex diseases. There has been recent interest in developing methods to construct polygenic risk scores using summary statistic data. We propose a method to construct polygenic risk scores via penalized regression using summary statistic data and publicly available reference data. Our method bears similarity to existing method LassoSum, extending their framework to the Truncated Lasso Penalty (TLP) and the elastic net. We show via simulation and real data application that the TLP improves predictive accuracy as compared to the LASSO while imposing additional sparsity where appropriate. To facilitate model selection in the absence of validation data, we propose methods for estimating model fitting criteria AIC and BIC. These methods approximate the AIC and BIC in the case where we have a polygenic risk score estimated on summary statistic data and no validation data. Additionally, we propose the so-called quasi-correlation metric, which quantifies the predictive accuracy of a polygenic risk score applied to out-of-sample data for which we have only summary statistic information. In total, these methods facilitate estimation and model selection of polygenic risk scores on summary statistic data, and the application of these polygenic risk scores to out-of-sample data for which we have only summary statistic information. We demonstrate the utility of these methods by applying them to GWA studies of lipids, height, and lung cancer.

Original languageEnglish (US)
Article numbere1008271
JournalPLoS computational biology
Volume16
Issue number10
DOIs
StatePublished - Oct 1 2020

Bibliographical note

Funding Information:
WP gratefully acknowledges the financial support of National Institute of General Medical Sciences grant T32GM10855, and NIH grants R01AG065636, R21AG057038, R01HL116720, R01GM113250, R01GM126002 and R01HL10539. WP and JP acknowledge the support of the Minnesota Supercomputing Institute. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Publisher Copyright:
Copyright: © 2020 Pattee, Pan. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Fingerprint

Dive into the research topics of 'Penalized regression and model selection methods for polygenic scores on summary statistics'. Together they form a unique fingerprint.

Cite this