TY - JOUR
T1 - Second-generation PLINK
T2 - Rising to the challenge of larger and richer datasets
AU - Chang, Christopher C.
AU - Chow, Carson C.
AU - Tellier, Laurent C.A.M.
AU - Vattikuti, Shashaank
AU - Purcell, Shaun M.
AU - Lee, James J.
N1 - Publisher Copyright:
© 2015 Chang et al.; licensee BioMed Central.
PY - 2015/2/25
Y1 - 2015/2/25
N2 - Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(√n)-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
AB - Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(√n)-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
KW - Computational statistics
KW - GWAS
KW - High-density SNP genotyping
KW - Population genetics
KW - Whole-genome sequencing
UR - http://www.scopus.com/inward/record.url?scp=84930213392&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84930213392&partnerID=8YFLogxK
U2 - 10.1186/s13742-015-0047-8
DO - 10.1186/s13742-015-0047-8
M3 - Article
C2 - 25722852
AN - SCOPUS:84930213392
SN - 2047-217X
VL - 4
JO - GigaScience
JF - GigaScience
IS - 1
M1 - 7
ER -