Hybrid assembly with long and short reads improves discovery of gene family expansions

Jason R. Miller, Peng Zhou, Joann Mudge, James Gurtowski, Hayan Lee, Thiruvarangan Ramaraj, Brian P. Walenz, Junqi Liu, Robert M. Stupar, Roxanne Denny, Li Song, Namrata Singh, Lyza G. Maron, Susan R. McCouch, W. Richard McCombie, Michael C. Schatz, Peter Tiffin, Nevin D. Young, Kevin A.T. Silverstein

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

Background: Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation. Methods: We developed a hybrid assembly pipeline called "Alpaca" that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation. Results: Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies. Conclusion: Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.

Original languageEnglish (US)
Article number541
JournalBMC Genomics
Volume18
Issue number1
DOIs
StatePublished - Jul 19 2017

Fingerprint

New World Camelids
Genetic Association Studies
Medicago truncatula
Genome
Tandem Repeat Sequences
Fabaceae
Technology
Population
Genes

Keywords

  • Genome assembly
  • Hybrid assembly pipeline
  • Medicago truncatula
  • Tandem repeats

Cite this

Hybrid assembly with long and short reads improves discovery of gene family expansions. / Miller, Jason R.; Zhou, Peng; Mudge, Joann; Gurtowski, James; Lee, Hayan; Ramaraj, Thiruvarangan; Walenz, Brian P.; Liu, Junqi; Stupar, Robert M.; Denny, Roxanne; Song, Li; Singh, Namrata; Maron, Lyza G.; McCouch, Susan R.; McCombie, W. Richard; Schatz, Michael C.; Tiffin, Peter; Young, Nevin D.; Silverstein, Kevin A.T.

In: BMC Genomics, Vol. 18, No. 1, 541, 19.07.2017.

Research output: Contribution to journalArticle

Miller, JR, Zhou, P, Mudge, J, Gurtowski, J, Lee, H, Ramaraj, T, Walenz, BP, Liu, J, Stupar, RM, Denny, R, Song, L, Singh, N, Maron, LG, McCouch, SR, McCombie, WR, Schatz, MC, Tiffin, P, Young, ND & Silverstein, KAT 2017, 'Hybrid assembly with long and short reads improves discovery of gene family expansions', BMC Genomics, vol. 18, no. 1, 541. https://doi.org/10.1186/s12864-017-3927-8
Miller, Jason R. ; Zhou, Peng ; Mudge, Joann ; Gurtowski, James ; Lee, Hayan ; Ramaraj, Thiruvarangan ; Walenz, Brian P. ; Liu, Junqi ; Stupar, Robert M. ; Denny, Roxanne ; Song, Li ; Singh, Namrata ; Maron, Lyza G. ; McCouch, Susan R. ; McCombie, W. Richard ; Schatz, Michael C. ; Tiffin, Peter ; Young, Nevin D. ; Silverstein, Kevin A.T. / Hybrid assembly with long and short reads improves discovery of gene family expansions. In: BMC Genomics. 2017 ; Vol. 18, No. 1.
@article{0c7239cd261542baba6bf6ef7b202f67,
title = "Hybrid assembly with long and short reads improves discovery of gene family expansions",
abstract = "Background: Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation. Methods: We developed a hybrid assembly pipeline called {"}Alpaca{"} that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation. Results: Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies. Conclusion: Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.",
keywords = "Genome assembly, Hybrid assembly pipeline, Medicago truncatula, Tandem repeats",
author = "Miller, {Jason R.} and Peng Zhou and Joann Mudge and James Gurtowski and Hayan Lee and Thiruvarangan Ramaraj and Walenz, {Brian P.} and Junqi Liu and Stupar, {Robert M.} and Roxanne Denny and Li Song and Namrata Singh and Maron, {Lyza G.} and McCouch, {Susan R.} and McCombie, {W. Richard} and Schatz, {Michael C.} and Peter Tiffin and Young, {Nevin D.} and Silverstein, {Kevin A.T.}",
year = "2017",
month = "7",
day = "19",
doi = "10.1186/s12864-017-3927-8",
language = "English (US)",
volume = "18",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Hybrid assembly with long and short reads improves discovery of gene family expansions

AU - Miller, Jason R.

AU - Zhou, Peng

AU - Mudge, Joann

AU - Gurtowski, James

AU - Lee, Hayan

AU - Ramaraj, Thiruvarangan

AU - Walenz, Brian P.

AU - Liu, Junqi

AU - Stupar, Robert M.

AU - Denny, Roxanne

AU - Song, Li

AU - Singh, Namrata

AU - Maron, Lyza G.

AU - McCouch, Susan R.

AU - McCombie, W. Richard

AU - Schatz, Michael C.

AU - Tiffin, Peter

AU - Young, Nevin D.

AU - Silverstein, Kevin A.T.

PY - 2017/7/19

Y1 - 2017/7/19

N2 - Background: Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation. Methods: We developed a hybrid assembly pipeline called "Alpaca" that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation. Results: Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies. Conclusion: Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.

AB - Background: Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation. Methods: We developed a hybrid assembly pipeline called "Alpaca" that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation. Results: Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies. Conclusion: Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.

KW - Genome assembly

KW - Hybrid assembly pipeline

KW - Medicago truncatula

KW - Tandem repeats

UR - http://www.scopus.com/inward/record.url?scp=85025075780&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85025075780&partnerID=8YFLogxK

U2 - 10.1186/s12864-017-3927-8

DO - 10.1186/s12864-017-3927-8

M3 - Article

C2 - 28724409

AN - SCOPUS:85025075780

VL - 18

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 541

ER -