Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes

Peng Zhou, Kevin A T Silverstein, Thiruvarangan Ramaraj, Joseph Guhlin, Roxanne Denny, Junqi Liu, Andrew D. Farmer, Kelly P. Steele, Robert M. Stupar, Jason R. Miller, Peter Tiffin, Joann Mudge, Nevin D. Young

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

Background: Previous studies exploring sequence variation in the model legume, Medicago truncatula, relied on mapping short reads to a single reference. However, read-mapping approaches are inadequate to examine large, diverse gene families or to probe variation in repeat-rich or highly divergent genome regions. De novo sequencing and assembly of M. truncatula genomes enables near-comprehensive discovery of structural variants (SVs), analysis of rapidly evolving gene families, and ultimately, construction of a pan-genome. Results: Genome-wide synteny based on 15 de novo M. truncatula assemblies effectively detected different types of SVs indicating that as much as 22% of the genome is involved in large structural changes, altogether affecting 28% of gene models. A total of 63 million base pairs (Mbp) of novel sequence was discovered, expanding the reference genome space for Medicago by 16%. Pan-genome analysis revealed that 42% (180 Mbp) of genomic sequences is missing in one or more accession, while examination of de novo annotated genes identified 67% (50,700) of all ortholog groups as dispensable - estimates comparable to recent studies in rice, maize and soybean. Rapidly evolving gene families typically associated with biotic interactions and stress response were found to be enriched in the accession-specific gene pool. The nucleotide-binding site leucine-rich repeat (NBS-LRR) family, in particular, harbors the highest level of nucleotide diversity, large effect single nucleotide change, protein diversity, and presence/absence variation. However, the leucine-rich repeat (LRR) and heat shock gene families are disproportionately affected by large effect single nucleotide changes and even higher levels of copy number variation. Conclusions: Analysis of multiple M. truncatula genomes illustrates the value of de novo assemblies to discover and describe structural variation, something that is often under-estimated when using read-mapping approaches. Comparisons among the de novo assemblies also indicate that different large gene families differ in the architecture of their structural variation.

Original languageEnglish (US)
Article number261
JournalBMC Genomics
Volume18
Issue number1
DOIs
StatePublished - Mar 27 2017

Fingerprint

Medicago
Medicago truncatula
Genome
Genes
Nucleotides
Leucine
Base Pairing
Synteny
Gene Pool
Soybeans
Fabaceae
Zea mays
Shock
Hot Temperature
Binding Sites

Cite this

Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes. / Zhou, Peng; Silverstein, Kevin A T; Ramaraj, Thiruvarangan; Guhlin, Joseph; Denny, Roxanne; Liu, Junqi; Farmer, Andrew D.; Steele, Kelly P.; Stupar, Robert M.; Miller, Jason R.; Tiffin, Peter; Mudge, Joann; Young, Nevin D.

In: BMC Genomics, Vol. 18, No. 1, 261, 27.03.2017.

Research output: Contribution to journalArticle

Zhou, Peng ; Silverstein, Kevin A T ; Ramaraj, Thiruvarangan ; Guhlin, Joseph ; Denny, Roxanne ; Liu, Junqi ; Farmer, Andrew D. ; Steele, Kelly P. ; Stupar, Robert M. ; Miller, Jason R. ; Tiffin, Peter ; Mudge, Joann ; Young, Nevin D. / Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes. In: BMC Genomics. 2017 ; Vol. 18, No. 1.
@article{82538f89e6494cabab2b8fe9059053de,
title = "Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes",
abstract = "Background: Previous studies exploring sequence variation in the model legume, Medicago truncatula, relied on mapping short reads to a single reference. However, read-mapping approaches are inadequate to examine large, diverse gene families or to probe variation in repeat-rich or highly divergent genome regions. De novo sequencing and assembly of M. truncatula genomes enables near-comprehensive discovery of structural variants (SVs), analysis of rapidly evolving gene families, and ultimately, construction of a pan-genome. Results: Genome-wide synteny based on 15 de novo M. truncatula assemblies effectively detected different types of SVs indicating that as much as 22{\%} of the genome is involved in large structural changes, altogether affecting 28{\%} of gene models. A total of 63 million base pairs (Mbp) of novel sequence was discovered, expanding the reference genome space for Medicago by 16{\%}. Pan-genome analysis revealed that 42{\%} (180 Mbp) of genomic sequences is missing in one or more accession, while examination of de novo annotated genes identified 67{\%} (50,700) of all ortholog groups as dispensable - estimates comparable to recent studies in rice, maize and soybean. Rapidly evolving gene families typically associated with biotic interactions and stress response were found to be enriched in the accession-specific gene pool. The nucleotide-binding site leucine-rich repeat (NBS-LRR) family, in particular, harbors the highest level of nucleotide diversity, large effect single nucleotide change, protein diversity, and presence/absence variation. However, the leucine-rich repeat (LRR) and heat shock gene families are disproportionately affected by large effect single nucleotide changes and even higher levels of copy number variation. Conclusions: Analysis of multiple M. truncatula genomes illustrates the value of de novo assemblies to discover and describe structural variation, something that is often under-estimated when using read-mapping approaches. Comparisons among the de novo assemblies also indicate that different large gene families differ in the architecture of their structural variation.",
author = "Peng Zhou and Silverstein, {Kevin A T} and Thiruvarangan Ramaraj and Joseph Guhlin and Roxanne Denny and Junqi Liu and Farmer, {Andrew D.} and Steele, {Kelly P.} and Stupar, {Robert M.} and Miller, {Jason R.} and Peter Tiffin and Joann Mudge and Young, {Nevin D.}",
year = "2017",
month = "3",
day = "27",
doi = "10.1186/s12864-017-3654-1",
language = "English (US)",
volume = "18",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes

AU - Zhou, Peng

AU - Silverstein, Kevin A T

AU - Ramaraj, Thiruvarangan

AU - Guhlin, Joseph

AU - Denny, Roxanne

AU - Liu, Junqi

AU - Farmer, Andrew D.

AU - Steele, Kelly P.

AU - Stupar, Robert M.

AU - Miller, Jason R.

AU - Tiffin, Peter

AU - Mudge, Joann

AU - Young, Nevin D.

PY - 2017/3/27

Y1 - 2017/3/27

N2 - Background: Previous studies exploring sequence variation in the model legume, Medicago truncatula, relied on mapping short reads to a single reference. However, read-mapping approaches are inadequate to examine large, diverse gene families or to probe variation in repeat-rich or highly divergent genome regions. De novo sequencing and assembly of M. truncatula genomes enables near-comprehensive discovery of structural variants (SVs), analysis of rapidly evolving gene families, and ultimately, construction of a pan-genome. Results: Genome-wide synteny based on 15 de novo M. truncatula assemblies effectively detected different types of SVs indicating that as much as 22% of the genome is involved in large structural changes, altogether affecting 28% of gene models. A total of 63 million base pairs (Mbp) of novel sequence was discovered, expanding the reference genome space for Medicago by 16%. Pan-genome analysis revealed that 42% (180 Mbp) of genomic sequences is missing in one or more accession, while examination of de novo annotated genes identified 67% (50,700) of all ortholog groups as dispensable - estimates comparable to recent studies in rice, maize and soybean. Rapidly evolving gene families typically associated with biotic interactions and stress response were found to be enriched in the accession-specific gene pool. The nucleotide-binding site leucine-rich repeat (NBS-LRR) family, in particular, harbors the highest level of nucleotide diversity, large effect single nucleotide change, protein diversity, and presence/absence variation. However, the leucine-rich repeat (LRR) and heat shock gene families are disproportionately affected by large effect single nucleotide changes and even higher levels of copy number variation. Conclusions: Analysis of multiple M. truncatula genomes illustrates the value of de novo assemblies to discover and describe structural variation, something that is often under-estimated when using read-mapping approaches. Comparisons among the de novo assemblies also indicate that different large gene families differ in the architecture of their structural variation.

AB - Background: Previous studies exploring sequence variation in the model legume, Medicago truncatula, relied on mapping short reads to a single reference. However, read-mapping approaches are inadequate to examine large, diverse gene families or to probe variation in repeat-rich or highly divergent genome regions. De novo sequencing and assembly of M. truncatula genomes enables near-comprehensive discovery of structural variants (SVs), analysis of rapidly evolving gene families, and ultimately, construction of a pan-genome. Results: Genome-wide synteny based on 15 de novo M. truncatula assemblies effectively detected different types of SVs indicating that as much as 22% of the genome is involved in large structural changes, altogether affecting 28% of gene models. A total of 63 million base pairs (Mbp) of novel sequence was discovered, expanding the reference genome space for Medicago by 16%. Pan-genome analysis revealed that 42% (180 Mbp) of genomic sequences is missing in one or more accession, while examination of de novo annotated genes identified 67% (50,700) of all ortholog groups as dispensable - estimates comparable to recent studies in rice, maize and soybean. Rapidly evolving gene families typically associated with biotic interactions and stress response were found to be enriched in the accession-specific gene pool. The nucleotide-binding site leucine-rich repeat (NBS-LRR) family, in particular, harbors the highest level of nucleotide diversity, large effect single nucleotide change, protein diversity, and presence/absence variation. However, the leucine-rich repeat (LRR) and heat shock gene families are disproportionately affected by large effect single nucleotide changes and even higher levels of copy number variation. Conclusions: Analysis of multiple M. truncatula genomes illustrates the value of de novo assemblies to discover and describe structural variation, something that is often under-estimated when using read-mapping approaches. Comparisons among the de novo assemblies also indicate that different large gene families differ in the architecture of their structural variation.

UR - http://www.scopus.com/inward/record.url?scp=85016145317&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85016145317&partnerID=8YFLogxK

U2 - 10.1186/s12864-017-3654-1

DO - 10.1186/s12864-017-3654-1

M3 - Article

VL - 18

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 261

ER -