Identification of the expressome by machine learning on omics data

Ryan C. Sartor, Jaclyn Noshay, Nathan M. Springer, Steven P. Briggs

Research output: Contribution to journalArticle

Abstract

Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbredspecific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.

Original languageEnglish (US)
Pages (from-to)18119-18125
Number of pages7
JournalProceedings of the National Academy of Sciences of the United States of America
Volume116
Issue number36
DOIs
StatePublished - Sep 3 2019

Fingerprint

Genes
Genome
DNA Methylation
Methylation
Histone Code
Machine Learning
Plant Genome
Molecular Sequence Annotation
Pseudogenes
Proteins
DNA Transposable Elements
Epigenomics
Proteomics
Zea mays
Chromatin
RNA
Messenger RNA

Keywords

  • Epigenomics
  • Genome annotation
  • Machine learning
  • Maize
  • Proteomics

PubMed: MeSH publication types

  • Journal Article

Cite this

Identification of the expressome by machine learning on omics data. / Sartor, Ryan C.; Noshay, Jaclyn; Springer, Nathan M.; Briggs, Steven P.

In: Proceedings of the National Academy of Sciences of the United States of America, Vol. 116, No. 36, 03.09.2019, p. 18119-18125.

Research output: Contribution to journalArticle

@article{f1772b4c50a14b489d79143d2ee7fd26,
title = "Identification of the expressome by machine learning on omics data",
abstract = "Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbredspecific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.",
keywords = "Epigenomics, Genome annotation, Machine learning, Maize, Proteomics",
author = "Sartor, {Ryan C.} and Jaclyn Noshay and Springer, {Nathan M.} and Briggs, {Steven P.}",
year = "2019",
month = "9",
day = "3",
doi = "10.1073/pnas.1813645116",
language = "English (US)",
volume = "116",
pages = "18119--18125",
journal = "Proceedings of the National Academy of Sciences of the United States of America",
issn = "0027-8424",
number = "36",

}

TY - JOUR

T1 - Identification of the expressome by machine learning on omics data

AU - Sartor, Ryan C.

AU - Noshay, Jaclyn

AU - Springer, Nathan M.

AU - Briggs, Steven P.

PY - 2019/9/3

Y1 - 2019/9/3

N2 - Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbredspecific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.

AB - Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbredspecific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.

KW - Epigenomics

KW - Genome annotation

KW - Machine learning

KW - Maize

KW - Proteomics

UR - http://www.scopus.com/inward/record.url?scp=85071788931&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071788931&partnerID=8YFLogxK

U2 - 10.1073/pnas.1813645116

DO - 10.1073/pnas.1813645116

M3 - Article

C2 - 31420517

AN - SCOPUS:85071788931

VL - 116

SP - 18119

EP - 18125

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

SN - 0027-8424

IS - 36

ER -