Genomic limitations to RNA sequencing expression profiling

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

The field of genomics has grown rapidly with the advent of massively parallel sequencing technologies, allowing for novel biological insights with regards to genomic, transcriptomic, and epigenomic variation. One widely utilized application of high-throughput sequencing is transcriptional profiling using RNA sequencing (RNAseq). Understanding the limitations of a technology is critical for accurate biological interpretations, and clear interpretation of RNAseq data can be difficult in species with complex genomes. To understand the limitations of accurate profiling of expression levels we simulated RNAseq reads from annotated gene models in several plant species including Arabidopsis, brachypodium, maize, potato, rice, soybean, and tomato. The simulated reads were aligned using various parameters such as unique versus multiple read alignments. This allowed the identification of genes recalcitrant to RNAseq analyses by having over- and/or under-estimated expression levels. In maize, over 25% of genes deviated by more than 20% from the expected count values, suggesting the need for cautious interpretation of RNAseq data for certain genes. The reasons identified for deviation from expected expression varied between species due to differences in genome structure including, but not limited to, genes encoding short transcripts, overlapping gene models, and gene family size. Utilizing existing empirical datasets we demonstrate the potential for biological misinterpretation resulting from inclusion of 'flagged genes' in analyses. While RNAseq is a powerful tool for understanding biology, there are limitations to this technology that need to be understood in order to improve our biological interpretations. Significance Statement RNAseq is widely used to interrogate the transcriptome of a given sample. Understanding the limitations of this technology is critical for accurate biological interpretations. Here we used simulated RNAseq datasets from seven species ranging in genome complexity. We identified characteristics of genes whose RNASeq counts deviated from expected expression values, including small genes, overlapping genes and gene families.

Original languageEnglish (US)
Pages (from-to)491-503
Number of pages13
JournalPlant Journal
Volume84
Issue number3
DOIs
StatePublished - Nov 1 2015

Fingerprint

RNA Sequence Analysis
sequence analysis
genomics
Genes
genes
Overlapping Genes
Technology
Genome
Zea mays
Brachypodium
genome
High-Throughput Nucleotide Sequencing
Lycopersicon esculentum
Solanum tuberosum
Genomics
family size
Transcriptome
Soybeans
corn
Arabidopsis

Keywords

  • Arabidopsis
  • RNAseq
  • expression profile
  • maize
  • structural annotation

Cite this

Genomic limitations to RNA sequencing expression profiling. / Hirsch, Cory D; Springer, Nathan M; Hirsch, Candice N.

In: Plant Journal, Vol. 84, No. 3, 01.11.2015, p. 491-503.

Research output: Contribution to journalArticle

@article{43ee91400a7546ab812b975224b2951f,
title = "Genomic limitations to RNA sequencing expression profiling",
abstract = "The field of genomics has grown rapidly with the advent of massively parallel sequencing technologies, allowing for novel biological insights with regards to genomic, transcriptomic, and epigenomic variation. One widely utilized application of high-throughput sequencing is transcriptional profiling using RNA sequencing (RNAseq). Understanding the limitations of a technology is critical for accurate biological interpretations, and clear interpretation of RNAseq data can be difficult in species with complex genomes. To understand the limitations of accurate profiling of expression levels we simulated RNAseq reads from annotated gene models in several plant species including Arabidopsis, brachypodium, maize, potato, rice, soybean, and tomato. The simulated reads were aligned using various parameters such as unique versus multiple read alignments. This allowed the identification of genes recalcitrant to RNAseq analyses by having over- and/or under-estimated expression levels. In maize, over 25{\%} of genes deviated by more than 20{\%} from the expected count values, suggesting the need for cautious interpretation of RNAseq data for certain genes. The reasons identified for deviation from expected expression varied between species due to differences in genome structure including, but not limited to, genes encoding short transcripts, overlapping gene models, and gene family size. Utilizing existing empirical datasets we demonstrate the potential for biological misinterpretation resulting from inclusion of 'flagged genes' in analyses. While RNAseq is a powerful tool for understanding biology, there are limitations to this technology that need to be understood in order to improve our biological interpretations. Significance Statement RNAseq is widely used to interrogate the transcriptome of a given sample. Understanding the limitations of this technology is critical for accurate biological interpretations. Here we used simulated RNAseq datasets from seven species ranging in genome complexity. We identified characteristics of genes whose RNASeq counts deviated from expected expression values, including small genes, overlapping genes and gene families.",
keywords = "Arabidopsis, RNAseq, expression profile, maize, structural annotation",
author = "Hirsch, {Cory D} and Springer, {Nathan M} and Hirsch, {Candice N}",
year = "2015",
month = "11",
day = "1",
doi = "10.1111/tpj.13014",
language = "English (US)",
volume = "84",
pages = "491--503",
journal = "Plant Journal",
issn = "0960-7412",
publisher = "Wiley-Blackwell",
number = "3",

}

TY - JOUR

T1 - Genomic limitations to RNA sequencing expression profiling

AU - Hirsch, Cory D

AU - Springer, Nathan M

AU - Hirsch, Candice N

PY - 2015/11/1

Y1 - 2015/11/1

N2 - The field of genomics has grown rapidly with the advent of massively parallel sequencing technologies, allowing for novel biological insights with regards to genomic, transcriptomic, and epigenomic variation. One widely utilized application of high-throughput sequencing is transcriptional profiling using RNA sequencing (RNAseq). Understanding the limitations of a technology is critical for accurate biological interpretations, and clear interpretation of RNAseq data can be difficult in species with complex genomes. To understand the limitations of accurate profiling of expression levels we simulated RNAseq reads from annotated gene models in several plant species including Arabidopsis, brachypodium, maize, potato, rice, soybean, and tomato. The simulated reads were aligned using various parameters such as unique versus multiple read alignments. This allowed the identification of genes recalcitrant to RNAseq analyses by having over- and/or under-estimated expression levels. In maize, over 25% of genes deviated by more than 20% from the expected count values, suggesting the need for cautious interpretation of RNAseq data for certain genes. The reasons identified for deviation from expected expression varied between species due to differences in genome structure including, but not limited to, genes encoding short transcripts, overlapping gene models, and gene family size. Utilizing existing empirical datasets we demonstrate the potential for biological misinterpretation resulting from inclusion of 'flagged genes' in analyses. While RNAseq is a powerful tool for understanding biology, there are limitations to this technology that need to be understood in order to improve our biological interpretations. Significance Statement RNAseq is widely used to interrogate the transcriptome of a given sample. Understanding the limitations of this technology is critical for accurate biological interpretations. Here we used simulated RNAseq datasets from seven species ranging in genome complexity. We identified characteristics of genes whose RNASeq counts deviated from expected expression values, including small genes, overlapping genes and gene families.

AB - The field of genomics has grown rapidly with the advent of massively parallel sequencing technologies, allowing for novel biological insights with regards to genomic, transcriptomic, and epigenomic variation. One widely utilized application of high-throughput sequencing is transcriptional profiling using RNA sequencing (RNAseq). Understanding the limitations of a technology is critical for accurate biological interpretations, and clear interpretation of RNAseq data can be difficult in species with complex genomes. To understand the limitations of accurate profiling of expression levels we simulated RNAseq reads from annotated gene models in several plant species including Arabidopsis, brachypodium, maize, potato, rice, soybean, and tomato. The simulated reads were aligned using various parameters such as unique versus multiple read alignments. This allowed the identification of genes recalcitrant to RNAseq analyses by having over- and/or under-estimated expression levels. In maize, over 25% of genes deviated by more than 20% from the expected count values, suggesting the need for cautious interpretation of RNAseq data for certain genes. The reasons identified for deviation from expected expression varied between species due to differences in genome structure including, but not limited to, genes encoding short transcripts, overlapping gene models, and gene family size. Utilizing existing empirical datasets we demonstrate the potential for biological misinterpretation resulting from inclusion of 'flagged genes' in analyses. While RNAseq is a powerful tool for understanding biology, there are limitations to this technology that need to be understood in order to improve our biological interpretations. Significance Statement RNAseq is widely used to interrogate the transcriptome of a given sample. Understanding the limitations of this technology is critical for accurate biological interpretations. Here we used simulated RNAseq datasets from seven species ranging in genome complexity. We identified characteristics of genes whose RNASeq counts deviated from expected expression values, including small genes, overlapping genes and gene families.

KW - Arabidopsis

KW - RNAseq

KW - expression profile

KW - maize

KW - structural annotation

UR - http://www.scopus.com/inward/record.url?scp=84944876337&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84944876337&partnerID=8YFLogxK

U2 - 10.1111/tpj.13014

DO - 10.1111/tpj.13014

M3 - Article

C2 - 26331235

AN - SCOPUS:84944876337

VL - 84

SP - 491

EP - 503

JO - Plant Journal

JF - Plant Journal

SN - 0960-7412

IS - 3

ER -