Download pdf - RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

RNA-‐seq Introduc1on

Promises and pi7alls

DNA is the same in all cells but which RNAs that is present is different in all

cells

There is a wide variety of different func1onal RNAs

Which RNAs (and some1mes then translated to proteins) varies between

samples -‐Tissues -‐ Cell types

-‐ Cell states

-‐Individuals -‐Cells

RNA gives informa1on on which genes that are expressed

How DNA get transcribed to RNA (and some1mes then translated to proteins) varies between e. g. -‐Tissues -‐ Cell types

-‐ Cell states

-‐Individuals

RNA gives informa1on on which genes that are expressed

How DNA get transcribed to RNA (and some1mes then translated to proteins) varies between e. g. -‐Tissues -‐ Cell types

-‐ Cell states

-‐Individuals

RNA flavors (pre sequencing era)

•  House keeping RNAs –  rRNAs, tRNAs, snoRNAs, snRNAs, SRP RNAs, cataly1c RNAs (RNAse E)

•  Protein coding RNAs –  (1 coding gene ~ 1 mRNA)

•  Regulatory RNAs –  Few rare examples

ENCODE, the Encyclopedia of DNA Elements, is a project funded by the Na1onal Human Genome Research Ins1tute to iden1fy all regions of transcrip1on, transcrip1on factor associa1on, chroma1n structure and histone modifica1on in the human genome sequence.

ENCyclopedia Of Dna Elements

Different kind of RNAs have different expression values

Landscape of transcrip/on in human cells, S Djebali et al. Nature 2012

What defines RNA depends on how you look at it

Variants

Adapted from Landscape of transcrip/on in human cells, S Djebali et al. Nature 2012

Abundance House keeping RNAs

mRNAs

Regulatory RNAs

Novel intergenic

None

Coverage

One gene many different mRNAs RNA-seq: alternative splicing

Defining func1onal DNA elements in

the human genome •  Statement

–  A priori, we should not expect the transcriptome to consist exclusively of func1onal RNAs.

•  Why is that –  Zero tolerance for errant

transcripts would come at high cost in the proofreading machinery needed to perfectly gate RNA polymerase and splicing ac1vi1es, or to instantly eliminate spurious transcripts.

–  In general, sequences encoding RNAs transcribed by noisy transcrip1onal machinery are expected to be less constrained, which is consistent with data shown here for very low abundance RNA

•  Consequence –  Thus, one should have high

confidence that the subset of the genome with large signals for RNA or chroma1n signatures coupled with strong conserva1on is func1onal and will be supported by appropriate gene1c tests.

–  In contrast, the larger propor1on of genome with reproducible but low biochemical signal strength and less evolu1onary conserva1on is challenging to parse between specific func1ons and biological noise.

This is of course not without an debate Variants Abundance

Most ‘‘Dark Matter’’ Transcripts Are Associated WithKnown GenesHarm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*

1 Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto,

Ontario, Canada

Abstract

A series of reports over the last few years have indicated that a much larger portion of the mammalian genome istranscribed than can be accounted for by currently annotated genes, but the quantity and nature of these additionaltranscripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess thequantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seqidentifies many fewer transcribed regions (‘‘seqfrags’’) outside known exons and ncRNAs. Most nonexonic seqfrags are inintrons, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority ofintergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation siteusage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sitesidentified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads thatdisplay characteristics of random sampling from a low-level background or several thousand small transcripts (medianlength = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions withopen chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance isgenerally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.

Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2010) Most ‘‘Dark Matter’’ Transcripts Are Associated With Known Genes. PLoS Biol 8(5): e1000371.doi:10.1371/journal.pbio.1000371

Academic Editor: Sean R. Eddy, HHMI Janelia Farm, United States of America

Received December 3, 2009; Accepted April 9, 2010; Published May 18, 2010

Copyright: ! 2010 van Bakel et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by Genome Canada (http://www.genomecanada.ca) through the Ontario Genomics Institute, the Ontario Research Fund, andMarch of Dimes (http://www.marchofdimes.com). HvB was supported by the Netherlands Organization for Scientific Research (NWO; http://www.nwo.nl) (grantno. 825.06.033) and the Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) (grant no. 193588). The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

Abbreviations: APA, alternative cleavage and polyadenylation; BW, bandwidth parameter; CAGE, capped analysis of gene expression; CNV, copy numbervariation; lincRNAs, large intervening noncoding RNAs; ncRNAs, noncoding RNAs; ORF, open reading frames; pasRNA, promoter-associated RNA; TSS, transcriptionstart site; TTS, transcription termination site; TU, transcript unit; TUF, transcript of unknown function

* E-mail: [email protected]

Introduction

In recent years established views of transcription have beenchallenged by the observation that a much larger portion of thehuman and mouse genomes is transcribed than can be accountedfor by currently annotated coding and noncoding genes. The bulkof these findings have come from experiments using ‘‘tiling’’microarrays with probes that cover the non-repetitive genome atregular intervals [1–9], or from sequencing efforts of full-lengthcDNA libraries enriched for rare transcripts [10,11]. Additionally,capped analysis of gene expression (CAGE) in human and mouseshow that a significant number of sequenced 59 tags map tointergenic regions [12]. Estimates of the proportion of transcriptsthat map to locations separate from known exons range from 47%to 80% and are distributed approximately equally between intronsand intergenic regions. Dubbed transcriptional ‘‘dark matter’’[13], the ‘‘hidden’’ transcriptome [1], or transcripts of unknownfunction (TUFs) [4,14], the exact nature of much of this additionaltranscription is unclear, but it has been presumed to comprise acombination of novel protein coding transcripts, extensions ofexisting transcripts, noncoding RNAs (ncRNAs), antisense tran-scripts, and biological or experimental background. Determining

the relative contributions of each of these potential sources isimportant for understanding the nature and possible biologicalfunction of transcriptional dark matter.

Homology searches for transcripts mapping outside knownannotation boundaries [10], as well as cDNA sequencing efforts,indicate that it is still possible to find new exons of protein codinggenes [10,15,16]. The genomic positions of TUFs are also biasedtowards known transcripts [8], suggesting that at least a portionmay represent extensions of current gene annotations. Neverthe-less, the majority of dark matter transcripts is thought to benoncoding [2,4,5,10]. Previous efforts to characterize dark mattertranscripts have revealed the existence of thousands of ncRNAswith evidence for tissue-specific expression [17,18], as well as overa thousand large intervening noncoding RNAs (lincRNAs)originating from intergenic regions bearing chromatin marksassociated with transcription [19]. Other studies have reportednew classes of ncRNAs, such as those that cluster close to thetranscription start sites (TSSs) of protein coding genes [20–24].These promoter-associated RNAs (pasRNAs) typically initiate inthe nucleosome free regions that mark a TSS, with transcriptionoccurring in both directions. Finally, results from the ENCODEpilot project have suggested a highly interleaved structure of the

PLoS Biology | www.plosbiology.org 1 May 2010 | Volume 8 | Issue 5 | e1000371

Perspective

The Reality of Pervasive TranscriptionMichael B. Clark1, Paulo P. Amaral1., Felix J. Schlesinger2., Marcel E. Dinger1, Ryan J. Taft1, John L.

Rinn3, Chris P. Ponting4, Peter F. Stadler5, Kevin V. Morris6, Antonin Morillon7, Joel S. Rozowsky8,

Mark B. Gerstein8, Claes Wahlestedt9, Yoshihide Hayashizaki10, Piero Carninci10, Thomas R. Gingeras2*,

John S. Mattick1*

1 Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia, 2 Watson School of Biological Sciences, Cold Spring Harbor Laboratory,

Cold Spring Harbor, New York, United States of America, 3 Broad Institute, Cambridge, Massachusetts, United States of America, 4 MRC Functional Genomics Unit,

Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom, 5 Department of Computer Science, University of Leipzig, Leipzig,

Germany, 6 Department of Molecular and Experimental Medicine, Scripps Research Institute, La Jolla, California, United States of America, 7 Institut Curie, UMR3244-

Pavillon Trouillet Rossignol, Paris, France, 8 Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America, 9 University of

Miami, Miami, Florida, United States of America, 10 Omics Science Center, RIKEN Yokohama Institute, Tsurumi-ku, Yokohama, Kanagawa, Japan

Current estimates indicate that onlyabout 1.2% of the mammalian genomecodes for amino acids in proteins. How-ever, mounting evidence over the pastdecade has suggested that the vast major-ity of the genome is transcribed, wellbeyond the boundaries of known genes, aphenomenon known as pervasive tran-scription [1]. Challenging this view, anarticle published in PLoS Biology by vanBakel et al. concluded that ‘‘the genome isnot as pervasively transcribed as previous-ly reported’’ [2] and that the majority ofthe detected low-level transcription is dueto technical artefacts and/or backgroundbiological noise. These conclusions attract-ed considerable publicity [3–6]. Here, wepresent an evaluation of the analysis andconclusions of van Bakel et al. comparedto those of others and show that (1) theexistence of pervasive transcription issupported by multiple independent tech-niques; (2) re-analysis of the van Bakel etal. tiling arrays shows that their results areatypical compared to those of ENCODEand lack independent validation; and (3)the RNA sequencing dataset used by vanBakel et al. suffered from insufficientsequencing depth and poor transcriptassembly, compromising their ability todetect the less abundant transcripts outsideof protein-coding genes. We conclude thatthe totality of the evidence stronglysupports pervasive transcription of mam-malian genomes, although the biologicalsignificance of many novel coding andnoncoding transcripts remains to be ex-plored.

Previous Evidence for PervasiveTranscription

The conclusion that the mammaliangenome is pervasively transcribed (i.e.,‘‘that the majority of its bases are associ-ated with at least one primary transcript’’[1]) was based on multiple lines ofevidence. Both large-scale cDNA sequenc-ing and hybridization to genome-widetiling arrays were the major empiricalsources of data. Analysis of full-lengthcDNAs from many tissues and develop-mental stages in mouse showed that atleast 63% of the genome is transcribed andidentified thousands of novel protein-coding transcripts and over 30,000 longnoncoding intronic, intergenic, and anti-sense transcripts [7–9]. In parallel, wholechromosome tiling array interrogation ofthe RNA content of a variety of humantissues and cell lines revealed that, collec-tively, at least 93% of genomic bases aretranscribed in one cell type or another[1,10–13].

Since it is well established that highlyexpressed mRNAs dominate the non-

ribosomal portion of the polyA+ transcrip-tome [7,8,10,14–19], normalization ap-proaches were used to reduce the quantityof highly expressed transcripts in thesecDNA analyses [7,8], and are implicit intiling array approaches. This was neces-sary to allow the detection of rarer (oftencell type–restricted [1,13,16,19,20]) tran-scripts.

The evidence for pervasive transcriptionalso includes observations from a widevariety of other independent techniques(see reviews [21] and [22] for references).Indeed, a simple query of currentlyavailable human spliced EST data inGenBank shows that documented tran-scripts cover 57.09% of the genome.Because ESTs are largely generated frompolyadenylated RNAs and do not exhaus-tively sample the transcriptome, this cov-erage represents the lower bound ofgenomic transcription.

Based on an analysis of genome-widetiling arrays and short read RNA sequenc-ing data, van Bakel et al. report that ‘‘most‘dark matter’ transcripts (i.e., novel tran-scripts of unknown function) are associated

The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.

Citation: Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, et al. (2011) The Reality of PervasiveTranscription. PLoS Biol 9(7): e1000625. doi:10.1371/journal.pbio.1000625

Academic Editor: Michael B. Eisen, University of California Berkeley, United States of America

Published July 12, 2011

Copyright: ! 2011 Clark et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: This work was supported by the Australian Research Council and National Health and MedicalResearch Council (JSM), NHGRI (TRG, PC), NIH (KVM, MG), Japanese MEXT Research Grant for the RIKEN OmicsScience Center (YH, PC), EU FP-7 (PFS), Damon Runyon, Smith, Merkin and Searle scholarships (JLR), and UKBBSRC, ERC and MRC (CPP). The funders had no role in study design, data collection and analysis, decision topublish, or preparation of the manuscript. We thank Tim Hughes and Harm van Bakel for unreservedlyproviding their data to us for analysis.


Abbreviation: PR, precision recall

* E-mail: [email protected] (JSM); [email protected] (TRG)

. These authors contributed equally to this work.

PLoS Biology | www.plosbiology.org 1 July 2011 | Volume 9 | Issue 7 | e1000625

Perspective

Response to ‘‘The Reality of Pervasive Transcription’’Harm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*

1 Banting and Best Department of Medical Research and Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario,

Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada

Clark et al. criticize several aspects ofour study [1], and specifically challengeour assertion that the degree of pervasivetranscription has previously been overstat-ed. We disagree with much of theirreasoning and their interpretation of ourwork. For example, many of our conclu-sions are based on overall sequence readdistributions, while Clark et al. focus ontranscript units and seqfrags (sets ofoverlapping reads). A key point is thatone can derive a robust estimate of therelative amounts of different transcripttypes without having a complete recon-struction of every single transcript.

In this brief response, we first revisitwhat is meant by pervasive transcription,and its potential significance. We thendiscuss the major points raised by Clarket al. in the order presented in theircritique. Finally, we demonstrate thatconclusions very similar to those of ouroriginal study are reached with a datasetwith far greater read depth, obtained bystrand-specific sequencing of rRNA-de-pleted total RNA from a single cell type.

The Meaning of ‘‘Pervasive’’,and the Importance ofTranscript Abundance

Clark et al. define pervasive transcrip-tion of a genome to mean ‘‘that themajority of its bases are associated withat least one primary transcript’’, which isthe same definition used in the ENCODE1% paper [2]. We believe that this specificclaim is not contested, nor is it particularlyinteresting. First, it has long been assumedthat roughly half of the human genomecomprises introns [3]. Second, the mech-anisms that control the positions ofinitiation and termination of Pol IItranscription, as well as RNA processing,are imperfect, such that low-level back-ground transcripts from both physiologi-cally relevant and non-canonical sites arise[4–6]. Blockage of surveillance mecha-nisms that normally degrade such ‘‘cryp-

tic’’ transcripts greatly increases theirabundance [7,8].

We acknowledge that the phrase quotedby Clark et al. in our Author Summaryshould have read ‘‘stably transcribed’’, orsome equivalent, rather than simply ‘‘tran-scribed’’. But this does not change the factthat we strongly disagree with the funda-mental argument put forward by Clarket al., which is that the genomic areacorresponding to transcripts is more im-portant than their relative abundance. Thisviewpoint makes little sense to us. Given thevarious sources of extraneous sequencereads, both biological and laboratory-derived (see below), it is expected that withsufficient sequencing depth the entiregenome would eventually be encompassedby reads. Our statement that ‘‘the genomeis not as not as pervasively transcribed aspreviously reported’’ stems from the factthat our observations relate to the relativequantity of material detected.

Of course, some rare transcripts (and/or rare transcription) are functional, andlow-level transcription may also provide apool of material for evolutionary tinkering.But given that known mechanisms—inparticular, imperfections in termination(see below)—can explain the presence oflow-level random (and many non-random)transcripts, we believe the burden of proofis to show that such transcripts are indeedfunctional, rather than to disprove theirputative functionality.

Contradiction of PreviousReports

The fact that our analyses contradictprevious reports is precisely why we

emphasized the lack of abundant pervasivetranscription in our study. Clark et al. citepapers that have previously documentedpervasive transcription, and point out thatseveral different approaches have beenused as confirmation. We believe that Clarket al. misinterpret what can be claimedfrom much of the literature in this area, andfail to acknowledge known weaknesses insome of these studies. We previouslyreviewed these issues [9]. For example,the number of transfrags detected inpermuted tiling array data can be as highas it is in the real data [10]. In addition, acommon form of ‘‘validation’’ in thesepapers is RT-PCR or RACE, but theseapproaches are generally semi-quantitativeat best and are prone to artefacts such astemplate switching, which readily produceschimeric transcripts in vitro ([11] andreferences therein). Indeed, we note thatin the ENCODE 1% study [2] repeatedlycited by Clark et al., 75 of the 100 negativecontrols (randomly selected non-transfragregions) were actually detected by RACE,making the ‘‘validation’’ rate for negativecontrols only slightly lower than that for theintronic and intergenic transfrags (86%–88%). Thus, either the tiling arrays orRACE assays are highly error prone. Thecontention of Clark et al. that ‘‘any estimateof the pervasiveness of transcription re-quires inclusion of all data sources’’ isflawed, because if one introduces erroneousdata from even a single source, the estimatebecomes worse.

Accuracy of Tiling Arrays

We agree that results obtained fromtiling arrays should improve with increased

The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.

Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2011) Response to ‘‘The Reality of PervasiveTranscription’’. PLoS Biol 9(7): e1001102. doi:10.1371/journal.pbio.1001102

Academic Editor: Michael B. Eisen, University of California Berkeley, United States of America

Published July 12, 2011

Copyright: ! 2011 van Bakel et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: This work was supported by Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) Operating Grant to TRH (grant no. 49451). HvB was supported by a CIHR postdoctoral fellowship (grantno. 193588). The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.


* E-mail: [email protected]

PLoS Biology | www.plosbiology.org 1 July 2011 | Volume 9 | Issue 7 | e1001102

Defining functional DNA elements in the human genome Kellis M et al. PNAS 2014;111:6131-6138

Biochemical evidence not enough to iden1fy func1onal RNAs

•  RNA seq course

The RNA seq course •  From RNA seq to reads (Introduc1on) •  Mapping reads programs (Tuesday) •  Transcriptome reconstruc1on using reference (Tuesday) •  Transcriptome reconstruc1on without reference (Tuesday) •  QC analysis (Wednesday) •  Differen1al expression analysis (Wednesday) •  Gene set analysis (Wednesday) •  miRNA analysis (Thursday) •  Allele specific analysis (Thursday) •  Single cell analysis (Thursday)

How are RNA-‐seq data generated?

Sampling process

Depending on the different steps you will get different results

AAAAAAAA

enrichments -‐>

reads -‐>

library -‐>

RNA-‐> PolyA (mRNA) RiboMinus (-‐ rRNA) Size <50 nt (miRNA ) …..

Size of fragment Strand specific 5’ end specific 3’ end specific …..

Single end (1 read per fragment) Paired end (2 reads per fragment)

Promises and pi7alls

Long reads •  Low throughput (-‐) •  Complete transcripts (+) •  Only highly expressed

genes (-‐-‐) •  Expensive (-‐) •  Low background noise (+) •  Easy downstream analysis

(+)

Micro Arrays •  High throughput (+) •  Only known sequences (-‐) •  Limited dynamic range (-‐) •  Cheap (+) •  High background noise (-‐) •  Not strand specific (-‐) •  Well established downstream

methods (+)

short reads •  High throughput (+) •  Frac1ons of transcripts (-‐) •  Full dynamic range (+-‐) •  Unlimited dynamic range (+) •  Cheap (+) •  Low background noise (+) •  Strand specificity (+) •  Re-‐sequencing (+)

1

10

100

1000

10000

1 10 100 1000 10000 100000 1000000

Signal

# trancripts/cell

EST

MicroArray

RNAseq

RNA seq reads correspond directly to abundance of RNAs in the sample

Map reads to reference

Transcriptome assembly using reference

Transcriptome assembly without reference

Quality control -‐samples might not be what you think they are •  Experiments go wrong –  30 samples with 5 steps from samples to reads has 150 poten1al steps for errors

–  Error rate 1/100 with 5 steps suggest that one of every 20 samples the reads does not represent the sample

•  Mixing samples –  30 samples with 5 steps from samples to reads has ~24M poten1al mix ups of samples

–  Error rate 1/ 100 with 5 steps suggest that one of every 20 sample is mislabeled

•  Combine the two steps and approximately one of every 10 samples are wrong

RNA QC Read quality

Transcript quality

Mapping sta1s1cs

Compare between samples

Differen1al expression analysis using univariate analysis

Typically univariate analysis (one gene at a 1me) – even though we know that genes are not independent

Gene set analysis and data integra1on

26

RNase Y and virulence

- 38% of total virulence factors are affected by RNase Y deletion (88% downregulated) - 18% cleaved by RNase Y

with the help of Cristina Della Beffa

¾ SCP hidden by gene downregulation ?

microRNA analysis (Johan)

(Berezikov et al. Genome Research, 2011.)

Single cell RNA-‐seq analysis

(Sandberg, Nature Methods 2014)