RNA-‐seq Introduc1on
Promises and pi7alls
DNA is the same in all cells but which RNAs that is present is different in all
cells
There is a wide variety of different func1onal RNAs
Which RNAs (and some1mes then translated to proteins) varies between
samples -‐Tissues -‐ Cell types
-‐ Cell states
-‐Individuals -‐Cells
RNA gives informa1on on which genes that are expressed
How DNA get transcribed to RNA (and some1mes then translated to proteins) varies between e. g. -‐Tissues -‐ Cell types
-‐ Cell states
-‐Individuals
RNA gives informa1on on which genes that are expressed
How DNA get transcribed to RNA (and some1mes then translated to proteins) varies between e. g. -‐Tissues -‐ Cell types
-‐ Cell states
-‐Individuals
RNA flavors (pre sequencing era)
• House keeping RNAs – rRNAs, tRNAs, snoRNAs, snRNAs, SRP RNAs, cataly1c RNAs (RNAse E)
• Protein coding RNAs – (1 coding gene ~ 1 mRNA)
• Regulatory RNAs – Few rare examples
ENCODE, the Encyclopedia of DNA Elements, is a project funded by the Na1onal Human Genome Research Ins1tute to iden1fy all regions of transcrip1on, transcrip1on factor associa1on, chroma1n structure and histone modifica1on in the human genome sequence.
ENCyclopedia Of Dna Elements
Different kind of RNAs have different expression values
Landscape of transcrip/on in human cells, S Djebali et al. Nature 2012
What defines RNA depends on how you look at it
Variants
Adapted from Landscape of transcrip/on in human cells, S Djebali et al. Nature 2012
Abundance House keeping RNAs
mRNAs
Regulatory RNAs
Novel intergenic
None
Coverage
One gene many different mRNAs RNA-seq: alternative splicing
Defining func1onal DNA elements in
the human genome • Statement
– A priori, we should not expect the transcriptome to consist exclusively of func1onal RNAs.
• Why is that – Zero tolerance for errant
transcripts would come at high cost in the proofreading machinery needed to perfectly gate RNA polymerase and splicing ac1vi1es, or to instantly eliminate spurious transcripts.
– In general, sequences encoding RNAs transcribed by noisy transcrip1onal machinery are expected to be less constrained, which is consistent with data shown here for very low abundance RNA
• Consequence – Thus, one should have high
confidence that the subset of the genome with large signals for RNA or chroma1n signatures coupled with strong conserva1on is func1onal and will be supported by appropriate gene1c tests.
– In contrast, the larger propor1on of genome with reproducible but low biochemical signal strength and less evolu1onary conserva1on is challenging to parse between specific func1ons and biological noise.
This is of course not without an debate Variants Abundance
Most ‘‘Dark Matter’’ Transcripts Are Associated WithKnown GenesHarm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*
1 Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto,
Ontario, Canada
Abstract
A series of reports over the last few years have indicated that a much larger portion of the mammalian genome istranscribed than can be accounted for by currently annotated genes, but the quantity and nature of these additionaltranscripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess thequantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seqidentifies many fewer transcribed regions (‘‘seqfrags’’) outside known exons and ncRNAs. Most nonexonic seqfrags are inintrons, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority ofintergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation siteusage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sitesidentified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads thatdisplay characteristics of random sampling from a low-level background or several thousand small transcripts (medianlength = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions withopen chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance isgenerally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.
Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2010) Most ‘‘Dark Matter’’ Transcripts Are Associated With Known Genes. PLoS Biol 8(5): e1000371.doi:10.1371/journal.pbio.1000371
Academic Editor: Sean R. Eddy, HHMI Janelia Farm, United States of America
Received December 3, 2009; Accepted April 9, 2010; Published May 18, 2010
Copyright: ! 2010 van Bakel et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by Genome Canada (http://www.genomecanada.ca) through the Ontario Genomics Institute, the Ontario Research Fund, andMarch of Dimes (http://www.marchofdimes.com). HvB was supported by the Netherlands Organization for Scientific Research (NWO; http://www.nwo.nl) (grantno. 825.06.033) and the Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) (grant no. 193588). The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Abbreviations: APA, alternative cleavage and polyadenylation; BW, bandwidth parameter; CAGE, capped analysis of gene expression; CNV, copy numbervariation; lincRNAs, large intervening noncoding RNAs; ncRNAs, noncoding RNAs; ORF, open reading frames; pasRNA, promoter-associated RNA; TSS, transcriptionstart site; TTS, transcription termination site; TU, transcript unit; TUF, transcript of unknown function
* E-mail: [email protected]
Introduction
In recent years established views of transcription have beenchallenged by the observation that a much larger portion of thehuman and mouse genomes is transcribed than can be accountedfor by currently annotated coding and noncoding genes. The bulkof these findings have come from experiments using ‘‘tiling’’microarrays with probes that cover the non-repetitive genome atregular intervals [1–9], or from sequencing efforts of full-lengthcDNA libraries enriched for rare transcripts [10,11]. Additionally,capped analysis of gene expression (CAGE) in human and mouseshow that a significant number of sequenced 59 tags map tointergenic regions [12]. Estimates of the proportion of transcriptsthat map to locations separate from known exons range from 47%to 80% and are distributed approximately equally between intronsand intergenic regions. Dubbed transcriptional ‘‘dark matter’’[13], the ‘‘hidden’’ transcriptome [1], or transcripts of unknownfunction (TUFs) [4,14], the exact nature of much of this additionaltranscription is unclear, but it has been presumed to comprise acombination of novel protein coding transcripts, extensions ofexisting transcripts, noncoding RNAs (ncRNAs), antisense tran-scripts, and biological or experimental background. Determining
the relative contributions of each of these potential sources isimportant for understanding the nature and possible biologicalfunction of transcriptional dark matter.
Homology searches for transcripts mapping outside knownannotation boundaries [10], as well as cDNA sequencing efforts,indicate that it is still possible to find new exons of protein codinggenes [10,15,16]. The genomic positions of TUFs are also biasedtowards known transcripts [8], suggesting that at least a portionmay represent extensions of current gene annotations. Neverthe-less, the majority of dark matter transcripts is thought to benoncoding [2,4,5,10]. Previous efforts to characterize dark mattertranscripts have revealed the existence of thousands of ncRNAswith evidence for tissue-specific expression [17,18], as well as overa thousand large intervening noncoding RNAs (lincRNAs)originating from intergenic regions bearing chromatin marksassociated with transcription [19]. Other studies have reportednew classes of ncRNAs, such as those that cluster close to thetranscription start sites (TSSs) of protein coding genes [20–24].These promoter-associated RNAs (pasRNAs) typically initiate inthe nucleosome free regions that mark a TSS, with transcriptionoccurring in both directions. Finally, results from the ENCODEpilot project have suggested a highly interleaved structure of the
PLoS Biology | www.plosbiology.org 1 May 2010 | Volume 8 | Issue 5 | e1000371
Perspective
The Reality of Pervasive TranscriptionMichael B. Clark1, Paulo P. Amaral1., Felix J. Schlesinger2., Marcel E. Dinger1, Ryan J. Taft1, John L.
Rinn3, Chris P. Ponting4, Peter F. Stadler5, Kevin V. Morris6, Antonin Morillon7, Joel S. Rozowsky8,
Mark B. Gerstein8, Claes Wahlestedt9, Yoshihide Hayashizaki10, Piero Carninci10, Thomas R. Gingeras2*,
John S. Mattick1*
1 Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia, 2 Watson School of Biological Sciences, Cold Spring Harbor Laboratory,
Cold Spring Harbor, New York, United States of America, 3 Broad Institute, Cambridge, Massachusetts, United States of America, 4 MRC Functional Genomics Unit,
Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom, 5 Department of Computer Science, University of Leipzig, Leipzig,
Germany, 6 Department of Molecular and Experimental Medicine, Scripps Research Institute, La Jolla, California, United States of America, 7 Institut Curie, UMR3244-
Pavillon Trouillet Rossignol, Paris, France, 8 Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America, 9 University of
Miami, Miami, Florida, United States of America, 10 Omics Science Center, RIKEN Yokohama Institute, Tsurumi-ku, Yokohama, Kanagawa, Japan
Current estimates indicate that onlyabout 1.2% of the mammalian genomecodes for amino acids in proteins. How-ever, mounting evidence over the pastdecade has suggested that the vast major-ity of the genome is transcribed, wellbeyond the boundaries of known genes, aphenomenon known as pervasive tran-scription [1]. Challenging this view, anarticle published in PLoS Biology by vanBakel et al. concluded that ‘‘the genome isnot as pervasively transcribed as previous-ly reported’’ [2] and that the majority ofthe detected low-level transcription is dueto technical artefacts and/or backgroundbiological noise. These conclusions attract-ed considerable publicity [3–6]. Here, wepresent an evaluation of the analysis andconclusions of van Bakel et al. comparedto those of others and show that (1) theexistence of pervasive transcription issupported by multiple independent tech-niques; (2) re-analysis of the van Bakel etal. tiling arrays shows that their results areatypical compared to those of ENCODEand lack independent validation; and (3)the RNA sequencing dataset used by vanBakel et al. suffered from insufficientsequencing depth and poor transcriptassembly, compromising their ability todetect the less abundant transcripts outsideof protein-coding genes. We conclude thatthe totality of the evidence stronglysupports pervasive transcription of mam-malian genomes, although the biologicalsignificance of many novel coding andnoncoding transcripts remains to be ex-plored.
Previous Evidence for PervasiveTranscription
The conclusion that the mammaliangenome is pervasively transcribed (i.e.,‘‘that the majority of its bases are associ-ated with at least one primary transcript’’[1]) was based on multiple lines ofevidence. Both large-scale cDNA sequenc-ing and hybridization to genome-widetiling arrays were the major empiricalsources of data. Analysis of full-lengthcDNAs from many tissues and develop-mental stages in mouse showed that atleast 63% of the genome is transcribed andidentified thousands of novel protein-coding transcripts and over 30,000 longnoncoding intronic, intergenic, and anti-sense transcripts [7–9]. In parallel, wholechromosome tiling array interrogation ofthe RNA content of a variety of humantissues and cell lines revealed that, collec-tively, at least 93% of genomic bases aretranscribed in one cell type or another[1,10–13].
Since it is well established that highlyexpressed mRNAs dominate the non-
ribosomal portion of the polyA+ transcrip-tome [7,8,10,14–19], normalization ap-proaches were used to reduce the quantityof highly expressed transcripts in thesecDNA analyses [7,8], and are implicit intiling array approaches. This was neces-sary to allow the detection of rarer (oftencell type–restricted [1,13,16,19,20]) tran-scripts.
The evidence for pervasive transcriptionalso includes observations from a widevariety of other independent techniques(see reviews [21] and [22] for references).Indeed, a simple query of currentlyavailable human spliced EST data inGenBank shows that documented tran-scripts cover 57.09% of the genome.Because ESTs are largely generated frompolyadenylated RNAs and do not exhaus-tively sample the transcriptome, this cov-erage represents the lower bound ofgenomic transcription.
Based on an analysis of genome-widetiling arrays and short read RNA sequenc-ing data, van Bakel et al. report that ‘‘most‘dark matter’ transcripts (i.e., novel tran-scripts of unknown function) are associated
The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.
Citation: Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, et al. (2011) The Reality of PervasiveTranscription. PLoS Biol 9(7): e1000625. doi:10.1371/journal.pbio.1000625
Academic Editor: Michael B. Eisen, University of California Berkeley, United States of America
Published July 12, 2011
Copyright: ! 2011 Clark et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.
Funding: This work was supported by the Australian Research Council and National Health and MedicalResearch Council (JSM), NHGRI (TRG, PC), NIH (KVM, MG), Japanese MEXT Research Grant for the RIKEN OmicsScience Center (YH, PC), EU FP-7 (PFS), Damon Runyon, Smith, Merkin and Searle scholarships (JLR), and UKBBSRC, ERC and MRC (CPP). The funders had no role in study design, data collection and analysis, decision topublish, or preparation of the manuscript. We thank Tim Hughes and Harm van Bakel for unreservedlyproviding their data to us for analysis.
Competing Interests: The authors have declared that no competing interests exist.
Abbreviation: PR, precision recall
* E-mail: [email protected] (JSM); [email protected] (TRG)
. These authors contributed equally to this work.
PLoS Biology | www.plosbiology.org 1 July 2011 | Volume 9 | Issue 7 | e1000625
Perspective
Response to ‘‘The Reality of Pervasive Transcription’’Harm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*
1 Banting and Best Department of Medical Research and Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario,
Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
Clark et al. criticize several aspects ofour study [1], and specifically challengeour assertion that the degree of pervasivetranscription has previously been overstat-ed. We disagree with much of theirreasoning and their interpretation of ourwork. For example, many of our conclu-sions are based on overall sequence readdistributions, while Clark et al. focus ontranscript units and seqfrags (sets ofoverlapping reads). A key point is thatone can derive a robust estimate of therelative amounts of different transcripttypes without having a complete recon-struction of every single transcript.
In this brief response, we first revisitwhat is meant by pervasive transcription,and its potential significance. We thendiscuss the major points raised by Clarket al. in the order presented in theircritique. Finally, we demonstrate thatconclusions very similar to those of ouroriginal study are reached with a datasetwith far greater read depth, obtained bystrand-specific sequencing of rRNA-de-pleted total RNA from a single cell type.
The Meaning of ‘‘Pervasive’’,and the Importance ofTranscript Abundance
Clark et al. define pervasive transcrip-tion of a genome to mean ‘‘that themajority of its bases are associated withat least one primary transcript’’, which isthe same definition used in the ENCODE1% paper [2]. We believe that this specificclaim is not contested, nor is it particularlyinteresting. First, it has long been assumedthat roughly half of the human genomecomprises introns [3]. Second, the mech-anisms that control the positions ofinitiation and termination of Pol IItranscription, as well as RNA processing,are imperfect, such that low-level back-ground transcripts from both physiologi-cally relevant and non-canonical sites arise[4–6]. Blockage of surveillance mecha-nisms that normally degrade such ‘‘cryp-
tic’’ transcripts greatly increases theirabundance [7,8].
We acknowledge that the phrase quotedby Clark et al. in our Author Summaryshould have read ‘‘stably transcribed’’, orsome equivalent, rather than simply ‘‘tran-scribed’’. But this does not change the factthat we strongly disagree with the funda-mental argument put forward by Clarket al., which is that the genomic areacorresponding to transcripts is more im-portant than their relative abundance. Thisviewpoint makes little sense to us. Given thevarious sources of extraneous sequencereads, both biological and laboratory-derived (see below), it is expected that withsufficient sequencing depth the entiregenome would eventually be encompassedby reads. Our statement that ‘‘the genomeis not as not as pervasively transcribed aspreviously reported’’ stems from the factthat our observations relate to the relativequantity of material detected.
Of course, some rare transcripts (and/or rare transcription) are functional, andlow-level transcription may also provide apool of material for evolutionary tinkering.But given that known mechanisms—inparticular, imperfections in termination(see below)—can explain the presence oflow-level random (and many non-random)transcripts, we believe the burden of proofis to show that such transcripts are indeedfunctional, rather than to disprove theirputative functionality.
Contradiction of PreviousReports
The fact that our analyses contradictprevious reports is precisely why we
emphasized the lack of abundant pervasivetranscription in our study. Clark et al. citepapers that have previously documentedpervasive transcription, and point out thatseveral different approaches have beenused as confirmation. We believe that Clarket al. misinterpret what can be claimedfrom much of the literature in this area, andfail to acknowledge known weaknesses insome of these studies. We previouslyreviewed these issues [9]. For example,the number of transfrags detected inpermuted tiling array data can be as highas it is in the real data [10]. In addition, acommon form of ‘‘validation’’ in thesepapers is RT-PCR or RACE, but theseapproaches are generally semi-quantitativeat best and are prone to artefacts such astemplate switching, which readily produceschimeric transcripts in vitro ([11] andreferences therein). Indeed, we note thatin the ENCODE 1% study [2] repeatedlycited by Clark et al., 75 of the 100 negativecontrols (randomly selected non-transfragregions) were actually detected by RACE,making the ‘‘validation’’ rate for negativecontrols only slightly lower than that for theintronic and intergenic transfrags (86%–88%). Thus, either the tiling arrays orRACE assays are highly error prone. Thecontention of Clark et al. that ‘‘any estimateof the pervasiveness of transcription re-quires inclusion of all data sources’’ isflawed, because if one introduces erroneousdata from even a single source, the estimatebecomes worse.
Accuracy of Tiling Arrays
We agree that results obtained fromtiling arrays should improve with increased
The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.
Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2011) Response to ‘‘The Reality of PervasiveTranscription’’. PLoS Biol 9(7): e1001102. doi:10.1371/journal.pbio.1001102
Academic Editor: Michael B. Eisen, University of California Berkeley, United States of America
Published July 12, 2011
Copyright: ! 2011 van Bakel et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.
Funding: This work was supported by Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) Operating Grant to TRH (grant no. 49451). HvB was supported by a CIHR postdoctoral fellowship (grantno. 193588). The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
PLoS Biology | www.plosbiology.org 1 July 2011 | Volume 9 | Issue 7 | e1001102
Defining functional DNA elements in the human genome Kellis M et al. PNAS 2014;111:6131-6138
Biochemical evidence not enough to iden1fy func1onal RNAs
• RNA seq course
The RNA seq course • From RNA seq to reads (Introduc1on) • Mapping reads programs (Tuesday) • Transcriptome reconstruc1on using reference (Tuesday) • Transcriptome reconstruc1on without reference (Tuesday) • QC analysis (Wednesday) • Differen1al expression analysis (Wednesday) • Gene set analysis (Wednesday) • miRNA analysis (Thursday) • Allele specific analysis (Thursday) • Single cell analysis (Thursday)
How are RNA-‐seq data generated?
Sampling process
Depending on the different steps you will get different results
AAAAAAAA
enrichments -‐>
reads -‐>
library -‐>
RNA-‐> PolyA (mRNA) RiboMinus (-‐ rRNA) Size <50 nt (miRNA ) …..
Size of fragment Strand specific 5’ end specific 3’ end specific …..
Single end (1 read per fragment) Paired end (2 reads per fragment)
Promises and pi7alls
Long reads • Low throughput (-‐) • Complete transcripts (+) • Only highly expressed
genes (-‐-‐) • Expensive (-‐) • Low background noise (+) • Easy downstream analysis
(+)
Micro Arrays • High throughput (+) • Only known sequences (-‐) • Limited dynamic range (-‐) • Cheap (+) • High background noise (-‐) • Not strand specific (-‐) • Well established downstream
methods (+)
short reads • High throughput (+) • Frac1ons of transcripts (-‐) • Full dynamic range (+-‐) • Unlimited dynamic range (+) • Cheap (+) • Low background noise (+) • Strand specificity (+) • Re-‐sequencing (+)
1
10
100
1000
10000
1 10 100 1000 10000 100000 1000000
Signal
# trancripts/cell
EST
MicroArray
RNAseq
RNA seq reads correspond directly to abundance of RNAs in the sample
Map reads to reference
Transcriptome assembly using reference
Transcriptome assembly without reference
Quality control -‐samples might not be what you think they are • Experiments go wrong – 30 samples with 5 steps from samples to reads has 150 poten1al steps for errors
– Error rate 1/100 with 5 steps suggest that one of every 20 samples the reads does not represent the sample
• Mixing samples – 30 samples with 5 steps from samples to reads has ~24M poten1al mix ups of samples
– Error rate 1/ 100 with 5 steps suggest that one of every 20 sample is mislabeled
• Combine the two steps and approximately one of every 10 samples are wrong
RNA QC Read quality
Transcript quality
Mapping sta1s1cs
Compare between samples
Differen1al expression analysis using univariate analysis
Typically univariate analysis (one gene at a 1me) – even though we know that genes are not independent
Gene set analysis and data integra1on
26
RNase Y and virulence
- 38% of total virulence factors are affected by RNase Y deletion (88% downregulated) - 18% cleaved by RNase Y
with the help of Cristina Della Beffa
¾ SCP hidden by gene downregulation ?
microRNA analysis (Johan)
(Berezikov et al. Genome Research, 2011.)
Single cell RNA-‐seq analysis
(Sandberg, Nature Methods 2014)