RNA-Seq: From A to T - Michigan State UniversityRNA-Seq: From A to T . Nick Beckloff Director,...

Preview:

Citation preview

RNA-Seq: From A to T

Nick Beckloff Director, Genomics Core

Research Technology Support Facilities

Tracy Teal BEACON, MMG

Michigan State University 7/30/2014

Outline

• RNA-Seq basics • Sample input • Choosing a method • Sequencing • Library Preparation • Validation • RNA-Seq vs Arrays • Future

Upcoming Events

• Wafegen seminar September* • Methylation Boot Camp September* • BiCEP Launch September • iCER 10th Anniversary October

* Tentative due to availability

RNA-Seq Applications

RNA-Seq

Transcriptome Profiling

Biomarkers

sRNA Variants

Isoforms

Novel transcripts

Contents in this presentation may change at any time**

RNA-Seq Basics

RNA Isolation RNA QC RNA-Seq Library prep

Fragmentation cDNA synthesis

Adapter/Barcode Amplification

Analysis Library QC Sequencing

RNA-seq Project Checklist

• Budget • Repetitions

Analysis

• Quality, quantity • Species

Library

• Read Length • Depth

Sequencing

RNA-seq Guidelines

Reps

Length Coverage

Length Differential Expression = 50 vs 100 bp Coverage: Euk = 20-30M reads/sample Prok = 10M reads/sample Repetitions = 3+ Reps > Depth “when moving from 10 – 30 MM reads with 2 or 3 replicates, one will pull in approximately 25% more differentially expressed (DE) genes.” Bioinformatics Volume 30, Issue 3. 301-304.

RNA-seq Guidelines Depth vs Reps: A real world scenario 12 total samples (1 control and 5 conditions in duplicate) 1×50 bp sequncing for differential expression 10 MM reads per samples (12 samples in one lane of ILMN HiSeq) OR 30 MM reads (4 samples in one lane of ILMN HiSeq). What is the best scenario?

Reads from 10 to 30M gives 25% more reads at 1.5x cost

Increase is reps provides 35% more DE genes at 1.6x cost Bioinformatics Volume 30, Issue 3. 301-304.

Project Planning High Quality

Partially Degraded

Degraded

RNA Input is one of the most important drivers in

selecting library prep method

• What Species? • Rna quality? • Quantity?

- 1-200ng, 1-5 ug, etc

• What RNA species? - mRNA, small RNA, etc

Choosing a Method

Library Preps (PolyA selection)

• Up to 1 ug input • RIN >7.0 • Clean output • Only PolyA RNA • Stranded

Features/Limitations

Library Preps (rRNA removal)

• Not 100% efficient • Check species compatibility • 100 ng-5 ug input* • Excludes RNAs < 200bp • Can use with degraded RNA

Features/Limitations Epicentre: Ribo-Zero TM

Presenter
Presentation Notes
Essentially a subtractive hybridization rRNA removal reagent contains oligo probes complementary to rRNA sequences Magnetic beads bind rRNA-probe complexes and remove them from solution Process takes ~1-1.5 hours; requires 1ug total RNA

Library Preps (rRNA removal)

http://www.epibio.com/rnamatchmaker

Check non-model organisms

BLAST rRNA seqs New Ribozero Plant

Leaf/Seed/Root!!

Library Preps (rRNA removal)

• Not 100% efficient • Some junk carry over • Find sweet spot • May need extra reads

Limitations/Tips

Sweet Spot

Library Preps (Quantitative RNA)

• 96 distinct, 8 nt Molecular Index

• Large number of combinations (96x96=9216)

• 96 Barcodes • 10-100 ng input • PE Sequencing

Features/Limitations Bioo Scientific qRNA-Seq kit

Library Preps (Low Input RNA)

Nugen Ovation System V2

• Oligo DT and random priming amplifies both polyA AND non-poly A

• 500 pg input • Stranded • Prokaryotic version • RNA < 200 bp lost

Features/Limitations

Library Preps (Low Input RNA)

SMARTer Universal Low Input

• Recommended for single cell

- Fluidgm C1 • Input 100 pg • Only for polyA samples • Yields best data for high

quality RNA*

*allegedly

Features/Limitations

Library Preps (sRNA)

Netxflex small RNA Kit

• Sequencing of small RNAs and miRNAs

• Gel free adapter depletion • No PAGE gel cuts • Total RNA input >1 ug • Adapters are ligated to

samples

Features/Limitations

Library Preps (Targeted RNA)

SureSelect RNA capture Kit

• Targeted capture of RNA • Total RNA input • KB to 10 MB size • Post-library capture

methods • Nugen uses single

primer method • No cost to design

Features/Limitations

Library Preps (Targeted Depletion)

Nugen InDA-C (Insert Dependent Adapter Cleavage)

• Customized probes for target exclusion

• Eliminates targets by cleaving sequencing adapter

• Post-library selection • Used in Ovation Prokaryotic

system • Stranded

Features/Limitations

Bead purification

InDA-C Fragmentation Enrichment PCR

Strand selection

Adaptor ligation

2nd strand synthesis

1st strand synthesis

Library Preps (Targeted Depletion)

Targeted Depletion of rRNA in (Vitis vinifera, cultivar pinot noir)

Percentage of Mapped Transcripts Cyto rRNA Chloroplast RNA Mito rRNA Informative

Library 1

Library 2

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

% informative reads increased from 22% to 56% with InDA-C

15.4

28.2 0.

5 55.9

18.9

24.3 0.

6 56.2

Mouse 18S Coverage with and without InDA-C

No InDA-C

With InDA-C

Probe location

Library Preps (Targeted Depletion)

Library Preps (Single Cell RNA)

Ovation Single Cell System

• 1-5 pg input • Converts to cDNA

amplifies library

Features/Limitations

Company Conf

idential

• Bioanalyzer traces of Ovation Single Cell RNA-Seq libraries from CD8+ resting sorted T-cells

• Functional libraries created from a single cell

• Estimated total RNA content per cell is 0.5 pg

% Non- rRNA

Single site

RefSeq Strand

Retention Total

Reads %

Aligned Input % rRNA Good alignment to genome – less wasted reads 1 cell 3,175,287 35 1.3 67 95.2

1 cell 2,928,789 39 1.4 69 87.9 Ribosomal reads less than 5% – more informative sequence

10 cells 3,089,680 54 4.1 69 97.9

Excellent strand retention – improved transcriptional value

10 cells 2,904,477 52 3.6 69 98.1

100 cells 3,687,320 81 2.9 66 98.0

100 cells 3,179,519 83 2.9 65 98.4

Library Preps (Single Cell RNA)

RNA-Seq Validation

RNA-Seq Validation (qPCR)

• qPCR validation of RNA-seq • Wafegen SmartChip System • >5,000 rxns/chip • 100 nl volume • Supports Taqman and Sybr

green • Custom targets for NGS

Features/Limitations

RNA-seq vs Arrays

• RNA-seq vs Microarrays

- Cost is comparable - Microarrays only detect what is spotted - RNA-seq > arrays for isoforms, novel transcripts - Complimentary to one another • Intangible

• Reviewers prefer RNA-seq in grants

Take Home Message

Genomics researchers astonished to learn microarrays still exist!! – The Science Web

Future of RNA-seq

• Transcriptomes of Everything • Lower Inputs • Single cell vs multiple profiles • Longer reads • More novel isoforms • More RNA subspecies

Take Home Message

General considerations for RNA-seq quantification for differential expression

Tracy K. Teal

Assistant Professor Microbiology & Molecular Genetics

July 30, 2014

Adapted  from  NGS  RNASeq  slides  Author:  Ian  Dworkin  

What are the goals of your research? Why did you generate all of the RNAseq data in the first place?

§  Transcriptome assembly (& SNP discovery) §  Transcript discovery (variants for Transcription

start site, alternative splicing, etc..) §  Quantification of (alternative transcripts) §  Differential expression analysis across

treatments.

RNA-­‐seq  is  generated  for  a  number  of  reasons  

What was once thought to be separate goals are now clearly recognized as

intertwined.

§  Early work for RNA-seq tried to “mirror” the type of gene level analysis used in microarrays.

§  However, RNA-seq has demonstrated how important it is to take into account alternative transcripts, even when attempting to get “gene level” measures.

How do we put together a useful pipeline for RNAseq

What are the steps we need to consider?

How do we put together a useful pipeline for RNAseq?

What are the steps we need to consider? §  Quality filtering §  Genome/transcriptome assembly. §  Mapping reads to genome/transcriptome. §  Deal with alternative transcripts (new

transcriptome)? §  Remap & count reads. §  Differential expression

Quality filtering Your analysis is only as good as your data

§  Quality control and removal of poor-quality reads (FASTQC, RNASeQC, fastx, …)

§  Remove adapters and linkers (FASTQC, Trimmomatic, …)

Mapping reads Ultimately all analyses require read mapping

Image  credit:  Nir  Friedman  lab  

Challenge: alternative splicing

Overview of RNA-Seq analysis pipeline for detecting differential expression

Oshlack  et  al.,  From  RNA-­‐seq  reads  to  differen3al  expression  results,  Genome  Biology  2010.    

Quality  filter  

RNA-­‐seq  Workflows  and  Tools.  Stephen  Turner.  Figshare.  hJp://dx.doi.org/10.6084/m9.figshare.662782  

Pipelines for RNA-seq (geared towards splicing)

Alamancos  et  al.  Methods  to  Study  Splicing  from  RNA-­‐Seq  hJp://arxiv.org/abs/1304.5952  Figshare.  hJp://dx.doi.org/10.6084/m9.figshare.679993  

The “tuxedo” protocol for RNA-seq

Trapnell  C  et  al    Differen[al  gene  and  transcript  expression  analysis  of  RNA-­‐seq  experiments  with  TopHat  and  Cufflinks  Nature  Protocols  7,  562–578  (2012)  

Nookaew  et  al  2102  NAR  

How should we map reads

§  Do we want to map to a reference genome (with a “splice aware” aligner)?

§  Or do we want to map to a transcriptome directly?

Mapping to the genome

How do we deal with alternative transcripts or paralogs during mapping?

§  "splicing aware" aligners: §  Exon First: (Tophat, MapSplice, SpliceMap) Fig1A Garber §  Step 1 - map reads to genome §  Step 2 -unmapped reads are split, and aligned.

§  Seed & extend (Fig1B Garber) (GSNAP, QPALMA) §  kmers from reads are mapped (the seeds), and then

extended

Garber  et  al.  2011  

Mapping to a transcriptome

§  What might be the downside to mapping to the transcriptome? Incomplete transcriptomes can lead to errors in inferred expression levels. Potentially less well annotated. §  For this Burrows-Wheeler is faster than seed

based approaches (shrimb & stampy), but the latter may be preferred if mapping to "distant" transcriptomes.

Which to use

§  If a (close to?) perfect match transcriptome assembly is available for mapping. Burrows-wheeler based aligners can be much faster than seed based methods (upto 15x faster)

§  BW based aligners have reduced performance once mismatches are considered. §  Exponential decrease in performance with each additional

mismatch (iteratively performs perfect searches). §  Seed methods may be more sensitive when mapping to

transcriptomes of distantly related species (or high polymorphism rates).

From  Garber  et  al.  2011  

Counting

§  One of the most difficult issues has been how to count reads.

§  What are some of the issues that we need to account for during counting of reads?

Counting

§  We are interested in transcript abundance. §  But we need to take into account a number of

things. §  How many reads in the sample. §  Length of transcripts §  GC content and sequencing bias

Counting

§  RPKM (Reads Per Kilobase of transcript per Million mapped reads) – Mortazavi et al 2008

§  FPKM (Fragments Per Kilobase of transcript per Million mapped reads). Avoids double counting in paired-end sequencing.

Normalizing  a  transcript's  read  count  by  both  its  length  and  the  total  number  of  mapped  reads  in  the  sample  

Garber  et    al.  Computa[onal  methods  for  transcriptome  annota[on  and  quan[fica[on  using  RNA-­‐seq.    Nat  Methods,  Jun  2011  

Accounting for multiple isoforms

§  - Only count reads that map uniquely to an isoform. Can be very problematic, when isoforms do not have unique exons.

§  - so called "isoform-expression" methods (cufflinks, MISO) model the uncertainty parametrically (often using MLE). The model with the best mix of isoforms that models the data (highest joint probability) is the best estimate. How this is handled differs a great deal by the different model.

Garber  et  al.  2011  

Trapnell  C  et  al  Differen[al  analysis  of  gene  regula[on  at  transcript  resolu[on  with  RNA-­‐seq  Nat  Biotechnol.  2013  Jan;31(1):46-­‐53  

Differential expression

§  DEseq (http://www.ncbi.nlm.nih.gov/pubmed/20979621) §  EDGE-R §  EBseq (RSEM/EBseq) §  RSEM (http://deweylab.biostat.wisc.edu/rsem/) §  eXpress (http://bio.math.berkeley.edu/eXpress/overview.html) §  Beers simulation pipeline(http://www.cbil.upenn.edu/BEERS/) §  DEXseq (http://bioconductor.org/packages/release/bioc/html/DEXSeq.html) §  Limma (voom) §  Htseq (python library) works with DEseq

Nookaew  et  al  2102  NAR  

Differen[ally  expressed  genes  based  on  sofware  for  quan[fica[on  Differen[ally  expressed  genes  based  on  sofware  for  mapping  

Problems with cufflink and cuffdiff? Reproducibility… §  http://seqanswers.com/forums/showthread.php?t=20702 §  http://seqanswers.com/forums/showthread.php?t=17662 §  http://seqanswers.com/forums/showthread.php?t=23962 §  http://seqanswers.com/forums/showthread.php?t=21020 §  http://seqanswers.com/forums/showthread.php?t=21708 §  http://www.biostars.org/p/6317/

So, what to do?

Example workflows and tutorials §  Ian Dworkin’s NGS course protocols http://ged.msu.edu/angus/tutorials-2013/index.html

§  Bacterial RNA-Seq workflow from Ben Johnson & Rob Abramovitch http://www.abramovitchlab.com/#/rna-seq-computational-methods/ §  Canadian Bioinformatics workshops http://bioinformatics.ca/workshops/2013/informatics-rna-sequence-analysis-2013 §  Trinity and Tuxedo tutorials http://trinityrnaseq.sourceforge.net/rnaseq_workshop.html §  Samtools for variant calling

The “tuxedo” protocol for RNA-seq

Trapnell  et  al  2012  

Overviews of RNA-Seq

§  Graber et al, Computational methods for transcriptome annotation and quantification using RNA-seq, Nat Methods, Jun 2011

§  http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNAseqDE_Dec2011.pdf

Aligning to a transcriptome or a genome

§  Aligning to a genome, you have to account for the different splice variations

§  Aligning to a transcriptome, you have the different isoforms, so the mapping is more straightforward

§  However, you might have to assemble your own transcriptome

How to assemble multiple alternative spliced transcripts?

1 2 3

In  the  presence  of  AS,  conven[onal  assembly  may  be  erroneous,  ambiguous,  or  truncated.  

Overlapping  

truncated   truncated  

correct   truncated  

Need to use splice-aware assemblers

•  Cufflinks  (most  commonly  used)  •  Scripture  •  Trinity  •  Trans-­‐ABySS  •  GRIT  

Recommended