RNA-Seq with the Tuxedo Suite

Preview:

Citation preview

RNA-Seq with the Tuxedo Suite

Monica Britton, Ph.D. Sr. Bioinformatics Analyst

June 2015 Workshop

The Basic Tuxedo Suite

References Trapnell C, et al. 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al. 2010 Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al. 2011 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al. 2011 Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al. 2011 Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Trapnell C, et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology

Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/

transcripts/promoters

Alignment and Differential Expression

TopHat

Cuffdiff

Read set(s)

Existing annotation

(GTF)

bam file(s)

Toptables, etc.

We followed these steps with the single-end reads

But, do we have all the genes?

• For organisms with genomes, gene models are stored in gtf file

• Assumptions: – The gtf file contains annotation for ALL transcripts and genes

– All splice sites, start/stop codons, etc. are correct

• Are these assumptions correct for every sequenced organism?

• RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation

• Method used depends on how much sequence information there is for the organism…

Gene Construction (Alignment) vs. Assembly

Haas and Zody (2010) Nat. Biotech. 28:421-3

Novel or Non-Model Organisms

Genome-Sequenced Organisms

Trinity software

Gene / Transcriptome Construction

• Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons

– Combine expressed exons into genes

– Find all splice variants for a gene

– Discover novel transcripts

• For newly sequenced organisms – Validate ab initio annotation

– Comparison between different annotation sets

• Can assist in finding some types of contamination – Reconstruction of rRNA genes

– Genomic/mitochondrial DNA in RNA library preps.

Reference Annotation Based Transcript (RABT) Assembly

TopHat

Cufflinks

Cuffmerge

Cuffcompare

Cuffdiff

Read set(s)

Existing annotation

(GTF) [optional]

bam file(s)

Read-set specific GTF(s)

Merged GTF

Final assembly (GTF and stats)

Toptables, etc.

TopHat Spliced Alignment to a Genome

Reference Annotation Based Transcript (RABT) Assembly

Cufflinks – Identification of Incompatible Fragments

Incompatible alignment

Cufflinks – Minimum Paths to Transcripts

Cufflinks – Abundance Estimation

Cufflinks – Abundance Estimation

Merging Cufflinks Assemblies

So Now We’ve Explored These Tools…

We’ve Used Other Software in Conjunction

HTSeq-count

edgeR

Raw Counts

(But HTSeq-count and edgeR are independent)

And Then Came Some Extensions…

Modules Introduced in 2014

Cuffquant

• Improves efficiency of running multiple samples

• Stores data in “.cxb” compressed format, that can later be analyzed with cuffdiff or cuffnorm

Cuffnorm

• Generate tables of expression values that are normalized for library size.

• Tables are used as input to Monocle

Monocle

• Used to analyze single-cell expression data

• Trapnell, et al., 2014, Nat. Biotech. 32:381

…But Software Continues to Evolve

HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts)

• Kim et al., 2015, Nat. Methods

• Planned to be Tophat3

• Faster than other aligners

• More accurate on simulated reads.

…But Software Continues to Evolve

StringTie

• Pertea et al., 2015, Nat. Biotech

• Probable successor to Cufflinks2

• Assembles more transcripts (based on simulated reads)

Ballgown

• Frazee et al., 2015, Nat. Biotech

• Bioconductor R package

• Probable successor to Cuffdiff2

• Includes useful Tablemaker preprocessor

A New Potential Game-Changer (2015)

Kallisto (“Near-Optimal RNA-Seq Quantification”)

• Bray et al. (http://arxiv.org/abs/1505.02710)

• Extremely fast, uses pseudo-alignment based on k-mers and deBruijn graphs

Speed Accuracy

Recommended