New RNA-seq workﬂows - Bioconductor · New RNA-seq workﬂows Charlotte Soneson University of...

New RNA-seq workflowsCharlotte Soneson University of Zurich

Brixen 2016

Wikipedia

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.Gene X

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

doesn’t account for all uncertainty in abundance estimates

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.Gene X

• slow• unnecessary?

ALIGNMENT

COUNTING

ANALYSIS

inaccurate?

doesn’t account for all uncertainty in abundance estimates

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.Gene X

• slow• unnecessary?

inaccurate?

ALIGNMENT

COUNTING

ANALYSIS

length = L

length = 2L

sample 1 sample 2

Abundance quantification

length = L

length = 2L

sample 1 sample 2

Abundance quantification

Gene-level read countslength = L

length = 2L

150 reads

sample 1 sample 2

150 reads

Gene length = 2.6L

Gene S1 S2

Count 150 150

What can we do?

• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)

• Include “adjustment” of gene counts to reflect underlying isoform composition

What can we do?

How can we get such values?

How could such adjustment be done?

Are they any good?

What can we do?

How can we get such values?

How could such adjustment be done?

Are they any good?

We need transcript-level

information!

library size

ti =cir

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Abundance units

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

TPMi = 106

library size

ti =cir

k ck= 109 · tiP

k (tk`k)

Abundance units

library size

fragment length

TPMi / RPKMiX

TPMi = 106

Abundance units

library size

fragment length

TPMi / RPKMiX

TPMi = 106

ti =cir

k ck= 109 · tiP

k (tk`k)

Abundance units

library size

fragment length

TPMi / RPKMiX

TPMi = 106

ti =cir

k ck= 109 · tiP

k (tk`k)

Abundance units

length of transcript ifragment length

TPMi / RPKMiX

TPMi = 106

library size

ti =cir

k ck= 109 · tiP

k (tk`k)

• Similar to correction factors for library size, but sample- and gene-specific

• Weighted average of transcript lengths, weighted by estimated abundances (TPMs)

• Average transcript length for gene g in sample s:

Offsets (“average transcript lengths”)

ATLgs =

✓is¯`is,X

✓is = 1

¯`is = e↵ective length of isoform i (in sample s)✓is = relative abundance of isoform i in sample s

length = L

length = 2L

Average transcript lengths

ATLg1 = 1 · L+ 0 · 2L = L

ATLg2 = 0 · L+ 1 · 2L = 2L

length = L

length = 2L

Average transcript lengths

ATLg2 = 0.5 · L+ 0.5 · 2L = 1.5L

ATLg1 = 0.75 · L+ 0.25 · 2L = 1.25L

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.Gene X

ALIGNMENT

COUNTING

ANALYSIS

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.Gene X

The “modern” workflow

“MAPPING”

ESTIMATION

ANALYSIS

• Does not provide “full” alignment information (i.e., no exact base-by-base alignment).

• Rather, finds all transcripts (and positions) that a read is compatible with.

• Comes in various flavors: • pseudoalignment (kallisto) • lightweight alignment (Salmon) • quasimapping (Sailfish, RapMap)

The “mapping” step

Bray et al. 2016; Patro et al. 2014; Patro et al. 2015; Srivastava et al. 2016

• Input: for each read, the “equivalence class” of compatible transcripts

• Probabilistic modeling of read generation process, with transcript abundance as parameter

• EM algorithm

• Output: estimated abundance of each transcript

The “estimation” step

Step 1: build transcriptome index

kallisto

Salmon

name of index

transcriptome fasta file

name of index

transcriptome fasta file

number of cores

Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html

reference files for alignment-based workflow

Step 2: quantify

kallisto

Salmon

name of index

output folder

number of cores

name of index

input fastq files

# bootstrapsnumber of cores input fastq files

libtype

output folder # bootstraps

Salmon LIBTYPE argumenthttp://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype

output

kallisto

Salmon

output

kallisto

Salmon

[abundance.tsv]

[quant.sf]

Comparison to traditional workflow

Salmon/kallisto…

• … are considerably faster than traditional alignment+counting -> allow bootstrapping

• … provide more highly resolved estimates (transcripts rather than gene) - can be aggregated to gene level

• … can use a larger fraction of the reads

• … don’t give precise alignments (for e.g. visualization in genome browser) - but avoid large alignment files

kallisto and Salmon gene counts overall similar

0 1 2 3 4 5

SRR1039508

kallisto (log(counts + 1))

Gene-level counts mostly similar to traditional approach

0 1 2 3 4 5

SRR1039508

featureCounts (log(counts + 1))

kallisto and Salmon can use slightly more reads

featureCounts kallisto Salmon

How to get the estimated values into R?

counts

“ATL” offsets

• Abundance estimates for lowly expressed transcripts are highly variable and should be interpreted with caution

A word of warning

True TPM

Soneson, Love & Robinson, F1000 Research 2016

• Problematic when coverage of region defining an isoform is low

A word of warning

• When aggregated to the gene level, abundance estimates are less variable

A word of warning

True TPM

Differential analysis types for RNA-seq

• Has the total output of a gene changed? DGE

• Has the expression of individual transcripts changed? DTE

• Has any isoform of a given gene changed? DTE+G

• Has the isoform composition for a given gene changed? DTU/DEU

- need different abundance quantification of transcriptomic features (genes, transcripts, exons)

• Srivastava et al.: RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq read to transcriptomes. Bioinformatics 32:i192-i200 (2016) - RapMap

• Patro et al.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv http://dx.doi.org/10.1101/021592 (2015) - Salmon

• Bray et al.: Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34(5):525-527 (2016) - kallisto • Patro et al.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.

Nature Biotechnology 32:462-464 (2014) - Sailfish • Pimentel et al.: Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv http://dx.doi.org/

10.1101/058164 (2016) - sleuth • Wagner et al.: Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among

samples. Theory in Biosciences 131:281-285 (2012) - TPM vs FPKM • Soneson et al.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.

F1000Research 4:1521 (2016) - ATL offsets (tximport package) • Li et al.: RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493-500 (2010) -

TPM, RSEM

References

New RNA-seq workﬂows - Bioconductor · New RNA-seq workﬂows Charlotte Soneson University of...

Documents

RNA-Seq-BasedBreastCancerSubtypesClassificationUsing

Rna seq - PDX models

RNA-Seq analysis workshop - · PDF fileOutline • Background of RNA-Seq • Application of RNA-Seq (what RNA-Seq can do?) • Available sequencing platforms and strategies and which

RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Introduction to RNA-Seq - University of California, Davis...Introduction to RNA-Seq Monica Britton, Ph.D. Bioinformatics Analyst December 2014 Workshop Overview of RNA-Seq Activities

Differential expression in RNA-Seq

1.RNA Seq Part1 WorkingToTheGoal

Gene Mapping via Bulked Segregant RNA-Seq (BSR-Seq)

Poster rna seq-molecular_medtriconf_2011_a_vladimirova

RNA-Seq de novo assembly traininggenoweb.toulouse.inra.fr/~formation/RNASeq_de_novo/RNASeq_de_… · – RNA-Seq techniques RNA-Seq experiment set up Read quality assessment Read

Analysis of RNA-seq Data - University of Hong Kongcgs.hku.hk/portal/files/GRC/Events/Seminars/2017/20170208/rna-seq.pdf · Outline • What is RNA-seq? • What can RNA-seq do? •

Analysing RNA-Seq data produced by Mars-Seq protocoldors.weizmann.ac.il/course/course2018/AnalysingRNA-Seq...Analysing RNA-Seq data produced by Mars-Seq protocol Dena Leshkowitz, Introduction

RNA-Seq and transcriptome analysis

RNA‐Seq: Methods and Applicaonsbarc.wi.mit.edu/education/hot_topics/RNAseq/RNA_Seq.pdf · Outline • Intro to RNA‐Seq Biological Quesons Comparison with Other Methods RNA‐Seq

RNA-seq data analysis - CSC · Analysis tool overview 250 NGS tools for • RNA-seq • single cell RNA-seq • miRNA-seq • exome/genome-seq • ChIP-seq • FAIRE/DNase-seq •

Measuring transcriptomes with RNA-Seq

Bioinformatics for DNA - seq and RNA- seq experiments

RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

Improved RNA-seq Workﬂows Using CyVerse …Improved RNA-seq Workﬂows Using CyVerse Cyberinfrastructure Kapeel M. Chougule,1,5 Liya Wang, 1Joshua C. Stein, Xiaofei Wang, Upendra

Biases in RNA- Seq data