New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of...

Preview:

Citation preview

New RNA-seq workflowsCharlotte Soneson University of Zurich

Brixen 2016

Wikipedia

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

doesn’t account for all uncertainty in abundance estimates

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

• slow• unnecessary?

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

inaccurate?

doesn’t account for all uncertainty in abundance estimates

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

• slow• unnecessary?

inaccurate?

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

length = L

length = 2L

sample 1 sample 2

T1

T2

Abundance quantification

length = L

length = 2L

sample 1 sample 2

T1

T2

Abundance quantification

Gene-level read countslength = L

length = 2L

150 reads

sample 1 sample 2

150 reads

T1

T2

Gene length = 2.6L

Gene S1 S2

Count 150 150

What can we do?

• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)

• Include “adjustment” of gene counts to reflect underlying isoform composition

• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)

• Include “adjustment” of gene counts to reflect underlying isoform composition

What can we do?

How can we get such values?

How could such adjustment be done?

Are they any good?

• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)

• Include “adjustment” of gene counts to reflect underlying isoform composition

What can we do?

How can we get such values?

How could such adjustment be done?

Are they any good?

We need transcript-level

information!

library size

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

library size

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Abundance units

`i

ciread count for transcript i

length of transcript ifragment length

TPMi / RPKMiX

i

TPMi = 106

library size

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

• Similar to correction factors for library size, but sample- and gene-specific

• Weighted average of transcript lengths, weighted by estimated abundances (TPMs)

• Average transcript length for gene g in sample s:

Offsets (“average transcript lengths”)

ATLgs =

X

i2g

✓is¯`is,X

i2g

✓is = 1

¯`is = e↵ective length of isoform i (in sample s)✓is = relative abundance of isoform i in sample s

length = L

length = 2L

T1

T2

Average transcript lengths

ATLg1 = 1 · L+ 0 · 2L = L

ATLg2 = 0 · L+ 1 · 2L = 2L

length = L

length = 2L

T1

T2

Average transcript lengths

ATLg2 = 0.5 · L+ 0.5 · 2L = 1.5L

ATLg1 = 0.75 · L+ 0.25 · 2L = 1.25L

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

The “modern” workflow

“MAPPING”

ESTIMATION

ANALYSIS

• Does not provide “full” alignment information (i.e., no exact base-by-base alignment).

• Rather, finds all transcripts (and positions) that a read is compatible with.

• Comes in various flavors: • pseudoalignment (kallisto) • lightweight alignment (Salmon) • quasimapping (Sailfish, RapMap)

The “mapping” step

Bray et al. 2016; Patro et al. 2014; Patro et al. 2015; Srivastava et al. 2016

• Input: for each read, the “equivalence class” of compatible transcripts

• Probabilistic modeling of read generation process, with transcript abundance as parameter

• EM algorithm

• Output: estimated abundance of each transcript

The “estimation” step

Step 1: build transcriptome index

kallisto

Salmon

name of index

transcriptome fasta file

name of index

transcriptome fasta file

number of cores

Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html

Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html

reference files for alignment-based workflow

Step 2: quantify

kallisto

Salmon

name of index

output folder

number of cores

name of index

input fastq files

# bootstrapsnumber of cores input fastq files

libtype

output folder # bootstraps

Salmon LIBTYPE argumenthttp://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype

output

kallisto

Salmon

output

kallisto

Salmon

[abundance.tsv]

[quant.sf]

Comparison to traditional workflow

Salmon/kallisto…

• … are considerably faster than traditional alignment+counting -> allow bootstrapping

• … provide more highly resolved estimates (transcripts rather than gene) - can be aggregated to gene level

• … can use a larger fraction of the reads

• … don’t give precise alignments (for e.g. visualization in genome browser) - but avoid large alignment files

kallisto and Salmon gene counts overall similar

0 1 2 3 4 5

01

23

45

SRR1039508

kallisto (log(counts + 1))

Salm

on (l

og(c

ount

s +

1))

Gene-level counts mostly similar to traditional approach

0 1 2 3 4 5

01

23

45

SRR1039508

featureCounts (log(counts + 1))

Salm

on (l

og(c

ount

s +

1))

kallisto and Salmon can use slightly more reads

0e+00

1e+07

2e+07

3e+07

SRR

1039

508

SRR

1039

509

SRR

1039

512

SRR

1039

513

SRR

1039

516

SRR

1039

517

SRR

1039

520

SRR

1039

521

Num

ber o

f map

ped

read

s

featureCounts kallisto Salmon

How to get the estimated values into R?

How to get the estimated values into R?

How to get the estimated values into R?

TPMs

counts

“ATL” offsets

• Abundance estimates for lowly expressed transcripts are highly variable and should be interpreted with caution

A word of warning

A B

C

1

Estim

ated

TPM

True TPM

A B

C

1

Soneson, Love & Robinson, F1000 Research 2016

• Problematic when coverage of region defining an isoform is low

A word of warning

Soneson, Love & Robinson, F1000 Research 2016

• When aggregated to the gene level, abundance estimates are less variable

A word of warning

A B

C

1

Estim

ated

TPM

True TPM

A B

C

1

Soneson, Love & Robinson, F1000 Research 2016

A B

C

1

Differential analysis types for RNA-seq

• Has the total output of a gene changed? DGE

• Has the expression of individual transcripts changed? DTE

• Has any isoform of a given gene changed? DTE+G

• Has the isoform composition for a given gene changed? DTU/DEU

- need different abundance quantification of transcriptomic features (genes, transcripts, exons)

• Srivastava et al.: RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq read to transcriptomes. Bioinformatics 32:i192-i200 (2016) - RapMap

• Patro et al.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv http://dx.doi.org/10.1101/021592 (2015) - Salmon

• Bray et al.: Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34(5):525-527 (2016) - kallisto • Patro et al.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.

Nature Biotechnology 32:462-464 (2014) - Sailfish • Pimentel et al.: Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv http://dx.doi.org/

10.1101/058164 (2016) - sleuth • Wagner et al.: Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among

samples. Theory in Biosciences 131:281-285 (2012) - TPM vs FPKM • Soneson et al.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.

F1000Research 4:1521 (2016) - ATL offsets (tximport package) • Li et al.: RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493-500 (2010) -

TPM, RSEM

References

Recommended