42
New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016

New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

New RNA-seq workflowsCharlotte Soneson University of Zurich

Brixen 2016

Page 2: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Wikipedia

Page 3: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

Page 4: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

doesn’t account for all uncertainty in abundance estimates

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

• slow• unnecessary?

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

inaccurate?

Page 5: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

doesn’t account for all uncertainty in abundance estimates

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

• slow• unnecessary?

inaccurate?

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

Page 6: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

length = L

length = 2L

sample 1 sample 2

T1

T2

Abundance quantification

Page 7: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

length = L

length = 2L

sample 1 sample 2

T1

T2

Abundance quantification

Page 8: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Gene-level read countslength = L

length = 2L

150 reads

sample 1 sample 2

150 reads

T1

T2

Gene length = 2.6L

Gene S1 S2

Count 150 150

Page 9: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

What can we do?

• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)

• Include “adjustment” of gene counts to reflect underlying isoform composition

Page 10: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)

• Include “adjustment” of gene counts to reflect underlying isoform composition

What can we do?

How can we get such values?

How could such adjustment be done?

Are they any good?

Page 11: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)

• Include “adjustment” of gene counts to reflect underlying isoform composition

What can we do?

How can we get such values?

How could such adjustment be done?

Are they any good?

We need transcript-level

information!

Page 12: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

library size

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

Page 13: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

library size

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

Page 14: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Page 15: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Abundance units

`i

ciread count for transcript i

length of transcript i

library size

fragment length

TPMi / RPKMiX

i

TPMi = 106

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Page 16: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Abundance units

`i

ciread count for transcript i

length of transcript ifragment length

TPMi / RPKMiX

i

TPMi = 106

library size

ti =cir

`i

TPMi = 106 · tiPk tk

RPKMi = 109 · ci`iP

k ck= 109 · tiP

k (tk`k)

Page 17: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Similar to correction factors for library size, but sample- and gene-specific

• Weighted average of transcript lengths, weighted by estimated abundances (TPMs)

• Average transcript length for gene g in sample s:

Offsets (“average transcript lengths”)

ATLgs =

X

i2g

✓is¯`is,X

i2g

✓is = 1

¯`is = e↵ective length of isoform i (in sample s)✓is = relative abundance of isoform i in sample s

Page 18: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

length = L

length = 2L

T1

T2

Average transcript lengths

ATLg1 = 1 · L+ 0 · 2L = L

ATLg2 = 0 · L+ 1 · 2L = 2L

Page 19: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

length = L

length = 2L

T1

T2

Average transcript lengths

ATLg2 = 0.5 · L+ 0.5 · 2L = 1.5L

ATLg1 = 0.75 · L+ 0.25 · 2L = 1.25L

Page 20: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

The traditional workflow

ALIGNMENT

COUNTING

ANALYSIS

Page 21: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

7 . . . 13 . . . . . . . . . . . . . . .

Gene AGene B

.

.

.Gene X

The “modern” workflow

“MAPPING”

ESTIMATION

ANALYSIS

Page 22: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Does not provide “full” alignment information (i.e., no exact base-by-base alignment).

• Rather, finds all transcripts (and positions) that a read is compatible with.

• Comes in various flavors: • pseudoalignment (kallisto) • lightweight alignment (Salmon) • quasimapping (Sailfish, RapMap)

The “mapping” step

Bray et al. 2016; Patro et al. 2014; Patro et al. 2015; Srivastava et al. 2016

Page 23: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Input: for each read, the “equivalence class” of compatible transcripts

• Probabilistic modeling of read generation process, with transcript abundance as parameter

• EM algorithm

• Output: estimated abundance of each transcript

The “estimation” step

Page 24: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Step 1: build transcriptome index

kallisto

Salmon

name of index

transcriptome fasta file

name of index

transcriptome fasta file

number of cores

Page 25: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html

Page 26: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html

reference files for alignment-based workflow

Page 27: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Step 2: quantify

kallisto

Salmon

name of index

output folder

number of cores

name of index

input fastq files

# bootstrapsnumber of cores input fastq files

libtype

output folder # bootstraps

Page 28: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Salmon LIBTYPE argumenthttp://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype

Page 29: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

output

kallisto

Salmon

Page 30: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

output

kallisto

Salmon

[abundance.tsv]

[quant.sf]

Page 31: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Comparison to traditional workflow

Salmon/kallisto…

• … are considerably faster than traditional alignment+counting -> allow bootstrapping

• … provide more highly resolved estimates (transcripts rather than gene) - can be aggregated to gene level

• … can use a larger fraction of the reads

• … don’t give precise alignments (for e.g. visualization in genome browser) - but avoid large alignment files

Page 32: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

kallisto and Salmon gene counts overall similar

0 1 2 3 4 5

01

23

45

SRR1039508

kallisto (log(counts + 1))

Salm

on (l

og(c

ount

s +

1))

Page 33: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Gene-level counts mostly similar to traditional approach

0 1 2 3 4 5

01

23

45

SRR1039508

featureCounts (log(counts + 1))

Salm

on (l

og(c

ount

s +

1))

Page 34: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

kallisto and Salmon can use slightly more reads

0e+00

1e+07

2e+07

3e+07

SRR

1039

508

SRR

1039

509

SRR

1039

512

SRR

1039

513

SRR

1039

516

SRR

1039

517

SRR

1039

520

SRR

1039

521

Num

ber o

f map

ped

read

s

featureCounts kallisto Salmon

Page 35: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

How to get the estimated values into R?

Page 36: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

How to get the estimated values into R?

Page 37: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

How to get the estimated values into R?

TPMs

counts

“ATL” offsets

Page 38: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Abundance estimates for lowly expressed transcripts are highly variable and should be interpreted with caution

A word of warning

A B

C

1

Estim

ated

TPM

True TPM

A B

C

1

Soneson, Love & Robinson, F1000 Research 2016

Page 39: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Problematic when coverage of region defining an isoform is low

A word of warning

Soneson, Love & Robinson, F1000 Research 2016

Page 40: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• When aggregated to the gene level, abundance estimates are less variable

A word of warning

A B

C

1

Estim

ated

TPM

True TPM

A B

C

1

Soneson, Love & Robinson, F1000 Research 2016

A B

C

1

Page 41: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Differential analysis types for RNA-seq

• Has the total output of a gene changed? DGE

• Has the expression of individual transcripts changed? DTE

• Has any isoform of a given gene changed? DTE+G

• Has the isoform composition for a given gene changed? DTU/DEU

- need different abundance quantification of transcriptomic features (genes, transcripts, exons)

Page 42: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

• Srivastava et al.: RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq read to transcriptomes. Bioinformatics 32:i192-i200 (2016) - RapMap

• Patro et al.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv http://dx.doi.org/10.1101/021592 (2015) - Salmon

• Bray et al.: Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34(5):525-527 (2016) - kallisto • Patro et al.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.

Nature Biotechnology 32:462-464 (2014) - Sailfish • Pimentel et al.: Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv http://dx.doi.org/

10.1101/058164 (2016) - sleuth • Wagner et al.: Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among

samples. Theory in Biosciences 131:281-285 (2012) - TPM vs FPKM • Soneson et al.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.

F1000Research 4:1521 (2016) - ATL offsets (tximport package) • Li et al.: RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493-500 (2010) -

TPM, RSEM

References