51
11 Reproducible RNA-seq analysis with Leonardo Collado-Torres @fellgernon #bioc2017

Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

11

Reproducible RNA-seq analysis with

Leonardo Collado-Torres@fellgernon#bioc2017

Page 2: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Reference genome

Reads

Page 3: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 4: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

GTEx TCGA slide adapted from Shannon Ellis

Page 5: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

SRA

Page 6: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Slide adapted from Ben Langmead

Page 7: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

http://rail.bio/Slide adapted from Ben Langmead

Page 8: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

Page 9: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

https://jhubiostatistics.shinyapps.io/recount/

Page 10: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

jx 1 jx 2 jx 3 jx 4

jx 5jx 6

Cove

rage

Reads

Gene

Isoform 1

Isoform 2

Potentialisoform 3

exon 1 exon 2 exon 3 exon 4

Expressed region 1: potential exon 5

Page 11: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 12: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

exon 1 exon 2

exon 3

Page 13: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

disjoint exon 1

disjoint exon 2 disjoint exon 3

Page 14: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 15: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 16: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 17: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

5 10 15

01

23

45

Genome

Coverage

3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1

AUC = area under coverage = 45

Page 18: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 19: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

> library('recount')

> download_study( 'ERP001942', type='rse-gene')

> load(file.path('ERP001942 ', 'rse_gene.Rdata'))

> rse <- scale_counts(rse_gene)

https://github.com/leekgroup/recount-analyses/

Page 20: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

slide adapted from Jeff Leek

Page 21: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

>library('recount')

> download_study('SRP029880', type='rse-gene')> download_study('SRP059039', type='rse-gene')> load(file.path('SRP029880 ', 'rse_gene.Rdata'))> load(file.path('SRP059039', 'rse_gene.Rdata'))> mdat <- do.call(cbind, dat)

https://github.com/leekgroup/recount-analyses/

Page 22: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Collado Torres et al. Nat. Biotech 2017

Page 23: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 24: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 25: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 26: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

jx 1 jx 2 jx 3 jx 4

jx 5jx 6

Cove

rage

Reads

Gene

Isoform 1

Isoform 2

Potentialisoform 3

exon 1 exon 2 exon 3 exon 4

Expressed region 1: potential exon 5

Page 27: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Collado-Torres et al, NAR, 2017

Page 28: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Fetal InfantChild TeenAdult 50+

6 / group, N = 36

Discovery data

Jaffe et al, Nat. Neuroscience, 2015

PostmortemHumanBrainSamples

Fetal InfantChild TeenAdult 50+

6 / group, N = 36

Replication data

Page 29: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Jaffe et al, Nat. Neuroscience, 2015

Page 30: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

BrainSpan dataJaffe et al, Nat. Neuroscience, 2015

Page 31: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

expression data for ~70,000 human samples

GTExN=9,962

TCGAN=11,284

SRAN=49,848

samples

expr

essi

on

estim

ates

geneexonjunctions

ERs

slide adapted from Shannon Ellis

Page 32: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

expression data for ~70,000 human samples

Answer meaningful questions about

human biology and expression

GTExN=9,962

TCGAN=11,284

SRAN=49,848

samples

expr

essi

on

estim

ates

geneexonjunctions

ERs

slide adapted from Shannon Ellis

Page 33: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

expression data for ~70,000 human samples

sam

ples

phenotypes

?GTEx

N=9,962TCGA

N=11,284SRA

N=49,848

samples

expr

essi

on

estim

ates

geneexonjunctions

ERs

Answer meaningful questions about

human biology and expression

slide adapted from Shannon Ellis

Page 34: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Category FrequencyF 95

female 2036Female 51

M 77male 1240Male 141Total 3640

Even when information is provided, it’s not always clear…

sra_meta$Sex

“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”

slide adapted from Shannon Ellis

Page 35: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

SRA phenotype information is far from completeSubjectID Sex Tissue Race Age

6620 NA female liver NA NA6621 NA female liver NA NA6622 NA female liver NA NA6623 NA female liver NA NA6624 NA female liver NA NA6625 NA male liver NA NA6626 NA male liver NA NA6627 NA male liver NA NA6628 NA male liver NA NA6629 NA male liver NA NA6630 NA male liver NA NA6631 NA NA blood NA NA6632 NA NA blood NA NA6633 NA NA blood NA NA6634 NA NA blood NA NA6635 NA NA blood NA NA6636 NA NA blood NA NA

z z z

z

slide adapted from Shannon Ellis

Page 36: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Goal :

to accurately predict critical

phenotype information for all samples in

recount

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRASequence Read Archive

N=49,848

TCGAThe Cancer Genome Atlas

N=11,284

GTExGenotype Tissue Expression Project

N=9,662

slide adapted from Shannon Ellis

Page 37: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Goal :

to accurately predict critical

phenotype information for all samples in

recount

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRASequence Read Archive

N=49,848

GTExGenotype Tissue Expression Project

N=9,662

divide samples

build and optimize

phenotype predictor

train

ing

set

test accuracy

of predictor

test

set

TCGAThe Cancer Genome Atlas

N=11,284

slide adapted from Shannon Ellis

Page 38: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Goal :

to accurately predict critical

phenotype information for all samples in

recount

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRASequence Read Archive

N=49,848

GTExGenotype Tissue Expression Project

N=9,662

divide samples

build and optimize

phenotype predictor

train

ing

set

test accuracy

of predictor

predict phenotypes

across samples in TCGA

test

set

TCGAThe Cancer Genome Atlas

N=11,284

slide adapted from Shannon Ellis

Page 39: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Goal :

to accurately predict critical

phenotype information for all samples in

recount

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRASequence Read Archive

N=49,848

GTExGenotype Tissue Expression Project

N=9,662

divide samples

build and optimize

phenotype predictor

train

ing

set predict

phenotypes across SRA

samplestest accuracy

of predictor

predict phenotypes

across samples in TCGA

test

set

TCGAThe Cancer Genome Atlas

N=11,284

slide adapted from Shannon Ellis

Page 40: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

select_regions()

Output:Coverage matrix (data.frame)Region information (GRanges)slide adapted from Shannon Ellis

Page 41: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Sex prediction is

accurate across data

setsNumber of Regions 20 20 20 20Number of Samples

(N)4,769 4,769 11,245 3,640

99.8% 99.6% 99.4%88.5%

slide adapted from Shannon Ellis

Page 42: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Sex prediction is

accurate across data

setsNumber of Regions 20 20 20 20Number of Samples

(N)4,769 4,769 11,245 3,640

99.8% 99.6% 99.4%88.5%

slide adapted from Shannon Ellis

Page 43: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

http://www.rna-seqblog.com/

Can we use expression data

to predict tissue?

slide adapted from Shannon Ellis

Page 44: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Number of Regions 589 589 589 589Number of Samples

(N)4,769 4,769 7,193 8,951

97.3% 96.5%

71.9%

50.6%

Tissue prediction is

accurate across data

sets

slide adapted from Shannon Ellis

Page 45: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

Number of Regions 589 589 589 589 589Number of Samples

(N)4,769 4,769 613 6,579 8,951

97.3% 96.5% 91.0%

70.2%

Prediction is more

accurate in healthy tissue

50.6%

slide adapted from Shannon Ellis

Page 46: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

> library('recount')

> download_study( 'ERP001942', type='rse-gene')

> load(file.path('ERP001942 ', 'rse_gene.Rdata'))

> rse <- scale_counts(rse_gene)

> rse_with_pred <- add_predictions(rse_gene)

https://github.com/leekgroup/recount-analyses/

Page 47: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

expression data for ~70,000 human samples

sam

ples

phenotypes

?GTEx

N=9,962TCGA

N=11,284SRA

N=49,848

samples

expr

essi

on

estim

ates

geneexonjunctions

ERs

Answer meaningful questions about

human biology and expression

sex tissue

M Blood

F Heart

F Liver

slide adapted from Shannon Ellis

Page 48: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris
Page 49: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

CollaboratorsThe Leek Group Jeff LeekShannon Ellis

Hopkins Ben LangmeadChris WilksKai KammersKasper HansenMargaret Taub

OHSUAbhinav Nellore

LIBDAndrew JaffeEmily BurkeStephen SemickCarrie WrightAmanda PriceNina Rajpurohit

Funding

NIH R01 GM105705NIH 1R21MH109956CONACyT 351535AWS in EducationSeven BridgesIDIES SciServer

Page 50: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

11

http://research.libd.org/recountWorkshop/

help(package=recountWorkshop)

file.edit(system.file('doc/recount-workshop.Rmd', package='recountWorkshop')

)Leonardo Collado-Torres@fellgernon#bioc2017

Page 51: Reproducible RNA-seqanalysis with - Bioconductor · Reference genome Reads. GTEx TCGA slide adapted from ShannonEllis. SRA. Slide adapted from Ben Langmead . ... Ben Langmead Chris

expression data for ~70,000 human samples

(Multiple) Postdoc positions available to- develop methods to process and analyze data from recount2- use recount2 to address specific biological questions

This project involves the Hansen, Leek, Langmead and Battle labs at JHU

Contact: Kasper D. Hansen ([email protected] | www.hansenlab.org)