45
Jonathan Laserson advised by Daphne Koller October 20 th , 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Embed Size (px)

Citation preview

Page 1: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Jonathan Laserson

advised by

Daphne Koller

October 20th, 2011

Bayesian Assembly of Reads from High Throughput Sequencing

THESIS DEFENSE

Page 2: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

High-Throughput Sequencing Millions of noisy short reads from a sample A snapshot of the genomic material in the

sample

Roche 454 GS-FLX Titanium Avg read length 380b reads in a run 500k Base error rate 0.74%

(84% indels) Cost per base 0.05$/kb

[You et al., 2011]

genomes

reads

CGGACT

CGGACTATCCACTATCCACCGACCCTGTGGGTCCCCGTCAGAACCCGCCCTA...

Output

Page 3: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Population Sequencing

x

x

x

x

x x

x

x

x

x

x

xxx

x

x

xx

noise

mutations Depth

Phylogenetic trees Immune-seq

Breadth Metagenomics de novo assembly

reads

sequences

Page 4: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

CTGGGAAAAG AGCCCAGCCCC CGTAGATCA CTGGAAAAG GCTGACGTAG AAAGCCCAG ACTGACCTG

CCCTCATTTC ACTGCCCTGG GACGTAGATCA GCCCTGGTGA CGGCTGACGT AGATCATAGA

AGCCCAGCCC CATAGAAACGAT

reads

an assembly

contigs

Input

Output

CGGCTGACGTAGATCATAGAAACGATCGGCTGACGT GCTGACGTAG GACGTAGATCA CGTAGATCA AGATCATAGA CATAGAAACGAT

GENOVO

ACTGCCCTGGGAAAAGCCCAGCCCCTCATTTCACTGCCCTGGACTGCCCTG GCCCTGGGA CTGGGAAAAG CTGGGAAAAG AAAGCCCAG AGCCCAGCCC AGCCCAGCCCC CCCTCATTTCreads

Page 5: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Previous Work Velvet, Euler:

Assume most reads have no error Assume a single genome Trash low-abundance reads Fix reads with one error a-priori

Newbler: No code / publication

Page 6: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Our Approach: Probabilistic Model

Fully Bayesian model for high-throughput sequencing Jointly performing 3 tasks:

de-noising the reads de novo sequence assembly multiple alignment

We do not throw away reads Can handle unbalanced populations of genomes

Page 7: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Our Approach: Probabilistic Model

bso

oi

o=-∞..+∞

π

si

i=1..N

α

yi

ρs

s=1..∞

β

s ~ CRP(α, N)

bso ~ U[A,C,G,T]

ρs ~ Beta(1,1+β)

oi ~ G(ρs_i)

yi ~ read_model(b,si,oi)

log P(y,b,s,o|ρ) ≈ #errors · log(ε) - #bases · log(4) + #seq · log(α) min read errors min bases

we want to find argmax P(b,s,o | y)

b,s,o

Page 8: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Inference

MCMC Making moves in a probabilistic space Stochastic process

escape local optimum multiple hypotheses

Page 9: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

18

Move 1: Read Mapping

A READ samples a sequence and an offset (inside the seq)si oi

10

candidate locations

Likelihood Term Stickiness Term Geometric TermP(si=s,oi=o | y,b,ρ,s-i,o-i) α

Page 10: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Comparison:How many bases did you BLAST into?

horizontal line:total basescovered byall methodstogether

vertical line:threshold for a perfect match

Page 11: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

horizontal line:total basescovered byall methodstogether

vertical line:threshold for a perfect match

Comparison:How many bases did you BLAST into?

Page 12: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

A Score for an Assembly

=

α controls minimal overlapbetween adjacent readsmin read errors min bases

#errors · log(ε) - #bases · log(4) + #seq · log(α)

Page 13: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Contributions Fully Bayesian model for high-

throughput sequencing An algorithm jointly performing 3 tasks:

de-noising the reads de novo sequence Assembly multiple alignment

A score for de novo assembly Code freely available

Laserson J, Jojic V, Koller D, Genovo: de novo Assembly for Metagenomes, Journal of Computational Biology, 2011. RECOMB 2010.

Page 14: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Population Sequencing

x

x

x

x

x x

x

x

x

x

x

xxx

x

x

xx

noise

mutations Depth

Phylogenetic trees Immune-seq

Breadth Metagenomics de novo assembly

reads

sequences

Page 15: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Immune System 1010 B-Cells in an

individual

Page 16: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

[Illustration: Wikipedia]

Immune System 1010 B-Cells in an

individual Phase 1: generate basic

seq choose V,D,J from catalog

V D J

57 x 27 x 6

~297 bases ~24 bases ~52 bases

Page 17: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Immune System 1010 B-Cells in an

individual Phase 1: generate basic

seq choose V,D,J stitch V-D and D-J by:

eat up a few letters from ends insert a few letters

V D JN1 N2

~360 bases [Illustration: Wikipedia]

Page 18: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Immune System 1010 B-Cells in an

individual Phase 1: generate basic

seq Phase 2: hyper-mutate

If detecting an invader, clone yourself with lots-of-mutations

Even better detection? keep cloning.

Remember the successful sequences for future reference, and keep producing them.

This set of sequences is one clone.V D JN1 N2

[Illustration: Janeway, 2001]

Page 19: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Input

Output

igForest

reads

clones

x

x

xx

x

x

x x

x

xxx

x

x

V1 JD 61

xx

V7 JD

x

11

x

xV4 JD 62

germline

mutationssequencing noise

Page 20: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

ReperTree

Phase 1: Divide reads into clones 1M reads! In a healthy individual almost every

readis from a different clone.

Phase 2: Construct a mutation tree for each clone. How to tell a mutation from a

sequencing error?

De Novo Annotation for B-Cells

Page 21: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Overview 10 (healthy) Organ donors 4 samples from each individual

Blood, spleen, two lymph nodes 380k reads total

8,000 clones 165k other B-cells variants (not clones)

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

JD

igForest

Page 22: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

A Repertoire

Vs (57)

Js (6)

Page 23: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Variants (173k)

blood spleen lymph node 1 lymph node 2

Js (6)

Vs (57)

pati

ents

pati

ents

Page 24: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

1.8

2

2.8

1.9

2.1

2.2

2.9

2.9

3

V,J VJ VS JS VD JD VJS VJD

XX

X XX X

X X X X

X X X X XX

X

X X

Repertoire Factorization

X(v,j,donor, source) = a(v,j,source) x b(v,j,donor)

*1e-3

RMSE = 1.8e-3

Page 25: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Properties Of VariantsJD

JD

JD

JD

JD

JD

JD

JD

JD

JD

length of variable region

length of N1

length of N2

distance to germline

fract

ion o

f vari

ants

Page 26: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Clones (8,000)

JD

JD

JD

bloodspleenMesenteric Mediastinal

shared2 sources

pureclones

shared3 sources

pati

ents

Evidence of clone migration

How do the trees look like?

Page 27: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Current Phylogenetic Trees

X*100 sequences Weeks to run No sequencing/PCR errors Single solution

Binary tree Data at the leaves Continuous mutation

model

X*10,000 √ Hours √ Noise model √ Multiple/

statistics √

√ √ Discrete √

igForesttraditional phylo trees

Page 28: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Previous Work on B Cells

ihmmune-align [Gaëta et al. 2007], SoDA2 [Munshaw and Kepler 2010]

examine each read independently, no tree

igTree [Barak et al. 2008] heuristic search, no high-throughput support

Our Approach: Joint probabilistic model for hypermutation

and sequencing noise

igForest

Page 29: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Generate Tree Start with one live cell at t=0 Give birth at rate λ Die at rate δ Stop when t= “snapshot time”.

time t=snapshot

root

t=0

(Birth/death process)

Page 30: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Generate Sequences Start with the inferred germline at the root Generate each child sequence from the sequence of

its parent Using a Mutation Model

Continue so, recursively.time

root

t=0

x

x

x

x

x

x

xx

x

x

x

x

xx

x

t=snapshot

Page 31: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Generate Reads

reads

live cells

Total N reads. Assign reads to live cells (uniformly). Copy read sequence from cell, based on read model.

t=snapshott=0

Page 32: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Inference MCMC Moves to manipulate tree and

read assignments Generating samples of trees along the way

Page 33: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

t=snapshott=0

1

2

1

01

2

1

Output Tree

mutation distance

Page 34: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Output Tree

t=snapshott=0

1

2

1

01

2

1

6 8

211

2 6

1 2

Page 35: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Sample Trees

reads

mutations compared to germline

bloodspleenMesenteric

Mediastinal

Page 36: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Sample TreesbloodspleenMesenteric

Mediastinal

Page 37: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE
Page 38: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Properties Of Treesmax mutational height

median mutational height

median mutational distance (edge length)

max mutational distance (edge length)

fract

ion o

f cl

ones

Page 39: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

FR2 CDR2 FR3 CDR3

clonesJD

JD

Clones

root down the tree

Mutation Analysis – V region

clonesJD

JD

variants

JD

JD

JD

JD

variants

Clones

FR2 CDR2 FR3 CDR3

fract

ion o

f m

uta

tion

s

mutations binned by their aligned codon position

FR2 CDR2 FR3 CDR3

fract

ion o

f m

uta

tion

s

germline to root

Page 40: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

FR2 CDR2 FR3 CDR3

clonesJD

JD

Clones

root down the tree

Mutation Analysis – V region

clonesJD

JD

variants

JD

JD

JD

JD

variants

Clones

FR2 CDR2 FR3 CDR3

fract

ion o

f m

uta

tion

s

mutations binned by their aligned codon position

FR2 CDR2 FR3 CDR3

fract

ion o

f m

uta

tion

s

germline to root

Page 41: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Fully Bayesian phylogenetic analysis O(10k) reads Handles sequencing noise

Big Picture view of the B cell repertoire Intuitive clone description

Insight on the Immune System Evidence of clone migration Differences between tissues Deep hypermutation analysis

Code (will be) freely available

Contributions

Page 42: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

x

x

x

x

x x

x

x

x

x

x

xxx

x

x

xx

noise

mutations Depth

Breadth

reads

sequences

Thank you!

Page 43: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Daphne Koller

Acknowledgements

Page 44: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

Vladimir Jojic

Scott BoydAndrew Fire

Katherine JacksonElyse Hope

Acknowledgements

Funding National Science Foundation Stanford Graduate Fellowship

Chair: Richard Olshen

Page 45: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE

DAGS team Ben Packer Karen Sachs Geremy Heitz Serafim Batzoglou and his

group Orly Fuhrman Leor Vartikovski My Parents