35
Jared Simpson Ontario Institute for Cancer Research & Department of Computer Science University of Toronto Error correction, assembly and consensus algorithms for MinION data London Calling, May 14th, 2015

150514 jts london_calling

Embed Size (px)

Citation preview

Page 1: 150514 jts london_calling

Jared Simpson !

Ontario Institute for Cancer Research &

Department of Computer Science University of Toronto

Error correction, assembly and consensus algorithms

for MinION data

London Calling, May 14th, 2015

Page 2: 150514 jts london_calling

Our collaboration

2

Page 3: 150514 jts london_calling

An overview of NGS assembly • Illumina data: short reads, very accurate, very deep

• nearly all Illumina assembly is based on exact matching algorithms • fragmented assemblies !

• Algorithms for Illumina data do not work for long, noisy reads • PacBio developed a pipeline (“HGAP”) to assemble their data

• We used this recipe as a starting point but with custom components

3

Page 4: 150514 jts london_calling

Long read assembly pipeline

4

Error correction

Celera Assembler

Consensus

Input reads

Genome Assembly

Page 5: 150514 jts london_calling

Input Data• First challenge is finding overlaps for reads with 15-20% errors

5

Page 6: 150514 jts london_calling

Overlap Detection

6

we use github.com/thegenemyers/daligner to compute overlaps

Page 7: 150514 jts london_calling

Partial Order Graphs

7

add read GCTACGAT that we want to correct to graph

Page 8: 150514 jts london_calling

Partial Order Graphs

8

add sequence GCTCGAT to graph

Page 9: 150514 jts london_calling

Partial Order Graphs

9

add sequence GCTCGATT to graph

Page 10: 150514 jts london_calling

Partial Order Graphs

10

maximum weight path GCTCGAT is the corrected read

Page 11: 150514 jts london_calling

Error Correction

11

Page 12: 150514 jts london_calling

Contig Assembly

12

Celera Assembler produces one contig at 98.5% identity

Page 13: 150514 jts london_calling

Assembly Polishing• Consensus problem is viewed as choosing a sequence C’ that maximizes

the probability of the event data

13

C 0= argmax

S2CP (D|S)

P (D|S) =rY

k=1

P (ei,k, ei+1,k, ..., ej,k|S,⇥)

where

Page 14: 150514 jts london_calling

Selecting a Consensus

14

Mutate

ACTACGATCGACTTACGA CCTACGATCGACTTACGA TCTACGATCGACTTACGA

... -CTACGATCGACTTACGA G-TACGATCGACTTACGA GC-ACGATCGACTTACGA GCT-CGATCGACTTACGA

... GACTACGATCGACTTACGA GCCTACGATCGACTTACGA GGCTACGATCGACTTACGA GTCTACGATCGACTTACGA

P (D|S)

-190 -187 -192 -176 -191 -193 -168 -198 -191 -195 -181

GCTACGATCGACTTACGA

Page 15: 150514 jts london_calling

Selecting a Consensus

15

GCT-CGATCGACTTACGAMutate

ACT

... -GGCGCT

... GGGG

P (D|S)

-190 -187 -192 -176 -191 -193 -168 -198 -191 -195 -181

Select new consensusGCT-CGATCGACTTACGA -168

Page 16: 150514 jts london_calling

Pore Models

16

Page 17: 150514 jts london_calling

Generating Events• What do we expect events from a given sequence to look like?

17

GCTACGATTSample Current ●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00 1.25time (s)

Cur

rent

(pA)

Page 18: 150514 jts london_calling

Generating Events• What do we expect events from a given sequence to look like?

18

GCTACGATTSample Current ●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●

●●●

●●●●●●

●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00 1.25time (s)

Cur

rent

(pA)

Page 19: 150514 jts london_calling

Generating Events• What do we expect events from a given sequence to look like?

19

GCTACGATTSample Current ●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●

●●●

●●●●●●

●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●

●●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●●●●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●

●●●

●●●●●●

●●●●●

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00 1.25time (s)

Cur

rent

(pA)

Page 20: 150514 jts london_calling

Generating Events• What do we expect events from a given sequence to look like?

20

GCTACGATTSample Current ●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●

●●●

●●●●●●

●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●

●●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●●●●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●

●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●●

●●●

●●●●

●●●●●●●●

●●●●●●●

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00 1.25time (s)

Cur

rent

(pA)

Page 21: 150514 jts london_calling

Generating Events• What do we expect events from a given sequence to look like?

21

CTACGATTSample Current ●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●

●●●

●●●●●●

●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●

●●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●●●●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●

●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●●

●●●

●●●●

●●●●●●●●

●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●

●●

●●

●●●●●●●●●●

●●●●

●●

●●●●●

●●●●●●●

●●●●●●●●●

●●

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00 1.25time (s)

Cur

rent

(pA)

Page 22: 150514 jts london_calling

Event Detection

22

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●

●●●

●●●●●●

●●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●

●●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●●●●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●

●●●

●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●●

●●●

●●●●

●●●●●●●●

●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●

●●

●●

●●●●●●●●●●

●●●●

●●

●●●●●

●●●●●●●

●●●●●●●●●

●●

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00 1.25time (s)

Cur

rent

(pA)

Event mean current (pA) current stdv duration (s)

1 60.3 0.7 0.521

2 40.6 1.0 0.112

3 52.2 2.0 0.356

4 54.1 1.2 0.291

5 49.5 1.5 0.141

Page 23: 150514 jts london_calling

A simple model• What is the probability of observing events E given a sequence S? • Assuming for the moment there are no missing or extra events:

23

P (e1, e2, ..., en|s1, s2, ..., sn,⇥) =nY

i=1

P (ei|si, µsi ,�si)

P (ei|k, µk,�k) = N (µk,�2k)

Page 24: 150514 jts london_calling

Complications

24

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●●

●●

●●●●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●●●

●●

●●●●

●●

●●●●●

●●●

30

40

50

60

70

0.0 0.2 0.4 0.6time (s)

Cur

rent

(pA)

Page 25: 150514 jts london_calling

Complications

25

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●●

●●

●●●●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●●●

●●

●●●●

●●

●●●●●

●●●

30

40

50

60

70

0.0 0.2 0.4 0.6time (s)

Cur

rent

(pA)

Is this an event ?

Page 26: 150514 jts london_calling

Complications

26

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●●

●●

●●●●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●●●

●●

●●●●

●●

●●●●●

●●●

30

40

50

60

70

0.0 0.2 0.4 0.6time (s)

Cur

rent

(pA) One event or two ?

Page 27: 150514 jts london_calling

Nanopore HMM• must consider:

• over segmentation • under segmentation • missed short events

• HMM: • M states: match event to 5-mers • E states: extra obs. of an event • K states: no event obs. for 5-mer

27

P (D|S)

P (⇡, e1, e2, ..., en|S,⇥) =nY

i=1

P (ei|⇡i, µsi ,�si)P (⇡i|⇡i�1, S)

P (e1, e2, ..., en|S,⇥) =X

P (⇡, e1, e2, ..., en|S,⇥)

Page 28: 150514 jts london_calling

Transition Probabilities• Probability of not observing an

event is a function of absolute difference between (expected) current

28

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●●

●●

●●●●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●●●

●●

●●●●

●●

●●●●●

●●●

30

40

50

60

70

0.0 0.2 0.4 0.6time (s)

Cur

rent

(pA)

Page 29: 150514 jts london_calling

Transition Probabilities• Probability of not observing an

event is a function of absolute difference between (expected) current

29

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●●

●●

●●●●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●●●

●●

●●●●

●●

●●●●●

●●●

30

40

50

60

70

0.0 0.2 0.4 0.6time (s)

Cur

rent

(pA)

Page 30: 150514 jts london_calling

Transition Probabilities

30

●●

●● ● ● ●

●●

●● ● ● ● ●

● ●● ● ●

●●

●● ●

●●

●●

0.1

0.2

0.3

0.4

0.5

0 5 10 15 20absolute difference (pA)

Skip

Pro

babi

lity

Page 31: 150514 jts london_calling

Assembly Accuracy

31

��

� �

� �

��

��

��

��

��

��

��

��

��

���

��

��

��

��

��

��

��

��

��

���

��

��

��

��

��

��

���

��

��

�� �

��

��

��

��

��

���

� �

��

��

��

���

��

��

��

��

��

��

� �

��

��

��

��

��

��

��

��

� �

��

��

��

��

���

��

��

��

��

���

� �

��

��

���

��

��

��

��

��

��

��

��

��

� �

��

� �

��

��

��

��

��

��

� � �

��

��

��

��

� �

��

��

���

���

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� �

��

��

���

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in d

raft

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

draft

reference

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in p

olis

hed

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

polished

reference

A

C

B

D

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in d

raft

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

draft

reference

� �

� �

��

��

��

��

��

��

���

��

��

���

��

��

��

��

��

��

��

� �

��

��

���

��

��

��

��

��

��

���

��

��

��

��

��

��

���

� �

��

��

��

���

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

���

��

��

��

��

���

� �

��

���

��

��

��

��

��

��

��

��

��

��

� �

��

���

� �

� �

��

��

��

��

��

��

���

� �

��

� ��

� �

���

��

��

���

����

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

���

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in p

olis

hed

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

polished

reference

A

C

B

D

Draft: 98.5% accuracy Polished: 99.5% accuracy

Page 32: 150514 jts london_calling

Assembly Accuracy

32

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in d

raft

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

draft

reference

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in p

olis

hed

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

polished

reference

A

C

B

D

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in d

raft

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

draft

reference

0

5000

10000

0 5000 100005�mer count in reference

5�m

er c

ount

in p

olis

hed

asse

mbl

y

0

3000

6000

9000

12000

TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer

coun

t

polished

reference

A

C

B

D

Page 33: 150514 jts london_calling

Aligning Events to a Reference• HMM can also align events to a reference genome

!

!

!

!

!

• Read about it here: • http://simpsonlab.github.io/2015/04/08/eventalign/

33

Page 34: 150514 jts london_calling

Planned Improvements• Model dwell duration to better call homopolymers

!

!

!

• SNP calling/genotyping !

!

!

• Improve scalability to handle larger genomes • Use signal data during error correction

34

CTAAAAAAAAAAAAGTACA

P (gi|D) =P (D|gi)P (gi)

P (D)

Page 35: 150514 jts london_calling

Acknowledgements & Code• Collaborators:

• Nick Loman, Josh Quick (Birmingham) • Jonathan Dursi (OICR) !

!

• Code: • github.com/jts/nanocorrect (error correction) • github.com/jts/nanopolish (signal-level algorithms) • github.com/jts/nanopore-paper-analysis (reproduce our paper)

35