26
Opera: Reconstructing optimal genomic scaffolds with high-throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University of Singapore Genome Institute of Singapore

Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

Embed Size (px)

Citation preview

Page 1: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

Opera: Reconstructing optimal genomic scaffolds with high-

throughput paired-end sequences

Song Gao, Niranjan Nagarajan, Wing-Kin Sung

National University of SingaporeGenome Institute of Singapore

Page 2: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

2

Outline

Overview• Methods

- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work

Page 3: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

3

Transcripts

Microbial Community

Biological Entity Data Entity

GenomeGenomic Sequence

TranscriptAssembly

Metagenome

Reads Analysis

ACGTTTAACAGG…TTACGATTCGATGA…GCCATAATGCAAG…

CTTAGAATCGGATAGAC…AGGCATAGACTAGAG…

Sequencing Machine

Page 4: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

4

Sequence Assembly

Reads Contigs ScaffoldsPaired-end Reads

Related Research Works

Contig Level

OLC Framework:

De Bruijn Graph:

Scaffold Level

Comparative Assembly:

Embedded Module:

Standalone Module:

(I) (II)

Celera Assembler[Myers et al,2000], Edena[Hernandez et al,2008],Arachne[Batzoglou et al,2002], PE Assembler[Ariyaratne et al ,2011]

EULER[Pevzner et al, 2001] , Velvet[Zerbino et al,2008] ,ALLPATHS[Butler et al,2008], SOAPdenovo[Li et al,2010]

AMOScmp[Pop,2004], ABBA[Salzberg,2008]

EULER[Pevnezer et al, 2001], Arachne[Batzoglou et al ,2002],Celera Assembler[Myers et al,2000], Velvet[Zerbino, 2008]

Bambus[Pop, et al, 2004] , SOPRA[Dayarian et al, 2010]

Page 5: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

5

Scaffolding Problem[Huson et al, 2002]

Value AdditionGap Filling:

GapCloser Module of SOAPdenovo

Repeat Resolution

Long-Range Genomic Structure

1k 3k 2.5k

Discordant Read

Paired-end Read Scaffold

Contig

* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)

Page 6: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

6

DataSequencing Errors Read Length Coverage

Analysis

Long Insert vs. Long Read[Chaisson, 2009; Zerbino, 2009]

Statistics of Assembled Genomes[Schatz et al, 2010]

Organism Genome Size

Grapevine 500Mb

Panda 2.4Gb

Strawberry 220Mb

Turkey 1.1Gb

* Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009)* Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, 336-346 (2009)

# of Contigs N50

58,611 18.2kb

200,604 36.7kb

16,487 28.1kb

128,271 12.6kb

# of Scaffold N50

2,093 1.33Mb

81,469 1.22Mb

3,263 1.44Mb

26,917 1.5Mb

* Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010)

* N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N.

Page 7: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

7

NP-Complete [Huson et al, 2002]

* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)

Page 8: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

8

Heuristic Methods- Celera Assembler[Myers et al,2000] - Euler[Pevzner et al, 2001]

- Jazz[Chapman et al, 2002] - Arachne[Batzoglou et al ,2002]

- Velvet[Zerbino et al,2008] - Bambus[Pop, et al, 2004]

“True Complexity”Phase transition based on parameters[Hayes, 1996]

Parametric Complexity[Rodney et al, 1999]

Vertex Cover Problem

Fixed-parameter tractabillity

* Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996).

3-SAT Problem

* Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999

Page 9: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

9

Outline

• Overview Methods

- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work

Page 10: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

10

1. Pre-Processing

Paired-end Reads -> Clusters [Huson et al, 2002]

Chimeric NoiseFiltered by simulation

* Upper Bound of Paired-end Reads

3

* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)

Chimera

Page 11: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

11

No discordant clusters in final scaffold

Naïve Solution

+A

+A+B

+A-B

+A+C

+A-C

+A+B+C

+A+B-C

Exponential Time

+A-C+B

+A-C-B

……

A B C D

2. A Special Case

Page 12: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

12

Dynamic ProgrammingScaffold Tail is Sufficient

Analogous to Bandwidth Problem[Saxe, 1980]

Orientation of Nodes

Direction of Edges

Discordant Edges …

* J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980)

width(w)

Upper Bound

Page 13: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

Equivalence class of scaffoldsS1 and S2 have the same tail -> They are in the same class

Feature of equivalence class:

- Use of the same set of contigs;

- All or none of them can be extended to a solution

Tail

+A-B+C

+D+E

-A+C

+D+E+F

Page 14: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

14

Equivalence ClassNumber of Discordant Edges (p)

Chimeric Reads

ACCAAAATTT

ACCAAGAATTT

Sequencing Errors

CTAGAA CAAGAA

?

Mapping Errors

3. Full Algorithm

Consider discordant clusters

Page 15: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

4. Graph Contraction

20k

Page 16: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

4. Graph Contraction

Page 17: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

4. Graph Contraction

Page 18: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

18

UtilityGenome finishing(Genome Size Estimation)

Scaffold Correctness

Calculate Gap Sizes

Maximum Likelihood

Quadratic Function

Solved through quadratic programming [Goldfarb, et al, 1983]

Polynomial Time

g1 g2 g3

μ,σ

5. Gap Estimation

* Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)

Page 19: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

19

Outline

• Overview• Methods

- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation

Results• Ongoing Work

Page 20: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

20

Runtime Comparison

◆E. coli ★B. pseudomallei ◆S. cerevisiae ◆D. melanogaster

Bambus 50s 16m 2m 3m

SOPRA 49m - 2h 5h

Opera 4s 7m 11s 30s

• Coverage of 300bp insert library: >20X• Coverage of 10kbp insert library: 2X• Contigs assembled using Velvet

◆ Simulated data set using MetaSim ★ In house data

Page 21: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

21

Scaffold Contiguity

E. coli B. pseudomallei S. cerevisiae D. melanogaster0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

N50

Velvet

Bambus

SOPRA

Opera

E. coli B. pseudomallei S. cerevisiae D. melanogaster0

1

2

3

4

5

6

7

8

9

Max Length

Velvet

Bambus

SOPRA

Opera

Page 22: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

22

Scaffold Correctness

E. coli S. cerevisiae D. melanogaster0

20

40

60

80

100

120

# of Breakpoints

VelvetBambusSOPRAOpera

Page 23: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

23

Scaffold Correctness

E. coli S. cerevisiae D. melanogaster0

2

4

6

8

10

12

14

16

18

# of Discordant Edges

VelvetOpera

E.coli S. cerevisiae D. melanogaster

Opera 1 3 4

Bambus 19 55 423

Page 24: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

24

Ongoing Work

Genome Size N50

Opera ~2Gbp 765.5Kbp

SSpace 281.7Kbp

A Rodent Genome

A Tree Genome

Genome Size N50 Max Length

Opera ~300Mbp 209.9Kbp 921.8Kbp

Page 25: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

25

Ongoing Work

Repeats

Lower bounds and better scaffold

Multiple Libraries

Other applications

Metagenomics

Cancer Genomics

Link: https://sourceforge.net/projects/operasf/

Page 26: Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University

26

Acknowledgement

Questions?

Wing-Kin Sung Niranjan Nagarajan

Pramila N. Ariyaratne

Fundings:

A*STAR of Singapore

Ministry of Education, Singapore

NUS Graduate School for Integrative Sciences and Engineering (NGS)