23
JAMES LINDSAY* , HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut* Georgia State University

JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Embed Size (px)

Citation preview

Page 1: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

Page 2: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

De-novo Assembly Paradigm

Sequencing

The Contigs

The Scaffolds

The Reads

The Genome

Assembly

Scaffolding

Page 3: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Page 4: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Sanger Sequencing

gene XYZ3’

UTR5’

UTR

Biologist: There are holes in my genes!

Page 5: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Page 6: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Read Pairs

Paired Read Construction

2kb

2kb

same strand and orientation

R1 R2

Informative Reads

Align each read against the contigs

Only accept uniquely mapped reads Use the non-unique

reads laterBoth reads in a pair

must map to different contigs

Page 7: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Linkage Information

Possible States

Two contigs are adjacent if: A read pair spans the contigs

State (A, B, C, D) Depends on orientation of

the read Order of contigs is arbitrary

Each read pair can be “consistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Page 8: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

The Scaffolding Problem

Given• Contigs• Paired readsFind• Orientation• Ordering• Relative DistanceGoal• Recreate true scaffolds

Possible Objectives• Un-weighted• Max number of consistent

read pairs• Weighted• Each states is weighted:

• Overlap with repeat• Deviation of expected distance• …

𝑊 𝑖𝑗𝐴 ,𝑊 𝑖𝑗

𝐵 ,𝑊 𝑖𝑗𝐶 ,𝑊 𝑖𝑗

𝐷

Page 9: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Graph Representation

Using input we can define a scaffolding graph:

This is an undirected multi-graph

Assume it is connected

𝐺=(𝑉 ,𝐸)

𝑉 ,𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑜𝑛𝑡𝑖𝑔𝑠E, set of

Page 10: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Integer Linear Program Formulation

Variables

, ,

max ∑( 𝑖 , 𝑗 ) ∈𝐸

(𝑊 ¿¿ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿

Contig Pair State:

Contig Orientation: 𝑆 𝑖∈ {0,1 }Pairwise Contig Consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Page 11: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Constraints

Pairwise Orientation

𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖

𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗

𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖

𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗

𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗

Mutually Exclusivity

Forbid 2 and 3 Cycles Explicitly

Page 12: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Graph Decomposition: Articulation Points

solve

Articulation point

Page 13: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Graph Decomposition: 2-cuts

2-cut+

+

+

-

-

+

-

-

Page 14: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Non-Serial Dynamic Programming

• SPQR-tree to schedule decomposition

• Traverse tree using DFS

• NSDP utilizes solutions of previous stage in current stage

Page 15: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Connected Component

Page 16: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Biconnected Component

Page 17: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Triconnected Component

Page 18: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Page 19: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Testing Framework

Venter Genome

Read Type Total ReadsTotal

BasesAvg

lengthCoverage

Sanger 31,861,976 2.79E+10 875 9.930637

SOLiD pairs 4.85E+08 2.42E+10 50 8.623028

# Reads# Bases in

reads # Contigs# Bases in

contigs N50112,00,000 1.1E+10 422,837 2.26E+09 7704

4x Assembly

Page 20: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Testing Metrics

Computer Scientists Finding Scaffold = Binary Classification Test

n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV

Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50

Break scaffold at incorrect edges, then find N50

Page 21: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Results

test case method

bundle size sensitivity ppv N50 TP50

10% opera 2 81.13% 99.26% 27,567 27,327

10% mip 2 59.01% 98.94% 19,988 19,755

10% ilp 1 79.86% 98.58% 26,814

26,459

25% opera 2 80.44% 98.27% 27,296

26,849

25% mip 2 58.95% 97.56% 19,842 19,518

25% ilp 1 79.30% 96.93% 26,684

26,079

100% opera 3 pending … … … 100% mip 3 failed n/a n/a n/a

100% ilp 1 68.25% 89.90% 20,538

19,006

Page 22: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Conclusions

Success ILP solves scaffolding problem! NSDP works.

Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries,

merge ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly?? Structural Variation??

Page 23: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Questions?