View
216
Download
1
Category
Tags:
Preview:
Citation preview
JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*
ACM-BCB 2012
Scaffolding Large Genomes Using Integer Linear
Programming
University of Connecticut* Georgia State University
De-novo Assembly Paradigm
shotgun sequencing
short contigs
the scaffolds
short reads
the genome
denovoassembly
scaffolding
Why Scaffolding?
Annotation Comparative biology
Re-sequencing and gap filling
Structural variation!gene XYZ
3’ UTR
5’ UTR
Scaffold
gene XYZ
No scaffold
Why Scaffolding?
Annotation Comparative biology
Re-sequencing and gap filling
Structural variation!gene XYZ
3’ UTR
5’ UTR
Sanger Sequencing
gene XYZ3’
UTR5’
UTR
Biologist: There are holes in my genes!
Why Scaffolding?
Annotation Comparative biology
Re-sequencing and gap Filling
Structural variation!
Massive Sequencing Projects
Effects of Read Length
I5k 5000 insect and
arthropod species
G10k 10,000 vertebrate
species
Dog Genome 7.5x Sanger N50: 180Kb
Chicken Genome 6x Illumina N50: 12Kb
Human Genome 100x Illumina N50: 24Kb
Fragmented Genomes
The Scaffolding Problem
GIVEN• CONTIGS, PAIRED READSFIND• ORIENTATION, ORDERING,
RELATIVE DISTANCEGOAL• RECREATE TRUE SCAFFOLDS
Paired Read Construction
Paired Read Styles
Mate Pair
Paired End
Paired Reads
2kb
2kb
same strand and orientation
R1 R2
100b 100b 10kb
different strand and orientation
R1R2
Linkage Information
Possible States (mate pair)Two contigs are adjacent if:
A read pair spans the contigs
State (A, B, C, D) Depends on orientation of
the read Order of contigs is arbitrary
Each read pair can be “consistent” with one of the four states
5’ 3’
contig i contig j
R1 R2A
B
C
D
Nodes Edges
Nodes are contigs Adjacent contigs have 4 edges (one for each state)
Weighted by overlap with repetitive region
Scaffolding Graph
contig i contig jState A
𝑊 𝑖𝑗𝐴= ∑
𝑟 𝑒𝑎𝑑𝑝𝑎𝑖𝑟𝑠
1−¿ 𝑏𝑝𝑖𝑛𝑟𝑒𝑝𝑒𝑎𝑡𝑟𝑒𝑔𝑖𝑜𝑛
¿𝑏𝑝𝑖𝑛𝑟𝑒𝑎𝑑
Integer Linear Program Formulation
Variables
, ,
𝑧=max ∑( 𝑖 , 𝑗 ) ∈𝐸
(𝑊 ¿¿ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿
Contig pair state:
Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:
𝑆 𝑖 𝑗 ∈ {0,1 }
Objective Maximize weight of consistent pairs
Constraints
Variables
, , Contig pair state:
Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:
𝑆 𝑖 𝑗 ∈ {0,1 }
Pairwise Orientation
𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖
𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗
𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖
𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗
Constraints
Variables
, , Contig pair state:
Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:
𝑆 𝑖 𝑗 ∈ {0,1 }
State Variables
2 𝐴𝑖𝑗≤(1−𝑆¿¿ 𝑖)+(1−𝑆 𝑗)¿ 2𝐵𝑖𝑗≤(1−𝑆¿¿ 𝑖)+𝑆 𝑗¿
2𝐶𝑖𝑗≤𝑆 𝑖+(1−𝑆 𝑗) 2𝐷𝑖𝑗≤𝑆𝑖+𝑆 𝑗
Constraints
Variables
, , Contig pair state:
Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:
𝑆 𝑖 𝑗 ∈ {0,1 }
𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗
Mutual Exclusivity
Constraints
Forbid 2 Cycles
𝐵𝑖𝑗+𝐶𝑖 𝑗≤ 𝑆𝑖 𝑗 𝐴𝑖𝑗+𝐷 𝑖 𝑗≤1−𝑆 𝑖 𝑗
Forbid 3 Cycles
2222
2222
*larger cycles are broken at the end
Largest Connected Component
Graph Decomposition: Articulation Points
solve
Articulation point
MIP, Salmela 2011
Largest Biconnected Component
Non-Serial Dynamic Programming
A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages
~inspired by (Neumaier, 06)
Non-Serial Dynamic Programming
2-cut+
+
+
-
-
+
-
-
𝑧 𝐴 𝑧𝐵
𝑧𝐶 𝑧𝐷
Non-Serial Dynamic Programming
+
+
+
-
-
+
-
-
𝑧 𝐴 𝑧𝐵
𝑧𝐶 𝑧𝐷
+
Objective Modification:
𝑧 𝐴
𝑧𝐵
𝑧𝐶
𝑧𝐷
SPQR-tree Based Implementation
• SPQR-tree efficiently finds 2 cuts (Tarjan, 73)
• DFS of SPQR-tree is used to schedule elimination order for NSDP
Post Processing ILP Solution
May have cyclesNot a total ordering
for each connected components
A
B
C
DF
E
ILP Solutionoutgoing incoming
A
B
C
D
E
F
A
B
C
D
E
F
Bipartite matching Objectives:
Max weight Max cardinality Max cardinality / Max weight
GAGE Framework
Genome Size (Mb) # reads
Staphlococcus Aureus 2.9 3,494,070
Rhodobacter sphaeorides
4.6 2,050,868
Human Chr14 107 22,669,408
Assembled using: ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA,
SOAPdenovo, VelvetScaffolded using:
SILP (our method), Opera, MIP, Bambus2
Testing Metrics
TPN50 Break scaffold at incorrect edges, then find N50 Size of contig where 50% of the contigs are this size
Binary Classification Given n contigs in a scaffold How many of n-1 adjacencies can you predict
PPV Sensitivity MCC
Results
staph rhodo chr140
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Scaffolding TPN50
silpoperamipbambus2
Genome
TP
N50 (
bp)
Results
staph rhodo chr140.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
PPV
silpoperamipbambus2
Genome
PP
V
Results
staph rhodo chr140.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
Sensitivity
silpoperamipbambus2
Genome
Sensi
tivit
y
Results
staph rhodo chr140.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
Matthews Correlation Coefficient
silpoperamipbambus2
Genome
MC
C
Conclusions
Success ILP solves scaffolding problem! NSDP works
Improvements Include SOAPdenovo, Allpaths-LG scaffolds in comparison Look at parameter effects Practical considerations (read style, multi-libraries, merge
ctgs)Future Work
Where else can I apply NSDP? Scaffold before assembly … promising Structural Variation??
Questions?
Recommended