Sequence Assembly and Protein Docking Algorithms
Vicky ChoiDepartment of Computer
ScienceDuke University
2/73
Outline
• Sequence Assembly Algorithm (Joint work with Martin Farach-Colton @Rutgers)
• Local Search Algorithm for Rigid Protein Docking
(Joint work with Pankaj K. Agarwal, Herbert Edelsbrunner,
Johannes Rudolph @Duke)
3/73
Outline: Sequence Assembly
• Biological Background
• Human Genome Project and the Sequence Assembly Problem
• The BARNACLE Assembler
4/73
DNAA DNA molecule consists of two strands which are tied together in a helical structure.
Each strand is represented by a string over the alphabet {A,C,G,T}, called a DNA sequence.Example
AAGCTTCAGTTCCTGACCTTCCAATCGCAA
{A,C,G,T} = nucleotide, base, basepair (bp)
Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis
5/73
Orientation: 5’ ! 3’
Complement: A $ T, C $ G
Example (5') (3')
ACCATGGTGCACCTGACTCCTGAGGAG
TGGTACCACGTGGACTGAGGACTCCTC
(3') (5')
one strand ) another strand
Two Strands: Reverse Complementary
Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis
(5’)
(3’) (5’)
(3’)
6/73
A genome is the complete set of DNA sequences of an organism.
Human Genome ~ 3x109 bp
Human Chromosomes
Image Credit: Sanger Center http://www.sanger.ac.uk/
7/73
DNA Sequencing
DNA Sequencing is the process for determining the sequence of nucleotides of a region of DNA.
Question: How to sequence a longer stretch of DNA?
Current technology : ~500bp
C G A A T C G T C G A T G C T A A T G
8/73
Shotgun Sequencing
Target DNA
Shotgun
Sequence ReadsDNA Sequencing
Copies of TargetDNA Cloning
ConsensusAssembly
ACGTAAGAGTACCGATTGGCCA
FinalDirected Read
…ACGTAGTCTTAGATGATAGTAGA…
9/73
Shotgun Sequencing History
• 1980s: 5 to 10 Kbp• 1990: 40 Kbp• 1995: 1.8 Mbp (H. Influenzae)• 2000: draft Drosophila (120 Mbp)• 2001: draft Human Genome
(3x109bp) (attempted by Celera)
10/73
Outline: Sequence Assembly
• Biological Background
• Human Genome Project and the Sequence Assembly Problem
• The BARNACLE Assembler
11/73
Human Genome Project (HGP)
• 1988: “Mapping and Sequencing the Human Genome”
• 1990: HGP started in US• 2001: A “working draft” version• 2003: Completed by HGP Consortium
standard
12/73
Hierarchical Shotgun Sequencing(BAC-by-BAC)
Shotgun Sequence &
Assemble of each TP BAC
Final Sequence
Tiling Path of BACs
Human Genome
BAC library
(100-200Kb)
Physical Map
•Map First, Then Sequence
A BAC is a segment of DNA from a chromosome.Each BAC is ~100-200Kb.
13/73
Hierarchical Shotgun Sequencing(BAC-by-BAC)
Human Genome
BAC library
(100-200Kb)
Physical Map
•Map First??
? ? ? ?
Physical Map is difficult to build! (original expected time: 5 years)
14/73
BAC-by-BAC ! BAC-BasedNew Idea: Map + Sequencing concurrently
Phase 1: Draft
Phase 2: Draft
Phase 3: Finished
Fragments
BAC
Sequence Reads
Ordered Fragments
Randomly pick BACs (not wait for Physical Map) and shotgun sequence BACs
15/73
BAC-Based
Human Genome
BAC library
(100-200Kb)
Sequence Assembly Problem
Finished + Draft BACs
the working draft of the human genome
16/73
Outline: Sequence Assembly
• Biological Background
• Human Genome Project and the Sequence Assembly Problem
• The BARNACLE Assembler– Details of Input– Difficulties– Basic Idea
17/73
Details of Input
• Sequence Information:– BACs
• Overlap Information: – Local Alignments – NT-pairs
• Orientation Information:
Plasmid, EST, mRNA
18/73
Input: Sequence Information Recall: A BAC is a contiguous stretch of DNA from a
chromosome. Each comes as a set of fragments.
AccessionPhas
eChrm
# frags
AC002092.1
1 17 4
Frag acc.lengt
h
AC002092.1~1 888
AC002092.1~24531
2
AC002092.1~33872
5
AC002092.1~4.1
10245
•Phase 1,2 = Draft
•Phase 3 = Finished
BAC
fragment
19/73
Input: Overlap Information• Preprocessing:
– Local alignments of all fragment pairs
– NT-pairs: Generated from GenBank annotation submitted from genome centers
20/73
Example: Input of Dec 2001 freeze
Phase BACs Fragments Total Length (Gbp) Average Number of Fragments1 15298 246424 2.5 16.112 2154 8161 3.3 3.793 17624 17624 2 1
Total 35076 272209 4.9 7.76
Sequence Information:
Chromosome Assignments:
31543 by STS; 2450 by Genbank; 1083 unknownOverlap Information: 403,466 fragment pairs, 12,656 NT-pairs
Orientation Information: 321,751 fragment pairs
21/73
True Overlap
Repeat-induced Overlap
True vs Repeat-induced Overlap
22/73
Low-copy repeats (segmental duplication)•Large block (>200Kb)•Highly Similar (>97%)
Repeats of the Human Genome
e.g. ALU, L1
High-copy repeats
23/73
Noise
• Chimeric BAC (CB)
• False positives (FP): due to repeat
• False negatives (FN): polymorphism, draft quality
24/73
The Basic Idea
1. “Conservatively” assemble fragments
25/73
Necessary Condition for True Overlaps
Does B overlap with C?
Idea: assemble non-conflict overlaps first
A overlaps with B
AB
CA overlaps with C
A
Yes.
AB
CNo.
AB
C
26/73
BAC Graph
The Basic Idea
1. “Conservatively” assemble fragments into subcontigs
27/73
Interval Graph
Definition: A graph G is called an interval graph if there is one-onecorrespondence between its vertices and a set of intervals on thereal line such that two vertices are adjacent in G iff their corresponding intervals overlap.
The BAC graph is an interval graph!
28/73
Necessary… But Not Sufficient
Long Repeats
Under-represented
29/73
Non-interval Graph
Collapsing Repeats:
Chimeric BAC
30/73
Forbidden SubgraphsTheorem (Lekkerkerker & Boland 1962) A graph is interval iff it does not contain one of the following (induced) subgraph:
31/73
Resolving Non-interval Graphs
Definition: A vertex u 2 V is I-critical if G|V\{u} is interval.
Given a non-interval graph G, identify a forbidden subgraph.
If at least one of the vertices of the forbidden subgraph is I-critical, we say G is fixable.
Based on the structure of the forbidden subgraph,a fixable graph G is resolved by
1. adding an FN edge; or2. removing FP edges; or3. removing a vertex.
32/73
Divide and Conquer Method
For the non-fixable graphs, we employ a divide-and-conquermethod by dividing the graph according to some articulation points such that each subcomponent is fixable.
33/73
BAC Graph
2. Resolve Non-interval Graph and
Find an Interval Realization of the BAC Graph
3. Orient and order subcontigs
The Basic Idea
1. “Conservatively” assemble fragments into subcontigs
34/73
2. Resolve Non-interval Graph and
Find an Interval Realization of the BAC Graph
3. Orient and order subcontigs
Error Detection
1. “Conservatively” assemble fragments into subcontigs
• wrong NT-pairs (annotation from genome centers)• chromosome misassignments
• chimerics
• fragment misassemblies
This is the only algorithm available that does Error Detection.
35/73
Output: Contigs
36/73
Other Two Assemblers
• GigAssembler by Jim Kent and David Haussler (stop after April 2001 freeze)
• NCBI’s assembler – top-down approach:• build a physical map using sequence overlaps as fingerprint overlaps; • using some scoring functions to resolve conflicts.
37/73
BARNACLE’s assembly
NCBI’s assembly
38/73
Comparison with NCBI’s Assembly(Dec 2001)
Assembled BAC Length
Barnacle NCBI
· 250K (good BACs) 33921 29952
250K-300K 434 461
300K-500K 549 1328
500K-800K 33 798
800K-1M 0 248
1M-2M 0 496
2M-3M 0 129
3M-10M 0 259
10M-20M 0 67
Total (>250K) 1016 3786
39/73
How was the human genome “finished”?
• Hand-curate tiling path of BACs (by Genome Centers)
• Finish sequencing the tiling path of BACs only
• Assemble by NCBI’s assembler based on the hand-curated tiling paths
40/73
Incorporating segmental duplication database
BARNACLE’s assembly suggested that at least 89 repeat-contained BACs were dropped from the tiling path.– 69 were added to HGP’s final tiling path– 20 were declared unnecessary
• Due to disagreement about repeat structure of genome
Collaboration with Evan E. Eichler(Department of Genetics, Case Western Reserve University)
The Sequence and Assembly of Highly Duplicated Regions in the Human Genome.
V. Choi, J. Bailey, G. Schuler, Z. Gu, P. Li, M. Farach-Colton and E. Eichler.
Genome Sequencing & Biology meeting at Cold Spring Harbor Laboratory 2002.
41/73
Conclusions
• Better assembly– Error detection– Measured by the assembled BAC length
• Efficient (3 minutes on a Pentium III)
• To do large scale sequencing:
–Handle repeats
–Design in data acquisition that will permit error detection & correction
Reference : V. Choi, M. Farach-Colton. BARNACLE: An Assemble Algorithm for Clone-based Sequences of Whole Genomes. Gene, 320, 165-176, 2003.
42/73
Acknowledgement
Wojciech Makalowski (NCBI/Penn State University)David Lipman (NCBI)Greg Schuler (NCBI)Evan E. Eichler (Case Western Reserve University)Granger Sutton (Celera)JinSheng Lai (Waksman Institute, Rutgers University)
• NCBI/NIH pre-doctoral visiting fellowship
• Program in Mathematics and Molecular Biology (PMMB),Burroughs Wellcome Fund Interface Program fellowship
43/73
Outline : Protein Docking
1. Protein-Protein Docking 2. Local Search Algorithm3. Test Results
44/73
Protein-Protein Docking
Barnase
Barstar
1BRS : Barnase + Barstar
45/73
Protein Re-Docking Problem(Bound Protein Docking)
Given a known protein-protein complex A-B (native configuration), randomly separate two proteins.
Fix A, find a rigid motion such that (B) is near-native.
Rigid Body Assumption
46/73
Formulation of Rigid Protein Docking
• A scoring function that can discriminate correct docking configuration from incorrect ones;
• A search algorithm that finds the docking configuration measured by the scoring function.
47/73
Protein
Each atom is represented by a ball in R3.
Notation: A = { (a1, r1), …, (an,rn) }where ai 2 R3 is the iith atom center with(van der Waals) radius ri
Atom Type C N O P SRadius in Angstrom 1.548 1.4 1.348 1.88 1.808
A protein molecule consists of a set of atoms.
48/73
Our Scoring Function
49/73
Exhaustive Search
• Sampling the rigid motion space (6-dimension)• Evaluate each motion using the scoring function
Rigid Motion = Rotation + Translation
A rotation in R3 can be specified by a rotation angle about a rotation axis u – represented by unit quaternion.
Sampling Rotation Space ) S3 (unit sphere in R4)
A translation in R3 is a 3-dimensional vector (x,y,z) 2 R3.
Sampling Translation Space : a 3-dimensional grid
50/73
Protein Re-Docking Without False Positives
The configuration (B) for which • Score(A,(B)) is maximized;• Bump(A,(B)) · 7;
is near-native : RMSD(Bnative, (B)) · 3.
Other prior works (e.g. FFT-based, geometric hashing)generate multiple possible docking configurations (i.e. near-native + false positives).
Empirical Results:
51/73
Sampling Rigid Motion Space : High Resolution
Rotations : 12,036 (~5 degree)Translation: 0.4 grid step size (106 ~ 107)
Running time: 13 hours ~ two days on 50 machines
Duke Internet Systems and Storage Group Cluster (~200 machines)
Diverse test set (25 protein-protein complexes):1A22, 1A4Y, 1BI8, 1BUH, 1BXI, 1CHO, 1CSE, 1DFJ, 1F47, 1FC21FIN, 1FS1, 3HLA, 1JAT, 1JLT, 1MCT, 1MEE, 2PTC, 3SGB, 4SGB1STF, 1TEC, 1TGS, 1TX4, 3YGS
52/73
Why high resolution?
Two close configurations (i.e. small RMSD),Score and Bump fluctuates greatly.Example:
[Score, Bump] = [309, 2], [467, 39], [158, 2]
53/73
Protein Re-Docking: Local Search Approach
Given A = {(aj,rj) : 1 · j · n}, B={(bi,si): 1 · i · m},find a rigid motion such that
• Score(A,(B)) is maximized; and• Bump(A,(B)) · 7.
B
A
““local”local”
54/73
Outline : Protein Docking
1. Protein-Protein Docking 2. Local Search Algorithm3. Test Results
55/73
Weighted Least Squares Rigid Motion
tentative goal
= WLSM(w,B,C): i wi||(bi) – ci||2 is minimized
absolute orientation problem in computer vision
56/73
Local Search Algorithm
57/73
Preprocessing: Candidate Positions
mid-spheres = {(a,r+s+0.75): (a,r) 2 A}
Vertex set = {v: v is a vertex of arrangement of mid-spheres, Bump((v,s),A)=0}Sc(v)=Score((v,s),A)
58/73
Example: Candidate Positions
59/73
Outer Loop : Increasing Score
Local search neighborhood distance D (· 4.5),Tentative goal ci is the largest score vertex within the local neighborhood of bi
60/73
Apply Least Squares Rigid Motion
61/73
Inner Loop: Collision Resolution
F = {(b,s) 2 B : Bump((b,s),A)=0, Score((b,s),A)>0}
Y = {(b,s) 2 B : Bump((b,s),A) 0}
62/73
For (bi, s) 2 F, ci = (bi), wi = 1For (bi, s) 2 Y, ci = the nearest vertex within distance 2 wi = W() / ||(bi) – ci||2
Collision Resolution
63/73
Example: 1BRS
Native : [309, 2] (0)Input : [91, 5] (3.78)Increasing Score: [297, 59] Resolving Collisions: [236, 34], [215,28], …, [132,2] (5.45) [326, 59] [298, 43],[282, 30], …, [119,4] (4.67) [246, 13] [247, 10],[200, 9],[174,4](2.67) [351, 30] [332, 16], [323,7] (1.98) [386, 18] [377, 7] (0.53)
Running Time: 30 seconds ~ 2 minutes for preprocessing1~3 seconds per local search
Notation: [Score, Bump] (RMSD)
64/73
Outline : Protein Docking
1. Protein-Protein Docking 2. Local Search Algorithm3. Test Results
65/73
PerturbationsPerturb protein B locally from its native position:
rotation=(u,) followed by translation=(v,t)
Sampling:
u,v 2 {32 uniformly distributed unit vectors in R3}
2 {0,3,6, …, 27} (degree)
t 2 {0,0.5,1.0, …, 4.5} (Angstrom)
Total:
(32x9+1) (rotations) x (32x9+1)(translations) = 83,521
66/73
Test Results
Example:= 18, t=2.5829/1024 = 81%
Success : Score > 90% Native_Score, Bump · 7, RMSD· 2
67/73
Test Results
40,903/44,481=92%
{( · 12, t · 3.5), ( · 15, t · 3.0), ( · 18, t · 2.5), ( · 21, t · 2.0)}
68/73
10 Different Protein-Protein Complexes · 18, t · 3.5 Angstrom(43,425 perturbations)
69/73
Conclusions
Works well in neighborhood :- Rotation angle · 18 degrees- Translation distance · 3.5 Angstrom
Global Search
Incorporate conformation flexibility
Reference: V. Choi, H. Edelsbrunner, P.K. Agarwal and J. Rudolph.Local Search Heuristic for Rigid Protein Docking. To be submitted.
70/73
VMD – Visual Molecular Dynamics: http://www.ks.uiuc.edu/Research/vmd
Acknowledgement
Biogeometry Group @ Duke:Tammy BaileyAndrew BanSergei BespamiatnykhAbhijit GuriaVijay NatarajanAlper UngorYusu Wang
Navin Goyal (Rutgers University)
Raimund Seidel (Univ. des Saarlandes)
Stefan Leopoldseder(Vienna Univ. of Technology)
71/73
Future Work
Protein Docking Problem : Unbound case
Repeats in the Human Genome:
72/73
G
G
GG
Aberrant recombination
Human disease or structural polymorphism
Repeats: Junk DNA?
Not Junk at All!
73/73
Future Work
Protein Docking Problem : Unbound case
Repeats in the Human Genome:Characterization and distribution of repeats (both high copy and low copy) in the human genome
Thank You!