Upload
makya
View
76
Download
0
Embed Size (px)
DESCRIPTION
Rapid Global Alignments. How to align genomic sequences in (more or less) linear time. Saving cells in DP. Find local alignments Chain -O(NlogN) L.I.S. Restricted DP. Methods to CHAIN Local Alignments. Sparse Dynamic Programming O(N log N). The Problem: Find a Chain of Local Alignments. - PowerPoint PPT Presentation
Citation preview
CS262 Lecture 9, Win07, Batzoglou
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
CS262 Lecture 9, Win07, Batzoglou
CS262 Lecture 9, Win07, Batzoglou
Saving cells in DP
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
CS262 Lecture 9, Win07, Batzoglou
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
CS262 Lecture 9, Win07, Batzoglou
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
x = x1, …, xm
• Find a subsequence
s = s1, …, sk
s1 < s2 < … < sk
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.Let input be w: w1,…, wn
INITIALIZATION:L: last LIS elt. array L[0] = -inf
L[1] = w1 L[2…n] = +inf
B: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far
ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]
B[j] iP[i] B[j – 1]
}
That’s it!!!• Running time?
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point (i, j), is inserted into w as follows:
• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
a is before b in w, and y < y’
15 3 24 16 20 4 24 3 11 18
6
42 7
1 8
10
95
113
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (y, x) < (y’, x’) if y < y’
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
42 7
1 8
10
95
113
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L = [L1] [L2] [L3] [L4] [L5] …
1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:
s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
42 7
1 8
10
95
113
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj: smallest (North) to largest (South) value L is implemented as a balanced binary tree
y
h
l
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score
V(b)V(a)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
i
j
k
Is k ever removed?
CS262 Lecture 9, Win07, Batzoglou
Examplex
y
a: 5
c: 3
b: 6
d: 4e: 2
2
56
91011
12141516
1. When on the leftmost end of rectangle i:a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
a b c d eV
5
L
li
V(i)
i
55a
8
118c
11 12
911b
1512d
13
16133
CS262 Lecture 9, Win07, Batzoglou
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
CS262 Lecture 9, Win07, Batzoglou
Examples
Human Genome BrowserABC
CS262 Lecture 9, Win07, Batzoglou
Gene Recognition
CS262 Lecture 9, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
CS262 Lecture 9, Win07, Batzoglou
Where are the genes?
CS262 Lecture 9, Win07, Batzoglou
CS262 Lecture 9, Win07, Batzoglou
Needles in a Haystack
CS262 Lecture 9, Win07, Batzoglou
• Classes of Gene predictors Ab initio
• Only look at the genomic DNA of target genome De novo
• Target genome + aligned informant genome(s)
EST/cDNA-based & combined approaches• Use aligned ESTs or cDNAs + any other kind of evidence
Gene Finding
EXON EXON EXON EXON EXON
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
CS262 Lecture 9, Win07, Batzoglou
Signals for Gene Finding
1. Regular gene structure
2. Exon/intron lengths
3. Codon composition
4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites
5. Patterns of conservation
6. Sequenced mRNAs
7. (PCR for verification)
CS262 Lecture 9, Win07, Batzoglou
Next Exon:Frame 0
Next Exon:Frame 1
CS262 Lecture 9, Win07, Batzoglou
Exon and Intron Lengths
CS262 Lecture 9, Win07, Batzoglou
Nucleotide Composition• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
CS262 Lecture 9, Win07, Batzoglou
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
CS262 Lecture 9, Win07, Batzoglou
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Intergene State
First Exon State
IntronState
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
exon exon exonintronintronintergene intergene
Intergene State
First Exon State
IntronState
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
CS262 Lecture 9, Win07, Batzoglou
Duration HMMs for Gene Recognition
TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
Duration d
iPINTRON(xi | xi-1…xi-w)
PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)
j+2
P5’SS(xi-3…xi+4)
PSTOP(xi-4…xi+3)
CS262 Lecture 9, Win07, Batzoglou
Genscan
• Burge, 1997
• First competitive HMM-based gene finder, huge accuracy jump
• Only gene finder at the time, to predict partial genes and genes in both strands
Features– Duration HMM– Four different parameter sets
• Very low, low, med, high GC-content
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
• Hox cluster is an example where everything is conserved
CS262 Lecture 9, Win07, Batzoglou
Patterns of Conservation
30% 1.3%0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
CS262 Lecture 9, Win07, Batzoglou
Comparison-based Gene Finders
• Rosetta, 2000• CEM, 2000
– First methods to apply comparative genomics (human-mouse) to improve gene prediction
• Twinscan, 2001– First HMM for comparative gene prediction in two genomes
• SLAM, 2002– Generalized pair-HMM for simultaneous alignment and gene prediction in two
genomes
• NSCAN, 2006– Best method to-date based on a phylo-HMM for multiple genome gene
prediction
CS262 Lecture 9, Win07, Batzoglou
Twinscan
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 lettersS = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns
ExampleHuman: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|
CS262 Lecture 9, Win07, Batzoglou
SLAM – Generalized Pair HMM
d
e
Exon GPHMM1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction• GENSCAN
• TWINSCAN
• N-SCAN
Target GGTGAGGTGACCAAGAACGTGTTGACAGTA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA
...
),...,,...,|( 1 oiioiii TTP III),...,|( 1 oiii TTTP
),...,,,...,|,( 11 oiioiiii TTTP III
Target sequence:
Informant sequences (vector):
Joint prediction (use phylo-HMM):
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
X
C Y
Z H
M R
)|()|()|()|()|()|()(
),,,,,,(
1
ZRPZMPYZPYHPXYPXCPAP
ZYXRMCHP
X
C
Y
Z
H
M R
)|()|()|()|()|()|()(
),,,,,,(
ZRPZMPXCPYZPYXPHYPHP
ZYXRMCHP
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison
GENSCANGeneralized HMMModels human sequence
TWINSCANGeneralized HMMModels human/mouse alignments
N-SCANPhylo-HMMModels multiple sequence evolution
NSCAN human/mouse
>Human/multiple
informants
CS262 Lecture 9, Win07, Batzoglou
• 2-level architecture• No Phylo-HMM that models alignments
CONTRAST
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
SVM SVM
CRF
X
Y
a b a b
CS262 Lecture 9, Win07, Batzoglou
CONTRAST
CS262 Lecture 9, Win07, Batzoglou
• log P(y | x) ~ wTF(x, y)
• F(x, y) = i f(yi-1, yi, i, x)
• f(yi-1, yi, i, x):
1{yi-1 = INTRON, yi = EXON_FRAME_1} 1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT) 1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC) (1-c)1{a<SVM_DONOR(i)<b} (optional) 1{EXON_FRAME_1, EST_EVIDENCE}
CONTRAST - Features
CS262 Lecture 9, Win07, Batzoglou
• Accuracy increases as we add informants
• Diminishing returns after ~5 informants
CONTRAST – SVM accuraciesSN SP
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Decoding
Viterbi Decoding:
maximize P(y | x)
Maximum Expected Boundary Accuracy Decoding:
maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)
Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Training
Maximum Conditional Likelihood Training:
maximize L(w) = Pw(y | x)
Maximum Expected Boundary Accuracy Training:
ExpectedBoundaryAccuracy(w) = i Accuracyi
Accuracyi = B 1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -
B’ ≠ B P(yi-1, yi is exon boundary B’ | x)
CS262 Lecture 9, Win07, Batzoglou
N-SCAN(Mouse)
CONTRAST(Mouse)
CONTRAST(Eleven
Informants)Gene Sn 35.6 50.8 58.6Gene Sp 25.1 29.3 35.5
Exon Sn 84.2 90.8 92.8Exon Sp 64.6 70.5 72.5Nucleotide Sn 90.8 96.0 96.9
Nucleotide Sp 67.9 70.0 72.0
Performance Comparison
N-SCAN(Mouse)
CONTRAST(Mouse)
CONTRAST(Eleven
Informants)
Gene Sn 46.8 60.7 65.4Gene Sp 31.7 40.6 46.2Exon Sn 89.7 92.6 93.9Exon Sp 66.9 74.8 76.2
Nucleotide Sn 93.7 95.7 96.7
Nucleotide Sp 69.3 74.3 75.8
De Novo
EST-assisted
HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison