View
215
Download
0
Embed Size (px)
Citation preview
04/18/23 ©Bud Mishra, 2001
L7-1
Computational Biology
Lecture #7: Local Lecture #7: Local AlignmentAlignment
Bud MishraProfessor of Computer Science and
Mathematics10 ¦ 22 ¦ 2002
04/18/23 ©Bud Mishra, 2001
L7-2
Local Alignment Problem(LAP)
• Finding substrings of high similarity:
• Given two strings, S1 and S2: They may have regions that are locally highly similar.
04/18/23 ©Bud Mishra, 2001
L7-3
LAPLocal Alignment Problem
• Given: Two strings S1 and S2
• Find: Substrings v S1 and v S2 whose similarity (in terms of an object function—e.g., optimal global alignment value) is maximum over all pairs of substrings from S1 and S2
v* = max v S1, v S2 distance(, )
04/18/23 ©Bud Mishra, 2001
L7-4
Example
• d(x,x) = 2,d(x,y) = -2• d(x,-) = d(-,x) = -1.
• = a x a b c s v S1
• = a x b a c s v S2
S_1 = p q r a x a b c s t v q
S_2 = x y a x b a c s l l
S_1 = p q r a x a b c s t v q
S_2 = x y a x b a c s l l
Local Alignment:a x a b - c s| | | | |a x - b a c s2 2 -1 2 -1 2 2
distance(, ) = 8
04/18/23 ©Bud Mishra, 2001
L7-5
Naïve Complexity
• Note: (1) Let |S1| = n and |S2| = m.– Total number of substrings of S1 = Cn+1,2 = O(n2)
– Total number of substrings of S2 = Cm+1,2 = O(m2)
– Naïvely, O(n2m2)candidate substrings need to be globally aligned by a DP algorithm of complexity O(|| ||)
• Complexity of the resulting algorithm = O(n3 m3)
• (2) An improved algorithm (SWAT, Smith-Waterman) reduces the time complexity to O(nm)
04/18/23 ©Bud Mishra, 2001
L7-6
LSAPLocal Suffix Alignment Problem
• A restricted version of the LAP.
• Given: Two strings S1 and S2 and two indices i 5 |S1| and j 5 |S2|
– Ai = S1[1..i] prefix of S1
– Bj = S2[1..j] prefix of S2
• Find: A suffix (possibly empty,) of Ai ( = S1[k..i]) and a suffix of Bj (possibly empty, ) of Bj ( = S2[l..j]) that maximizes a linear objective function V(, ) over all pairs of suffixes of Ai and Bj.
04/18/23 ©Bud Mishra, 2001
L7-7
Objective Function
• v(i,j) = max = suf S1[1..i], = suf S2[1..j] V(, )
= Value of the optimal local suffix alignment for the given index pair i, j.
• v* = maxi 5 n, j 5 m v(i,j)
= Value of the optimal local alignment.
• n = |S1|, m = |S2|
04/18/23 ©Bud Mishra, 2001
L7-8
Optimal Local AlignmentRecurrence Equations
• v* = maxi 5 n, j 5 m V(i,j)
• = suf S[1..i], = suf S2[1..j]
• v* = v(i’, j’) = V(, )• Consider an optimal suffix alignment
with = suf S1[1..i] and = suf S2[1..j]
• Case 1: = = (= empty string)– Base: V(, ) = 0
04/18/23 ©Bud Mishra, 2001
L7-9
Optimal Local AlignmentRecurrence Equations
• Case 2: , = ‘± S1[i] and
S1[i] matches “-”
– Ind(A): V(,) = V(’, ) + d(S1[i], -)
• …or S1[i] matches S2[j]
( = ’ ± S2[j])
– Ind(C): V(,) = V(’, ’) + d(S[i], S2[j])
04/18/23 ©Bud Mishra, 2001
L7-10
Optimal Local AlignmentRecurrence Equations
• Case 3: , = ’ ± S2[j] and
S2[j] matches “-”
– Ind(B): V(,) = V(, ’) + d(-, S2[j])
• …or S1[i] matches S2[j]
( = ’ ± S1[i])
– Ind(C): V(,) = V(’, ’) + d(S[i], S2[j])
04/18/23 ©Bud Mishra, 2001
L7-11
Recurrence Equation
• V(i,j) = max = suf S1[1..i], = suf S2[1..j] V(, )
• Base: v(i,j)|i=0 Ç j=0 = 0
(v(0,0) = v(i,0) = v(0,j) = 0)
• Induction: v(i,j)|i=0 Æ j=0 =max[0,
v(i-1,j) + d(S1[i],-),
v(i,j-1)+ d(-, S2[j]),
v(i-1,j-1), d(S1[i], S2[j]) ]
04/18/23 ©Bud Mishra, 2001
L7-12
Dynamic Programming Table
• (with Traceback)• Compute all v(i,j) entries: Complexity = O(nm)• Find v* = v(i*, j*) by finding the largest value in
any cell: Complexity = O(nm)• Trace the pointer back from from v(I*, j*) until a
cell is reached with value v(i’,j’) =0:Complexity = O(n+m)
• Results: = S1[i’..i*] v S1 and = S2[j’..j*] v S2
• Total Complexity = O(nm) = O(|S1|, |S2|)
04/18/23 ©Bud Mishra, 2001
L7-13
Example. x y a x b a c s l l
. 0 0 0 0 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 0 0 0 0
q 0 0 0 0 0 0 0 0 0 0 0
r 0 0 0 0 0 0 0 0 0 0 0
a 0 0 0 -2 Ã 1 0 -2 Ã 1 0 0 0
x 0 -2 Ã 1 " 1 -4 Ã 3 Ã 2 Ã 1 0 0 0
a 0 " 1 0 -3 " 3 " 2 -5 Ã 4 Ã 3 Ã 2 Ã 1
b 0 0 0 " 2 " 2 -5 Ã 4 Ã 3 Ã 2 Ã 1 Ã 0
c 0 0 0 " 1 " 1 " 4 " 3 -6 Ã 5 Ã 4 Ã 3
s 0 0 0 0 0 " 3 " 2 " 5 -8 Ã 7 Ã 7
t 0 0 0 0 0 " 2 " 1 " 4 " 7 Ã 6 Ã 6
v 0 0 0 0 0 " 1 0 " 3 " 6 Ã 5 Ã 5
q 0 0 0 0 0 0 0 " 2 " 5 Ã 4 Ã 4
04/18/23 ©Bud Mishra, 2001
L7-14
Dealing with Gaps
• A gap is any “maximal consecutive run of spaces” in a single string of a given alignment.
c t t t a a c - - a - a c
c - - - c a c c c a t - c
gap, g1
gap, g2
gap, g3
gap, g4
04/18/23 ©Bud Mishra, 2001
L7-15
Gaps
• Initial Gap– A gap may be bordered on the right by the first character of a
string.• Final Gap
– A gap may be bordered on the left by the last character of a string.• Internal Gap
– A gap may be bordered on both left and right
• Simple Gap Penalty Model Constant Wt, Wg
– Each gap contributes a constant penalty = Wg
– d(x,x) = 2, d(x,y) = -2, d(x,-) = d(-,y) = 0– # gaps = k. Then– Value of an alignment = i=1
l d(S’1[i], S’2[i]) – k Wg
04/18/23 ©Bud Mishra, 2001
L7-16
Biological Motivations forGap Models
– Unequal Crossing-over in Meiosis– DNA slippage during replication– Insertion of transposable elements
(“Jumping Genes”)– Insertion by retroviruses– Translocation between chromosomes
• Examples of Alignment with gaps:– cDNA matching problem– Processed Pseudo-gene Problem
04/18/23 ©Bud Mishra, 2001
L7-17
Gap Weights
• Constant:– Each gap has a penalty of Wg
– Each space is free: d(x,-) = d(-,x) = 0.• Affine:
– Gap initiation weight = Wg
– Gap Extension weight = Ws
– Each gap of length q has a penalty of Wg + q Ws
• Convex:– Each gap of length q has a penalty of Wg + ln q Ws
• Arbitrary:– Each gap of length q has a penalty of Wg + ( q) Ws, where
(q) = arbitrary function
04/18/23 ©Bud Mishra, 2001
L7-18
General Model
• Arbitrary:– Each gap of length q has a penalty of Wg + ( q) Ws,
where (q) = arbitrary function– (q) = 0 constant– (q) = q linear/affine– (q) = ln q convex
• Total Cost under constant model i=1
l d(S’1[i], S’2[i]) – (#gaps) Wg
• Total Cost under affine model i=1
l d(S’1[i], S’2[i]) – (#gaps) Wg – (#spaces) Ws
04/18/23 ©Bud Mishra, 2001
L7-19
Local Alignmentunder Arbitrary Gap Weight Model
• Dynamic Programming (Needleman & Wunsch)
• Given two strings S1 and S2 start by aligning the prefixes– S1,i = S1[1..i] and
– S2,j = S2[1..j]
• There are three different cases to consider…
04/18/23 ©Bud Mishra, 2001
L7-20
Case 1
• S1[i] is aligned to a character strictly to the left of a character S2[j]
S1,i
S2,j
S1[i]
S2[j]
04/18/23 ©Bud Mishra, 2001
L7-21
Case 2
• S1[i] is aligned to a character strictly to the right of a character S2[j]
S1,i
S2,j
S1[i]
S2[j]
04/18/23 ©Bud Mishra, 2001
L7-22
Case 3
• S1[i] and S2[j] are aligned opposite each other:– Subcase A
S1[i] = S2[j]
– Subcase BS1[i] S2[j]
S1,i
S2,j
S1[i]
S2[j]
04/18/23 ©Bud Mishra, 2001
L7-23
Auxiliary Vaiables
• XL(i,j) =
maxalignments for case 1 distance(S1[1..i], S2[1..j])• XR(i,j) =
maxalignments for case 2 distance(S1[1..i], S2[1..j])• XS(i,j) =
maxalignments for case 3 distance(S1[1..i], S2[1..j])• V(i,j) = max(XL(i,j), XR(i,j), XS(i,j))
04/18/23 ©Bud Mishra, 2001
L7-24
Recurrence: Base
• Notation: ? , “undefined”
• XS(0,0) = 0,XS(i,0) = ?, XS(0,j) = ?
• XL(0,0) = ?, XL(i,0) = -(i), XL(0,j) = ?
• XR(0,0) = ?, XR(i,0) = ?, XR(0,j) = -(j)
• V(0,0) = 0, V(i,0) = -(i), V(0,j) = -(j)
04/18/23 ©Bud Mishra, 2001
L7-25
Recurrence: Induction
i > 0 and j > 0:
• XS(i,j) = V(i-1,j-1) + d(S1[i], S2[j])
• XL(i,j) = max0 5 k 5 j-1 (V(i,k) - (j-k))
• XR(i,j) = max0 5 l 5 i-1 (V(l,j) - (i-l))
• V(i,j) = max(XL(i,j), XR(i,j),XS(i,j))
• Each V(i,j) can be computed in time O(i+j)
04/18/23 ©Bud Mishra, 2001
L7-26
Total Time Complexity
• Let |S1| = n and |S2| = m.
• The recurrence can be evaluated with a Dynamic Programming Table of space complexity = O(nm) and in time complexity = O(n2m+m2n)
04/18/23 ©Bud Mishra, 2001
L7-27
Affine Gap Model-Recurrence
• SWAT : Smith-Waterman• Modifying the recurrence
equations for the affine case:– XS(0,0) = 0, XS(i,0) = ?, XS(0,j) = ?
– XL(0,0) = ?, XL(i,0) = -Wg-i Ws, XL(0,j) = ?
– XR(0,0) = ?, XR(i,0) = ?, XR(0,j) = -Wg- j Ws
– V(0,0) = 0, V(i,0) = -Wg-i Ws, V(0,j) = -Wg- j Ws
04/18/23 ©Bud Mishra, 2001
L7-28
Recurrence: Induction
i > 0 and j > 0:• XS(i,j) = V(i-1,j-1) + d(S1[i], S2[j])• XL(i,j) = max(XL(i, j-1) –Ws, ?, XS(i,j-1) – Wg –Ws,
V(i,j-1)-Wg-Ws)
= max[XL(i, j-1), V(i,j-1)-Wg] –Ws
• XR(i,j) = max(?, XR(i-1, j) –Ws, XS(i-1,j) – Wg –Ws,
V(i-1,j)-Wg-Ws)
= max[XR(i-1, j), V(i-1,j)-Wg] –Ws
• V(i,j) = max(XL(i,j), XR(i,j),XS(i,j))• Each V(i,j) can be computed in O(1) time. The optimal
alignment with affine gap weights can be computed with a DP table of space and time complexity = O(nm).
04/18/23 ©Bud Mishra, 2001
L7-29
Parallelization
• Systolic Arrays:• Create a special-purpose processor P(i,j) for
(i,j)th entry of the Dynamic Programming Table.
• Connect P(i,j) to P(i-1,j), P(i-1,j-1) and P(i, j-1)
• Each processor holds static data Wg and Ws. Each processor stores and transmits dynamic data:
XS(i,j), XL(i,j), XR(i,j) and V(i,j).
04/18/23 ©Bud Mishra, 2001
L7-30
Systolic Computation
• Dynamically compute in one cycle:– XS(i,j), XL(i,j), XR(i,j), V(i,j)
• using– XS(i-1,j), XL(i-1,j), XR(i-1,j), V(i-1,j)– XS(i,j-1), XL(i,j-1), XR(i,j-1), V(i,j-1)– XS(i-1,j-1), XL(i-1,j-1), XR(i-1,j-1), V(i-1,j-1)
• and– Wg & Ws.
04/18/23 ©Bud Mishra, 2001
L7-31
Database Search
• Blast & Its relatives:• A query search \Rightarrow
– Compare the query sequence to all the sequences in the database for local similarities.
• Heuristics:
–BLAST–FAST
• Needs good complexity Analysis
04/18/23 ©Bud Mishra, 2001
L7-32
BLAST
• Basic Local Alignment Search Tool• Query sequence, 2 *, Database, L
µ *
• BLAST returns a list of high scoring segment pairs between the query sequence and sequences in the database. Score function depends on -PAM score functions.
04/18/23 ©Bud Mishra, 2001
L7-33
BLAST Heuristics
• BLAST is a 3 step algorithm:– Step 1. Compile list of high scoring strings: W
= words.W=All w-mers that score at least with some
w-mer of the query.– Step 2. Search for hits—Each hit defines a
seed. Construct a DFA to recognize \cW. Scan the database compiling the hits.
– Step 3. Extend the seeds. The seeds are extended in both directions until the score falls a certain distance below the best so far.
04/18/23 ©Bud Mishra, 2001
L7-34
FAST
• s, t = Two sequences being compared. |s| = m & |t| = n.– Step 1. Determine k-tuples common to both
sequences—k = 1 or 2.– Step 2. “Offset” of a common k-tuple is
computed. If the common k-tuples start at position s[i] and t[j], then• offset = i-j
– Step 3. Determine the most common offset value to align the sequences.
– Step 4. Combine the common k-tuples to create a region.
04/18/23 ©Bud Mishra, 2001
L7-35
Example
• Offsets for 1-tuples– A ( (2,6,7)– F ( (4)– H ( (1)– I ( (9)– L ( (11)– Q ( (8)– R ( (3)– V ( (10)– Y ( (5)
• Alignment:
– H A R F Y A A Q I V L + + + | | | | +
– V D MA AQ I A
1 2 3 4 5 6 7 8 9 10 11 s= H A R F Y A A Q I V L
t = V D M A A Q I A 1 2 3 4 5 6 7 8
{9}
{-2,2,3}
{-3,1,2}
{-6,-2,-1}
{2}
{2}
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9