6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science

04/18/23 ©Bud Mishra, 2001

L7-1

Computational Biology

Lecture #7: Local Lecture #7: Local AlignmentAlignment

Bud MishraProfessor of Computer Science and

Mathematics10 ¦ 22 ¦ 2002

04/18/23 ©Bud Mishra, 2001

L7-2

Local Alignment Problem(LAP)

• Finding substrings of high similarity:

• Given two strings, S1 and S2: They may have regions that are locally highly similar.

04/18/23 ©Bud Mishra, 2001

L7-3

LAPLocal Alignment Problem

• Given: Two strings S1 and S2

• Find: Substrings v S1 and v S2 whose similarity (in terms of an object function—e.g., optimal global alignment value) is maximum over all pairs of substrings from S1 and S2

v* = max v S1, v S2 distance(, )

04/18/23 ©Bud Mishra, 2001

L7-4

Example

• d(x,x) = 2,d(x,y) = -2• d(x,-) = d(-,x) = -1.

• = a x a b c s v S1

• = a x b a c s v S2

S_1 = p q r a x a b c s t v q

S_2 = x y a x b a c s l l

S_1 = p q r a x a b c s t v q

S_2 = x y a x b a c s l l

Local Alignment:a x a b - c s| | | | |a x - b a c s2 2 -1 2 -1 2 2

distance(, ) = 8

04/18/23 ©Bud Mishra, 2001

L7-5

Naïve Complexity

• Note: (1) Let |S1| = n and |S2| = m.– Total number of substrings of S1 = Cn+1,2 = O(n2)

– Total number of substrings of S2 = Cm+1,2 = O(m2)

– Naïvely, O(n2m2)candidate substrings need to be globally aligned by a DP algorithm of complexity O(|| ||)

• Complexity of the resulting algorithm = O(n3 m3)

• (2) An improved algorithm (SWAT, Smith-Waterman) reduces the time complexity to O(nm)

04/18/23 ©Bud Mishra, 2001

L7-6

LSAPLocal Suffix Alignment Problem

• A restricted version of the LAP.

• Given: Two strings S1 and S2 and two indices i 5 |S1| and j 5 |S2|

– Ai = S1[1..i] prefix of S1

– Bj = S2[1..j] prefix of S2

• Find: A suffix (possibly empty,) of Ai ( = S1[k..i]) and a suffix of Bj (possibly empty, ) of Bj ( = S2[l..j]) that maximizes a linear objective function V(, ) over all pairs of suffixes of Ai and Bj.

04/18/23 ©Bud Mishra, 2001

L7-7

Objective Function

• v(i,j) = max = suf S1[1..i], = suf S2[1..j] V(, )

= Value of the optimal local suffix alignment for the given index pair i, j.

• v* = maxi 5 n, j 5 m v(i,j)

= Value of the optimal local alignment.

• n = |S1|, m = |S2|

04/18/23 ©Bud Mishra, 2001

L7-8

Optimal Local AlignmentRecurrence Equations

• v* = maxi 5 n, j 5 m V(i,j)

• = suf S[1..i], = suf S2[1..j]

• v* = v(i’, j’) = V(, )• Consider an optimal suffix alignment

with = suf S1[1..i] and = suf S2[1..j]

• Case 1: = = (= empty string)– Base: V(, ) = 0

04/18/23 ©Bud Mishra, 2001

L7-9


• Case 2: , = ‘± S1[i] and

S1[i] matches “-”

– Ind(A): V(,) = V(’, ) + d(S1[i], -)

• …or S1[i] matches S2[j]

( = ’ ± S2[j])

– Ind(C): V(,) = V(’, ’) + d(S[i], S2[j])

04/18/23 ©Bud Mishra, 2001

L7-10


• Case 3: , = ’ ± S2[j] and

S2[j] matches “-”

– Ind(B): V(,) = V(, ’) + d(-, S2[j])

• …or S1[i] matches S2[j]

( = ’ ± S1[i])

– Ind(C): V(,) = V(’, ’) + d(S[i], S2[j])

04/18/23 ©Bud Mishra, 2001

L7-11

Recurrence Equation

• V(i,j) = max = suf S1[1..i], = suf S2[1..j] V(, )

• Base: v(i,j)|i=0 Ç j=0 = 0

(v(0,0) = v(i,0) = v(0,j) = 0)

• Induction: v(i,j)|i=0 Æ j=0 =max[0,

v(i-1,j) + d(S1[i],-),

v(i,j-1)+ d(-, S2[j]),

v(i-1,j-1), d(S1[i], S2[j]) ]

04/18/23 ©Bud Mishra, 2001

L7-12

Dynamic Programming Table

• (with Traceback)• Compute all v(i,j) entries: Complexity = O(nm)• Find v* = v(i*, j*) by finding the largest value in

any cell: Complexity = O(nm)• Trace the pointer back from from v(I*, j*) until a

cell is reached with value v(i’,j’) =0:Complexity = O(n+m)

• Results: = S1[i’..i*] v S1 and = S2[j’..j*] v S2

• Total Complexity = O(nm) = O(|S1|, |S2|)

04/18/23 ©Bud Mishra, 2001

L7-13

Example. x y a x b a c s l l

. 0 0 0 0 0 0 0 0 0 0 0

p 0 0 0 0 0 0 0 0 0 0 0

q 0 0 0 0 0 0 0 0 0 0 0

r 0 0 0 0 0 0 0 0 0 0 0

a 0 0 0 -2 Ã 1 0 -2 Ã 1 0 0 0

x 0 -2 Ã 1 " 1 -4 Ã 3 Ã 2 Ã 1 0 0 0

a 0 " 1 0 -3 " 3 " 2 -5 Ã 4 Ã 3 Ã 2 Ã 1

b 0 0 0 " 2 " 2 -5 Ã 4 Ã 3 Ã 2 Ã 1 Ã 0

c 0 0 0 " 1 " 1 " 4 " 3 -6 Ã 5 Ã 4 Ã 3

s 0 0 0 0 0 " 3 " 2 " 5 -8 Ã 7 Ã 7

t 0 0 0 0 0 " 2 " 1 " 4 " 7 Ã 6 Ã 6

v 0 0 0 0 0 " 1 0 " 3 " 6 Ã 5 Ã 5

q 0 0 0 0 0 0 0 " 2 " 5 Ã 4 Ã 4

04/18/23 ©Bud Mishra, 2001

L7-14

Dealing with Gaps

• A gap is any “maximal consecutive run of spaces” in a single string of a given alignment.

c t t t a a c - - a - a c

c - - - c a c c c a t - c

gap, g1

gap, g2

gap, g3

gap, g4

04/18/23 ©Bud Mishra, 2001

L7-15

Gaps

• Initial Gap– A gap may be bordered on the right by the first character of a

string.• Final Gap

– A gap may be bordered on the left by the last character of a string.• Internal Gap

– A gap may be bordered on both left and right

• Simple Gap Penalty Model Constant Wt, Wg

– Each gap contributes a constant penalty = Wg

– d(x,x) = 2, d(x,y) = -2, d(x,-) = d(-,y) = 0– # gaps = k. Then– Value of an alignment = i=1

l d(S’1[i], S’2[i]) – k Wg

04/18/23 ©Bud Mishra, 2001

L7-16

Biological Motivations forGap Models

– Unequal Crossing-over in Meiosis– DNA slippage during replication– Insertion of transposable elements

(“Jumping Genes”)– Insertion by retroviruses– Translocation between chromosomes

• Examples of Alignment with gaps:– cDNA matching problem– Processed Pseudo-gene Problem

04/18/23 ©Bud Mishra, 2001

L7-17

Gap Weights

• Constant:– Each gap has a penalty of Wg

– Each space is free: d(x,-) = d(-,x) = 0.• Affine:

– Gap initiation weight = Wg

– Gap Extension weight = Ws

– Each gap of length q has a penalty of Wg + q Ws

• Convex:– Each gap of length q has a penalty of Wg + ln q Ws

• Arbitrary:– Each gap of length q has a penalty of Wg + ( q) Ws, where

(q) = arbitrary function

04/18/23 ©Bud Mishra, 2001

L7-18

General Model

• Arbitrary:– Each gap of length q has a penalty of Wg + ( q) Ws,

where (q) = arbitrary function– (q) = 0 constant– (q) = q linear/affine– (q) = ln q convex

• Total Cost under constant model i=1

l d(S’1[i], S’2[i]) – (#gaps) Wg

• Total Cost under affine model i=1

l d(S’1[i], S’2[i]) – (#gaps) Wg – (#spaces) Ws

04/18/23 ©Bud Mishra, 2001

L7-19

Local Alignmentunder Arbitrary Gap Weight Model

• Dynamic Programming (Needleman & Wunsch)

• Given two strings S1 and S2 start by aligning the prefixes– S1,i = S1[1..i] and

– S2,j = S2[1..j]

• There are three different cases to consider…

04/18/23 ©Bud Mishra, 2001

L7-20

Case 1

• S1[i] is aligned to a character strictly to the left of a character S2[j]

S1,i

S2,j

S1[i]

S2[j]

04/18/23 ©Bud Mishra, 2001

L7-21

Case 2

• S1[i] is aligned to a character strictly to the right of a character S2[j]

S1,i

S2,j

S1[i]

S2[j]

04/18/23 ©Bud Mishra, 2001

L7-22

Case 3

• S1[i] and S2[j] are aligned opposite each other:– Subcase A

S1[i] = S2[j]

– Subcase BS1[i] S2[j]

S1,i

S2,j

S1[i]

S2[j]

04/18/23 ©Bud Mishra, 2001

L7-23

Auxiliary Vaiables

• XL(i,j) =

maxalignments for case 1 distance(S1[1..i], S2[1..j])• XR(i,j) =

maxalignments for case 2 distance(S1[1..i], S2[1..j])• XS(i,j) =

maxalignments for case 3 distance(S1[1..i], S2[1..j])• V(i,j) = max(XL(i,j), XR(i,j), XS(i,j))

04/18/23 ©Bud Mishra, 2001

L7-24

Recurrence: Base

• Notation: ? , “undefined”

• XS(0,0) = 0,XS(i,0) = ?, XS(0,j) = ?

• XL(0,0) = ?, XL(i,0) = -(i), XL(0,j) = ?

• XR(0,0) = ?, XR(i,0) = ?, XR(0,j) = -(j)

• V(0,0) = 0, V(i,0) = -(i), V(0,j) = -(j)

04/18/23 ©Bud Mishra, 2001

L7-25

Recurrence: Induction

i > 0 and j > 0:

• XS(i,j) = V(i-1,j-1) + d(S1[i], S2[j])

• XL(i,j) = max0 5 k 5 j-1 (V(i,k) - (j-k))

• XR(i,j) = max0 5 l 5 i-1 (V(l,j) - (i-l))

• V(i,j) = max(XL(i,j), XR(i,j),XS(i,j))

• Each V(i,j) can be computed in time O(i+j)

04/18/23 ©Bud Mishra, 2001

L7-26

Total Time Complexity

• Let |S1| = n and |S2| = m.

• The recurrence can be evaluated with a Dynamic Programming Table of space complexity = O(nm) and in time complexity = O(n2m+m2n)

04/18/23 ©Bud Mishra, 2001

L7-27

Affine Gap Model-Recurrence

• SWAT : Smith-Waterman• Modifying the recurrence

equations for the affine case:– XS(0,0) = 0, XS(i,0) = ?, XS(0,j) = ?

– XL(0,0) = ?, XL(i,0) = -Wg-i Ws, XL(0,j) = ?

– XR(0,0) = ?, XR(i,0) = ?, XR(0,j) = -Wg- j Ws

– V(0,0) = 0, V(i,0) = -Wg-i Ws, V(0,j) = -Wg- j Ws

04/18/23 ©Bud Mishra, 2001

L7-28

Recurrence: Induction

i > 0 and j > 0:• XS(i,j) = V(i-1,j-1) + d(S1[i], S2[j])• XL(i,j) = max(XL(i, j-1) –Ws, ?, XS(i,j-1) – Wg –Ws,

V(i,j-1)-Wg-Ws)

= max[XL(i, j-1), V(i,j-1)-Wg] –Ws

• XR(i,j) = max(?, XR(i-1, j) –Ws, XS(i-1,j) – Wg –Ws,

V(i-1,j)-Wg-Ws)

= max[XR(i-1, j), V(i-1,j)-Wg] –Ws

• V(i,j) = max(XL(i,j), XR(i,j),XS(i,j))• Each V(i,j) can be computed in O(1) time. The optimal

alignment with affine gap weights can be computed with a DP table of space and time complexity = O(nm).

04/18/23 ©Bud Mishra, 2001

L7-29

Parallelization

• Systolic Arrays:• Create a special-purpose processor P(i,j) for

(i,j)th entry of the Dynamic Programming Table.

• Connect P(i,j) to P(i-1,j), P(i-1,j-1) and P(i, j-1)

• Each processor holds static data Wg and Ws. Each processor stores and transmits dynamic data:

XS(i,j), XL(i,j), XR(i,j) and V(i,j).

04/18/23 ©Bud Mishra, 2001

L7-30

Systolic Computation

• Dynamically compute in one cycle:– XS(i,j), XL(i,j), XR(i,j), V(i,j)

• using– XS(i-1,j), XL(i-1,j), XR(i-1,j), V(i-1,j)– XS(i,j-1), XL(i,j-1), XR(i,j-1), V(i,j-1)– XS(i-1,j-1), XL(i-1,j-1), XR(i-1,j-1), V(i-1,j-1)

• and– Wg & Ws.

04/18/23 ©Bud Mishra, 2001

L7-31

Database Search

• Blast & Its relatives:• A query search \Rightarrow

– Compare the query sequence to all the sequences in the database for local similarities.

• Heuristics:

–BLAST–FAST

• Needs good complexity Analysis

04/18/23 ©Bud Mishra, 2001

L7-32

BLAST

• Basic Local Alignment Search Tool• Query sequence, 2 *, Database, L

µ *

• BLAST returns a list of high scoring segment pairs between the query sequence and sequences in the database. Score function depends on -PAM score functions.

04/18/23 ©Bud Mishra, 2001

L7-33

BLAST Heuristics

• BLAST is a 3 step algorithm:– Step 1. Compile list of high scoring strings: W

= words.W=All w-mers that score at least with some

w-mer of the query.– Step 2. Search for hits—Each hit defines a

seed. Construct a DFA to recognize \cW. Scan the database compiling the hits.

– Step 3. Extend the seeds. The seeds are extended in both directions until the score falls a certain distance below the best so far.

04/18/23 ©Bud Mishra, 2001

L7-34

FAST

• s, t = Two sequences being compared. |s| = m & |t| = n.– Step 1. Determine k-tuples common to both

sequences—k = 1 or 2.– Step 2. “Offset” of a common k-tuple is

computed. If the common k-tuples start at position s[i] and t[j], then• offset = i-j

– Step 3. Determine the most common offset value to align the sequences.

– Step 4. Combine the common k-tuples to create a region.

04/18/23 ©Bud Mishra, 2001

L7-35

Example

• Offsets for 1-tuples– A ( (2,6,7)– F ( (4)– H ( (1)– I ( (9)– L ( (11)– Q ( (8)– R ( (3)– V ( (10)– Y ( (5)

• Alignment:

– H A R F Y A A Q I V L + + + | | | | +

– V D MA AQ I A

1 2 3 4 5 6 7 8 9 10 11 s= H A R F Y A A Q I V L

t = V D M A A Q I A 1 2 3 4 5 6 7 8

{9}

{-2,2,3}

{-3,1,2}

{-6,-2,-1}

{2}

{2}

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

Documents

6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science