Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming

Inexact Matching

• General Problem– Input

• Strings S and T

– Questions• How distant is S from T?• How similar is S to T?

• Solution Technique– Dynamic programming with

cost/similarity/scoring matrix

Biological Motivation

• Read pages 210-214 in textbook• “First Fact of Biological Sequence Analysis”

– In biomolecular sequences (DNA, RNA, amino acid sequences), high sequence similarity usually implies significant functional or structural similarity

– sequence similarity implies functional/structural similarity

– Converse is NOT true– Evolution reuses, builds upon, duplicates, and modifies

“successful” structures

Measuring Distance of S and T

• Consider S and T• We can transform S into T using the

following four operations– insertion of a character into S– deletion of a character from S– substitution (replacement) of a character in S by

another character (typically in T)– matching (no operation)

Example

• S = vintner• T = writers• vintner• wintner (Replace v with w)• wrintner (Insert r)• writner (Delete first n)• writer (Delete second n)• writers (Insert S)

Example

• Edit Transcript (or just transcript):– a string that describes the transformation of one

string into the other

• Example– RIMDMDMMI– v intner– wri t ers

Edit Distance

• Edit distance of strings S and T– The minimum number of edit operations (insertion,

deletion, replacement) needed to transform string S into string T

– Levenshtein distance [299], Levenshtein appears to have been the first to define this concept

• Optimal transcript– An edit transcript of S and T that has the minimum

number of edit operations– cooptimal transcripts

Alignment

• A global alignment of strings S and T is obtained– by inserting spaces (dashes) into S and T

• they should have the same number of characters (including dashes) at the end

– then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T

– Note ALL positions in both S and T are involved

– Later, we will consider local alignments

Alignments and Edit transcripts

• Example Alignment– v-intner-– wri-t-ers

• Alignments and edit transcripts are interrelated– edit transcript: emphasizes process

• the specific mutational events

– alignment: emphasizes product • the relationship between the two strings

– Alignments are often easier to work with and visualize• also generalize better to more than 2 strings

Edit Distance Problem

• Input– 2 strings S and T

• Task– Output edit distance of S and T– Output optimal edit transcript– Output optimal alignment

• Solution method– Dynamic Programming

Definition of D(i,j)

• Let D(i,j) be the edit distance of S[1..i] and T[1..j]– The edit distance of the first i characters of S

with the first j characters of T– Let |S| = n, |T| = m

• D(n,m) = edit distance of S and T• We will compute D(i,j) for all i and j such

that 0 <= i <= n, 0 <= j <= m

Recurrence Relation

• Base Case:– For 0 <= i <= n, D(i,0) = i– For 0 <= j <= m, D(0,j) = j

• Recursive Case:– 0 < i <= n, 0 < j <= m– D(i,j) = min

• D(i-1,j) + 1• D(i,j-1) + 1• D(i-1,j-1) + d(i,j)

– d(i,j) = 0 if S(i) = T(j) and is 1 otherwise

What the various cases mean

• D(i,j) = min– D(i-1,j) + 1:

• Align S[1..i-1] with T[1..j] optimally

• Match S(i) with a dash in T

– D(i,j-1) + 1• Align S[1..i] with T[1..j-1] optimally

• Match a dash in S with T(j)

– D(i-1,j-1) + d(i,j)• Align S[1..i-1] with T[1..j-1] optimally

• Match S(i) with T(j)

Computing D(i,j) values

D(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0

v 1

i 2

n 3

t 4

n 5

e 6

r 7

Initialization: Base Case


0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1

i 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Row i=1


0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Entry i=2, j=3


0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 ?

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Calculation methodologies

• Location of edit distance– D(n,m)

• Example was to calculate row by row

• Can also calculate column by column

• Can also use antidiagonals

• Key is to build from upper left corner

Traceback

• Using table to construct optimal transcript

• Pointers in cell D(i,j)– Set a pointer from cell (i,j) to

• cell (i, j-1) if D(i,j) = D(i, j-1) + 1

• cell (i-1,j) if D(i,j) = D(i-1,j) + 1

• cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j)

– Follow path of pointers from (n,m) back to (0,0)

• Example: Figure 11.3 on page 222

What the pointers mean

• horizontal pointer: cell (i,j) to cell (i, j-1) – Align T(j) with a space in S– Insert T(j) into S

• vertical pointer: cell (i,j) to cell (i-1, j)– Align S(i) with a space in T– Delete S(i) from S

• diagonal pointer: cell (i,j) to cell (i-1, j-1)– Align S(i) with T(j)– Replace S(i) with T(j)

Table and transcripts

• The pointers represent all optimal transcripts

• Theorem: – Any path from (n,m) to (0,0) following the

pointers specifies an optimal transcript.– Conversely, any optimal transcript is specified

by such a path. – The correspondence between paths and

transcripts is one to one.

Running Time

• Initialization of table– O(n+m)

• Calculating table and pointers– O(nm)

• Traceback for one optimal transcript or optimal alignment– O(n+m)

Operation-Weight Edit Distance

• Consider S and T

• We can assign weights to the various operations– insertion/deletion of a character: cost d– substitution (replacement) of a character: cost r– matching: cost e– Previous case: d = r = 1, e = 0

Modified Recurrence Relation

• Base Case:– For 0 <= i <= n, D(i,0) = i d– For 0 <= j <= m, D(0,j) = j d


• D(i-1,j) + d• D(i,j-1) + d• D(i-1,j-1) + d(i,j)

– d(i,j) = e if S(i) = T(j) and is r otherwise

Alphabet-Weight Edit Distance

• Define weight of each possible substitution– r(a,b) where a is being replaced by b for all a,b in the

alphabet

– For example, with DNA, maybe r(A,T) > r(A,G)

– Likewise, I(a) may vary by character

• Operation-weight edit distance is a special case of this variation

• Weighted edit distance refers to this alphabet-weight setting


• Base Case:– For 0 <= i <= n, D(i,0) = 1 <= k <= i I(S(k))– For 0 <= j <= m, D(0,j) = 1 <= k <= j I(T(k))


• D(i-1,j) + I(S(i))• D(i,j-1) + I(T(j))• D(i-1,j-1) + d(i,j)

– d(i,j) = r(S(i), T(j))

Measuring Similarity of S and T

• Definitions– Let be the alphabet for strings S and T– Let ’ be the alphabet with character - added– For any two characters x,y in ’, s(x,y) denotes

the value (or score) obtained by aligning x with y– For a given alignment A of S and T, let S’ and T’

denote the strings after the chosen insertion of spaces and l their new length

– The value of alignment A is 1<=i<=l s(S(i),T(i))

Example

• a b a a - b a b• a a a a a b - b• 1-2+1+1+0+2+0+2=5

s a b -

a 1 -2 0

b 2 -1

- 0

String Similarity Problem

• Input– 2 strings S and T– Scoring matrix s for alphabet ’

• Task– Output optimal alignment value of S and T

• The alignment of S and T with maximal, not minimal, value

– Output this alignment


• Base Case:– For 0 <= i <= n, V(i,0) = 1 <= k <= i s(S(k),-)

– For 0 <= j <= m, V(0,j) = 1 <= k <= j s(-,T(k))

• Recursive Case:– 0 < i <= n, 0 < j <= m

– V(i,j) = max• V(i-1,j) + s(S(i),-)

• V(i,j-1) + s(-,T(j))

• V(i-1,j-1) + s(S(i), T(j))

Longest Common Subsequence Problem

• Given 2 strings S and T, a common subsequence is a subsequence that appears in both S and T.

• The longest common subsequence problem is to find a longest common subsequence (lcs) of S and T– subsequence: characters need not be contiguous– different than substring

• O(nm) solution: – Make scoring matrix 1 for match, 0 for mismatch– The matched characters in an alignment of maximal value

form a longest common subsequence

Similarity and Distance

• If we are focused on aligning both entire strings, maximizing similarity is essentially identical to minimizing distance– Just need to modify scoring matrices appropriately

• When we consider substrings of uncertain length, maximizing similarity often makes more sense than minimizing distance– Overlapping strings

– Local alignment

Overlapping Strings

• Find best alignment where the two strings overlap without penalizing for the unmatched ends– Application: sequence assembly problem

• strings are likely to overlap without being substrings of each other

• Solution method– End-space free variant of dynamic programming– Change base conditions so that V(i,0) = V(0,j) = 0– Need to search over row n and column n for optimal value

• Optimal value may not be in entry (n,m)

• Why is max similarity better than min distance?

Maximally Similar Substrings

• Local alignment problem– Input

• Two strings S and T

– Task• Find substrings s and t of S and T that have the maximum

possible alignment value as well as this value.• Let v* denote this value.

• Why is max similarity better than min distance?• Read pages 230-231 for motivation

Local suffix alignments

• Define v(i,j) to be the value of the optimal alignment of any of the i+1 suffixes of S[1..i] with any of the j+1 suffixes of T[1..j].– We bound v(i,j) to be at least 0 by scoring the

alignment of two empty suffixes to be 0

• Theorem– v* (the value of the optimal local alignment) =

max{ v(i,j) | 1 <= i <= n, 1 <= j <= m}

Recurrences for local suffix alignments

• Base Case:– For 0 <= i <= n, v(i,0) = – For 0 <= j <= m, v(0,j) =

• Recursive Case:– 0 < i <= n, 0 < j <= m– v(i,j) = max

• 0• v(i-1,j) + s(S(i),-)• v(i,j-1) + s(-,T(j))• v(i-1,j-1) + s(S(i), T(j))

Comments

• Traceback– No longer start from cell (n,m)– Search whole table for max value and start from there– Still O(mn) running time

• Terminology– In the literature, the distinction between problem statements

from solution methods is not clear– Global alignment often referred to as Needleman-Wunsch

alignment• There solution method was cubic in terms of m,n

– Smith-Waterman often used to refer to both local alignments and their solution method

Comments continued

• Scoring schemes– The utility of optimal local alignments is highly

dependent on the scoring scheme– Examples

• matches 1, mismatches & spaces 0 leads to longest common subsequence

• mismaches and spaces big negatives leads to longest common substring

– Average score in matrix must be negative, otherise local alignments tend to be global

– There is a theory developed about scoring schemes that we will cover later.

Aligning with Gaps

• Gaps: Any maximal run of spaces in a single string of a given alignment

• Example– S = aaabbbcccdddeeefff– T = aaabbbdddeeefffggg– Alignment

•aaabbbcccdddeeefff---•aaabbb---dddeeefffggg

Scoring with gaps

• Example Scoring– aaabbbcccdddeeefff---– aaabbb---dddeeefffggg– 111111-1 111111111 -1 = 13

• Why include gaps in scoring schemes?– Read 236-240– When an insertion/deletion event occurs, often more than a

single character is inserted or deleted.– A single gap cost helps model the fact that a sequence of

insertions/deletions is really one mutational event

Constant gap weight model

• We present a series of possible gap weight models, each of which is a special case of the next one

• Constant gap weight model– each individual space is free (Ws = 0)

– each gap has constant cost Wg

– Alignment problem boils down to finding an alignment that maximizes

• Match scores - mismatch scores - Wg(# of gaps)

– Dynamic programming can still solve in O(nm) time

Affine gap weight model

• Gap opening versus gap extension penalties– each gap has constant cost Wg

– each individual space has cost Ws < Wg, typically

• Alignment problem boils down to finding an alignment that maximizes

• Match scores - mismatch scores - Wg(# of gaps) - Ws(# of spaces)

• Dynamic programming can still solve in O(nm) time• Probably most commonly used model because of

efficiency and generality of model

Convex gap weight model

• Extension penalty should not be a constant but rather decrease as length of gap increases– One example

• each gap has cost Wg + log q where q is the length of the gap

• Time now requires more than O(nm) time– In chapter 13 is an O(nmlog m) time solution– Further improvement is possible, but costly

Arbitrary gap weight model

• Gap cost is an arbitrary function of gap length– each gap has cost w(q) where q is the length of the

gap– no properties are assumed on w(q) such as its

second derivative is negative

• Solution time is now O(nm2 + n2m)– cubic cost, similar to original Needleman-Wunsch

solution

Recurrences for arbitrary gap weights

• Base Case:– For 0 <= i <= n, V(i,0) = w(i)– For 0 <= j <= m, V(0,j) = w(j)

• Recursive Case:– 0 < i <= n, 0 < j <= m– V(i,j) = max

• V(i-1,j-1) + s(S(i),T(j))• max0<=k<j-1 [V(i,k) - w(j-k)]

– Match S[1..i] with T[1..k] and gap of length j-k at end of T

• max0<=k<i-1 [V(k,j) - w(i-k)]– Match S[1..k] with T[1..j] and gap of length i-k at end of S

Recurrences for affine gap weights

• Base Case:– For 0 <= i <= n, V(i,0) = E(i,0)Wg - iWs

– For 0 <= j <= m, V(0,j) = F(0,j) = -Wg - jWs

• Recursive Case:– 0 < i <= n, 0 < j <= m– V(i,j) = max [E(i,j), F(i,j), G(i,j)]

• G(i,j) = V(i-1,j-1) + s(S(i),T(j))• E(i,j) = max [E(i,j-1), V(i,j-1) - Wg] - Ws

– max checks if gap begins at S(i) or if it began earlier

• F(i,j) = max [F(i-1,j), V(i-1,j) - Wg] - Ws

– max checks if gap begins at T(j) or if it began earlier

Documents

Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming