cbreport3.pdf

Multiple Sequence Alignment and

Similarity

Seminar Report

Arun Moorthy (CS 93115)

Department of Computer Science & Engineering

Indian Institute of Technology, Madras

1

Contents

1 Introduction 3

2 Background:Sequence Similarity & Alignment 4

2.1 Denitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4

2.2 Optimal Alignments by Dynamic Programming : : : : : : : : : : 4

3 Multiple Sequence Similarity 5

3.1 Multiple String Alignment Problem : : : : : : : : : : : : : : : : : 5

3.2 Optimal Multiple Alignment by DP : : : : : : : : : : : : : : : : : 6

4 Approximation Algorithm for Multiple String Alignment 7

4.1 Assumptions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7

4.2 Denitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7

4.3 Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7

4.4 Time Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8

4.5 Error Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8

5 Consensus String 10

5.1 Example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10

6 Concluding Remarks 11

2

1 Introduction

Sequence relationships are usually not restricted to those between two sequences.

Rather, they extend to relatedness among a family of sequences. This leads us to

the study of the alignment of r (r > 2) sequences.

Protein databases are often categorized by protein families. A protein fam-

ily is a collection of proteins with similar structure (i.e, shape), similar function,

or similar evolutionary history. A consensus sequence for the family can be

derived from a good multiple sequence alignment of all its members.

Multiple sequence alignment helps to gain insight into evolutionary rela-

tionships. We assume that the consensus sequence is the ancestor of all other

sequences in the family, and by looking at the number of edits that are necessary

to get from the ancestor sequence to a member sequence, we get an estimate of

the time when the two sequences diverged in evolutionary history.

Section 2 summarizes the concepts involved in sequence similarity and align-

ment. Section 3 introduces us to multiple sequence similarity and alignment.

In Section 4, we study an approximation algorithm for achieving multiple string

alignment and discuss its time complexity and perform an error analysis. Section

5 denes the consensus string.

3

2 Background:Sequence Similarity & Alignment

Given two sequences, it is of great use to nd out how similar they are. (Re-

member that the genome is a string over the alphabet fA;C; T;Gg.) Similarity is

obtained by nding a good alignment between them.

2.1 Denitions

1. (x; y) denotes the score for matching two characters x and y.

2. An Alignment maps two strings S and T into strings S

0

and T

0

where

jS

0

j = jT

0

j and removal of all spaces from S

0

and T

0

leaves S and T .

The value of the alignment is

P

l

i=1

(S

0

[i]; T

0

[i]) where l = jS

0

j = jT

0

j.

3. Am Optimal Alignment is one with the best value, and the value is de-

noted by Sim(S,T).

2.2 Optimal Alignments by Dynamic Programming

We can use the principle of dynamic programming(DP) to obtain optimal align-

ments between two strings. The exact procedure is as follows:

Let S[1::m] and T [1::n] be two strings which we wish to align.

Let V (i; j) be the value of an optimal alignment forS[1::i] and T [1::j]. V (m;n)

can be computed using the recurrence:

V (i; j) = max

0

B

B

B

@

V (i 1; j 1) + (S[i]; T [j]);

V (i 1; j) + (S[i];);

V (i; j 1) + (; T [j])

1

C

C

C

A

8 i; j > 0

V (i; 0) =

P

i

k=1

(S[k];) and

V (0; j) =

P

j

k=1

(; S[k])

4

3 Multiple Sequence Similarity

3.1 Multiple String Alignment Problem

Given strings S

1

; S

2

; :::S

k

a multiple (global) alignment maps them to

strings S

0

1

; S

0

2

; :::S

0

k

that may contain spaces, where

1. jS

0

1

j = jS

0

2

j = ::: = jS

0

k

j, and

2. the removal of all spaces from S

0

i

leaves S

i

, for 1 i k.

There are several ways to dene the value of the multiple alignment.

1. One method involves reconstructing (the scoring function). is now

dened to be a function

: fA [ fgg

r

! R:

Assume that a pairwise distance d is given for the entire alphabet.

Then we can dene to be

(a

1

; a

2

; :::a

r

) = min

2A

r

X

i=1

d(a

i

; ):

The minimizing is analogous to a \center of gravity" letter.

2. Another method involves making use of the same scoring function.

The sum-of-pairs(SP) value for a multiple global alignment of A of

k strings is the sum of the values of all

k

2

pairwise alignments induced

by A(assuming that the scoring function is symmetric).

Let us dene (x; y) to measure the distance between characters x and

y. So it will assign higher values the more distant two strings are. In

the case of two strings, we will thus be trying to minimize

l

X

i=1

(S

0

[i]; T

0

[i]);

where l = jS

0

j = jT

0

j.

An optimal SP(global) alignment of strings S

1

; S

2

; :::; S

k

is an alignment

that has the minimum possible sum-of-pairs value for these k strings.

5

3.2 Optimal Multiple Alignment by DP

Given k strings each of length n, there is a generalization of the dynamic program-

ming algorithm that nds an optimal SP alignment. Instead of a 2-dimensional

table, it lls in a k-dimensional table.. This table has dimensions

(n+ 1) (n + 1) :::(n+ 1) (k times)

So the table has (n+ 1)

k

entries. Each entry depends on 2

k

1 adjacent entries,

corresponding to the possibilities for the last match. in an optimal alignment.

Each of the (n + 1)

k

entries can be computed in time proportional to 2

k

,

and hence the running time of the algorithm is O((2n)

k

). For a typical protein,

n 200. Hence the algorithm will be practical only for very small values of k,

perhaps 3 or 4. But there are hundreds of members in protein families! Any

useful algorithm must work for k in the hundreds too.

6

4 Approximation Algorithm for Multiple String

Alignment

To put it in a nutshell, we need an algorithm whose running time is polynomial

in both n and k. But there is very less likelihood of such an algorithm existing

because according to Wang and Jiang the optimal SP alignment problem is

NP-complete.

We will now look at the approximation approach to nd a solution for the

multiple string alignment problem. We will show that this is a polynomial time

algorithm that produces multiple string alignments whose SP values are less than

twice that of the optimal solutions.

4.1 Assumptions

Triangle Inequality: (x; z) (x; y) + (y; z), for all characters x; y, and z.

(x; x) = 0 for all characters x.

4.2 Denitions

Dene for all strings S and T , D(S; T ) to be the value of the minimum global

alignment of S and T .

4.3 Algorithm

Input : A set T of k strings.

Step 1

Find S

1

2 T that minimizes

X

S2TfS

1

g

D(S

1

; S):

7

This is done by running the DP algorithm on each of the

k

2

pairs of strings in

T .

Step 2

Call the remaining strings in T S

2

; :::; S

k

. Add these strings one at a time to a

multiple alignment that initially consists only of S

1

, as follows:

Suppose S

1

; S

2

; :::; S

i1

are already aligned as S

0

1

; S

0

2

; :::; S

0

i1

. To add S

i

, run

the dynamic programming algorithm mentioned earlier on S

0

1

and S

i

to produce

S

00

1

and S

0

i

. Adjust S

0

2

; :::; S

0

i1

by adding spaces to those columns where spaces

were added to get S

00

1

from S

0

1

. Replace S

0

1

by S

00

1

.

4.4 Time Analysis

Theorem The approximation algorithm runs in time O(K

2

n

2

) when given k

strings each of length at most n.

Proof Each of the

k

2

values D(S; T ) required to compute S

1

can be computed

in O(n

2

) time. So the total time for the Step 1 is O(k

2

n

2

). After adding S

i

to the

multiple string alignment, the length of S

0

i

is at most in, so the time to add all n

strings to the multiple string alignment is

k1

X

i=1

O((in):n) = O(k

2

n

2

):

4.5 Error Analysis

Let M be the alignment produced by the algorithm, let d(i; j) be the distance M

induces on the pair S

i

; S

j

, and let

v(M) =

k

X

i=1

k

X

j=1;j 6=i

d(i; j):

Let M

be the optimal alignment, d

(i; j) be the distance M

induces on the

pair S

i

; S

j

, and

v(M

) =

k

X

i=1

k

X

j=1;j 6=i

d

(i; j):

We now prove that

v(M)

v(M

)

2(k 1)

k

< 2

8

In other words the algorithm produces an alignment whose SP value is less than

twice that of the optimal SP alignment.

Proof

We will derive an upper bound on v(M) and a lower bound on v(M

), and then

take their quotient.

v(M) =

k

X

i=1

k

X

j=1;j 6=i

d(i; j)

k

X

i=1

k

X

j=1;j 6=i

(d(i; 1) + d(1; j))

(triangle inequality)

= 2(k 1)

k

X

l=2

d(1; l)

(because each d(l; 1) occurs in 2(k 1) terms of the second line)

= 2(k 1)

k

X

l=2

D(S

1

; S

l

)

Now for v(M

):

v(M

) =

k

X

i=1

k

X

j=1;j 6=i

d

(i; j)

k

X

i=1

k

X

j=1;j 6=i

D(S

i

; S

j

)

(by denition of D)

k

X

i=1

k

X

j=2

D(S

1

; S

j

)

(by denition of S

1

)

= k

k

X

i=2

D(S

1

; S

l

)

Combining these two bounds,

v(M)

v(M

)

2(k 1)

k

< 2

Note that, for small values of k, the approximation is signicantly better than

a factor of 2.

9

5 Consensus String

Given a multiple string alignment, we would like to derive from it a consensus

string that can be used to represent the entire set of strings in the alignment.

Given a multiple alignmentM of strings S

1

; S

2

; :::S

k

, the consensus charac-

ter of column i is the character c

i

that minimizes the sum of distances to it from

all the characters in column i; that is, it minimizes

P

k

j=1

(S

0

j

[i]; c

i

). Let d(i) be

this minimum sum. The consensus string is the concatenation c

1

c

2

:::c

l

of all

the consensus characters, where l = jS

0

1

j = ::: = jS

0

k

j. The alignment error of

M is

P

k

i=1

d(i).

5.1 Example

a c - c d b -

- c - a d b d

a b c d a d

Using the distance function as , we get SP value of 3 + 5 + 4 = 12.

And the consensus string for this alignment is ac-cdbd. Its alignment error is 6.

10

6 Concluding Remarks

Having revised the concepts of sequence similarity and alignment, we then saw

how the concept of similarity and alignment in strings can be extended to multiple

(r > 2) strings. Subsequently we saw how the dynamic programming approach to

string alignment could be extended to take care of the multiple sequence case too.

An approximation algorithm that ran faster than the DP algorithm was studied.

Finally we saw how we could generate a consensus string for a given family of

proteins.

References

[1] Waterman, Micheal S.: \Introduction to Computational Biology: Maps, se-

quences and genomes", Chapman & Hall.

[2] Tompa, Martin: \Lecture Notes:Chapter 6", Notes by Markus Mock, 1996.

[3] Barton, Geo: \Sequence Alignment Tutorial",

http : ==geoff:biop:ox:ac:uk=papers=rev93 1=tableofcontents3

1

:html:

11

Documents

cbreport3.pdf