11

cbreport3.pdf

Embed Size (px)

Citation preview

  • Multiple Sequence Alignment and

    Similarity

    Seminar Report

    Arun Moorthy (CS 93115)

    Department of Computer Science & Engineering

    Indian Institute of Technology, Madras

    1

  • Contents

    1 Introduction 3

    2 Background:Sequence Similarity & Alignment 4

    2.1 Denitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4

    2.2 Optimal Alignments by Dynamic Programming : : : : : : : : : : 4

    3 Multiple Sequence Similarity 5

    3.1 Multiple String Alignment Problem : : : : : : : : : : : : : : : : : 5

    3.2 Optimal Multiple Alignment by DP : : : : : : : : : : : : : : : : : 6

    4 Approximation Algorithm for Multiple String Alignment 7

    4.1 Assumptions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7

    4.2 Denitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7

    4.3 Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7

    4.4 Time Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8

    4.5 Error Analysis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8

    5 Consensus String 10

    5.1 Example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10

    6 Concluding Remarks 11

    2

  • 1 Introduction

    Sequence relationships are usually not restricted to those between two sequences.

    Rather, they extend to relatedness among a family of sequences. This leads us to

    the study of the alignment of r (r > 2) sequences.

    Protein databases are often categorized by protein families. A protein fam-

    ily is a collection of proteins with similar structure (i.e, shape), similar function,

    or similar evolutionary history. A consensus sequence for the family can be

    derived from a good multiple sequence alignment of all its members.

    Multiple sequence alignment helps to gain insight into evolutionary rela-

    tionships. We assume that the consensus sequence is the ancestor of all other

    sequences in the family, and by looking at the number of edits that are necessary

    to get from the ancestor sequence to a member sequence, we get an estimate of

    the time when the two sequences diverged in evolutionary history.

    Section 2 summarizes the concepts involved in sequence similarity and align-

    ment. Section 3 introduces us to multiple sequence similarity and alignment.

    In Section 4, we study an approximation algorithm for achieving multiple string

    alignment and discuss its time complexity and perform an error analysis. Section

    5 denes the consensus string.

    3

  • 2 Background:Sequence Similarity & Alignment

    Given two sequences, it is of great use to nd out how similar they are. (Re-

    member that the genome is a string over the alphabet fA;C; T;Gg.) Similarity is

    obtained by nding a good alignment between them.

    2.1 Denitions

    1. (x; y) denotes the score for matching two characters x and y.

    2. An Alignment maps two strings S and T into strings S

    0

    and T

    0

    where

    jS

    0

    j = jT

    0

    j and removal of all spaces from S

    0

    and T

    0

    leaves S and T .

    The value of the alignment is

    P

    l

    i=1

    (S

    0

    [i]; T

    0

    [i]) where l = jS

    0

    j = jT

    0

    j.

    3. Am Optimal Alignment is one with the best value, and the value is de-

    noted by Sim(S,T).

    2.2 Optimal Alignments by Dynamic Programming

    We can use the principle of dynamic programming(DP) to obtain optimal align-

    ments between two strings. The exact procedure is as follows:

    Let S[1::m] and T [1::n] be two strings which we wish to align.

    Let V (i; j) be the value of an optimal alignment forS[1::i] and T [1::j]. V (m;n)

    can be computed using the recurrence:

    V (i; j) = max

    0

    B

    B

    B

    @

    V (i 1; j 1) + (S[i]; T [j]);

    V (i 1; j) + (S[i];);

    V (i; j 1) + (; T [j])

    1

    C

    C

    C

    A

    8 i; j > 0

    V (i; 0) =

    P

    i

    k=1

    (S[k];) and

    V (0; j) =

    P

    j

    k=1

    (; S[k])

    4

  • 3 Multiple Sequence Similarity

    3.1 Multiple String Alignment Problem

    Given strings S

    1

    ; S

    2

    ; :::S

    k

    a multiple (global) alignment maps them to

    strings S

    0

    1

    ; S

    0

    2

    ; :::S

    0

    k

    that may contain spaces, where

    1. jS

    0

    1

    j = jS

    0

    2

    j = ::: = jS

    0

    k

    j, and

    2. the removal of all spaces from S

    0

    i

    leaves S

    i

    , for 1 i k.

    There are several ways to dene the value of the multiple alignment.

    1. One method involves reconstructing (the scoring function). is now

    dened to be a function

    : fA [ fgg

    r

    ! R:

    Assume that a pairwise distance d is given for the entire alphabet.

    Then we can dene to be

    (a

    1

    ; a

    2

    ; :::a

    r

    ) = min

    2A

    r

    X

    i=1

    d(a

    i

    ; ):

    The minimizing is analogous to a \center of gravity" letter.

    2. Another method involves making use of the same scoring function.

    The sum-of-pairs(SP) value for a multiple global alignment of A of

    k strings is the sum of the values of all

    k

    2

    pairwise alignments induced

    by A(assuming that the scoring function is symmetric).

    Let us dene (x; y) to measure the distance between characters x and

    y. So it will assign higher values the more distant two strings are. In

    the case of two strings, we will thus be trying to minimize

    l

    X

    i=1

    (S

    0

    [i]; T

    0

    [i]);

    where l = jS

    0

    j = jT

    0

    j.

    An optimal SP(global) alignment of strings S

    1

    ; S

    2

    ; :::; S

    k

    is an alignment

    that has the minimum possible sum-of-pairs value for these k strings.

    5

  • 3.2 Optimal Multiple Alignment by DP

    Given k strings each of length n, there is a generalization of the dynamic program-

    ming algorithm that nds an optimal SP alignment. Instead of a 2-dimensional

    table, it lls in a k-dimensional table.. This table has dimensions

    (n+ 1) (n + 1) :::(n+ 1) (k times)

    So the table has (n+ 1)

    k

    entries. Each entry depends on 2

    k

    1 adjacent entries,

    corresponding to the possibilities for the last match. in an optimal alignment.

    Each of the (n + 1)

    k

    entries can be computed in time proportional to 2

    k

    ,

    and hence the running time of the algorithm is O((2n)

    k

    ). For a typical protein,

    n 200. Hence the algorithm will be practical only for very small values of k,

    perhaps 3 or 4. But there are hundreds of members in protein families! Any

    useful algorithm must work for k in the hundreds too.

    6

  • 4 Approximation Algorithm for Multiple String

    Alignment

    To put it in a nutshell, we need an algorithm whose running time is polynomial

    in both n and k. But there is very less likelihood of such an algorithm existing

    because according to Wang and Jiang the optimal SP alignment problem is

    NP-complete.

    We will now look at the approximation approach to nd a solution for the

    multiple string alignment problem. We will show that this is a polynomial time

    algorithm that produces multiple string alignments whose SP values are less than

    twice that of the optimal solutions.

    4.1 Assumptions

    Triangle Inequality: (x; z) (x; y) + (y; z), for all characters x; y, and z.

    (x; x) = 0 for all characters x.

    4.2 Denitions

    Dene for all strings S and T , D(S; T ) to be the value of the minimum global

    alignment of S and T .

    4.3 Algorithm

    Input : A set T of k strings.

    Step 1

    Find S

    1

    2 T that minimizes

    X

    S2TfS

    1

    g

    D(S

    1

    ; S):

    7

  • This is done by running the DP algorithm on each of the

    k

    2

    pairs of strings in

    T .

    Step 2

    Call the remaining strings in T S

    2

    ; :::; S

    k

    . Add these strings one at a time to a

    multiple alignment that initially consists only of S

    1

    , as follows:

    Suppose S

    1

    ; S

    2

    ; :::; S

    i1

    are already aligned as S

    0

    1

    ; S

    0

    2

    ; :::; S

    0

    i1

    . To add S

    i

    , run

    the dynamic programming algorithm mentioned earlier on S

    0

    1

    and S

    i

    to produce

    S

    00

    1

    and S

    0

    i

    . Adjust S

    0

    2

    ; :::; S

    0

    i1

    by adding spaces to those columns where spaces

    were added to get S

    00

    1

    from S

    0

    1

    . Replace S

    0

    1

    by S

    00

    1

    .

    4.4 Time Analysis

    Theorem The approximation algorithm runs in time O(K

    2

    n

    2

    ) when given k

    strings each of length at most n.

    Proof Each of the

    k

    2

    values D(S; T ) required to compute S

    1

    can be computed

    in O(n

    2

    ) time. So the total time for the Step 1 is O(k

    2

    n

    2

    ). After adding S

    i

    to the

    multiple string alignment, the length of S

    0

    i

    is at most in, so the time to add all n

    strings to the multiple string alignment is

    k1

    X

    i=1

    O((in):n) = O(k

    2

    n

    2

    ):

    4.5 Error Analysis

    Let M be the alignment produced by the algorithm, let d(i; j) be the distance M

    induces on the pair S

    i

    ; S

    j

    , and let

    v(M) =

    k

    X

    i=1

    k

    X

    j=1;j 6=i

    d(i; j):

    Let M

    be the optimal alignment, d

    (i; j) be the distance M

    induces on the

    pair S

    i

    ; S

    j

    , and

    v(M

    ) =

    k

    X

    i=1

    k

    X

    j=1;j 6=i

    d

    (i; j):

    We now prove that

    v(M)

    v(M

    )

    2(k 1)

    k

    < 2

    8

  • In other words the algorithm produces an alignment whose SP value is less than

    twice that of the optimal SP alignment.

    Proof

    We will derive an upper bound on v(M) and a lower bound on v(M

    ), and then

    take their quotient.

    v(M) =

    k

    X

    i=1

    k

    X

    j=1;j 6=i

    d(i; j)

    k

    X

    i=1

    k

    X

    j=1;j 6=i

    (d(i; 1) + d(1; j))

    (triangle inequality)

    = 2(k 1)

    k

    X

    l=2

    d(1; l)

    (because each d(l; 1) occurs in 2(k 1) terms of the second line)

    = 2(k 1)

    k

    X

    l=2

    D(S

    1

    ; S

    l

    )

    Now for v(M

    ):

    v(M

    ) =

    k

    X

    i=1

    k

    X

    j=1;j 6=i

    d

    (i; j)

    k

    X

    i=1

    k

    X

    j=1;j 6=i

    D(S

    i

    ; S

    j

    )

    (by denition of D)

    k

    X

    i=1

    k

    X

    j=2

    D(S

    1

    ; S

    j

    )

    (by denition of S

    1

    )

    = k

    k

    X

    i=2

    D(S

    1

    ; S

    l

    )

    Combining these two bounds,

    v(M)

    v(M

    )

    2(k 1)

    k

    < 2

    Note that, for small values of k, the approximation is signicantly better than

    a factor of 2.

    9

  • 5 Consensus String

    Given a multiple string alignment, we would like to derive from it a consensus

    string that can be used to represent the entire set of strings in the alignment.

    Given a multiple alignmentM of strings S

    1

    ; S

    2

    ; :::S

    k

    , the consensus charac-

    ter of column i is the character c

    i

    that minimizes the sum of distances to it from

    all the characters in column i; that is, it minimizes

    P

    k

    j=1

    (S

    0

    j

    [i]; c

    i

    ). Let d(i) be

    this minimum sum. The consensus string is the concatenation c

    1

    c

    2

    :::c

    l

    of all

    the consensus characters, where l = jS

    0

    1

    j = ::: = jS

    0

    k

    j. The alignment error of

    M is

    P

    k

    i=1

    d(i).

    5.1 Example

    a c - c d b -

    - c - a d b d

    a b c d a d

    Using the distance function as , we get SP value of 3 + 5 + 4 = 12.

    And the consensus string for this alignment is ac-cdbd. Its alignment error is 6.

    10

  • 6 Concluding Remarks

    Having revised the concepts of sequence similarity and alignment, we then saw

    how the concept of similarity and alignment in strings can be extended to multiple

    (r > 2) strings. Subsequently we saw how the dynamic programming approach to

    string alignment could be extended to take care of the multiple sequence case too.

    An approximation algorithm that ran faster than the DP algorithm was studied.

    Finally we saw how we could generate a consensus string for a given family of

    proteins.

    References

    [1] Waterman, Micheal S.: \Introduction to Computational Biology: Maps, se-

    quences and genomes", Chapman & Hall.

    [2] Tompa, Martin: \Lecture Notes:Chapter 6", Notes by Markus Mock, 1996.

    [3] Barton, Geo: \Sequence Alignment Tutorial",

    http : ==geoff:biop:ox:ac:uk=papers=rev93 1=tableofcontents3

    1

    :html:

    11