40
. Multiple Multiple Sequence Sequence Alignments Alignments

Multiple Sequence Alignments

  • Upload
    jereni

  • View
    50

  • Download
    3

Embed Size (px)

DESCRIPTION

Multiple Sequence Alignments. z. x. y. The Global Alignment problem. AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA. AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC. A - T. A G -. G T T. G G G. G T G. G - -. T - A. T T A. - - A. - T A. C C A. C C C. - G C. - G -. - PowerPoint PPT Presentation

Citation preview

Page 1: Multiple Sequence Alignments

.

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 2: Multiple Sequence Alignments

2

The Global Alignment problem

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Page 3: Multiple Sequence Alignments

3

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

Page 4: Multiple Sequence Alignments

4

Motivations• Protein databases categorized by protein families

Collection of proteins with similar structure, function, or evolutionary history

• Comparing a new protein with a family requires to construct a representation of the family and then compare the new protein with the family representation

• How to score a multiple alignment ?• Consensus Distance• Evolutionary Tree Distance• Sum-of-Pairs Distance

Page 5: Multiple Sequence Alignments

5

Definition

Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

All sequences have the same length LScore of the global map is maximum

A faint similarity between two sequences becomes significant if present in many

Multiple alignments can help improve the pairwise alignments

Page 6: Multiple Sequence Alignments

6

Multiple Sequence AlignmentDefinition: Given stings S1, S2, …,Sk a multiple (global) alignment

map them to strings S’1, S’2, …,S’k that may contain spaces, where:

1. |S’1|= |S’2|=…= |S’k|

2. The removal of spaces from S’i leaves Si

Definition: The sum-of-pairs (SP) value for a multiple global

alignment A of k strings is the sum of the values of all pairwise alignments induced by A

2k

Page 7: Multiple Sequence Alignments

7

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 8: Multiple Sequence Alignments

8

Multiple Sequence AlignmentGiven k strings of length n, there is a generalization of the dynamic

programming algorithm that finds an optimal SP alignment.

NP completeness:

• Instead of a 2-dimensional table we now have a k-dimensional table to fill. O(nk) cells to fill

• Each dimension’s size is n+1. Each entry depends on 2k - 1adjacent entries.

Time Complexity: O(k2knk)

Page 9: Multiple Sequence Alignments

9

Multidimensional Dynamic Programming

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Page 10: Multiple Sequence Alignments

10

Example: in 3D (three sequences):

7 neighbors/cells

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }

Multidimensional Dynamic Programming

Page 11: Multiple Sequence Alignments

11

Multiple alignmentsWe use a matrix to represent the alignment of k sequences,

K=(x1,...,xk). We assume no columns consists solely of blanks.

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

The common scoring functions give a score to each column, and set: score(K)= ∑i score(column(i))

For k=10, a scoring function has 2k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V).

x1

x2

x3

x4

Page 12: Multiple Sequence Alignments

12

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj).

In order for this score to be written as ∑i score(column(i)),we set score(-,-) = 0. Why ?

Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (rows).

Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).

Page 13: Multiple Sequence Alignments

13

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

Definition: The sum-of-pairs (SP) value for a multiple global

alignment A of k strings is the sum of the values of all projected

pairwise alignments induced by A where the pairwise alignment

function score(xi,xj) is additive.

2k

Page 14: Multiple Sequence Alignments

14

Example

Consider the following alignment:

a c - c d b -- c - a d b da - b c d a d

Using the edit distance and for ,

this alignment has a SP value of

0, xx 1, yx yx

3 + 4 + 5 = 12

Page 15: Multiple Sequence Alignments

15

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional Dynamic Programming How do affine gaps generalize?

VERY badly! Require 2N states, one per combination

of gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N

LN)

XY XYZ Z

Y YZ

X XZ

Page 16: Multiple Sequence Alignments

16

Multiple Sequence AlignmentGiven k strings of length n, there is a natural generalization of the

dynamic programming algorithm that finds an alignment that maximizes

SP-score(K) = ∑i<j score(xi,xj).

Instead of a 2-dimensional table, we now have a k-dimensional table to fill.

For each vector i =(i1,..,ik), compute an optimal multiple alignment for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik).

The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2k-1 adjacent entries.

Page 17: Multiple Sequence Alignments

17

The idea via K=2

])[,(],[)],[(],[

])[],[(],[max],[

1jtj1iV1is1jiV

1jt1isjiV1j1iV

])..[],..[(],[ j1ti1sdjiV

V[i,j] V[i+1,j]

V[i,j+1] V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2k-1 non-zero binary vectors (1,1), (1,0), (0,1).

Recall the notation:

and the following recurrence for V:

Page 18: Multiple Sequence Alignments

18

The idea for arbitrary k

Order the vectors i=(i1,..,ik) by increasing order of the sum ∑ij. Set s(0,..,0)=0, and for i > (0,...,0):

( ) max{ ( ) ( ( , ))}s s score columnb

i i b i b

•The vector b ranges over all non-zero binary vectors. •The vector i-b is the non-negative difference of i and b. •The jth entry of column(i,b) equals cj= xj(ij) if bi=1, and cj= ‘-’ otherwise.(Reflecting that b is 1 at location j if that location changed in the “current comparison”).

Where

Page 19: Multiple Sequence Alignments

19

Complexity of the DP approach

Number of cells nk.Number of adjacent cells O(2k).Computation of SP score for each column(i,b) is O(k2)

Total run time is O(k22knk) which is utterly unacceptable !

Not much hope for a polynomial algorithm because the problem has been shown to be NP complete.

Need heuristic to reduce time.

Page 20: Multiple Sequence Alignments

20

Time saving heuristics: Relevance testsHeuristic: Avoid computing score(i) for irrelevant vectors.

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

x1

x2

x3

x4

Let L be a lower bound on the optimal SP score of a multiple alignment of the k sequences. A lower bound L can be obtained from an arbitrary multiple alignment, computed in any way.

Main idea: Compute upper bounds H(u,v) for the optimal score for every two sequences s=xu and t=xv, 1 u < v k. When processing vector i=(..iu,..iv…), the relevant cells are such that in every projection on xu and xv, the optimal pairwise score is above a value based on H(u,v) and L.

Page 21: Multiple Sequence Alignments

21

Recall the Linear Space algorithm V[i,j] = d(s[1..i],t[1..j]) B[i,j] = d(s[i+1..n],t[j+1..m]) F[i,j] + B[i,j] = score of best alignment through

(i,j)t

s

These computations done in linear space.

),( vuxxiia vuBuild such a table

for every two sequences s=xu and t=xv, 1 u, v k. This entry encodes the optimum through (iu,iv).

Page 22: Multiple Sequence Alignments

22

Let S(u,v) the score of the alignment of xu and xv in the multiple alignment.

Then, we have: L ≤ S(u,v) – H(u,v) +

And then: S(u,v) ≥ L + H(u,v) –

Now for each pair u,v we want to consider only the cells Iu and Iv for which the best pairwise alignment score that can be obtained through them (that is ) is greater than the above value:

Time saving heuristics:Relevance test

''

)','(vu

vuH

''

)','(vu

vuH

),( vuxxiia vu

),( vuxxiia vu

≥ L + H(u,v) - ''

)','(vu

vuH

Page 23: Multiple Sequence Alignments

23

A Profile Representation of Multiple Alignment

Given a multiple alignment M = m1…mn

Replace each column mi with profile entry pi Frequency of each letter in # gaps Optional: # gap openings, extensions, closings

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2O .2 .8 .4 .4E .4C .2 .8 .4 .2

Page 24: Multiple Sequence Alignments

24

Multiple Alignments With Profile Consider the MSA

a b c - aa b a b aa c c b -c b - b c

• Its corresponding profile P is C1 C2 C3 C4 C5a 75% 25% 50%b 75% 75%c 25% 25% 50% 25%− 25% 25% 25%

Aligning a string S to a profile P will tell us how well S fits P. Given the column positions C of P, the alignment consists of inserting spaces into S and C=(1,2,3,4,5) as in pure string alignment. For instance, an alignment of aabbc to P is:

a a b - b c1 - 2 3 4 5

Page 25: Multiple Sequence Alignments

25

String-to-Profile Alignment Scoring a column j is equivalent to aligning S j to each character

at column j. σ(j) = sum{over all i}σ(Sj, ij)pij

pij is frequency of i-th character in column j,

Score of an alignment = sum of all column scores σ(j).

Use Dynamic Programming as before (NW, SW, …) to do a string-to-profile alignment

Except that you should use this scoring function defined above.

Profile-to-Profile Alignments?

Page 26: Multiple Sequence Alignments

26

Progressive Alignment

When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

Page 27: Multiple Sequence Alignments

27

Example

Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)

s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)

Result: pxy = (0.7, 0.1, 0, 0, 0.2)

s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)

Result: px- = (0.4, 0.1, 0, 0, 0.5)

Page 28: Multiple Sequence Alignments

28

Progressive Alignment

When evolutionary tree is unknown:

Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary

distance, based on pairwise alignment Construct a tree (we will describe more in detail later in the course) Align on the tree

x

w

y

z?

Page 29: Multiple Sequence Alignments

29

CLUSTALW (1). Perform pairwise alignments of all sequences (2). Use alignment scores to produce a phylogenetic tree (3). Align the sequences sequentially by the tree that is

based on genetic distances.

-- The most closely related sequences are aligned first, then additional sequences or groups are added according to initial alignments

-- Genetic distance: no. of mismatched positions divided by the total no. of matched positions (positions opposite a gap are not scored)

-- Sequence contributions to MSA are weighted according to their relationships on the tree-- weighting scheme: the more distant, the higher the weight-- Context (neighbor amino acid) is taken into account for the gap penality-- Gap score is adapted to force gaps to open at the same position.

Page 30: Multiple Sequence Alignments

30

S1 S3 S2 S5 S6

DP alignment

S1 sequenceS3 sequence

S2 sequenceS5 sequenceS6 sequence

First align S1 and S3, S5 and S6, then align (S5,S6) and S2. Finally

Pairwise alignment

Page 31: Multiple Sequence Alignments

31

Tree AlignmentsAssume that there is a tree T=(V,E) whose leaves are the sequences. • Associate a sequence in each internal node.• Tree-score(K) = ∑(i,j)Escore(xi,xj).

Finding the optimal assignment of sequences to the internal nodes is NP Hard.

We will meet again this problem in the study ofPhylogenetic trees

Page 32: Multiple Sequence Alignments

32

Star Alignments

Rather then summing up all pairwise alignments, select a fixed sequence x0 as a center, and set

Star-score(K) = ∑j>0score(x0,xj).

The algorithm to find optimal alignment: at each step, add another sequence aligned with x0, keeping old gaps and possibly adding new ones.

Page 33: Multiple Sequence Alignments

33

Multiple Sequence Alignment – Approximation Algorithm

Polynomial time algorithm:

assumption: the cost function δ is a distance function:

• (triangle inequality)

Let D(S,T) be the value of the minimum global alignment between S and T.

0),( xx

),(),(),( yxzyzx

0),(),( xyyx

Page 34: Multiple Sequence Alignments

34

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Polynomial time algorithm:The input is a set Γ of k strings Si.1. Find the string S1 that minimizes

1\

1,SS

SSD

2. Call the remaining strings S2, …,Sk.

3. Add a string to the multiple alignment that initially contains only S1 as follows:

• Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1. Add Si by running dynamic programming algorithm on S’1 and Si to produce S’’1 and S’i.

• Adjust S’2, …,S’i-1 by adding spaces to those columns where spaces were added to get S’’1 from S’1.

• Replace S’1 by S’’1.

Page 35: Multiple Sequence Alignments

35

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Time analysis:• Choosing S1 – running dynamic programming algorithm times – O(k2n2)• When Si is added to the multiple alignment, the length of S1 is at most in, so the time to add all k strings is

2k

1

1

22k

i

nkOninO

Page 36: Multiple Sequence Alignments

36

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Error analysis:

• M - The alignment produced by this algorithm.

For all i, d(1,i)=D(S1,Si)

(we performed optimal alignment between S’1 and Si and )0),(

k

i

k

ijj

jidMv1 1

,•

• d(i,j) - the distance M induces on the pair Si,Sj.

• M* - optimal alignment.

Page 37: Multiple Sequence Alignments

37

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Error analysis:

k

llSSDk

21,)1(2

k

jjSSDk

21,

2)1(2)()(

*

k

kMvMv

k

i

k

ijj

jidMv1 1

, jdidk

i

k

ijj

,1,11 1

k

l

ldk2

,1)1(2

Triangle inequality

k

i

k

ijj

jidMv1 1

** ,

k

i

k

ijj

ji SSD1 1

,

k

i

k

ijj

jSSD1 1

1,

Definition of S1

Page 38: Multiple Sequence Alignments

38

Iterative Refinement

Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge

Page 39: Multiple Sequence Alignments

39

Page 40: Multiple Sequence Alignments

40

Some ResourcesGenome Resources

Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway

Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2

ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable

http://probcons.stanford.edu/ PROBCONS – most accurate