39
A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic of China Bioinformatics, Vol. 21 no. 1 2005 Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring

Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

  • Upload
    zuzela

  • View
    13

  • Download
    0

Embed Size (px)

DESCRIPTION

A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic of China Bioinformatics, Vol. 21 no. 1 2005. Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

A memory-efficient algorithm for multiple sequence alignment with

constraints

Chin Lung Lu and Yen Pin Huang

National Chiao Tung UniversityTaiwan, Republic of China

Bioinformatics, Vol. 21 no. 1 2005

Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

Page 2: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

2

Motivation

Incorporate the biological structures and consensuses into sequence alignment

Memory efficient

Page 3: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

3

Problem Formulation -- Constraints

What is the multiple sequence alignment with constraints ?

A T C T C G C T

T G C A T A T

AT T

A T -- C -- T C G C T

-- T G C A T -- -- A T

-- -- -- A T C T C G C T

T G C A T A T -- -- -- --

1C 2C

Conserved sites of a protein

or DNA/RNA family

),...,,( 21 CCC

iiii i

cccC ...21

No overlapping between them

CCC ...21

Page 4: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

4

Problem Formulation -- Constraints

T G C A T A T

G A

iC

jS

2i

jS

iii ccC 21

1),( ij CS

Hamming Distance

0.5

1 i

Approximately appears

Page 5: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

5

Given S={s1,s2,…,sx}, and

Problem Formulation -- Constraints

A T G C A T C G C T

-- T G C A T -- -- A T

T T G C A T C A T C

L

Subseq(S2, L’)

Band L’

T G C C C

string T={t1,t2,..tk}, for

T approximately appears in L

*)),',(( kTLSsubseq i ,1 xi

Page 6: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

6

Problem Formulation

1 2 x 1 2Let S={S ,S ,...,S } over the alphabet . Let =(C ,C ,...,C )

be an ordered set of constraints. Then the CMSA of S w.r.t is an

alignment L of S over {-} with the optimal sum-of-pair score

(SP sco

'1 2 i

'

re) in which all the constraints of approximately appear in

the order of C C ... such that (subseq(S , ), )

for all 1 i x and 1 j , where is the band of L whose induced

consensus i

j j j

j

C L C

L

js C .

Constrained Multiple Sequence Alignment (CMSA)

S1

S2

S3

C1 C2 C3

Optimal Sum-of-Pair Score

CPSA

Page 7: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

7

CMSA

Pick two sequencesFind the CPSAUse it as a kernel to progressively

align more sequences

[1] Progressive Multiple Alignment with Constraints, Gene Myers et al. [2] MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai Chin Lung Lu Ching Ta Yu Yen Pin Huang

Page 8: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

8

Algorithm Overview

bj-1

ai-1

Find recursive relationship

ai

bj

M(i-1,j-1)

M(i,j)

Divide-and-Conquer

Page 9: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

9

Notations

1 2

1 2

1 2

1 2

...

...

( , ) ...

( , ) ...

m

n

i i

i i m i

A a a a

B bb b

pref A i a a a A

suff A i a a a A

1 2 1 2..... .....i i i ma a a a a a

iA iA

Page 10: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

10

Notations

1 2

1 2

1 2

1 2

( , j) ...

( , j) ...

( , k) ...

( , k) ...

j j

j j n j

k k

k k k

pref B b b b B

suff B b b b B

pref C C C

suff C C C

Page 11: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

11

Notation ( , )

k

DM i j

( , ) kM i j

( , ) K

IM i j

jBjB

iA iA1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

CkC1 Cγ … …

( , ) k

SM i j

( , ) k

DM i j

( , ) kM i j

( , ) K

IM i j

( , ) k

SM i j

( , ) kM i j

Page 12: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

12

Alignment Score

Let ( , ) be the score of an optimal constrained

alignment of and w.r.t k

i j k

M i j

A B

A

B ...C1 C2 Ck

Page 13: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

13

Alignment Score - Substitution

Let ( , ) be the maximum score of all constrained

alignments of and w.r.t that end with a substitution

pair ( , ).

Sk

i j k

i j

M i j

A B

a b

A

B ...C1 C2 Ck

ai

bj

Page 14: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

14

Alignment Score -- Deletion

i

i

Let ( , ) be the maximum scores of all constrained

alignment of A and w.r.t. that end with a deletion

pair (a , -).

Dk

j k

M i j

B

--

ai

A

B...

C1 C2 Ck

Page 15: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

15

Alignment Score -- Insertion

i

j

Let ( , ) be the maximum scores of all constrained

alignment of A and w.r.t. that end with a insertion

pair (-,b ).

Ik

j k

M i j

B

--

b j

A

B...

C1 C2 Ck

Page 16: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

16

Semi-Constrained Alignment

k-1

k

A semi-constrained alignment of and w.r.t

a constrained alignment and w.r.t , and end

with a band which is a prefix of C

i j k

i j

A B is

A B

A

B ...C1 C2 Ck-1 Ck

( , , )kN i j hh

( , )kpref C h

Page 17: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

17

Recurrence of Scores

k

if 0, then

( 1, 1) ( , )

M ( , ) max ( , )

( , )

k i jDkIk

k

M i j a b

i j M i j

M i j

Page 18: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

18

Recurrence of Scores

k

if 1 , then

( 1, 1) ( , )

( , )M ( , ) max

( , )

( , , )

k i jDkIk

k k

k

M i j a b

M i ji j

M i j

N i j

Page 19: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

19

Recurrence of Scores

1 0 1

If ( ( , ), ) and ( ( , ), )

( , , ) ( , ) ( , )

( , , )

k

i k k k j k k k

k k k k k i h j hh

k k

suff A C suff B C

N i j M i j a b

else

N i j

1 2 1............ ..........i h i ia a a a a

1 2 1........... .........j h j jbb b b b

Ck

Page 20: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

20

? ( 1, )kM i j

1 2 1...................... i ia a a a

1 2 1..................... j jbb b b

( , )DkM i j

D Sk k

D Dk k

D Ik k

Substitution: M ( , ) M ( 1, )

Deletion: M ( , ) M ( 1, )

Insertion: M ( , ) M ( 1, )

o e

e

o e

i j i j w w

i j i j w

i j i j w w

a i-1

b j

--

b j

a i-1

--

Page 21: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

21

Recurrence

Sk

D Ik k

Dk

M ( 1, )

M ( , ) max M ( 1, )

M ( 1, )

o e

o e

e

i j w w

i j i j w w

i j w

DkM ( 1, ) o ei j w w

kM ( 1, ) o ei j w w

kDk D

k

M ( 1, )M ( , ) max

M ( 1, )o e

e

i j w wi j

i j w

kIk I

k

M ( 1, )M ( , ) max

M ( 1, )o e

e

i j w wi j

i j w

Page 22: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

22

( i, j, k)( i, j-1, k)

( i-1, j-1, k) ( i-1, j, k)

Sequence B

Sequence A

Constraints

( m, n, γ )

( 0, 0, 0)

Page 23: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

23

1 0 1( , , ) ( , ) ( , )

kk k k k k i h j hh

N i j M i j a b

Nk

Page 24: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

24

Assignment

Design an algorithm to find the CPSA using dynamic programming

technique. Analyze the time and space complexity of your algorithm. For

simpilicity, you can ignore the open-gap penalty. Prove your algorithm

is consistent with the constrained set . . . it will find such a CPSA if

there exists one.

i e

Email: [email protected]

Page 25: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

25

Divide-and-Conquer

Page 26: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

26

( , , )kN i j h( , , )kN i j h

1 2 1( ) [ , ,..., , ( , )]k k kh c c c pref c h 1( ) [ ( , ), ,..., , ]k k k kh suff c h c c

( , )IkM i j( , )I

kM i j

( , )SkM i j ( , )S

kM i j

( , )kM i j ( , )kM i j

( , )DkM i j( , )D

kM i j

jBjB

iA iA1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

h

pref(Ck,h) suff(Ck, λk - h)

CkC1 Cγ … …

Page 27: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

27

Divide-and-Conquer

( , )mid mid midk i jL A B ( , )

mid mid midk i jL A B

Case 1: if the last pair of ( , ) is a substitution

A. ( , ) and ( , ) are optimal constrained

( , ) ( , ) ( , )

B.

mid mid mid

mid mid mid mid mid mid

k midmid

k i j

k i j k i j

Smid mid k mid mid

L A B

L A B L A B

M m n M i j M i j

L

( , ) and ( , ) are optimal semi-constrained

( , ) ( , , ) ( , , )

C. if

( , ) ( , , ) (

mid mid mid mid mid mid

mid mid

mid

mid mid mid

k i j k i j

k mid mid mid k mid mid mid

mid k

k mid mid k k m

A B L A B

M m n N i j h N i j h

h

M m n N i j M i

, ) id midj

Page 28: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

28

Divide-and-Conquer

( , )mid mid midk i jL A B ( , )

mid mid midk i jL A B

Case 2: if the last pair of ( , ) is a deletion

A. If the first pair of ( , ) is not a deletion pair

( , ) max{ ( , ) ( , ),

mid mid mid

mid mid mid

k midmid

k i j

k i j

D Smid mid k mid mid

L A B

L A B

M m n M i j M i j

( , ) ( , )}

B. If the first pair of ( , ) is a deletion pair

( , ) ( , ) ( , )

k midmid

mid mid mid

k midmid

D Imid mid k mid mid

k i j

D Dmid mid k mid mid o

M i j M i j

L A B

M m n M i j M i j w

Page 29: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

29

Summary( , )

midk mid midM i j

( , ) ( , )

( , ) ( , )

( , ) ( , )( , ) max

( , ) ( , )

( , , ) ( , , )

K midmid

K midmid

K midmid

K midmid

mid mid

mi

D Imid mid k mid mid

D Smid mid k mid mid

Smid mid k mid mid

D Dmid mid k mid mid o

k mid mid mid k mid mid mid

k

M i j M i j

M i j M i j

M i j M i jM m n

M i j M i j w

N i j h N i j h

N

( , , ) ( , )d mid midmid mid k k mid midi j M i j

( , ) ( , )K midmid

D Dmid mid k mid midM i j M i j

( , )midk mid midM i j( , )

Kmid

Dmid midM i j

Page 30: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

30

Take , , as indices , and , where 1 ,

0 and 1 , such that the following maximal value

is the maximum.

( , ) ( , )

( , ) ( , )

( , ) max ( , ) (

K

K

K

mid mid mid

k

Dmid k mid

Smid k mid

D Dmid k mi

j k h j k h j n

k h

M i j M i j

M i j M i j

M m n M i j M i

, )

( , , ) ( , , )

( , , ) ( , )

d o

k mid k mid

k mid k k mid

j w

N i j h N i j h

N i j M i j

Summary

( , ) { ( , , , )}midM m n Max F i j k h

2mid

mi

Page 31: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

31

, , ,Algorithm CPSA-DC( , , )

1. Divide A into 2, then call BestScore() and BestScoreRev(), where

the sizes of B and 's are not changed.

2. The BestScore() and BestScoreRev

start end start end start endi i j j k k

mid midi i

{ ( , , , )

() return all the alignment scores

of (A , , ) (A , , )

3 Find the where the value of j, k, h will be used as

the middle point to divide the alignment for recur

}

smi

k

d

j k jB an

max F i k h

B

j

d

ive call of

CPSA-DC()

Implementation -- CPSA-DC()

Page 32: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

32

( , ) kM i j

jB

jB

1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

CkC1 Cγ … …

midiA

Page 33: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

33

Complexity

k ,

A single matrix E of size ( +1)(n+1) with each entry of

4 for ( , ), ( ), ( , ), ( , )

and ( , , )

Temporary Space: V is the same size as E

Total Space: ( )

Let , the s

s D Ik mid k mid k mid k mid

k mid

M i j M i j M i j M i j

N i j h

n

mn

ize of the original problem,

then the total time complexity of CPSA-DC algorithm is

equal to ... 22 4 8

Page 34: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

34

Experimental Results

Page 35: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

35

Experimental Results

Page 36: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

36

Discussion

Lack of proof of consistency of constraints

Optimal pair-wise subsequences alignment might cause the failure of the overall optimal alignment

Page 37: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

37

Discussion

http://genome.life.nctu.edu.tw:8080/MUSICME/index.html

Page 38: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

38

Assignment

1 2 1 2

1 2

Let { , } over the alphabet . Let ( , ,..., )

be an ordered set of constraints, where ... . Then

the Constrained Pair-wise Sequence Alignment (CPSA) of S

w.r.t is an alignment

i

i i ii

S S S C C C

C c c c

1 2

'

L of S over {-} with the optimal

sum-of-pair score (SP score) in which all the constraints of

approximately appear in the order of ... such

that the hamming distance ( ( , ), )i j j

C C C

subseq S L C

'

for

all 1 2 and 1 , where 0 1, and is the band

of L whose induced consensus is . A band is a block of

consecutive columns in L.

Design an efficient algorithm to find the CPSA using

j

j

j

i j L

C

dynamic

programming technique or whatever method you prefer. For

simpilicity, you can ignore the open-gap penalty. Analyze

the time and space complexity of your algorithm. Prove your

algorithm is consistent with the constraint set . . . it will find

such a CPSA if there exists one.

i e

Page 39: Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

39

Reference

Efficient Constrained Multiple Sequence Alignmentwith Performance GuaranteeFrancis Y.L. Chin N.L. Ho T.W. Lamy Prudence W.H. Wong M.Y. Chan

Divide-and-conquer multiple alignment withsegment-based constraintsMichael Sammeth1,∗, Burkhard Morgenstern2 and Jens Stoye 1

Multiple sequence alignment with the divide-and-conquer methodJens Stoye

MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai1 Chin Lung Lu2 ∗ Ching Ta Yu1 Yen Pin Huang