Aligning Alignments Exactly

Preview:

DESCRIPTION

Aligning Alignments Exactly. By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng. Background Definition Hardness An Exponential time algorithm. Alignments. - PowerPoint PPT Presentation

Citation preview

Aligning Alignments ExactlyAligning Alignments Exactly

By John Kececioglu, Dean StarrettBy John Kececioglu, Dean StarrettCS Dept. Univ. of ArizonaCS Dept. Univ. of Arizona

Appeared in 8Appeared in 8thth ACM RECOME 2004, ACM RECOME 2004,

Presented by Jie MengPresented by Jie Meng

BackgroundBackground DefinitionDefinition HardnessHardness An Exponential time algorithmAn Exponential time algorithm

AlignmentsAlignments

Given two (DNA or Protein) sequences, an Given two (DNA or Protein) sequences, an alignment puts them against each other alignment puts them against each other such that the similar parts are aligned as such that the similar parts are aligned as close as possible, for example:close as possible, for example:

A T – C – T C G C TA T – C – T C G C T- T G - A T G – A T- T G - A T G – A T

There are four kinds of alignments

Match

Insertion;

Deletion;

Mismatch

Scoring AlignmentsScoring Alignments

There are four types of aligned columns:There are four types of aligned columns:– Match – Score Match – Score matchmatch = 0. = 0.

– Mismatch – Score Mismatch – Score mismatchmismatch 0. 0.

– Insertion – Score Insertion – Score insertioninsertion 0. 0.

– DeletionDeletion – Score – Score deletiondeletion 0. 0.

The The scorescore of an alignment is defined to be the of an alignment is defined to be the sumsum of the score of the aligned columns. of the score of the aligned columns.

The goal is to minimize the scoreThe goal is to minimize the score

Gap-costGap-cost

We can extend the score We can extend the score indel indel by by openopen and and extensionextension, then for a gap of size x, we have , then for a gap of size x, we have openopen +x* +x* extensionextension instead of x* instead of x* indel indel ..

AT----CGCTTCAT AT----CGCTTCAT -TGCAT—AT----- -TGCAT—AT-----

openopen +4* +4* extensionextension

Multiple AlignmentsMultiple Alignments

In general we also need compare In general we also need compare multiplemultiple sequences and find the similarities.sequences and find the similarities.

Multiple alignmentMultiple alignment generalizes the generalizes the alignment idea to handle many alignment idea to handle many sequences.sequences.

AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

Sum-of-Pairs (SP) ScoreSum-of-Pairs (SP) Score

Given a multiple alignment, the Given a multiple alignment, the sum-of-pairssum-of-pairs (SP) (SP) score is given by the sum of the score is given by the sum of the inducedinduced pairwise pairwise alignment scores of each pair in the alignment.alignment scores of each pair in the alignment.

AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

AT-C-TCGAT -TGCAT--AT AT-C-TCGATAT-C-TCGAT -TGCAT--AT AT-C-TCGAT

-TGCAT--AT ATCCA-CGCT ATCCA-CGCT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

+ +

BAD NEWSBAD NEWS

Multiple alignment is NP-hardMultiple alignment is NP-hard

One methods is to approximate the One methods is to approximate the optimal value; optimal value;

Progressive alignments Progressive alignments

A problem arised natually: A problem arised natually: Aligning AlignmentsAligning Alignments

Aligning Alignments

Let S be a collection of strings s1, s2, s3…sk, over alphabet ;

An alignment of S is a matrix A with k rows such that:i) Each entry is either a letter or a space;ii) No column is all space;iii) Reading across row i and remove space, we get string si;

Like before, we have three types of aligning score:match, mismatch and substitution;

Aligning Alignments

Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B;

AT-C-TCGAT-TGCAT--ATATCCA-CGAT

CT-ATTGGAT-TTAT-G--TCTTA-GGGAT

Aligning Alignments

In other word, We treat the columns of A and B as single letters, just like aligning two sequences.

CTGT-T

AT-TGT

C-TG-T--T

-AT--T-GT

Aligning Alignments

The score function is still sum-of-pair, namely

We note that the alignment of Ai’ and Bj’ may contain space in both sequences, so we just remove the space here

Ai’: a----aa-a

Bj’: aaa-a-a-a

ki lj

ji BAD1 1

'' ),(

Aligning Alignments

Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.

Aligning Alignments

With gap cost, this problem is NP-complete We can use a reduction from MAX-CUT problem MAX-CUT: Given a graph G=(V, E), and a integer

c, ask whether there is a partition of V: V= L R and , such that the size of the cut is no less than c;

By cut, it means the set of edges which have one end vertex in L and another is in R;

RL

NP-hardnessNP-hardness

• Given an instance of MAX-CUT G=(V,E), V={v1, v2, …vn} and E={e1, e2, … em},and a integer c;

• we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;

NP-hardnessNP-hardness

• The dummy rows in A are (0-)n, dummy rows in B are (0--)n;

• As to the edge rows in A: suppose the row for e, and e=(vi, vj), then in columns i and j, there are substring, “-1”, and space elsewhere;

• As to the edge rows in B: suppose the row for e, and e=(vi, vj), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”

NP-hardnessNP-hardness

• Simply we let score for match is 0,

score for mismatch is 1,

and gap open cost is 2, gap extension cost is 1

ask whether there is an alignment such that the score is less then d-c;

So we have an instance of Aligning Alignments.

HOMEWORK4HOMEWORK4

• Given a set of multiple alignments {A1, A2, … An}, each Ai is a multiple alignment with ki sequences, without gap cost, is the problem of multiple alignment on those alignments {A1, A2, … An} hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.

Exact Algorithm

The basic idea is still dynamic programming; We have to remember extra information by a set,

so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.

Exact Algorithm

S(i, j)=

B[j])(A[i],1)-j1,-S(i

B[j])(-,1)-jS(i, (A[i],-)j)1,-S(i

0j and 0i }{

0jor 0i {}

Exact Algorithm

C(i,j,t)=min

Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;

}]),[],,[()],[],[(s)1,-j1,-{C(i min

|}][|*)],[,(s)1,-j{C(i, min

|}][|*),],[(s)j,1,-{C(i min

open

tB[j])(A[i],s&1)-j1,-S(is

extensionopen

tB[j])(-,s&1)-jS(i,s

extensionopen

t(A[i],-)s&j)1,-S(is

BqAp

jqBipADsjBiAg

jBksjBg

iAlsiAg

Exact Algorithm

The optimum value is

The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is

)},,({],[

snmCMinnmSs

nk ,)23((

nk ,)()23((

2

12

3

22

nk

kkn

n

k

Any Questions?

423B

jmeng@cs.tamu.edu

Recommended