33
A Sub-quadratic Sequence A lignm entA lgorithm forU nrestricted Scoring M atrices Maxime Crochemore Michal Ziv Ukelson Gad M. Landau

path

Embed Size (px)

DESCRIPTION

Monge Property of DIST and OUT. [Monge 1781, Aggarwal- Park 1988]. b. a. w. a. b. path. z. path. A. C. D. Y. path. B. T. path. d. c. d. c. If path Z is better than path T than path Y is better than W. If A is better than C If C is better than A. - PowerPoint PPT Presentation

Citation preview

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

Maxime Crochemore

Michal Ziv Ukelson

Gad M. Landau

A = c t a c g a g a c

B = a a c g a c g a t

- a c g t- -1 -1 -1 -1a -1 1 -1 -1 -1c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1

The Sequence Alignment Problem

Compare two strings A and B and measure their similarityby finding the optimal alignment between them.

The alignment is classically based on the transformation of one sequence into the other, via operations of substitutions,insertions, and deletions (indels).

The Scoring Matrix P

A = c t a c g a g a c

B = a a c g a c g a t

A = c t a c g a g a c

B = a a c g a c g a t

Global Alignment.

Local Alignment.

- a c g t- -1 -1 -1 -1a -1 1 -1 -1 -1c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1

The Scoring Matrix P

Trace:

Two Sequence Alignment Problems

Value:

Value:

Trace:

2

5

- a c g t- -1 -1 -1 -1a -1 1 -1 -1 -1c -1 -1 1 -1 -1 g -1 -1 -1 1 -1 t -1 -1 -1 -1 1

The Scoring Matrix P

The O(n ) time, Classical Dynamic Programming Algorithm2

c

ac

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

ag

a

g5

6

7

c

| B| = n

| A| = n

t9

9

8

0 1 2 3 4 5 6 8 7 9

The Alignment Graph

Computing the Optimal Global Alignment Value

c

ac

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

ag

a

g5

6

7

c

| B| = n

| A| = n

t9

9

8

0 1 2 3 4 5 6 8 7 9

Classical Dynamic Programming: O(n )

Score of = 1Score of = -1

2

Computing the Optimal Local Alignment Value

c

ac

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

ag

a

g5

6

7

c

| B| = n

| A| = n

t9

9

8

0 1 2 3 4 5 6 8 7 9

Classical Dynamic Programming: O(n )

Score of = 1Score of = -1

2

LZ78 Parsing: Each phrase is the longest matching phrase seen previously, plus one character.

B = a a c g a c g a t

Acceleration by Text Compression: Compress the sequences in order to speed up the alignment process.

(0, a) (1, c) (0, g) (2, g) (1, t)

1 2 3 4 5

Trie for B

0a

1c2

g3

g4

5

t

LZ78 Parsing: Each phrase is the longest matching phrase seen previously, plus one character.

B = a a c g a c g a t

Theorem 1.[ Lempel and Ziv 1976]

Given a sequence S of size n over a constant alphabet. The maximal number of phrases obtained by any scheme which

parses S into distinct phrases is O(n / log n).

(0, a) (1, c) (0, g) (2, g) (1, t)

1 2 3 4 5

Theorem 2.[Ziv and Lempel 1978] Given a sequence S of size n over a constant alphabet. The number of phrases obtained by LZ78 parsing of S

is O(h n / log n), where h <=1.For most texts, h is the entropy of the text, which is a measure of how "compressible" the text is.

ctacgaga

a ta a c g a c g

c

O(n ) vertices2

ctacgaga

a ta a c g a c g

c

O(n ) vertices2

ctacgaga

a ta a c g a c g

c

O(h n / log n) rows of n vertices +O(h n / log n) columns of n vertices

ctacgaga

a ta a c g a c g

c

O(n ) vertices2

ctacgaga

a ta a c g a c g

c

O(h n / log n) rows of n vertices +O(h n / log n) columns of n vertices

Our Results:

O(hn / log n) algorithm for Computing theOptimal Global Alignment Value and Optimal Local Alignment Value.

2

(Reminder: h <=1)

The work for each block.

a a c g a c gctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

g

a

ga c

I4 I5 I6I3

I2

I1

O2 O3 O4O1

O6

O5

The work for each block is done in O(t) time.

c

t = |I| = |O| = 6G

Computing O from I:The Global Alignment Solution.

g

a

ga c

I4 I5 I6I3

I2

I1

O2 O3 O4O1

O6

O5

a a c g a c gctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6c

Ix = the weight of an optimal path from

vertex (0,0) to vertex x of the input border.

Oy = the weight of an optimal path from

vertex (0,0) to vertex y of the output border.

Computing O from I:The Global Alignment Solution.

g

a

ga c

I4 I5 I6I3

I2

I1

O2 O3 O4O1

O6

O5

a a c g a c gctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6c

Ix = the weight of an optimal path from

vertex (0,0) to vertex x of the input border.

Oy = the weight of an optimal path from

vertex (0,0) to vertex y of the output border.

Input + DIST[4] = OUT[4] 1 -3 -2 2 -1 1 3 1 4 2 0 2 1 0 1 3 -2 1

g

a

ga c

3

I3

2

I4

1

I5

3

I6

2

I2

1

I1

O2 O4O3

O5

O6

O1

?

O = max(I + DIST[x,3])x 4

x = 0

6

Score of = 1Score of = -1

Computing the score for Output Border Vertex O4

3

1

I1 = 1 0 -1 -2 I2 = 2 -1 -1 -2

I3 = 3 -2 0 0 I4 = 2 -2 -2 I5 = 1 -2

I6 = 3

1 0 -1 1 1 0 1 3 3 0 0 -1

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

1 2 3 4 5 6

-3 -1 -3 1 -1 -3 0 -2 -2 0 -1 -1

-2 -1 0

-2

1 -1

4 2 0 2 0 0 1 0 0

1 2 3

1 3 3 4 2 3

g

a

ga c

3

I3

2

I4

1

I5

3

I6

2

I2

1

I1

O2 O4O3

O5

O6

O1

41 3 3

2

3

Output Vector O values are set toOUT Matrix Column Maxima

I1 = 1 0 -1 -2 I2 = 2 -1 -1 -2

I3 = 3 -2 0 0 I4 = 2 -2 -2 I5 = 1 -2

I6 = 3

1 0 -1 1 1 0 1 3 3 0 0 -1

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

1 2 3 4 5 6

-3 -1 -3 1 -1 -3 0 -2 -2 0 -1 -1

-2 -1 0

-2

1 -1

4 2 0 2 0 0 1 0 0

1 2 3

1 3 3 4 2 3

How to compute the column maximaof OUT in O(t) time ?(Utilize the Total Monotonicity Property of OUT).

How to obtain the DIST for G in O(t) time ?(Take advantage of theincremental nature ofLZ78 parsing).

The Main Challenges

The Monge Property of an array M [Monge 1781]. For all a <b and c < d, M[a, c] - M[b, c] >= M[a, d] - M[b, d]

Computing the Column Maxima of OUT in O(t) time

Monge implies Total Monotonicity.

The Total Monotonicity Property [Aggarwal et al 1987].

For all a <b and c < d, M[a,c] <= M[b,c] M[a,d] <= M[b,d]

Therefore, OUT is Monge and hence Totally Monotone.

DIST is Monge [Aggarwal- Park 1988].

A

B

CD

path wpath z

b

path Tpath Y

Monge Property of DIST and OUT[Monge 1781, Aggarwal- Park 1988]

a

c d

a b

c d

If path Z is better than path T than path Y is better than W

If A is better than CIf C is better than A

1 0 -1 1 1 0 1 3 3 0 0 -1

OUT Matrix

-2

1 -1

4 2 0 2 0 0 1 0 0

1 2 3

The Rectangle Problem

g

a

ga c

1 0 -1 1 1 0 1 3 3 -12 0 0 -13 -13 -1

-14 -14 -14

-2 -inf -inf

1 -1 -inf

4 2 0 2 0 0 1 0 0

1 2 3

Complementing the undefined OUT entries (without introducing new column maxima)

1. Upper Right Triangle. All values are set to -inf.2. Lower Left Triangle. Let k denote th maximal absolute value of a score in the scoring matrix .

OUT[i,j] in the lower left triangle will be set to -(n+i+1)*k.

For all a <b and c < d, OUT[a,c] <= OUT[b,c] ---> OUT[a,d] <= OUT[b,d]

1 0 -1 1 1 0 1 3 3 -12 0 0 -13 -13 -1

-14 -14 -14

-2 -inf -inf

1 -1 -inf

4 2 0 2 0 0 1 0 0

1 2 3

Complementing the undefined OUT entries,without changing its Total Monotonicity property,and without introducing new column maxima.

OUT Matrix

I1 = 1 0 -1 -2 I2 = 2 -1 -1 -2

I3 = 3 -2 0 0 I4 = 2 -2 -2 I5 = 1 -2

I6 = 3

1 0 -1 1 1 0 1 3 3 0 0 -1

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

1 2 3 4 5 6

-3 -1 -3 1 -1 -3 0 -2 -2 0 -1 -1

-2 -1 0

-2

1 -1

4 2 0 2 0 0 1 0 0

1 2 3

1 3 3 4 2 3

How to compute the column maximaof OUT in O(t) time ?(Utilize the Total Monotonicity Property of OUT).

How to obtain the DIST for G in O(t) time ?(Take advantage of theincremental nature ofLZ78 parsing).

The Main Challenges

Gga

ga c

ctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Utilizing the incremental nature of LZ78 parsing for efficient DIST construction.

left prefix block of G G

ga

ga c

gaa c

ctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

5/2

4 = (2, g)

5/4

55

2

G

ga

ga ca

3/4

ctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

top prefix block of G

ga c

3/4

5/2

3/2

leftprefix (5/2)

diagonal prefix (3,2)

topprefix (3,4)

G (5,4)

ga

ga cga c

gaa c

aa c

a

ctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Only one new DIST column needs to be computed for each block, and this DIST column is computed in O(t) time.

3/43/2

ga

ga c ga c

gaa c

a

ctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

-3 -1 1 0

0

-2

Only one new DIST column needs to be computed for each block, and this DIST column is computed in O(t) time.diagonal

prefix

aa c

5/2

left prefix

left prefix

top prefix

3/4

5/2

3/2

left prefix (3,2)

diagonalprefix (3,2)

topprefix(3,4)

block G (5,4)

ga

ga cga c

gaa c

aa c

a

ctacgaga

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Accessing a Prefix Block in Constant time.

a c

Trie for A0

13

5

2

t

4

gg

a

c

g

g

Trie for B0

31

2

46

5c

t

Summary of Results:

Global Alignment Problem.

-An O(hn / log n) time and space complexity algorithm for computing the optimal global alignment value.

-After the optimal value has been computed, an optimal alignment trace can be recovered in time linear with its size.

Local Alignment Problem.

-An O(hn / log n) time and space complexity algorithm for computing the optimal local alignment value.

-After the optimal value has been computed, given a vertex whosescore is maximal, an optimal alignment trace ending in the vertexcan be recovered in time linear with its size.

2

2

Open Problems:

We showed an O(hn / log n) time and space complexityalgorithm for computing the optimal global and localalignment values of two strings.

In the paper we show how to reduce the space complexity

to O(h n / log n) .

Can the space requirement of the algorithm be further reduced, without impairing its sub-quadratic time complexity?

2

2 2 2