21
EE3J2 Data Mining Slide 1 EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 1

EE3J2 Data Mining

Lecture 12: Sequence Analysis (2)

Martin Russell

Page 2: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 2

Objectives

Revise dynamic programming Examples

Page 3: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 3

Alignment pathA C X C C D

A

B

C

D

d(C,X)

Page 4: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 4

Accumulated Distance

The accumulated distance along the path p is the sum of distances along its length

Large accumulative distance = poor matches between symbols = poor path

Small accumulative distance = good matches between symbols = good path

The path with the smallest accumulated distance is called the optimal path

Computed using Dynamic Programming

Page 5: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 5

Dynamic Programming

A C X C C D A

B

C

D

Accumulated distance to this point…

…is minimum of accumulated distances to possible previous points

Plus local, incremental cost

Page 6: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 6

Formally…

nmdKnmad

nmdnmad

nmdKnmad

nmad

INS

DEL

,1,

,1,1

,,1

min,

Accumulated distance up to the point (m,n)

Deletion penalty

‘Local’ distance between mth symbol in sequence 1 and nth symbol in sequence 2

Page 7: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 7

Example application: sequence retrieval

……

AAGDTDTDTDD

AABBCBDAAAAAAA

BABABABBCCDF

GGGGDDGDGDGDGDTDTD

DGDGDGDGD

AABCDTAABCDTAABCDTAAB

CDCDCDTGGG

GGAACDTGGGGGAAA

…….

…….

Corpus of sequential data

‘query’ sequence Q

…BBCCDDDGDGDGDCDTCDTTDCCC…

Dynamic Programming

Distance Calculation Calculate ad(S,Q)

for each sequence S in corpus

QSadSS

,minargˆ

Page 8: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 8

Example: Edit DistanceS1 = AABCD KDEL=0

S2 = ABCCD KINS = 0

Distance matrix

A B C C DA 0 1 1 1 1

A 0 1 1 1 1

B 1 0 1 1 1

C 1 1 0 0 1

D 1 1 1 1 0

Accumulated distance matrix

A B C C DA 0 1 2 3 4

A 0 1 2 3 4

B 1 0 1 2 3

C 2 1 0 0 1

D 2 1 1 1 0

Forward path matrix

A B C C D

A \ _ _ _ _

A | _ _ _ _

B | \ _ _ _

C | | \ _ _

D | | | | \

A B C C D

A \ _ _ _ _

A | _ _ _ _

B | \ _ _ _

C | | \ _ _

D | | | | \

AABCCDAABCCD

Page 9: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 9

Example 2: Edit DistanceS1 = AABCD KDEL=2

S2 = ABCCD KINS = 2

Distance matrix

A B C C DA 0 1 1 1 1

A 0 1 1 1 1

B 1 0 1 1 1

C 1 1 0 0 1

D 1 1 1 1 0

Accumulated distance matrix

A B C C DA 0 3 6 9 12

A 2 1 4 7 10

B 5 2 2 5 8

C 8 5 2 2 5

D 11 8 5 3 2

Forward path matrix

A B C C D

A \ _ _ _ _

A | \ _ _ _

B | \ \ _ _

C | | \ \ _

D | | | \ \

A B C C D

A \ _ _ _ _

A | \ _ _ _

B | \ \ _ _

C | | \ \ _

D | | | | \

ABCCDABCCD

Page 10: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 10

edit-dist.c

New C program on course website Computes the edit distance between two sequences Prints out:

– Distance matrix

– Forward accumulated distance matrix

– Forward path matrix

– Optimal path

– Optimal alignment

Page 11: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 11

edit-dist.c

Format:edit-dist seq1 seq2 <Kdel> <Kins>

Seq1 and seq2 are the sequences <Kdel> and <Kins> optional, default 0

Page 12: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 12

Matching partial sequences

In some applications the interest is in whether one sequence matches a subsequence of another sequence

Example: Bioinformatics– Look for examples of a simple DNA sequence within a

more complex sequence

– Infer evolutionary relationship between two organisms

Page 13: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 13

Partial alignment

Simple intuitive solution is to allow Dynamic Programming to:– Start at any point in the first row

– End at any point in the final row

Then proceed as before Unfortunately this has limitations…

Page 14: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 14

Finding matching sub-sequencesStart DP

from here

Best scoring end point

Lower cost path

Page 15: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 15

Backwards Pass DP

nmdKnmad

nmdnmad

nmdKnmad

nmad

INS

DEL

,1,

,1,1

,,1

min,

Forward pass

nmdKnmad

nmdnmad

nmdKnmad

nmad

INS

DEL

,1,

,1,1

,,1

min,

Backward pass

Page 16: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 16

Backwards Pass DP

Starts in bottom row, works right-to-left and bottom-to-top

Otherwise, backwards accumulated distance matrix and backwards path matrix calculations analogous with forward-pass DP

Page 17: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 17

Forward-backward DP

Suppose that we have done a complete forward DP and a complete backward DP

We will have two path matrices:

– Forward path matrix

– Backward path matrix For any point in bottom row can trace-back through forward

path matrix and recover path ending in top row For any point in top row can trace-back through backward

path matrix and recover path ending in bottom row

Page 18: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 18

Matching sub-sequences

Choose a point in the bottom row. Traceback though forward path matrix

Identify start of path. Then traceback through backward path matrix

Are paths the same? If so, then we have a matching

subsequence

Page 19: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 19

Matching subsequences

If a path occurs as a consequence of tracing-back through the forward path matrix and tracing-back through the backward path matrix, then the corresponding section of the horizontal sequence is called a matching subsequence

The matching subsequences are those which achieve a good match with the vertical pattern

Page 20: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 20

Matching subsequences

A

B

B

C

X Z A B C C Y Z

matching subsequence

We say that this subsequence most closely resembles the original sequence ABBC

Page 21: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

EE3J2 Data MiningSlide 21

Summary

Revision of Dynamic Programming Examples: Edit distance Motivation for interest in optimal subsequences Forward and backward dynamic programming Matching subsequences, subsequences which most

closely resemble a given sequence