23
7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

Embed Size (px)

Citation preview

Page 1: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

1

7. Sequence Mining

Sequences and Strings

Recognition with Strings

MM & HMM

Sequence Association Rules

Page 2: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

2

Sequences and Strings

• A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence– Sequences and strings are often used as synonyms

– String elements (characters, letters, or symbols) are nominal

– A type of particularly long string text

• |x| denotes the length of sequence x– |AGCTTC| is 6

• Any contiguous string that is part of x is called a substring, segment, or factor of x– GCT is a factor of AGCTTC

Page 3: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

3

Recognition with Strings

• String matching– Given x and text, determine whether x is a factor of text

• Edit distance (for inexact string matching)– Given two strings x and y, compute the minimum

number of basic operations (character insertions, deletions and exchanges) needed to transform x into y

Page 4: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

4

String Matching

• Given |text| >> |x|, with characters taken from an alphabet A– A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…}

• A shift s is an offset needed to align the first character of x with character number s+1 in text

• Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text

Page 5: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

5

Naïve (Brute-Force) String Matching

• Given A, x, text, n = |text|, m = |x|

s = 0

while s ≤ n-m

if x[1 …m] = text [s+1 … s+m]

then print “pattern occurs at shift” s

s = s + 1• Time complexity (worst case): O((n-m+1)m)• One character shift at a time is not necessary

Page 6: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

6

Boyer-Moore and KMP

• See StringMatching.ppt and do not use the following alg• Given Given AA, , xx, , texttext, , nn = | = |texttext|, |, mm = | = |xx||

F(x) = last-occurrence functionF(x) = last-occurrence functionG(x) = good-suffix function; G(x) = good-suffix function; ss = 0 = 0whilewhile s s ≤ n-m≤ n-m

j = mj = mwhile while j>0j>0 and and xx[j] = [j] = texttext [s+j] [s+j] j = j-1j = j-1ifif j = 0 j = 0 thenthen print “pattern occurs at shift” s print “pattern occurs at shift” s

s = s + G(0)s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])]else s = s + max[G(j), j-F(text[s+j0])]

Page 7: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

7

Edit Distance

• ED between x and y describes how many fundamental operations are required to transform x to y.

• Fundamental operations (x=‘excused’, y=‘exhausted’)

– Substitutions e.g. ‘c’ is replaced by ‘h’

– Insertions e.g. ‘a’ is inserted into x after ‘h’

– Deletions e.g. a character in x is deleted

• ED is one way of measuring similarity between two strings

Page 8: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

8

Classification using ED

• Nearest-neighbor algorithm can be applied for pattern recognition.– Training: data of strings with their class labels stored

– Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string.

• The key is how to calculate ED.• An example of calculating ED

Page 9: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

9

Hidden Markov Model

• Markov Model: transitional states• Hidden Markov Model: additional visible states• Evaluation• Decoding• Learning

Page 10: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

10

Markov Model

• The Markov property: – given the current state, the transition

probability is independent of any previous states.

• A simple Markov Model – State ω(t) at time t– Sequence of length T:

• ωT = {ω(1), ω(2), …, ω(T)}

– Transition probability • P(ω j(t+1)| ω i(t)) = aij

– It’s not required that aij = aji

Page 11: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

11

Hidden Markov Model

• Visible states – VT = {v(1), v(2), …, v(T)}

• Emitting a visible state vk(t)– P(v k(t)| ω j(t)) = bjk

• Only visible states vk (t) are accessible and states ωi (t) are unobservable.

• A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state.

Page 12: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

12

Three Key Issues with HMM

• Evaluation– Given an HMM, complete with transition probabilities aij

and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model

• Decoding– Given an HMM and a set of observations VT. Determine

the most likely sequence of hidden states ωT that led to VT.

• Learning– Given the number of states and visible states and a set of

training observations of visible symbols, determine the probabilities aij and bjk.

Page 13: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

13

Other Sequential Patterns Mining Problems

• Sequence alignment (homology) and sequence assembly (genome sequencing)

• Trend analysis– Trend movement vs. cyclic variations, seasonal variations

and random fluctuations

• Sequential pattern mining– Various kinds of sequences (weblogs)– Various methods: From GSP to PrefixSpan

• Periodicity analysis– Full periodicity, partial periodicity, cyclic association rules

Page 14: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

14

Periodic Pattern

• Full periodic pattern– ABC ABC ABC

• Partial periodic pattern– ABC ADC ACC ABC

• Pattern hierarchy– ABC ABC ABC DE DE DE DE ABC ABC ABC DE

DE DE DE ABC ABC ABC DE DE DE DE

Sequences of transactions

[ABC:3|DE:4]

Guozhu Dong
Full periodic patterns: too restrictive for data mining.Pattern hierarchy: Overall pattern is made from two more detailed patterns, each with a duration
Page 15: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

15

Sequence Association Rule Mining

• SPADE (Sequential Pattern Discovery using Equivalence classes)

• Constrained sequence mining (SPIRIT)

Page 16: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

16

Bibliography

• R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience.

Page 17: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

17

a33

1

3

2

a31

a22

a11

a12

a32

a23

a13

a21

Page 18: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

18

a33

1

3

2

a31

a22a11

a12

a32

a23

a13

a21

b31

v1

v2

v3

v4

v4

v1

v2

v3

v2

v3

v4

v1

b32

b34

b33

b21b22

b23

b24

b11

b12

b13 b14

Page 19: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

19

vk

1

3 3 3 3

2 2 2 2

1 1 1 1

c c c c

2

3

c

…………

…………

…………

…………

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a12

a22

a32

ac2

b2k

1(2)

2(2)

3(2)

c(2)

1 2 3 T-1 Tt =

Page 20: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

20

0

0.01 0.0077

0.0002

0

0.09 0.0052

0.0024

0

0 0 0 0.0011

0.2 0.0057

0.0007

0

1

0

0

0 1 2 3 4t =

3

2

1

0

v3 v1 v3 v2 v0

0.2

x 2

0.3 x 0.3

0.1 x 0.1

0.4 x 0.5

Page 21: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

21

1 2 3 4 5 6 7 0

/v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/

Page 22: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

22

0

3 3 3 3

2 2 2 2

0 0 0 0

c c c c

2

3

c

…………

…………

…………

…………

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 2 3 T-1 Tt =

1 1 1 1 1…………

0

2

3

c

.

.

.

1

4

max(1)

max(2)

max(3) max(T-1)

max(T)

Page 23: 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

23

0

0.01 0.0077

0.0002

0

0.09 0.0052

0.0024

0

0 0 0 0.0011

0.2 0.0057

0.0007

0

1

0

0

0 1 2 3 4t =

3

2

1

0

v3 v1 v3 v2 v0