38
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Embed Size (px)

Citation preview

Page 1: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Pattern-Growth Methods for Sequential Pattern Mining

Iris Zhang2003-5-14

Page 2: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Outline• Sequential pattern mining• Apriori-like methods

– GSP

• Pattern-growth methods– FreeSpan– PrefixSpan

• Performance analysis• Conclusions

Page 3: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Motivation

• Sequential pattern mining: Finding time-related frequent patterns

• Most data and applications are time-related– Customer shopping patterns, telephone calling

patterns

– Natural disasters (e.g., earthquake, hurricane)

– Disease and treatment

– Stock market fluctuation

– Weblog click stream analysis

– DNA sequence analysis

Page 4: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Concepts• Let I={i1,i2,…,in} be a set of all items

• Itemset is a subset of items• Sequence is an ordered list of itemset.

itemsets are called elements. The number of items in the sequence is its length– e.g. < (ef)(ab)(df)cb >

• A sequence =<a1a2…an> is called subsequence of =<b1b2…bm>, denoted , if there exist integers 1j1 <j2<…<jn m such that a1bj1, a2bj2,…,anbjn

– e.g. <a(bc)dc>is subsequence of <<a(abc)(ac))d(cf)>>

Page 5: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Concepts (con’t)• Sequence database is a set of tuples <sid,s>, sid is a

sequence_id, and s is a sequence. A tuple is said to contain a sequence if is a subsequence of s

• Support of is the number of tuples in the database containing

• If the support of no less than a threshold, it is called sequential pattern– <(ab)c> is a sequential pattern given support threshold

min_sup =2

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Page 6: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Problem definition

• Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database

Page 7: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Apriori-like methods

• Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent– e.g. <bh> is infrequent, so do <abh>,<b(dh)>

• GSP (Generalized Sequential Pattern) algorithm– Level-by-level do

• Generate candidate sequences• Use Apriori property to prune candidates• Scan database to collect support counts

Page 8: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

GSP Mining Process

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba> Cand. cannot pass sup. threshold

Cand. not in DB at all

Page 9: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Bottlenecks of Apriori-Like Methods• Potentially huge set of candidate sequences

– 1,000 frequent length-1 sequences generate length-2

candidates

• Multiple scans of database

• Difficulties at mining long sequential patterns– Exponential number of short candidates

– A length-100 sequential pattern needs candidate sequences

500,499,12

999100010001000

30100100

1

1012100

i i

Page 10: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Pattern-growth methods• A divide-and-conquer approach

– Recursively project a sequence database into a set of smaller databases

– Mine each projected database to find the subset of patterns

• Algorithms– FreeSpan: Frequent Pattern-Projected Sequential

Pattern Mining– PrefixSpan: Prefix-Projected Sequential Pattern

Mining

Page 11: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

FreeSpan• Example: given a sequence database S and

min_support = 2

• Step 1: find length-1 sequential patterns and list them in support descending order– f_list = a:4,b:4,c:4,d:3,e:3,f:3

SID Sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

Page 12: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

FreeSpan (con’t)• Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6 disjoint subsets:– ones only contain item a– ones contain item b but no items after b in f_list– ones contain item c but no items after c in f_list– ones contain item d but no items after d in f_list– ones contain item e but no items after e in f_list– ones contain item f

find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively

Page 13: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

FreeSpan (con’t)• Finding Seq. Patterns containing item b but

no items after b in f_list– <b>-projected database: <a(ab)a>, <aba>,

<(ab)b>, <ab>

– Find all the length-2 seq. pat. containing item b but no items after b in f_list : <ab>:4, <ba>:2, <(ab)>:2

– Further partition and miningSID Sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

Page 14: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

From FreeSpan to PrefixSpan• Freespan:

– Projection-based: No candidate sequence needs to be generated

– But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database

• PrefixSpan– Projection-based

– But only prefix-based projection: less projections and quickly shrinking sequences

Page 15: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

PrefixSpan-conceptsSuppose all items in an element are listed alphabetically.Given a sequence =<e1e2…en>, =<e’1e’2…e’m>(mn)

• Prefix: is the prefix of iff (1) e’i=ei (i m-1) (2) e’m

em(3) all items in (em- e’m) are alphabetically after those in e’m.

– e.g. =<a(abc)(ac)d(cf)>, =<a(ab)>, ’=<a(bc)>

• Postfix: sequence =<e1e2…e’m>, =<e’’mem+1…en> is called the postfix of w.r.t. prefix , where e’’m=(em-e’m), denoted as =.

– e.g. =<(_c)(ac)d(cf)> is the postfix of w.r.t. prefix <a(ab)>

Page 16: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

PrefixSpan-concepts (con’t)

• Projected database: let be a sequential pattern in S. -projected database, denoted s|, is the collection of postfixes of sequences in S w.r.t. prefix

• Support count in projected database: let be a sequential pattern in S, be a sequence having prefix . The support count of in -projected database is the number of sequence in s| such that .

Page 17: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

PrefixSpan-process• Step 1: find length-1 sequential patterns

– <a>:4, <b>:4, <c>:4, <d>:3, <e>:3, <f>:3

• Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:

– ones having prefix <a>;– ones having prefix <b>;– …– ones having prefix <f>;

find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively

SID Sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

Page 18: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

PrefixSpan-Process (con’t)• Finding Seq. Patterns with Prefix <a>

– <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

– Find all the length-2 seq. pat. having prefix <a>:<aa>:2, <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2

– Further partition into 6 subsets• Having prefix <aa>;

• …

• Having prefix <af>;

SID Sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

Page 19: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Completeness of PrefixSpanSID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Length-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>

prefix <af>

<b>-projected database …

prefix <b><a>-projected database

<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>

Length-2 seq. pan<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

prefix <a>

prefix <aa>

<aa>-proj. db <af>-proj. db

prefix <c>, …, <f>

… …

Page 20: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Efficiency of PrefixSpan

• No candidate sequence needs to be

generated

• Projected databases keep shrinking

• Major cost of PrefixSpan: constructing

projected databases

– Can be improved by bi-level projections and

pseudo-projections

Page 21: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Optimization Techniques in PrefixSpan

• Single-level vs. bi-level projection

– Bi-level projection with 3-way checking may

reduce the number and size of projected

databases

• Physical projection vs. pseudo-projection

– Pseudo-projection may reduce the effort of

projection when the projected database fits in

main memory

Page 22: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

S-matrix for sequence databaseLength-1 sequential patterns: <a>, <b>, <c>, <d>, <e>, <f>

All length-2 sequential patterns are found in S-matrix

S-matrix

fedcba

1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f

0(1, 1, 0)(1, 2, 0)(1, 2, 0)(1, 2, 1)e

0(1, 3, 0)(2, 2, 0)(2, 1, 1)d

3(3, 3, 2)(4, 2, 1)c

1(4, 2, 2)b

2a

<aa> happens twice

<ac> happens4 times

<ca>happens twice

<(ac)> happens once

Page 23: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

S-matrix for <ab>-projected database• <ab>-projected database:

– <(_c)(ac)d(cf)>,<(_c)(ae)>,<c>

• frequent items:<a>,<c>,<(_c)>• S-matrix:

a 0

c (1, 0, 1) 1

(_c) (, 2, ) (, 1, )

a c (_c)

No a(_c), no count

Lead to pattern

<a(bc)a>

SID Sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <(eg(af)cbc>

Page 24: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Scaling-up by Bi-level Projection

• Partition search space based on length-2

sequential patterns

• Only form projected databases and pursue

recursive mining over bi-level projected

databases

Page 25: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Benefits of Bi-level Projection• More patterns are found in each shoot

• Much less projections

– In the example, there are 53 patterns.

– 53 level-by-level projections

– 22 bi-level projections

Page 26: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

3-way Apriori Checking

• Using Apriori heuristic to prune items in projected databases

a 2

b (4, 2, 2) 1

c (4, 2, 1) (3, 3, 2) 3

d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0

e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0

f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1

a b c d e f

<acd> cannot be a pattern w.r.t. min_support=2exclude d from <ac>-projected database

Page 27: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Pseudo-projection• Major cost of PrefixSpan: projection

– Postfixes of sequences often appear repeatedly in recursive projected databases

• When the projected database fit in memory, use pointers to form projections– Pointer to the sequence

– Offset of the postfix

s=<a(abc)(ac)d(cf)>

<(abc)(ac)d(cf)>

<(_c)(ac)d(cf)>

s|<a>: ( , 2)

s|<ab>: ( , 4)

Page 28: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Pseudo-Projection vs. Physical Projection• Pseudo-projection avoids physically copying

postfixes– Efficient when database fits in main memory

– Not efficient when database cannot fit in main memory

• Disk-based random accessing is very costly

• Suggested Approach:– Integration of physical and pseudo-projection

– Swapping to pseudo-projection when the data set fits in memory

Page 29: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Experiments

• Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc. 1995 ICDE’95– number of items 1000– number of sequences in the data set 10,000– average number of items within elements 8– average number of elements in a sequence 8

Page 30: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Experiments (con’t)

• Comparing PrefixSpan with GSP and

FreeSpan in large databases – GSP (IBM Almaden, Srikant & Agrawal EDBT’96)

– FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00)

– Prefix-Span-1 (single-level projection)

– Prefix-Span-2 (bi-level projection)

• Comparing effects of pseudo-projection

• Comparing I/O cost and scalability

Page 31: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

PrefixSpan Is Faster Than GSP and FreeSpan

0

50

100

150

200

250

300

350

400

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Support threshold (%)

Ru

nti

me

(se

con

d)

PrefixSpan-1

PrefixSpan-2

FreeSpan

GSP

Page 32: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Effect of Pseudo-Projection for projected database fit in memory

0

40

80

120

160

200

0.20 0.30 0.40 0.50 0.60

Support threshold (%)

Ru

nti

me

(se

con

d)

PrefixSpan-1

PrefixSpan-2

PrefixSpan-1 (Pseudo)

PrefixSpan-2 (Pseudo)

Page 33: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

I/O Cost: When It Cannot Fit in Memory

0.E+00

2.E+09

4.E+09

6.E+09

8.E+09

1.E+10

0.0 1.0 2.0 3.0Support threshold (%)

I/O C

ost

PrefixSpan-1PrefixSpan-1 (pseudo)PrefixSpan-2PrefixSpan-2 (pseudo)

Page 34: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Scalability (When DB Is Large)

0

5

10

15

20

25

30

0 100 200 300 400 500

# of sequences (thousand)

Ru

nti

me

(th

ou

san

d

seco

nd

)

PrefixSpan-1

PrefixSpan-2

min_sup=0.2%

Page 35: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Conclusions• Both PrefixSpan and FreeSpan are pattern-

growth methods which perform better than Apriori-like methods for sequential pattern mining problem

• PrefixSpan is more elegant than FreeSpan– Apriori heuristic is integrated into bi-level

projection in PrefixSpan– Pseudo-projection substantially enhances the

performance of the memory-based processing

Page 36: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

References

• J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.

• J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224.

• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.

Page 37: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Q&A

Page 38: Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Thanks