Upload
brandon-flynn
View
217
Download
0
Embed Size (px)
Citation preview
Pattern-Growth Methods for Sequential Pattern Mining
Iris Zhang2003-5-14
Outline• Sequential pattern mining• Apriori-like methods
– GSP
• Pattern-growth methods– FreeSpan– PrefixSpan
• Performance analysis• Conclusions
Motivation
• Sequential pattern mining: Finding time-related frequent patterns
• Most data and applications are time-related– Customer shopping patterns, telephone calling
patterns
– Natural disasters (e.g., earthquake, hurricane)
– Disease and treatment
– Stock market fluctuation
– Weblog click stream analysis
– DNA sequence analysis
Concepts• Let I={i1,i2,…,in} be a set of all items
• Itemset is a subset of items• Sequence is an ordered list of itemset.
itemsets are called elements. The number of items in the sequence is its length– e.g. < (ef)(ab)(df)cb >
• A sequence =<a1a2…an> is called subsequence of =<b1b2…bm>, denoted , if there exist integers 1j1 <j2<…<jn m such that a1bj1, a2bj2,…,anbjn
– e.g. <a(bc)dc>is subsequence of <<a(abc)(ac))d(cf)>>
Concepts (con’t)• Sequence database is a set of tuples <sid,s>, sid is a
sequence_id, and s is a sequence. A tuple is said to contain a sequence if is a subsequence of s
• Support of is the number of tuples in the database containing
• If the support of no less than a threshold, it is called sequential pattern– <(ab)c> is a sequential pattern given support threshold
min_sup =2
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Problem definition
• Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database
Apriori-like methods
• Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent– e.g. <bh> is infrequent, so do <abh>,<b(dh)>
• GSP (Generalized Sequential Pattern) algorithm– Level-by-level do
• Generate candidate sequences• Use Apriori property to prune candidates• Scan database to collect support counts
GSP Mining Process
1st scan: 8 cand. 6 length-1 seq. pat.
2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all
3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. pat.
5th scan: 1 cand. 1 length-5 seq. pat.
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
<abb> <aab> <aba> <baa> <bab> …
<abba> <(bd)bc> …
<(bd)cba> Cand. cannot pass sup. threshold
Cand. not in DB at all
Bottlenecks of Apriori-Like Methods• Potentially huge set of candidate sequences
– 1,000 frequent length-1 sequences generate length-2
candidates
• Multiple scans of database
• Difficulties at mining long sequential patterns– Exponential number of short candidates
– A length-100 sequential pattern needs candidate sequences
500,499,12
999100010001000
30100100
1
1012100
i i
Pattern-growth methods• A divide-and-conquer approach
– Recursively project a sequence database into a set of smaller databases
– Mine each projected database to find the subset of patterns
• Algorithms– FreeSpan: Frequent Pattern-Projected Sequential
Pattern Mining– PrefixSpan: Prefix-Projected Sequential Pattern
Mining
FreeSpan• Example: given a sequence database S and
min_support = 2
• Step 1: find length-1 sequential patterns and list them in support descending order– f_list = a:4,b:4,c:4,d:3,e:3,f:3
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
FreeSpan (con’t)• Step 2: divide search space. The complete
set of seq. pat. can be partitioned into 6 disjoint subsets:– ones only contain item a– ones contain item b but no items after b in f_list– ones contain item c but no items after c in f_list– ones contain item d but no items after d in f_list– ones contain item e but no items after e in f_list– ones contain item f
find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively
FreeSpan (con’t)• Finding Seq. Patterns containing item b but
no items after b in f_list– <b>-projected database: <a(ab)a>, <aba>,
<(ab)b>, <ab>
– Find all the length-2 seq. pat. containing item b but no items after b in f_list : <ab>:4, <ba>:2, <(ab)>:2
– Further partition and miningSID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
From FreeSpan to PrefixSpan• Freespan:
– Projection-based: No candidate sequence needs to be generated
– But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database
• PrefixSpan– Projection-based
– But only prefix-based projection: less projections and quickly shrinking sequences
PrefixSpan-conceptsSuppose all items in an element are listed alphabetically.Given a sequence =<e1e2…en>, =<e’1e’2…e’m>(mn)
• Prefix: is the prefix of iff (1) e’i=ei (i m-1) (2) e’m
em(3) all items in (em- e’m) are alphabetically after those in e’m.
– e.g. =<a(abc)(ac)d(cf)>, =<a(ab)>, ’=<a(bc)>
• Postfix: sequence =<e1e2…e’m>, =<e’’mem+1…en> is called the postfix of w.r.t. prefix , where e’’m=(em-e’m), denoted as =.
– e.g. =<(_c)(ac)d(cf)> is the postfix of w.r.t. prefix <a(ab)>
PrefixSpan-concepts (con’t)
• Projected database: let be a sequential pattern in S. -projected database, denoted s|, is the collection of postfixes of sequences in S w.r.t. prefix
• Support count in projected database: let be a sequential pattern in S, be a sequence having prefix . The support count of in -projected database is the number of sequence in s| such that .
PrefixSpan-process• Step 1: find length-1 sequential patterns
– <a>:4, <b>:4, <c>:4, <d>:3, <e>:3, <f>:3
• Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:
– ones having prefix <a>;– ones having prefix <b>;– …– ones having prefix <f>;
find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
PrefixSpan-Process (con’t)• Finding Seq. Patterns with Prefix <a>
– <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
– Find all the length-2 seq. pat. having prefix <a>:<aa>:2, <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2
– Further partition into 6 subsets• Having prefix <aa>;
• …
• Having prefix <af>;
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
Completeness of PrefixSpanSID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Length-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>
…
prefix <af>
<b>-projected database …
prefix <b><a>-projected database
<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>
Length-2 seq. pan<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>
prefix <a>
prefix <aa>
<aa>-proj. db <af>-proj. db
prefix <c>, …, <f>
… …
Efficiency of PrefixSpan
• No candidate sequence needs to be
generated
• Projected databases keep shrinking
• Major cost of PrefixSpan: constructing
projected databases
– Can be improved by bi-level projections and
pseudo-projections
Optimization Techniques in PrefixSpan
• Single-level vs. bi-level projection
– Bi-level projection with 3-way checking may
reduce the number and size of projected
databases
• Physical projection vs. pseudo-projection
– Pseudo-projection may reduce the effort of
projection when the projected database fits in
main memory
S-matrix for sequence databaseLength-1 sequential patterns: <a>, <b>, <c>, <d>, <e>, <f>
All length-2 sequential patterns are found in S-matrix
S-matrix
fedcba
1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f
0(1, 1, 0)(1, 2, 0)(1, 2, 0)(1, 2, 1)e
0(1, 3, 0)(2, 2, 0)(2, 1, 1)d
3(3, 3, 2)(4, 2, 1)c
1(4, 2, 2)b
2a
<aa> happens twice
<ac> happens4 times
<ca>happens twice
<(ac)> happens once
S-matrix for <ab>-projected database• <ab>-projected database:
– <(_c)(ac)d(cf)>,<(_c)(ae)>,<c>
• frequent items:<a>,<c>,<(_c)>• S-matrix:
a 0
c (1, 0, 1) 1
(_c) (, 2, ) (, 1, )
a c (_c)
No a(_c), no count
Lead to pattern
<a(bc)a>
SID Sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <(eg(af)cbc>
Scaling-up by Bi-level Projection
• Partition search space based on length-2
sequential patterns
• Only form projected databases and pursue
recursive mining over bi-level projected
databases
Benefits of Bi-level Projection• More patterns are found in each shoot
• Much less projections
– In the example, there are 53 patterns.
– 53 level-by-level projections
– 22 bi-level projections
3-way Apriori Checking
• Using Apriori heuristic to prune items in projected databases
a 2
b (4, 2, 2) 1
c (4, 2, 1) (3, 3, 2) 3
d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0
e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0
f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1
a b c d e f
<acd> cannot be a pattern w.r.t. min_support=2exclude d from <ac>-projected database
Pseudo-projection• Major cost of PrefixSpan: projection
– Postfixes of sequences often appear repeatedly in recursive projected databases
• When the projected database fit in memory, use pointers to form projections– Pointer to the sequence
– Offset of the postfix
s=<a(abc)(ac)d(cf)>
<(abc)(ac)d(cf)>
<(_c)(ac)d(cf)>
s|<a>: ( , 2)
s|<ab>: ( , 4)
Pseudo-Projection vs. Physical Projection• Pseudo-projection avoids physically copying
postfixes– Efficient when database fits in main memory
– Not efficient when database cannot fit in main memory
• Disk-based random accessing is very costly
• Suggested Approach:– Integration of physical and pseudo-projection
– Swapping to pseudo-projection when the data set fits in memory
Experiments
• Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc. 1995 ICDE’95– number of items 1000– number of sequences in the data set 10,000– average number of items within elements 8– average number of elements in a sequence 8
Experiments (con’t)
• Comparing PrefixSpan with GSP and
FreeSpan in large databases – GSP (IBM Almaden, Srikant & Agrawal EDBT’96)
– FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00)
– Prefix-Span-1 (single-level projection)
– Prefix-Span-2 (bi-level projection)
• Comparing effects of pseudo-projection
• Comparing I/O cost and scalability
PrefixSpan Is Faster Than GSP and FreeSpan
0
50
100
150
200
250
300
350
400
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Support threshold (%)
Ru
nti
me
(se
con
d)
PrefixSpan-1
PrefixSpan-2
FreeSpan
GSP
Effect of Pseudo-Projection for projected database fit in memory
0
40
80
120
160
200
0.20 0.30 0.40 0.50 0.60
Support threshold (%)
Ru
nti
me
(se
con
d)
PrefixSpan-1
PrefixSpan-2
PrefixSpan-1 (Pseudo)
PrefixSpan-2 (Pseudo)
I/O Cost: When It Cannot Fit in Memory
0.E+00
2.E+09
4.E+09
6.E+09
8.E+09
1.E+10
0.0 1.0 2.0 3.0Support threshold (%)
I/O C
ost
PrefixSpan-1PrefixSpan-1 (pseudo)PrefixSpan-2PrefixSpan-2 (pseudo)
Scalability (When DB Is Large)
0
5
10
15
20
25
30
0 100 200 300 400 500
# of sequences (thousand)
Ru
nti
me
(th
ou
san
d
seco
nd
)
PrefixSpan-1
PrefixSpan-2
min_sup=0.2%
Conclusions• Both PrefixSpan and FreeSpan are pattern-
growth methods which perform better than Apriori-like methods for sequential pattern mining problem
• PrefixSpan is more elegant than FreeSpan– Apriori heuristic is integrated into bi-level
projection in PrefixSpan– Pseudo-projection substantially enhances the
performance of the memory-based processing
References
• J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.
• J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224.
• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.
Q&A
Thanks