Upload
sonaji-yusup
View
227
Download
0
Embed Size (px)
Citation preview
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 1/55
Sequential Pattern Analysis
Departemen Ilmu Komputer FMIPA IPB
November 17, 2007Data Mining: Concepts and Techniques1
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 2/55
Object Timestamp Events
Sequence Database:
, ,A 20 6, 1A 23 1
B 11 4, 5, 6
B 21 7, 8, 1, 2B 28 1, 6C 14 1, 8, 7
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 3/55
Formal Definition of a Se uence A sequence is an ordered list of elements (transactions)
= 1 2 3 …
Each element contains a collection of events (items)
ei = i1, i2, …, ik
Each element is attributed to a specific time or location
Length of a sequence, |s|, is given by the number of elements ofthe sequence
A k-sequence is a sequence that contains k events (items)
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 4/55
Examples of Sequence Data
Sequence
Database
Sequence Element
(Transaction)
Event
(Item)
customer
a customer at time t
, ,
CDs, etc
Web Data Browsing activity of a A collection of files Home page, index
after a single mouse click
, ,
Event data History of events generated
by a given sensor
Events triggered by a
sensor at time t
Types of alarms
generated by sensors
Genome
sequences
DNA sequence of a
particular species
An element of the DNA
sequence
Bases A,T,G,C
i1 i1i2
i3i2
Element(Transaction)
Event(Item)
Sequence
e1 e2 e3 e4 e5
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 5/55
Examples of Sequence
Web sequence:
omepage ec ron cs g a ameras anon g a amera opp ng
Cart} {Order Confirmation} {Return to Shopping} >
-Island:
(http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm)<
{condenser polisher outlet valve shut} {booster pumps trip}{main waterpump trips} {main turbine trips} {reactor pressure increases}>
Sequence of books checked out at a library:<{Fellowship of the Ring} {The Two Towers} {Return of the King}>
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 6/55
Formal Definition of a SubsequenceA sequence A= <a1 a2 … an> is contained in another sequence
B =<b b … b >m
(m≥
n) if there exist integers i1 < i2 < … < insuch that a1⊆ bi1 , a2 ⊆ bi1, …, an ⊆ binB A
Data sequence Subsequence Contain?
< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes
< {1,2} {3,4} > < {1} {2} > No
< {2,4} {2,4} {2,5} > < {2} {4} > Yes
The support of a subsequence w is defined as the fraction of datasequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose
su ort is ≥ minsu
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 7/55
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 8/55
equen a a ern n n : e n on
Given:
a database of sequences a user-specified minimum support threshold, minsup
Task:
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 9/55
Given a sequence: <{a b} {c d e} {f} {g h i}>
a en e on equen a a ern n n :
Examp es o su sequences:<{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc.
- -sequence?
k=4: Y _ _ Y Y _ _ _ Y
<{a} {d e} {i}>126
1*2*3*4
6*7*8*9
4
9==⎟⎟
⎠
⎞⎜⎜⎝
⎛ =⎟⎟
⎠
⎞⎜⎜⎝
⎛
k
n
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 10/55
Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are hidden in databases A mining algorithm should
find the complete set of patterns, when possible, satisfying the
minimum support (frequency) threshold e g y e c en , sca a e, nvo v ng on y a sma num er o
database scans
be able to incor orate various kinds of user-s ecific constraints
November 17, 2007Data Mining: Concepts and Techniques10
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 11/55
A Basic Property of Sequential Patterns: Apriori
A basic ro ert : A riori A rawal & Sirkant’94
If a sequence S is not frequent Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and <(ah)b>
<(bd)cb(ac)>10
SequenceSeq. ID Given support threshold min_sup =2
<(be)(ce)d>40
<(ah)(bf)abf>30
November 17, 2007Data Mining: Concepts and Techniques11
<a(bd)bcb(ade)>50
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 12/55
GSP—A Generalized Sequential Pattern Mining Algorithm
GSP (Generalized Sequential Pattern) mining algorithm
propose y grawa an r ant,
Outline of the method
Initiall ever item in DB is a candidate of len th-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each candidate sequence generate can i ate engt - + sequences rom engt - requent sequences
using Apriori
repeat until no frequent sequence or no candidate can be found
Major strength: Candidate pruning by Apriori
November 17, 2007Data Mining: Concepts and Techniques12
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 13/55
Finding Length-1 Sequential Patterns
Examine GSP using an example
Initial candidates: all singleton sequences
<a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates
Cand Sup
<a> 3
min_sup =2
<c> 4
<d> 3
< >
<(bd)cb(ac)>10
SequenceSeq. ID <e> 3
<f> 2
<(be)(ce)d>40
<(ah)(bf)abf>30 <h> 1
November 17, 2007Data Mining: Concepts and Techniques13
<a( ) c (a e)>50
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 14/55
Generating Length-2 Candidates
<a> <b> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af>
<b> <ba> <bb> <bc> <bd> <be> <bf>
<c> <ca> <cb> <cc> <cd> <ce> <cf>
<d> <da> <db> <dc> <dd> <de> <df>
eng -
Candidates<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)>
<c> <(cd)> <(ce)> <(cf)>
Without Apriori property,
8*8+8*7/2=92 candidates
<d> <(de)> <(df)>
<e> <(ef)>
<f>
Apriori prunes
44.57% candidates
November 17, 2007Data Mining: Concepts and Techniques14
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 15/55
Finding Length-2 Sequential Patterns
Scan database one more time, collect support count for each
len th-2 candidate
There are 19 length-2 candidates which pass the minimumsupport threshold
T ey are engt -2 sequentia patterns
November 17, 2007Data Mining: Concepts and Techniques15
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 16/55
Support Count Length-2 Candidate
<a> <b> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af>< >
SequenceSeq. ID
min_sup =2
<b> <ba> <bb> <bc> <bd> <be> <bf>
<c> <ca> <cb> <cc> <cd> <ce> <cf>
<d> <da> <db> <dc> <dd> <de> <df><(ah)(bf)abf>30
<(bf)(ce)b(fg)>20
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff><a(bd)bcb(ade)>50
<( e)(ce) >40
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)>
<c> <(cd)> <(ce)> <(cf)>
0
1
<d> <(de)> <(df)>
<e> <(ef)>
<f>
3
4
November 17, 2007Data Mining: Concepts and Techniques16
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 17/55
Length-2 Sequential Patterns
<a> <b> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af>< >
SequenceSeq. ID
min_sup =2
<b> <ba> <bb> <bc> <bd> <be> <bf>
<c> <ca> <cb> <cc> <cd> <ce> <cf>
<d> <da> <db> <dc> <dd> <de> <df><(ah)(bf)abf>30
<(bf)(ce)b(fg)>20
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff><a(bd)bcb(ade)>50
<( e)(ce) >40
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)>
<c> <(cd)> <(ce)> <(cf)>
0
1
<d> <(de)> <(df)>
<e> <(ef)>
<f>
3
4
November 17, 2007Data Mining: Concepts and Techniques17
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 18/55
Generating Length-3 Candidates and Finding Length-3 Patterns
Generate Length-3 Candidates
e - o n engt - sequent a patterns Based on the Apriori property
<aa>,<ab> and <ba> are all length-2 sequential patterns <aba> is alength-3 candidate
<(bd)>, <db>, <bb> and are all length-2 sequential patterns <(bd)b> is a
length-3 candidate
46 candidates are generated
Find Length-3 Sequential Patterns
, 19 out of 46 candidates pass support threshold
November 17, 2007Data Mining: Concepts and Techniques18
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 19/55
Support Count Length-3 Candidate
<aa> <ab> <ba> <bb> <bc> <bd>
<aa>< >
SequenceSeq. ID
min_sup =2
<ab>
<ba>
<bb><(ah)(bf)abf>30
<(bf)(ce)b(fg)>20
<bc>
<bd><a(bd)bcb(ade)>50
<( e)(ce) >40
0
1
3
4
November 17, 2007Data Mining: Concepts and Techniques19
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 20/55
The GSP Mining Process
<abba> <(bd)bc> …
<(bd)cba>
4th scan: 8 cand. 6 length-4 seq. pat.
5th scan: 1 cand. 1 length-5 seq. pat. an . cannot pass
sup. thresholdCand. not in DB at all
<abb> <aab> <aba> <baa> <bab> …
2nd scan: 51 cand. 19 length-2 seq. pat. 10
3rd scan: 46 cand. 19 length-3 seq. pat. 20cand. not in DB at all
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <a > … <a > < a> < > … < > < a > … < e >1st scan: 8 cand. 6 length-1 seq. pat.
.
<(bf)(ce)b(fg)>20
<(bd)cb(ac)>10
.
min_sup =2
November 17, 2007Data Mining: Concepts and Techniques20 <a(bd)bcb(ade)>50
<(be)(ce)d>40
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 21/55
The GSP Algorithm
Take sequences in form of <x> as length-1 candidates
Scan database once, find F1, the set of length-1 sequential patterns Let k=1; while Fk is not empty do
Form Ck+1, the set of length-(k+1) candidates from Fk;
If Ck+1 is not empty, scan database once, find Fk+1, the set of length- sequent a patterns
Let k=k+1;
November 17, 2007Data Mining: Concepts and Techniques21
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 22/55
Bottlenecks of GSP
A huge set of candidates could be generated
1,000 frequent length-1 sequences generate
length-2 candidates!500,499,1
2
999100010001000 =
×+×
Multiple scans of database in mining
Real challenge: mining long sequential patterns
An exponential number of short candidates
A length-100 sequential pattern needs 1030
candidate sequences!30100
100
1012100
≈−=⎟ ⎞
⎜⎛
November 17, 2007Data Mining: Concepts and Techniques22
1=i
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 23/55
FreeSpan:
Fre uent Pattern-Pro ected Se uential Pattern Minin
A divide-and-conquer approach
on the current set of frequent patterns Mine each projected database to find its patterns
. an . e , . ortazav - s , . en, . aya , . . su, ree pan: requentpattern-projected sequential pattern mining. In KDD’00.
f_list: b:5, c:4, a:3, d:3, e:3, f:2
All seq. pat. can be divided into 6 subsets:•
Sequence Database SDB
< (bd) c b (ac) > . .•Those containing e but no f•Those containing d but no e nor f •Those containing a but no d , e or f
< (bf) (ce) b (fg) >< (ah) (bf) a b f >< (be) (ce) d >
November 17, 2007Data Mining: Concepts and Techniques23
•Those containing c but no a, d , e or•Those containing only item b
< a (bd) b c b (ade) >
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 24/55
From FreeSpan to PrefixSpan: Why?
Freespan:
Projection-based: No candidate sequence needs to be generated
But, projection can be performed at any point in the sequence,
an t e projecte sequences o wi not s rin muc
PrefixSpan
ro ec on- ase
But only prefix-based projection: less projections and quickly
shrinkin se uences
November 17, 2007Data Mining: Concepts and Techniques24
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 25/55
Prefix and Suffix (Projection)
<a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence<a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
<aa> <(_bc)(ac)d(cf)>
<ab> < c ac d cf >
November 17, 2007Data Mining: Concepts and Techniques25
_
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 26/55
Mining Sequential Patterns by Prefix
Pro ections
Step 1: in engt -1 sequentia patterns
<a>, <b>, <c>, <d>, <e>, <f>
. . .partitioned into 6 subsets:
The ones having prefix <a>; The ones having prefix <b>;
…
SID sequence
10 <a(abc)(ac)d(cf)>
e ones av ng pre x 20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
November 17, 2007Data Mining: Concepts and Techniques26
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 27/55
Finding Seq. Patterns with Prefix <a>
Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>,<(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>,ac , a , a
Further partition into 6 subsets
< > SID sequence
…
Having prefix <af>
10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
November 17, 2007Data Mining: Concepts and Techniques27
40 <eg(af)cbc>
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 28/55
Completeness of PrefixSpan
SID sequence
SDB
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
-
<a>, <b>, <c>, <d>, <e>, <f>
<a>-projected database
Having prefix <a>
<b>- ro ected database …
Having prefix <b>
Having prefix <c>, …, <f>
<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb>< >
Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>
_
Having prefix <aa> Having prefix <af>
… …
November 17, 2007Data Mining: Concepts and Techniques28
<aa>-proj. db … <af>-proj. db
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 29/55
Efficiency of PrefixSpan
No candidate se uence needs to be enerated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing projected databases
Can be im roved b bi-level ro ections
November 17, 2007Data Mining: Concepts and Techniques29
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 30/55
Physical projection vs. pseudo-projection
Pseudo-projection may reduce the effort of projection when the
projected database fits in main memory
Parallel projection vs. partition projection Partition projection may avoid the blowup of disk space
November 17, 2007Data Mining: Concepts and Techniques30
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 31/55
Speed-up by Pseudo-projection
Postfixes of sequences often appear repeatedly inrecursive projected databases
When (projected) database can be held in main memory,use pointers to form projections
Offset of the postfix s=<a(abc)(ac)d(cf)>
<a>
<(abc)(ac)d(cf)>
<ab>
s|<a>: ( , 2)
November 17, 2007Data Mining: Concepts and Techniques31
<(_ c)(ac)d(cf)>s|<ab>: ( , 4)
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 32/55
Pseudo-Projection vs. Physical Projection
Pseudo-projection avoids physically copyin postfixes Efficient in running time and space when database can be held in
main memory
,memory
Disk-based random accessing is very costly Suggested Approach:
Integration of physical and pseudo-projection Swa in to seudo- ro ection when the data set fits in memor
November 17, 2007Data Mining: Concepts and Techniques32
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 33/55
PrefixSpan Is Faster than GSP and FreeSpan
-
300
350
o n d ) PrefixSpan-2
FreeSpan
150
200
i m
e ( s e GSP
0
50
100
R u n
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Support threshold (%)
November 17, 2007Data Mining: Concepts and Techniques33
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 34/55
Effect of Pseudo-Projection
PrefixSpan-1
160 n d )
PrefixSpan-2
PrefixSpan-1 (Pseudo)
Pref ixS an-2 Pseudo
120
e
( s e c o
40 R u n t i
0
0.20 0.30 0.40 0.50 0.60
November 17, 2007Data Mining: Concepts and Techniques34
Support threshold (%)
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 35/55
Compact Representation of Frequent Itemsets
Some itemsets are redundant because they have identical support
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
Number of frequent itemsets
Need a com act re resentation
∑=
⎟⎜⎝
×=10
1
3k
k
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 36/55
An itemset is maximal frequent if none of its immediate supersets is frequent
MaximalItemsets
Infre uent
BorderItemsets
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 37/55
An itemset is closed if none of its immediate supersets has the
same support as t e itemset
Itemset Support{A} 4
1 {A,B}
2 {B,C,D}
3 {A,B,C,D}
{B} 5
{C} 3
{D} 4
emse uppor
{A,B,C} 2
{A,B,D} 3
A C D 24 {A,B,D}
5 {A,B,C,D}
,
{A,C} 2
{A,D} 3
B C 3
{B,C,D} 3
{A,B,C,D} 2
{B,D} 4{C,D} 3
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 38/55
TID Items
1 ABC
null
124 123 1234 245 345
Transaction Ids
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3
5 DE AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
12 2 24 4 4 2 3 4
ABCD ABCE ABDE ACDE BCDE2 4
ABCDE
o suppor e yany transactions
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 39/55
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 40/55
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 41/55
equen a a ern n n : xamp e
Object Timestamp Events
insup =
Examples of Frequent Subsequences:
, ,
A 2 2,3
A 3 5
B 1 1,2 , s=
< {2,3} > s=60%< {2,4}> s=80%
< {3} {5}> s=80%
B 2 2,3,4
C 1 1, 2
C 2 2,3,4
s=< {2} {2} > s=60%< {1} {2,3} > s=60%< {2} {2,3} > s=60%
, ,
D 1 2
D 2 3, 4
D 3 4, 5 , , s,
E 2 2, 4, 5
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 42/55
Given n events: i1, i2, i3, …, in
Candidate 1-subsequences:<{i1}>, <{i2}>, <{i3}>, …, <{in}>
Candidate 2-subsequences:
<{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-1}{in}>
Candidate 3-subsequences:
<{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2} {i2}>, …,<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1} {i1}
{i2}>, …
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 43/55
enera ze equen a a ern Step 1:
Make the first pass over the sequence database D to yield all the 1-elementrequent sequences
Step 2:
Re eat until no new fre uent se uences are found Candidate Generation:
Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidatesequences that contain k items
Prune candidate k-sequences that contain infrequent (k-1)-subsequences
Support Counting: Make a new pass over the sequence database D to find the support for these candidate
Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 44/55
Base case (k=2): Merging two frequent 1-sequences <{i1}> and <{i2}> will produce two candidate
2-sequences: <{i1} {i2}> and <{i1 i2}>
General case (k>2): A frequent (k-1)-sequence w1 is merged with another frequent
(k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by
removing the first event in w1 is the same as the subsequence obtained by removinge as even n w2
The resulting candidate after merging is given by the sequence w1 extended with the lastevent of w2.
2 , 2
becomes part of the last element in w1
Otherwise, the last event in w2 becomes a separate element appended to the end of w1
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 45/55
Merging the sequences
w =< 1 2 3 4 > and w =< 2 3 4 5 >will produce the candidate sequence < {1} {2 3} {4 5}> because the last two
events in w2 (4 and 5) belong to the same element
Merging the sequencesw1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>
will produce the candidate sequence < {1} {2 3} {4} {5}> because the lasttwo events n w2 an o not e ong to t e same e ement
We do not have to merge the sequencesw1 = an w2 =
to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viablecandidate, then it can be obtained by merging w1 with< 1 2 6 5 >
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 46/55
Timing Constraints (I)
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 47/55
Timing Constraints (I)
{A B} {C} {D E}
xg: max-gap
<= ms
= xg ng ng: min-gap
ms: maximum span
xg = 2, ng = 0, ms= 4
Data sequence Subsequence Contain?
< {2,4} {3,5,6} {4,7} {4,5} {8} > < {6} {5} > Yes
< {1} {2} {3} {4} {5}> < {1} {4} > No
< {1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes
< {1,2} {3} {2,3} {3,4} {2,4} {4,5}> < {1,2} {5} > No
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 48/55
Mining Sequential Patterns with Timing Constraints
Approach 1:
Mine sequential patterns without timing constraints
Postprocess the discovered patterns
Approach 2:
Modif GSP to directl rune candidates that violate timinconstraints
Question: Does Apriori principle still hold?
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 49/55
Object Timestamp Events Suppose:
, ,
A 2 2,3
A 3 5
B 1 1 2
xg = 1 (max-gap)
ng = 0 (min-gap)
m = 5 maximum s anB 2 2,3,4
C 1 1, 2
C 2 2,3,4
minsup = 60%
, ,
D 1 2
D 2 3, 4
D 3 4, 5
<{2} {5}> support = 40%
but
<{2} {3} {5}> support = 60%
E 1 1, 3E 2 2, 4, 5
ro em ex s s ecause o max-gap cons ra n
No such problem if max-gap is infinite
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 50/55
s is a contiguous subsequence of
w e1 e2 … ek
if any of the following conditions hold:1. s is obtained from w by deleting an item from either e1 or ek
. s s o ta ne rom w y e et ng an tem rom any e ement ei t atcontains more than 2 items
3. s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of
w recursive definition
Examples: s = < {1} {2} > is a contiguous subsequence of
< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} >
is not a contiguous subsequence of < {1} {3} {2}> and < {2} {1} {3} {2}>
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 51/55
Without maxgap constraint:
A candidate k-sequence is pruned if at least one of its (k-1)-
subsequences is infrequent
With maxgap constraint:
A candidate k-se uence is runed if at least one of itscontiguous (k-1)-subsequences is infrequent
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 52/55
{A B} {C} {D E}
xg: max-gap
-
<= ms
<= xg >ng <= wsg
ws: window sizems: maximum span
x = 2 n = 0 ws = 1 m = 5
Data sequence Subsequence Contain?
< 2 4 3 5 6 4 7 4 6 8 > < 3 5 > No
< {1} {2} {3} {4} {5}> < {1,2} {3} > Yes
< {1,2} {2,3} {3,4} {4,5}> < {1,2} {3,4} > Yes
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 53/55
Given a candidate pattern: <{a, c}>
Any data sequences that contain
<… {a c} … >,<… {a} … {c}…> ( where time({c}) – time({a}) ≤ ws)<…{c} … {a} …> (where time({a}) – time({c}) ≤ ws)
will contribute to the support count of candidate pattern
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 54/55
In some domains, we may have only one very long time series
Exam le: monitoring network traffic events for attacks
monitoring telecommunication alarm signals
oa s o n requen sequences o even s n e me ser es
This problem is also known as frequent episode mining
E1
E2
E1
E2
E1
E2
E3
E4 E3 E4
E1
E2
E2 E4
E3 E5
E2
E3 E5
E1
E2 E3 E1
Pattern: <E1> <E3>
General Support Counting Schemes
8/10/2019 Sequence Databases and Sequential Pattern Analysis
http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 55/55
Assume:
xg = 2 (max-gap)
n = 0 min- a
ws = 0 (window size)
ms = 2 (maximum span)