View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Fast Vertical Mining Using Diffsets
Mohammed J. ZakiKaram Gouda
Amir Epstein
Outline
• Introduction• Problem Setting and Notations• Equivalence Classes & Diffsets• Algorithms For Mining Frequent, Closed and
Maximal Patterns• Experimental Results• Conclusions
Amir Epstein
Introduction
• Horizontal methods (Most are Apriori variants)
• Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth)
• Mining Closed Sets (A-Close, Closet, Charm)
• Vertical Methods• Vertical Approach Problems• Diffsets
Amir Epstein
Notations
• I – set of items• T- database transactions • Tid – transaction identifier• Itemset – a set • Tidset – a set • K-itemset – An itemset with k items• Support of an itemset X, denoted - the
number of transactions in which X occurs as a subset
TY
IX
)(X
Amir Epstein
Notation
• Frequent itemset – if • Powerset P(I) – search space enumeration• Maximal frequent itemset- if it is not a subset of
any other frequent itemset• Closed frequent itemset (X) - if there is not exist
a superset with • Closure of an itemset X, denoted c(X) – the
smallest closed set that contains X
supmin_)( X
XY )()( YX
Amir Epstein
The Problem
• Find all frequent items having minimum support
Amir Epstein
Database Example
Amir Epstein
Frequent, Closed and Maximal Itemsets
Amir Epstein
Data Formats
Amir Epstein
Equivalence Classes
• Define a function ,where
the k-length prefix of X• Define an equivalence relation (prefix-based) :
)()(: IPNIPp
]:1[),( kXkXp
k
),(),(),(, kYpkXpYXIPYXK
Amir Epstein
Example{} {A,C,D,T,W}
A {C,D,T,W} C {D,T,W} D {T,W}T {W} W
AC {D,T,W}
ACD {T,W}
ACDT {W}
ACDTW
AD {TW} AT {W} AW CD {T,W}
CDT {W}
CDTW
ACT {W} ACW
ACDW ACTW
ADT ADW ATW
ADTW
CT {W} CW DT ,W}
DTW
DW TW
CDW CTW
Amir Epstein
Compute Subset Class
• Let • Perform intersection of with all with
to obtain a new class with elements ,where is frequent
},...,,{ 21 nXXXP
iPX jPX ij
iPX jX
ji XPX
Amir Epstein
Tidset Intersections (example)1
2
3
4
5
6
1
3
4
5
A C2
4
5
6
D1
3
5
6
T W1
2
3
4
5
1
3
4
5
4
5
1
3
5
1
3
4
5
AC AD AT AW
2
4
5
6
1
3
5
6
1
2
3
4
5
5
6
2
4
5
1
3
5
CD CT CW DT DW TW
1
3
5
1
3
4
5
1
3
5
1
3
5
2
4
5
1
3
5
ACT ACW ATW CDW CTW
ACTW
Amir Epstein
Diffsets
Difference of the prefix tidset and a class member tidset
• Consider class with prefix P• Let t(X) denote the tidset of element X• Let d(X) denote the diffset of element X, with respect to prefix tidset• Let PX and PY be class members of P• Support )()( and )()( PtPYtPtPXt
Amir Epstein
Diffsets
• Then • Define diffset • Then
)()()( PYtPXtPXYt )()()( XtPtPXd
)()()( PXdPPX
Amir Epstein
Diffsets
• How to Calculate using d(PX) and d(PY) ?– – –
)(PXY
)()()( PXYdPXPXY
)()()( PXYtPXtPXYd
)()()]()([)]()([
)()()()()()()(
PXdPYdXtPtYtPt
PtPtPYtPXtPXYtPXtPXYd
Amir Epstein
Example
t(P)
d(PY) d(PX)
t(X)
t(Y)
d(PXY) t(PXY)
Amir Epstein
Diffset Intersections (example)
1
2
3
4
5
6
1
3
4
5
A C2
4
5
6
D1
3
5
6
T W1
2
3
4
5
1
3
4
AC AD AT AW
1
3
2
4
6 6 6
CD CT CW DT DW TW
4 6 6
ACT ACW ATW CDW CTW
ACTW
2
4
A C D T W2
6
1
3
2
4
6
TIDSET database DIFFSET database
Amir Epstein
Diffset Example
• Diffset calculation– –
• Support calculation–
13)()()( DtAtADd
132613)()()( AdDdADd
224)()()( ADdAAD
Amir Epstein
Diffset Example
• Database Size– Tidsets database size =23– Diffets database size =7
• Total Size– Tidsets database size =76– Diffsets database size =22
• Size By Length
K-itemset (k) Avg. tidset length Avg. diffset length
2 3.8 1
3 3.2 0.6
4 3 0
Amir Epstein
Experimental Study
• Compare diffsets versus tidsets in terms of database sizes
• Method– Real datasets (usually dense)– Synthetic datasets (sparse)
Amir Epstein
Size Of Database
Amir Epstein
Average Diffset / Tidset Size By length
Amir Epstein
Average Diffset / Tidset Size Database Min_sup
(%)Max Length
Avg. Diffset Size
Avg. Tidset Size
Reduction Ration
chess 0.5 16 26 1820 70
connect 90 12 143 62204 435
mushroom 5 17 60 622 10
Pumsb* 35 15 301 18977 63
pumsb 90 8 330 45036 136
T10I4D100K 0.025 11 14 86 6
T20I16D100K 0.1 14 31 230 11
T40I10D100K 0.5 18 96 755 8
Amir Epstein
When To Use diffsets
• Usually there is a cross-over point• For Dense dataset start with diffset format• For Sparse dataset start with tidset format
Amir Epstein
Reduction Ratio
• Let class P• Let PX and PY class members with t(PX) and
t(PY)• Consider new Itemset PXY in class PX• PXY can be stored as t(PXY) or d(PXY)• Definition : reduction ratio • Benefit if or •
)(/)( PXYdPXYtr
1r 1)(/)( PXYdPXYt
1))()(/()()( PYtPXtPXYtPXYd
Amir Epstein
Reduction Ratio
• Or
•
1))()(/()( PXYtPXtPXYt
2)(/)( PXYtPXt
Amir Epstein
Compressed Bitvectors
• Classical way run-length encoding (RLE) – not appropriate for association mining
• Skinning encoding scheme (used by Viper) – Worst case compression ratio reaches asymptotically
2.91– Best case compression ratio asymptotically reaches 32
Amir Epstein
GenMax: Mining Maximal Frequent Itemsets
• Uses backtracking search technique• Optimizations
– Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree)
– Superset checking• More Optimizations
– Progressive focusing to improve superset checking– Vertical database format to improve frequency checking using
tidsets, which is more improved by diffsets• Memory Handling
– Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset
Amir Epstein
MReturn 18.
.17
16.
),,( 15.
};{ 14.
)()()( );()( 13.
},{ 12.
break in set super a has 11.
in follows or is 10.
.9
}:{ 8.
7.
Itemsets.Frequent Maximal// .6
}. oforder sortedin :{)()( 5.
. oforder in each Sort .4
. INCREASING then and
ofy cardinalit INCREASINGin in itemsSort .3
set.-combine its , calculate itemeach For .2
.F Calculate , Calculate 1.
)(Dataset
1
1
1
1
1
21
ZMM
YZZ
YXIExtend
xjZxY
jtitIdjcicX
jiI
ZH
c(i)}jxj{x:xH
c(i)j
xiMxZ
Fi
{}; M
Fijjicic
Fc(i)
σ(i)
c(i)F
c(i)Fi
F
T
thenif
do eachfor
doeachfor
GenMax
Amir Epstein
} { 18.
0) Y and 0(extendflg .17
} { .16
),,( .15
}:{ 14.
13.
}{ 12.
)( 11.
1extendflg .10
)( 9.
frequent) is (NewI 8.
)()()( };{ 7.
break; ; 1extendflg 6.
in set super has 5.
in follows or is 4.
0Y 3.
2.
0extendflg 1.
.contain which itemsets maximal all i.e.,//
far so found itemsets maximalrelevant ofset theis //
and set, combine thei.e., , toadded becan // that
items ofset theis extended, be itemset to theis //
),,(
IYY
NewIYY
NewYNewXNewIExtend
XjYxNewY
NewIYY
NewX
jcXNewX
IdjdNewIdjINewI
YG
X}jxj{x:xG
Xj
I
Y
I
XI
YXIExtend
thenif
else
then if
thenif
thenif
then if
doeachfor
Procedure
Amir Epstein
dEclat: Mining All Frequent Itemsets
• Performs bottom-up search• The equivalence class lattice is traversed in a bfs
order• Input: class members• F.I are generated by computing diffsets for all
distinct pairs of itemsets and checking the support of the resulting itemset
• Stores in memory intermediate diffsets (tidsets) of at most two levels
Amir Epstein
);(DiffEclat all 7.
emptyinitially // ; 6.
supmin_)( 5.
);( -)()( 4.
; 3.
with , all 2.
all 1.
:
ii
iii
ij
ji
j
i
TT
T{R}TT
R
XdXdRd
XXR
ij[P]X
[P]X
[P]
do for
thenif
do for
do for
)(DiffEclat
Amir Epstein
dCharm: Mining Frequent Closed Itemsets
• Performs bottom-up search• Eliminates branches and grows itemsets using
subset relationship
Amir Epstein
Subset Relationships
Theorem: Let and be any two
members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold:1. If , then 2. If , then , but 3. If , then , but 4. If , then
)( ii XdX )( jj XdX
P jfi XX f
)()( ji XdXd )()()( jiji XXcXcXc
)()( ji XdXd )()( ji XcXc )()( jii XXcXc
)()( ji XdXd )()( ji XcXc )()( jij XXcXc
)()( ji XdXd )()()( jiji XXcXcXc
Amir Epstein
(NewN) DiffCharmNewN i 11.
to Add )()( 10.
to Add Nodes; from Remove )()( 9.
with all Replace )()( 8.
with all Replace Nodes; from Remove )( )( 7.
continue supmin_)( 6.
)( -)()( 5.
4.
with , all 3.
2.
all 1.
:
then f
then if
then if
then if
then if
thenif
do for
do for
)( DiffCharm
NewNRXdXd
NewNRXXdXd
RXXdXd
RXXXdXd
R
XdXdRd
XXR
ij[P]X
XX
[P]X
P
ji
jji
iji
ijji
ij
j
j
i
i
Amir Epstein
Optimized Initialization
• Computation • Let be the number of frequent items• Let be the average tidset size • Amount of data read is • Number Of intersections• In horizontal approach amount of data read
is
2F
n
l2/)1( nnl
nl
2/)1( nn
Amir Epstein
Improvement
• Compute frequent items of length 2• Combine items and only if is frequent • Now The number of intersections in practice is
closer to rather then • Frequent itemsets of length 2 computation
– perform vertical to horizontal transformation– Update the count of pairs of items
1I 2I 21 II
)(nO )( 2nO
Amir Epstein
Experimental Results
• Times include all costs, including horizontal to vertical database conversion
• Method– Real datasets (usually dense)– Synthetic datasets (sparse)
Amir Epstein
Database Characteristics Database # Items Avg. trans. Length # Records
chess 76 37 3,196
connect 130 43 67,557
mushroom 120 23 8,124
Pumsb* 7117 50 49,046
pumsb 7117 74 49,046
T10I4D100K 1000 10 100,000
T20I16D100K 1000 40 100,000
Amir Epstein
Length Of the Longest Itemset
Amir Epstein
Cardinality Of F.I , C.F.I and M.F.I
Amir Epstein
Improvements using Diffsets
Amir Epstein
Mining Frequent Itemsets
Amir Epstein
Mining Closed Itemsets
Amir Epstein
Mining Maximal Itemsets
Amir Epstein
Conclusions
• Diffsets dramatically cut down the size of memory required to store intermediate results
• Diffsets increase performance significantly when incorporated into previous vertical mining methods
• Diffsets can deliver over order of magnitude performance improvements over the best previous methods