Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Fast Vertical Mining Using Diffsets

Mohammed J. ZakiKaram Gouda

Amir Epstein

Outline

• Introduction• Problem Setting and Notations• Equivalence Classes & Diffsets• Algorithms For Mining Frequent, Closed and

Maximal Patterns• Experimental Results• Conclusions

Amir Epstein

Introduction

• Horizontal methods (Most are Apriori variants)

• Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth)

• Mining Closed Sets (A-Close, Closet, Charm)

• Vertical Methods• Vertical Approach Problems• Diffsets

Amir Epstein

Notations

• I – set of items• T- database transactions • Tid – transaction identifier• Itemset – a set • Tidset – a set • K-itemset – An itemset with k items• Support of an itemset X, denoted - the

number of transactions in which X occurs as a subset

TY

IX

)(X

Amir Epstein

Notation

• Frequent itemset – if • Powerset P(I) – search space enumeration• Maximal frequent itemset- if it is not a subset of

any other frequent itemset• Closed frequent itemset (X) - if there is not exist

a superset with • Closure of an itemset X, denoted c(X) – the

smallest closed set that contains X

supmin_)( X

XY )()( YX

Amir Epstein

The Problem

• Find all frequent items having minimum support

Amir Epstein

Database Example

Amir Epstein

Frequent, Closed and Maximal Itemsets

Amir Epstein

Data Formats

Amir Epstein

Equivalence Classes

• Define a function ,where

the k-length prefix of X• Define an equivalence relation (prefix-based) :

)()(: IPNIPp

]:1[),( kXkXp

k

),(),(),(, kYpkXpYXIPYXK

Amir Epstein

Example{} {A,C,D,T,W}

A {C,D,T,W} C {D,T,W} D {T,W}T {W} W

AC {D,T,W}

ACD {T,W}

ACDT {W}

ACDTW

AD {TW} AT {W} AW CD {T,W}

CDT {W}

CDTW

ACT {W} ACW

ACDW ACTW

ADT ADW ATW

ADTW

CT {W} CW DT ,W}

DTW

DW TW

CDW CTW

Amir Epstein

Compute Subset Class

• Let • Perform intersection of with all with

to obtain a new class with elements ,where is frequent

},...,,{ 21 nXXXP

iPX jPX ij

iPX jX

ji XPX

Amir Epstein

Tidset Intersections (example)1

2

3

4

5

6

1

3

4

5

A C2

4

5

6

D1

3

5

6

T W1

2

3

4

5

1

3

4

5

4

5

1

3

5

1

3

4

5

AC AD AT AW

2

4

5

6

1

3

5

6

1

2

3

4

5

5

6

2

4

5

1

3

5

CD CT CW DT DW TW

1

3

5

1

3

4

5

1

3

5

1

3

5

2

4

5

1

3

5

ACT ACW ATW CDW CTW

ACTW

Amir Epstein

Diffsets

Difference of the prefix tidset and a class member tidset

• Consider class with prefix P• Let t(X) denote the tidset of element X• Let d(X) denote the diffset of element X, with respect to prefix tidset• Let PX and PY be class members of P• Support )()( and )()( PtPYtPtPXt

Amir Epstein

Diffsets

• Then • Define diffset • Then

)()()( PYtPXtPXYt )()()( XtPtPXd

)()()( PXdPPX

Amir Epstein

Diffsets

• How to Calculate using d(PX) and d(PY) ?– – –

)(PXY

)()()( PXYdPXPXY

)()()( PXYtPXtPXYd

)()()]()([)]()([

)()()()()()()(

PXdPYdXtPtYtPt

PtPtPYtPXtPXYtPXtPXYd

Amir Epstein

Example

t(P)

d(PY) d(PX)

t(X)

t(Y)

d(PXY) t(PXY)

Amir Epstein

Diffset Intersections (example)

1

2

3

4

5

6

1

3

4

5

A C2

4

5

6

D1

3

5

6

T W1

2

3

4

5

1

3

4

AC AD AT AW

1

3

2

4

6 6 6

CD CT CW DT DW TW

4 6 6

ACT ACW ATW CDW CTW

ACTW

2

4

A C D T W2

6

1

3

2

4

6

TIDSET database DIFFSET database

Amir Epstein

Diffset Example

• Diffset calculation– –

• Support calculation–

13)()()( DtAtADd

132613)()()( AdDdADd

224)()()( ADdAAD

Amir Epstein

Diffset Example

• Database Size– Tidsets database size =23– Diffets database size =7

• Total Size– Tidsets database size =76– Diffsets database size =22

• Size By Length

K-itemset (k) Avg. tidset length Avg. diffset length

2 3.8 1

3 3.2 0.6

4 3 0

Amir Epstein

Experimental Study

• Compare diffsets versus tidsets in terms of database sizes

• Method– Real datasets (usually dense)– Synthetic datasets (sparse)

Amir Epstein

Size Of Database

Amir Epstein

Average Diffset / Tidset Size By length

Amir Epstein

Average Diffset / Tidset Size Database Min_sup

(%)Max Length

Avg. Diffset Size

Avg. Tidset Size

Reduction Ration

chess 0.5 16 26 1820 70

connect 90 12 143 62204 435

mushroom 5 17 60 622 10

Pumsb* 35 15 301 18977 63

pumsb 90 8 330 45036 136

T10I4D100K 0.025 11 14 86 6

T20I16D100K 0.1 14 31 230 11

T40I10D100K 0.5 18 96 755 8

Amir Epstein

When To Use diffsets

• Usually there is a cross-over point• For Dense dataset start with diffset format• For Sparse dataset start with tidset format

Amir Epstein

Reduction Ratio

• Let class P• Let PX and PY class members with t(PX) and

t(PY)• Consider new Itemset PXY in class PX• PXY can be stored as t(PXY) or d(PXY)• Definition : reduction ratio • Benefit if or •

)(/)( PXYdPXYtr

1r 1)(/)( PXYdPXYt

1))()(/()()( PYtPXtPXYtPXYd

Amir Epstein

Reduction Ratio

• Or

•

1))()(/()( PXYtPXtPXYt

2)(/)( PXYtPXt

Amir Epstein

Compressed Bitvectors

• Classical way run-length encoding (RLE) – not appropriate for association mining

• Skinning encoding scheme (used by Viper) – Worst case compression ratio reaches asymptotically

2.91– Best case compression ratio asymptotically reaches 32

Amir Epstein

GenMax: Mining Maximal Frequent Itemsets

• Uses backtracking search technique• Optimizations

– Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree)

– Superset checking• More Optimizations

– Progressive focusing to improve superset checking– Vertical database format to improve frequency checking using

tidsets, which is more improved by diffsets• Memory Handling

– Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset

Amir Epstein

MReturn 18.

.17

16.

),,( 15.

};{ 14.

)()()( );()( 13.

},{ 12.

break in set super a has 11.

in follows or is 10.

.9

}:{ 8.

7.

Itemsets.Frequent Maximal// .6

}. oforder sortedin :{)()( 5.

. oforder in each Sort .4

. INCREASING then and

ofy cardinalit INCREASINGin in itemsSort .3

set.-combine its , calculate itemeach For .2

.F Calculate , Calculate 1.

)(Dataset

1

1

1

1

1

21

ZMM

YZZ

YXIExtend

xjZxY

jtitIdjcicX

jiI

ZH

c(i)}jxj{x:xH

c(i)j

xiMxZ

Fi

{}; M

Fijjicic

Fc(i)

σ(i)

c(i)F

c(i)Fi

F

T

thenif

do eachfor

doeachfor

GenMax

Amir Epstein

} { 18.

0) Y and 0(extendflg .17

} { .16

),,( .15

}:{ 14.

13.

}{ 12.

)( 11.

1extendflg .10

)( 9.

frequent) is (NewI 8.

)()()( };{ 7.

break; ; 1extendflg 6.

in set super has 5.

in follows or is 4.

0Y 3.

2.

0extendflg 1.

.contain which itemsets maximal all i.e.,//

far so found itemsets maximalrelevant ofset theis //

and set, combine thei.e., , toadded becan // that

items ofset theis extended, be itemset to theis //

),,(

IYY

NewIYY

NewYNewXNewIExtend

XjYxNewY

NewIYY

NewX

jcXNewX

IdjdNewIdjINewI

YG

X}jxj{x:xG

Xj

I

Y

I

XI

YXIExtend

thenif

else

then if

thenif

thenif

then if

doeachfor

Procedure

Amir Epstein

dEclat: Mining All Frequent Itemsets

• Performs bottom-up search• The equivalence class lattice is traversed in a bfs

order• Input: class members• F.I are generated by computing diffsets for all

distinct pairs of itemsets and checking the support of the resulting itemset

• Stores in memory intermediate diffsets (tidsets) of at most two levels

Amir Epstein

);(DiffEclat all 7.

emptyinitially // ; 6.

supmin_)( 5.

);( -)()( 4.

; 3.

with , all 2.

all 1.

:

ii

iii

ij

ji

j

i

TT

T{R}TT

R

XdXdRd

XXR

ij[P]X

[P]X

[P]

do for

thenif

do for

do for

)(DiffEclat

Amir Epstein

dCharm: Mining Frequent Closed Itemsets

• Performs bottom-up search• Eliminates branches and grows itemsets using

subset relationship

Amir Epstein

Subset Relationships

Theorem: Let and be any two

members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold:1. If , then 2. If , then , but 3. If , then , but 4. If , then

)( ii XdX )( jj XdX

P jfi XX f

)()( ji XdXd )()()( jiji XXcXcXc

)()( ji XdXd )()( ji XcXc )()( jii XXcXc

)()( ji XdXd )()( ji XcXc )()( jij XXcXc

)()( ji XdXd )()()( jiji XXcXcXc

Amir Epstein

(NewN) DiffCharmNewN i 11.

to Add )()( 10.

to Add Nodes; from Remove )()( 9.

with all Replace )()( 8.

with all Replace Nodes; from Remove )( )( 7.

continue supmin_)( 6.

)( -)()( 5.

4.

with , all 3.

2.

all 1.

:

then f

then if

then if

then if

then if

thenif

do for

do for

)( DiffCharm

NewNRXdXd

NewNRXXdXd

RXXdXd

RXXXdXd

R

XdXdRd

XXR

ij[P]X

XX

[P]X

P

ji

jji

iji

ijji

ij

j

j

i

i

Amir Epstein

Optimized Initialization

• Computation • Let be the number of frequent items• Let be the average tidset size • Amount of data read is • Number Of intersections• In horizontal approach amount of data read

is

2F

n

l2/)1( nnl

nl

2/)1( nn

Amir Epstein

Improvement

• Compute frequent items of length 2• Combine items and only if is frequent • Now The number of intersections in practice is

closer to rather then • Frequent itemsets of length 2 computation

– perform vertical to horizontal transformation– Update the count of pairs of items

1I 2I 21 II

)(nO )( 2nO

Amir Epstein

Experimental Results

• Times include all costs, including horizontal to vertical database conversion

• Method– Real datasets (usually dense)– Synthetic datasets (sparse)

Amir Epstein

Database Characteristics Database # Items Avg. trans. Length # Records

chess 76 37 3,196

connect 130 43 67,557

mushroom 120 23 8,124

Pumsb* 7117 50 49,046

pumsb 7117 74 49,046

T10I4D100K 1000 10 100,000

T20I16D100K 1000 40 100,000

Amir Epstein

Length Of the Longest Itemset

Amir Epstein

Cardinality Of F.I , C.F.I and M.F.I

Amir Epstein

Improvements using Diffsets

Amir Epstein

Mining Frequent Itemsets

Amir Epstein

Mining Closed Itemsets

Amir Epstein

Mining Maximal Itemsets

Amir Epstein

Conclusions

• Diffsets dramatically cut down the size of memory required to store intermediate results

• Diffsets increase performance significantly when incorporated into previous vertical mining methods

• Diffsets can deliver over order of magnitude performance improvements over the best previous methods

Documents

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda