NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 1

Efficient Data-Reduction Methods for On-line Association Rule Mining

H. Bronnimann B. Chen M. Dash, Y. Qiao, P. ScheuermannP. Haas

Polytechnic Univ Exilixis Northwestern UniversityIBM Almaden

[email protected] [email protected] [email protected] {manoranj,yiqiao,peters}@ece.nwu.edu

NGDM’02 2

Motivation

Volume of Data in Warehouses & Internet is growing faster than Moore’s Law

Scalability is a major concern “Classical” algorithms require one/more scans of the database

Need to adopt to Streaming Data

One Solution: Execute algorithm on a sample

Data elements arrive on-line Limited amount of memory

Lossy compressed synopses (sketch) of data

NGDM’02 3

Motivation

Advantage: can explicitly trade-off accuracy and speed

Work best when tailored to application

Base set of items & each data element is vector of item counts Application: Association rule mining

Sampling Methods

Our Contributions Sampling methods for count datasets

NGDM’02 4

Outline

Motivation

FAST

Epsilon Approximation

Experimental Results

Data Stream Reduction

Conclusion

Outline of the Presentation

NGDM’02 5

The Problem

Generate a smaller subset S0 of a larger superset S such that the supports of 1-itemsets in S0 are close to those in S

0 00

,minimize ( , )

S S S nDist S S

NP-Complete: One-In-Three SAT Problem

I1(T) = set of all 1-itemsets in transaction set TL1(T) = set of frequent 1-itemsets in transaction set T

f(A;T) = support of itemset A in transaction set T

);();(max 0)(1

SAfSAfDistSIA

20

)(2 ));();((

1

SAfSAfDistSIA

|)()(|

|)()(||)()(|

101

1010111 SLSL

SLSLSLSLDist

NGDM’02 6

FAST-trim

1. Obtain a large simple random sample S from D.2. Compute f(A;S) for each 1-itemset A.3. Using the supports computed in Step 2, obtain a

reduced sample S0 from S by trimming away outlier transactions.

4. Run a standard association-rule algorithm against S0

– with Minimum support p and confidence c – to obtain the final set of Association Rules.

FAST-trim OutlineGiven a specified minimum support p and confidencec, FAST-trim Algorithm proceeds as follows:

NGDM’02 7

FAST-trim

while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S) }}

FAST-trim AlgorithmUses input parameter k to explicitly trade-offspeed and accuracy

Trimming Phase

t G

Note: Removal of outlier t* causes maximum decrease or minimumincrease in Dist(S0,S)

NGDM’02 8

FAST-grow

while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S) }}

FAST-grow AlgorithmSelect representative transactions from S and add themto the sample S0 that is initially empty

Growing Phase

t G

NGDM’02 9

Epsilon Approximation (EA)

Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows:


Can estimate simultaneously the frequency of a collection ofsubsets VC dimension is finite

Applications to computational geometry and learning theory

Def: A sample S0 of S1 is an approximation iff discrepancysatisfies ),( 10 SSDist

NGDM’02 10


Deterministically halves the data to get sample S0

Apply halving repeatedly (S1 => S2 => … => St (= S0)) until

Each halving step introduce a discrepancy where m = total no. of items in database, ni = size of sub-sample Si

Halving stops with the maximum t such that

Halving Method

),( 10 SSDist

),( mnii

ti

iit mn ),(

NGDM’02 11


How to compute halving?Hyperbolic cosine method [Spencer]

1. Color each transaction red (in sample) or blue (not in sample)

2. Penalty for each item, reflectsPenalty small if red/blue approximately balancedPenalty will shoot up exponentially when

red dominates (item is over-sampled), orblue dominates (item is under-sampled)

3. Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average=> One of the two colors does not increase the penalty globally

NGDM’02 12


Penalty Computation Let Qi = Penalty for item Ai

Init Qi = 2 Suppose that we have colored the first j transactions

iiii bi

ri

bi

ri

jii QQ )1()1()1()1()(

where ri = ri(j) = no. of red transactions containing Ai

bi = bi(j) = no. of blue transactions containing Ai

i = parameter that influences how fast penalty changes as function of |ri - bi|

NGDM’02 13


How to color transaction j+1 Compute global penalty:

i

rji

r QQ )||()(= Global penalty assuming transaction j+1 is red

i

bji

b QQ )||()( = Global penalty assuming transaction j+1 is blue

Choose color for which global penalty is smaller

EA is inherently an on-line method

NGDM’02 14

Performance Evaluation

Synthetic data set IBM QUEST project [AS94] 100,000 transactions 1,000 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 minimum support: 0.77% length of the maximal large itemsets: 6

Final sampling ratios:0.76%, 1.51%, 3.0%, … dictated by EA halvings

NGDM’02 15


Time vs. Sample Ratio FAST_trim vs. EA/SRS

0

0.5

1

1.5

2

2.5

3

3.5

0 0.05 0.1 0.15 0.2 0.25 0.3Sample Ratio

Time (

cpu s

ec)

FAST_trim_D1FAST_trim_D2EASRS

Accuracy vs. Sample RatioFAST_trim vs. EA/SRS

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3

Sample Ratio

Accu

racy

FAST_trim_D1FAST_trim_D2EASRS

87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)

NGDM’02 16


Time vs. Sample RationFAST_grow vs. EA/SRS

0

0.5

1

1.5

2

2.5

3

3.5

0 0.05 0.1 0.15 0.2 0.25 0.3

Sample Ratio

Time (

cpu s

ec)

FAST_grow_D1FAST_grow_D2EASRS

Accuracy vs. Sample RatioFAST_grow vs. EA/SRS

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3Sample Ratio

Accu

racy

FAST_grow_D1FAST_grow_D2EASRS

FAST_grow_D2 is best for very small sampling ratio (< 2%) EA best over-all in accuracy

NGDM’02 17


Data Stream Reduction (DSR) Representative sample of data stream

Assign more weight to recent data while partially keeping track of old data

…

…

NS/2 NS/2 NS/2 NS/2

NS/2 NS/4 NS/8 1

mS mS-1 mS-2 1 Bucket#

Bucket#1mS-2mS-1mS

To generate NS-element sample, halve (mS-k) times of bucket k

Total #Transactions

= ms.Ns/2

NGDM’02 18


Practical Implementation

Ns

0 Halving

1 Halving

2 Halving

Empty

1 Halving

2 Halving3 Halving

To avoid frequent halving we use one buffer onceand compute new representative sample whenbuffer is full by applying EA

NGDM’02 19


Problem: Two users immediately before and after halving operation see data that varies substantially

Continuous DSR: Buffer divided into chunks

2ns

4ns

Ns-2ns

Ns

Next ns

transactionsarrive

Oldest chunkis halved first

New transns

3ns

5ns

Ns-ns

Ns

NGDM’02 20

Conclusion

Two-stage sampling approach based on trimming outliers or selecting representative transactions Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample Can be used in conjunction with other non-sampling count-based mining algorithms EA-based data stream reduction

• We are investigating how to evaluate goodness of representative subset• Frequency information to be used for discrepancy function

Documents

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic