Upload
gordon-holmes
View
214
Download
0
Embed Size (px)
Citation preview
NGDM’02 1
Efficient Data-Reduction Methods for On-line Association Rule Mining
H. Bronnimann B. Chen M. Dash, Y. Qiao, P. ScheuermannP. Haas
Polytechnic Univ Exilixis Northwestern UniversityIBM Almaden
[email protected] [email protected] [email protected] {manoranj,yiqiao,peters}@ece.nwu.edu
NGDM’02 2
Motivation
Volume of Data in Warehouses & Internet is growing faster than Moore’s Law
Scalability is a major concern “Classical” algorithms require one/more scans of the database
Need to adopt to Streaming Data
One Solution: Execute algorithm on a sample
Data elements arrive on-line Limited amount of memory
Lossy compressed synopses (sketch) of data
NGDM’02 3
Motivation
Advantage: can explicitly trade-off accuracy and speed
Work best when tailored to application
Base set of items & each data element is vector of item counts Application: Association rule mining
Sampling Methods
Our Contributions Sampling methods for count datasets
NGDM’02 4
Outline
Motivation
FAST
Epsilon Approximation
Experimental Results
Data Stream Reduction
Conclusion
Outline of the Presentation
NGDM’02 5
The Problem
Generate a smaller subset S0 of a larger superset S such that the supports of 1-itemsets in S0 are close to those in S
0 00
,minimize ( , )
S S S nDist S S
NP-Complete: One-In-Three SAT Problem
I1(T) = set of all 1-itemsets in transaction set TL1(T) = set of frequent 1-itemsets in transaction set T
f(A;T) = support of itemset A in transaction set T
);();(max 0)(1
SAfSAfDistSIA
20
)(2 ));();((
1
SAfSAfDistSIA
|)()(|
|)()(||)()(|
101
1010111 SLSL
SLSLSLSLDist
NGDM’02 6
FAST-trim
1. Obtain a large simple random sample S from D.2. Compute f(A;S) for each 1-itemset A.3. Using the supports computed in Step 2, obtain a
reduced sample S0 from S by trimming away outlier transactions.
4. Run a standard association-rule algorithm against S0
– with Minimum support p and confidence c – to obtain the final set of Association Rules.
FAST-trim OutlineGiven a specified minimum support p and confidencec, FAST-trim Algorithm proceeds as follows:
NGDM’02 7
FAST-trim
while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S) }}
FAST-trim AlgorithmUses input parameter k to explicitly trade-offspeed and accuracy
Trimming Phase
t G
Note: Removal of outlier t* causes maximum decrease or minimumincrease in Dist(S0,S)
NGDM’02 8
FAST-grow
while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S) }}
FAST-grow AlgorithmSelect representative transactions from S and add themto the sample S0 that is initially empty
Growing Phase
t G
NGDM’02 9
Epsilon Approximation (EA)
Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows:
Epsilon Approximation (EA)
Can estimate simultaneously the frequency of a collection ofsubsets VC dimension is finite
Applications to computational geometry and learning theory
Def: A sample S0 of S1 is an approximation iff discrepancysatisfies ),( 10 SSDist
NGDM’02 10
Epsilon Approximation (EA)
Deterministically halves the data to get sample S0
Apply halving repeatedly (S1 => S2 => … => St (= S0)) until
Each halving step introduce a discrepancy where m = total no. of items in database, ni = size of sub-sample Si
Halving stops with the maximum t such that
Halving Method
),( 10 SSDist
),( mnii
ti
iit mn ),(
NGDM’02 11
Epsilon Approximation (EA)
How to compute halving?Hyperbolic cosine method [Spencer]
1. Color each transaction red (in sample) or blue (not in sample)
2. Penalty for each item, reflectsPenalty small if red/blue approximately balancedPenalty will shoot up exponentially when
red dominates (item is over-sampled), orblue dominates (item is under-sampled)
3. Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average=> One of the two colors does not increase the penalty globally
NGDM’02 12
Epsilon Approximation (EA)
Penalty Computation Let Qi = Penalty for item Ai
Init Qi = 2 Suppose that we have colored the first j transactions
iiii bi
ri
bi
ri
jii QQ )1()1()1()1()(
where ri = ri(j) = no. of red transactions containing Ai
bi = bi(j) = no. of blue transactions containing Ai
i = parameter that influences how fast penalty changes as function of |ri - bi|
NGDM’02 13
Epsilon Approximation (EA)
How to color transaction j+1 Compute global penalty:
i
rji
r QQ )||()(= Global penalty assuming transaction j+1 is red
i
bji
b QQ )||()( = Global penalty assuming transaction j+1 is blue
Choose color for which global penalty is smaller
EA is inherently an on-line method
NGDM’02 14
Performance Evaluation
Synthetic data set IBM QUEST project [AS94] 100,000 transactions 1,000 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 minimum support: 0.77% length of the maximal large itemsets: 6
Final sampling ratios:0.76%, 1.51%, 3.0%, … dictated by EA halvings
NGDM’02 15
Experimental Results
Time vs. Sample Ratio FAST_trim vs. EA/SRS
0
0.5
1
1.5
2
2.5
3
3.5
0 0.05 0.1 0.15 0.2 0.25 0.3Sample Ratio
Time (
cpu s
ec)
FAST_trim_D1FAST_trim_D2EASRS
Accuracy vs. Sample RatioFAST_trim vs. EA/SRS
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Sample Ratio
Accu
racy
FAST_trim_D1FAST_trim_D2EASRS
87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)
NGDM’02 16
Experimental Results
Time vs. Sample RationFAST_grow vs. EA/SRS
0
0.5
1
1.5
2
2.5
3
3.5
0 0.05 0.1 0.15 0.2 0.25 0.3
Sample Ratio
Time (
cpu s
ec)
FAST_grow_D1FAST_grow_D2EASRS
Accuracy vs. Sample RatioFAST_grow vs. EA/SRS
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2 0.25 0.3Sample Ratio
Accu
racy
FAST_grow_D1FAST_grow_D2EASRS
FAST_grow_D2 is best for very small sampling ratio (< 2%) EA best over-all in accuracy
NGDM’02 17
Data Stream Reduction
Data Stream Reduction (DSR) Representative sample of data stream
Assign more weight to recent data while partially keeping track of old data
…
…
NS/2 NS/2 NS/2 NS/2
NS/2 NS/4 NS/8 1
mS mS-1 mS-2 1 Bucket#
Bucket#1mS-2mS-1mS
To generate NS-element sample, halve (mS-k) times of bucket k
Total #Transactions
= ms.Ns/2
NGDM’02 18
Data Stream Reduction
Practical Implementation
Ns
0 Halving
1 Halving
2 Halving
Empty
1 Halving
2 Halving3 Halving
To avoid frequent halving we use one buffer onceand compute new representative sample whenbuffer is full by applying EA
NGDM’02 19
Data Stream Reduction
Problem: Two users immediately before and after halving operation see data that varies substantially
Continuous DSR: Buffer divided into chunks
2ns
4ns
Ns-2ns
Ns
Next ns
transactionsarrive
Oldest chunkis halved first
New transns
3ns
5ns
Ns-ns
Ns
NGDM’02 20
Conclusion
Two-stage sampling approach based on trimming outliers or selecting representative transactions Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample Can be used in conjunction with other non-sampling count-based mining algorithms EA-based data stream reduction
• We are investigating how to evaluate goodness of representative subset• Frequency information to be used for discrepancy function