Upload
yuma
View
23
Download
0
Embed Size (px)
DESCRIPTION
Verifying and Mining Frequent Patterns from Large Windows over Data Streams. Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. ICDE 2008 Cancun, Mexico. Finding Frequent Patterns for Association Rule Mining. - PowerPoint PPT Presentation
Citation preview
Verifying and Mining Frequent Patterns from Large Windows over Data Streams
Barzan Mozafari,Hetal Thakkar,
and Carlo Zaniolo
ICDE 2008ICDE 2008Cancun, MexicoCancun, Mexico
Finding Frequent Patterns for Association Rule Mining
Given a set of transactions T and a support threshold s, find all patterns with support >= s
Apriori [Agrawal’ 94], FP-growth [Han’ 00] Fast & light algorithms for data streams
More than 30 proposals [Jiang’ 06] For mining windows over streams In particular DSMSs divide windows into panes,
a.k.a. slides As in our Stream Mill Miner system
Moment (Maintaining Closed Frequent Itemsets over a Stream Sliding Window)
Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz Collaboration of UCLA + IBM
Closed Enumeration Tree (CET)
Very similar to FP-tree, except that keeps a dynamic set of items: Closed freq itemsets Boundary itemsets
Moment Algorithm (I)
Hope: In the absence of cocncept drifts, not many changes in status
Maintains two types of boundary nodes;
1. Freq / non-freq
2. Closed / non-closed
Taking specific actions to maintain a shifting boundary whenever a concept shift occurs
Moment Algorithm (II)
Infreq gateway nodes Infreq + its parent freq + result of a candidate join
Unpromising gateway nodes Freq + prefix of a closed w/ same support
Intermiddiate nodes Freq + has a child w/ same supp + not
unpromising Closed nodes
Closed freq
Moment Algorithm (III)
Increments: Add/Delete to/from CET upon
arrival/expiration of each transaction.
Downside: Batch operations not applicable, suffers from
big slide sizes
Advantage: Efficient for small slides
CanTree [Leung’ 05]
Use a fixed canonical order according to decreasing single freq.
Use a single-round version of FP-growth
Algorithm:
Upon each window move: Add/Remove new/expired trans to/from FP-
tree (using the same item order) Run FP-growth! (Without any pruning)
CanTree (cont.)
Pros: Very efficient for large slides
Cons: Inefficient for small slides Not scallable for large windows
Needs memory for entire window
Frequent Patterns Mining overData Streams
Challenges Computation Storage Real-time response Customization Integration with the DSMS
S4… ……….S5 S6 S7
W4 W5
Expired New
Frequent Patterns Mining over Data Streams
Difficult problem: [Chi’ 04, Leung’ 05, Cheung’ 03, Koh’ 04, …]
Mining each window from scratch - too expensive Subsequent windows have many freq patterns in common
Updating frequent patterns every new tuple, also too expensive
SWIM’s middle-road approach: incrementally maintain frequent patterns over sliding windows Desiderata: scalability with slide size and window size
SWIM (Sliding Window Incremental Miner) If pattern p is freq in a window, it must be freq in at least
one of its slides -- keep a union of freq patterns of all slides (PT)
S4… ……….S5 S6 S7
W4 W5
Expired New
PT
PT = F4 U F5 U F6
Count/Update frequencies
Mine
MiningAlg.Add F7 to PT
Count/Update frequencies
Prune PT
PT = F5 U F6 U F7
SWIM
For each new slide Si Find all frequent patterns in Si (using FP-growth)
Verify frequency of these new patterns in each window slide Immediately or With delay (< N slides) Trade-off: max delay vs. computation.
No false negatives or false positives!
SWIM – Design Choices
Data Structure for Si’s: FP-tree [Han’ 00] Data Structure for PT: FP-tree Mining Algorithm: FP-growth Count/Update frequencies: Naïve? Hash-
tree? Counting is the bottleneck New and improved counting method named
Conditional Counting
Conditional Counting
Verification Given a set of transactions T, a set of patterns P,
and a threshold s Goal: Find the exact freq of each p P w.r.t. to T,
IF AND ONLY IF its freq is s If s=0, verification = counting, but if s>0 extra
computation can be avoided Proposed fast verifiers
DTV, DFV, hybrid
Conditionalization on FP-trees
FP-tree FP-tree | g FP-tree | gd
Attempt I: DTV (Double-Tree Verifier) Not only conditionalize the fp-tree, but also the
pattern tree
root:?
b:?
d:?
e:? f:? g:?
g:?
Header Table
a
b
c
d
e
f
g
h Initial pattern tree
root:?
b:?
d:?
Header Table
a
b
c
d
e
f
root:4
b:3
d:2
Header Table
a
b
c
d
e
f
root:?
b:?
d:?
e:? f:? g:2
g:4
Header Table
a
b
c
d
e
f
g
hpattern tree | “g”pattern tree | “g”, after
verification against FP-treeFilling original pattern tree
using reverse pointers
root
a:5
b:5
c:5
d:4
e:1 f:1 g:2
b:1
e:1
g:1
h:1g:1
Header Table
a
b
c
d
e
f
g
h
FP-tree
root
a:3
b:3
c:3
d:2
b:1
e:1
Header Table
a
b
c
d
e
f
FP-tree | g
(a:2,b:2,c:2,d:2)(a:1,b:1,c:1)
(b:1,e:1)Conditional pattern base of “g”
DTV (cont.)
Scales up well on large trees Much pruning from conditionalization
However, for smaller trees Less pruning Overhead of conditionalization not always worth it
Attempt II: DFV(Depth-First Verifier)
Each node n in PT corresponds to a unique pattern pn, therefore: For each node n in PT
Traverse the FP-tree and count the occurrence of pn in a depth-first order
Keep the nodes marked as FAIL/OK while visiting their children
Utilize these marks for optimized execution
More efficient when both trees are small
DFV (cont.)
DFV (cont.)
Comparing Verifiers
Hybrid Verifier
Start with performing DTV recursively
Until the resulting trees are small enough, then perform DFV
Comparing Verifiers
Verifiers vs. Hash Trees (Counting)
SWIM with Hybrid Verifier (I)
SWIM with Hybrid Verifier (II)
Applications of Verifiers (I)
Improving counting in static mining methods Candidate-generation (and pruning) phase Example: Toivonen Approach [Toivonen’ 96]
1. Maintain a boundary of smallest non-frequent and largest frequent patterns
2. Check the frequency of boundary patterns
Applications of Verifiers (II)
In case resources are limited1. Mine once
2. Keep monitoring the current patterns (by verifying them)
Since verifying is computationally cheaper
3. Whenever a significant concept shift is detected, mine again!
Monitoring/Concept Shift Detection
Verification is much faster than mining (when it suffices)
Privacy Preserving Applications
Random noise methods: Add many fake items into the transactions to
increase the variance [Evfimievski’ 03] Overhead:
Long transactions (in the order of the no of items)
Lemma: Max depth of the recursion in DTV is <= the max len of the patterns to be verified. Run-time independent of the transaction length
Optimization when integrated into a DSMS
Stream Mill Miner (SMM) provides integrated support for online mining algorithms by User Define Aggregates (UDAs) Definition of Mining Models
Constraints used for optimization Max allowed delay Interesting/Uninteresting items Interesting/Uninteresting patterns
These are turned from post-conditions into pre-conditions
Conclusions
1. SWIM for incremental mining over large windows More efficient than existing approaches on data streams Trade-off between real-time response, efficiency, memory,
etc.
2. Efficient algorithms for verification/conditional counting
DTV, DFV, and Hybrid These can be used to speed-up many applications:
Incremental mining, enhancing static algorithms, privacy preserving techniques, …
Implementations of SWIM and the verifiers available at
http://wis.cs.ucla.edu/swim/index.htm
References[Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994.
[Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003.
[Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004.
[Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003.
[Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.
[Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004.
[Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005.
[Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145.
Thank you!
Questions?
DFV (cont.)
DFV (cont.)