Click here to load reader

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern

Embed Size (px)

Citation preview

  • Slide 1

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern Mining Requests Over Streaming Data This work is supported under NSF grants CCF-0811510, IIS- 0119276, IIS-0414380. Chandi Slide 2 Motivation: data streams are everywhere 2 transaction info patterns Stock Market Are there any patterns in transactions over past hour? Battlefield position info Stock Analysts Commander Where are the main clusters formed by enemy warcraft? patterns 2 Slide 3 Motivation: pattern mining requests tend to be parameterized 3 Example 1: give me the stocks that dropped significantly in the most recent transactions. Example 2: give me the major clusters formed by enemy warcraft. 10%, 30% or 50% to the original price with in last 10,30, or 60 minutes. size: n war-crafts density: m war-crafts / mi 2 Slide 4 Motivation: best parameters settings are hard to determine I need info for any cluster sized 5 or higher Clusters formed by fighter planes need to be updated every 5 seconds I only care about the clusters that are formed by more than 20 warcraft Clusters formed by boom carriers need to be updated every 10 seconds Multiple analysts may raise multiple queries with different parameter settings. Parameter settings? I probably know. But, can I try different combinations of them? ? A single analyst may raise multiple queries with different parameter settings. 4 Problem: A lot of similar queries, yet with different parameter settings, how to answer them efficiently. Slide 5 Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multi-Query Sharing. 5. Experimental Study 6. Conclusion 5 5 Slide 6 Target Pattern Type 6 12 14 5 7 6 4 8 2 9 16 17 1 13 15 Core Object: has no less than neighbors in distance from it. Edge Object: not core object but a neighbor of a core object. Noise: not core object and not a neighbor of any core object. range cnt A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them Why: popular and well know, arbitrary shapes, allow unclassified (noise), deterministic Slide 7 Clustering in Sliding Windows Over Stream 54321678 W1 54321678 W2 7 Applications include: monitoring congestion (cluster) in traffic looking for intensive transaction areas (cluster) in stock trades identifying malicious attacks (cluster) in network Slide 8 Problem Definition Input: a query group QG with multiple density-based clustering queries querying on the same input stream but with arbitrary parameter settings. Goal: to minimize both the average processing time and the peak memory space needed by the system. 8 Template Density-Based Clustering Query Over Sliding Windows Pattern-specific Window-specific Slide 9 State-Of-Art 9 Previous work on multiple query sharing concentrates on traditional SPJ and aggregation operators [Arasu04][Hammad03][Wang06]. No existing solution for multiple query optimization for complex pattern mining requests, such as clustering, in streaming environments. Existing algorithms for single density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. Executing algorithms above independently for each query (naive solution ) is not scalable. Slide 10 Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multi-Query Sharing. 5. Experimental Study 6. Conclusion 10 Slide 11 Density-Based Clustering on Data Streams 11 In highly dynamic streaming environments: Re-computation. Incremental cluster maintenance. Extra-N proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. Maintain Exact Neighborships (neighbor lists) for none-core objects. Maintain Abstract Neighborships (cluster memberships) for core objects. A general concept of Predicted View is applied to efficiently update the cluster structure. key: a compact and easy-maintainable cluster representation. Slide 12 Concept of Predicted Views 12 14 5 7 6 13 11 2 9 10 1 3 8 4 15 16 Current View of W 0 window size=16, slide size=4, time=1 Predicted View of W 1 12 14 5 7 6 13 11 9 10 8 15 16 Predicted View of W 2 12 14 5 7 6 13 11 2 9 10 1 3 8 4 15 16 12 1413 11 9 10 15 16 Predicted View of W 3 12345678910111213141516 W0W0 W1W1 W2W2 W3W3 12 Slide 13 Update Predicted Views Current View of W 1 Predicted View of W 2 12 1413 11 9 10 15 16 Predicted View of W 3 12 14 5 7 6 13 11 9 10 8 15 16 1413 15 16 Predicted View of W 4 171819205678910111213141516 W1W1 W2W2 W3W3 W4W4 17 20 18 19 17 20 18 19 17 20 18 19 17 20 18 19 New Data Points 12 14 5 7 6 13 11 2 9 10 1 3 8 4 15 16 window size=16, slide size=4, time=1 13 Expired View of W 0 Slide 14 Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multiple Query Sharing. 5. Experimental Study 6. Conclusion 14 Slide 15 Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary cnt range Pattern-Specific ParametersWindow-Specific Parameters 15 Slide 16 Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed 16 range cnt Any relationship between the cluster sets identified by them? Slide 17 Growth Property among DB-cluster Sets 17 Independent Cluster Structure StorageHierarchical Cluster Structure Storage Grow If any cluster Ci in Clu_Set1 is contained by one cluster in Clu_Set2, Clu_Set2 is a Growth of Clu_Set1. c6c5c4 c6c5c4 Slide 18 Benefits of Hierarchical Cluster Structure 18 Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently. Slide 19 Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed 19 range cnt range Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same. Slide 20 Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed Idea: Growth property transitively holds. Solution: A single integrated representation of predicted views: range cnt IntView_ Slide 21 Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary cnt range Pattern-Specific ParametersWindow-Specific Parameters 21 Slide 22 Arbitrary Pattern-Specific Parameter Case -- Fixed, Arbitrary Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same. cnt range cnt 22 Slide 23 Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed cnt range Idea: Growth property transitively holds. Solution: A single integrated representation of predicted views: range IntView_ Slide 24 Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary cnt range Pattern-Specific ParametersWindow-Specific Parameters 24 Slide 25 25 Growth property holds,if and . An Predicted View Tree for all queries. Arbitrary Pattern-Specific Parameter Case -- arbitrary, arbitrary cnt range Qi. range Qj. cnt Qi. cnt Qj. IntView_ Cluster set by Q3 Cluster set by Q5 Cluster set by Q2 Cluster set by Q1 Cluster set by Q4 grow Slide 26 Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary cnt range Pattern-Specific ParametersWindow-Specific Parameters 26 Slide 27 Arbitrary Window-Specific Parameter Case -- arbitrary win, fixed slide In this case, maintaining a single query will be sufficient to answer all. The predicted views for Qi with largest win cover all needed. 27 12345678910111213141516 W0W0 W1W1 W3W3 Q1.win=16, slide=4 Q2.win=12, slide=4 W0W0 W2W2 Shared Time Answer for Q1 at T=16Answer for Q2 at T=16 W0W0 Q3.win=8, slide=4 W0W0 Q4.win=4, slide=4 Answer for Q3 at T=16 Answer for Q4 at T=16 Slide 28 Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary cnt range Pattern-Specific ParametersWindow-Specific Parameters 28 Slide 29 Arbitrary Window-Specific Parameter Case -- arbitrary slide, arbitrary win Use a single meta query with largest window size and adaptive slide size to represent queries. 29 Grow Slide 30 Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary cnt range Pattern-Specific ParametersWindow-Specific Parameters 30 Slide 31 General Case -- arbitrary all four parameters Our proposed techniques for arbitrary pattern parameter cases (intra-window-optimization) for arbitrary window parameter cases (inter-window-optimization) are orthogonal to each other. Final integrated structure for QG. 31 IntView_ IntView_W IntView Slide 32 Experimental Study 32 Alternative Methods: 1. Incremental DBSCAN [Ester98] 2. Incremental DBSCAN with rqs (range query search sharing) 3. Extra-N [Yang09] 4. Extra-N with rqs (range query search sharing) 5. Chandi (our solution) Real Streaming Data: 1. GMTI data recording information about moving vehicles [Mitre08]. 2. STT data recording stock transactions from NYSE [INETATS08]. Measurements: 1. Average processing time for each tuple. 2. Memory footprint. Chandi Slide 33 Evaluation for Performance 33 Arbitrary Pattern Parameter Case Arbitrary Window Parameter Case Slide 34 Evaluation for Performance 34 Arbitrary All Four Parameter Cases Slide 35 35 Conclusion: First effort of applying multi-query optimization techniques on complex mining requests. General principles proposed, such as incremental pattern representation and meta query strategy, applicable to other pattern types. First full-sharing framework, called Chandi, for multiple density-based clustering queries over streaming windows, Experimental study shows Chandi has excellent efficiency and scalability. Future Work: Other pattern types: other cluster types, outliers, association rules Pattern management. Visualization and interaction. Slide 36 The End Thanks 36 Chandi