31
Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 Unified Framework Supporting Interactive xploration of Density-Based Clusters In Streaming Windows This work is supported under NSF grants CCF-0811510, IIS- 0119276, IIS-0414380.

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Embed Size (px)

DESCRIPTION

A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows. Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted. - PowerPoint PPT Presentation

Citation preview

Page 1: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward

Worcester Polytechnic InstituteEDBT 2010, Submitted

1

A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In

Streaming Windows

This work is supported under NSF grants CCF-0811510, IIS-0119276, IIS-0414380.

Page 2: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

What are Density-Based Clusters?

2

Clusters that are defined by individual data points (tuples) and their local “neighborhood”.

How they are different from K-median style clustering?

Cluster 1

Cluster 2

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Page 3: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Formal Definition

3

12

14

5

7

6

4

8

29

16

17

1

13

15

Core Object: has more than neighbors

in distance from it.

Edge Object: not core object but a neighbor

of a core object.

Noise: not core object and not a neighbor of

any core object.

θrange

θ cnt

A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them

Page 4: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Cluster Detection in Sliding Windows

54321 6 7 8

W1

54321 6 7 8

W2

4

Template Density-Based Clustering Query Over Sliding Windows

Pattern-specific

Window-specific

Page 5: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Application Examples:

5

transaction info clusters

Stock Market

Are there intensive-transaction areas in last 1 hour transactions?

Battle field

position info

Stock Analysts

Commander

Where are the main clusters formed by enemy war-crafts

clusters

5

Page 6: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

State-of-Art

6

Existing algorithms for density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09].

Extra-N suffers from the performance inefficiency as the slide/win rate increases.

No evolution semantics defined for density-based cluster changes over the time.

No existing system allowing interactive exploration of density-based clusters in streaming windows.

Page 7: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Goals

7

1. A more efficient density-based clustering algorithm over streams.

2. An evolution semantics that intuitively explain cluster changes.

3. A visualized pattern space allowing interactive exploration of clusters.

Page 8: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Review: existing algorithm– Extra-N

8

In highly dynamic streaming environments: Re-computation.Incremental cluster maintenance.

Extra-N[Yang09] proposed a hybrid neighbor relationship

(neighborship) mechanism to represent cluster structure. Maintain “Exact Neighborships” (neighbor lists) for none-

core objects.Maintain “Abstract Neighborships” (cluster memberships)

for core objects.

A general concept of “Predicted View” is applied to efficiently update the cluster structure.

—Key: a compact and easy-maintainable cluster representation.

Page 9: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Concept of Predicted Views

12

14

57

613

11

2

9

101

3

8

4

15

16

Current View of W0

window size=16, slide size=4, time=1

Predicted View of W1

12

14

57

613

11

9

10

8

15

16

Predicted View of W2

12

14

57

613

11

2

9

101

3

8

4

15

16

12

14 13

11

9

1015

16

Predicted View of W3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16W0

W1W2

W3

9

Page 10: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Update Predicted Views

Current View of W1 Predicted View of W2

12

14 13

11

9

1015

16

Predicted View of W3

12

14

57

613

11

9

10

8

15

16

14 13

15

16

Predicted View of W4

17 18 19 205 6 7 8 9 10 11 12 13 14 15 16W1

W2W3

W4

17

20

18

19

17

20

18

19

17

20

18

19

17

20

18

19

New Data Points

12

14

57

613

11

2

9

101

3

8

4

15

16

window size=16, slide size=4, time=1

10

Expired View of W0

Page 11: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Inefficiency of Extra-N

11

When Slide/Win rate increases, (for example Win=10000, slide=10), large number of predicted views need to be maintained independently.

Heavy burden to both CPU and memory resources.

WinSlide

Page 12: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Proposed Solution: IWIN

12

Any relationship between the cluster identified ?

Page 13: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

“Growth Property” among DB-cluster Sets

13 Independent Cluster Structure Storage Hierarchical Cluster Structure Storage

Grow

If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1 .

c6c5c4 c6c5c4

Page 14: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Integrated Vs. Independent Maintenance of Predicted Views

14 IWIN: Integrated maintenance Extra-N: Independetmaintenance

Page 15: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Benefits of Integrated Maintenance

15

Benefits for Memory Resources: Memory space needed by storing cluster sets identified

by multiple queries in QG is independent from |QG|.

Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster

structure (which are usually similar) can be maintained incrementally, rather than independently.

IWIN outperforms Extra-N in both CPU and memory utilizations.

Page 16: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Goals

16

1. A more efficient density-based clustering algorithm over streams.

2. An evolution semantics that intuitively explain cluster changes.

3. A visualized pattern space allowing interactive exploration of clusters.

Page 17: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Why we need evolution semantics?

17

Analysts need to know how clusters change over time.

It is hard to observe by looking at the clusters only (even with visualization).

Commander

History: Did any clusters merge? Now: Are their any new cluster?Future: Is there any cluster breaking shortly?

Page 18: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Proposed Semantics

18

Single Step Evolutions:birth terminationsplitmergePreserve/expand/shrink

Multi Step Evolutions:split-expand

split-merge

shrink-split

/ /

Page 19: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

How to Compute

19

Extract Predicted Evolution (before window slide)

Update Evolution (after window slide)

preserve

split

preserve

shrink

Page 20: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Conclusion for Proposed Semantics

20

1. Intuitively describe the cluster evolution over the time.

2. Easily maintainable: can be computed on-the-fly during cluster maintenance.

Page 21: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Goals

21

1. A more efficient density-based clustering algorithm over streams.

2. An evolution semantics that intuitively explain cluster changes.

3. A visualized pattern space allowing interactive exploration of clusters.

Page 22: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Outline

22

1. What is Neighbor-Based Pattern Detection2. State-of-Art3. Potential Solutions & Their Inefficiency 4. Proposed Solution: Extra-N5. Experimental Study6. Conclusion

Page 23: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Why needed?

23

Analysts need to navigate along the time axis to learn the current, review the history, and predict the near future.Example: how are the two clusters in current window

related to those detected 30 minutes back?

Analysts need to study the clusters and their evolution at different abstraction level.Example: for routine traffic monitoring, only the position

of major clusters need to be reported; when accident happened, specific information of cluster members need to be reported.

Page 24: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Proposed Pattern Space

24

Page 25: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Evaluation for IWIN

25

Alternative Methods:1. Incremental DBSCAN [Ester98] 2. Extra-N [Yang09] 3. IWIN

Real Streaming Data:1. GMTI data recording information about moving

vehicles [Mitre08].2. STT data recording stock transactions from NYSE

[INETATS08].Measurements:

1. Average processing time for each tuple. 2. Memory footprint.

Page 26: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Evaluation for IWIN

26

Page 27: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Case Study 1

Page 28: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Case Study 2

28

Page 29: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Conclusion

29

1. Presented the first unified framework supporting interactive exploration of density-based clusters in streaming windows.

2. Designed a more efficient density-based clustering algorithm IWIN.

3. Define the first evolution semantics for density-based clusters.

4. Our experimental study confirms the both the efficiency and effectiveness of our proposed framework.

Page 30: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

Future work

30

Support multiple queries.

Support other pattern types, such as outliers, association rules…

Support pattern storage and match.

More?

Page 31: Di Yang,  Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute

The End

31

Thanks