19
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Embed Size (px)

Citation preview

Page 1: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Clustering Data Streams

Chun Wei

Dept Computer & Information Technology

Advisor: Dr. Sprague

Page 2: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Data Stream

Massive data sets accumulated at an astonishing rate.

Examples: Tracking network data to study change in

traffic patterns and possible intrusionsTracking meteorological data, such as

temperatures

Page 3: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

NASA MISR satellite

Collects several TB of satellite imagery data daily

Page 4: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Challenges

Compactness of data representationFast, incremental processing of new

data points (one-pass and linear access of data)

Clear and fast identification of changes in evolving clustering models

Page 5: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Compactness

Utilize a data structure that summarizes a group of data points, minimizing the storage space

The space required does not grow appreciably with the number of points processed

Page 6: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Incremental Processing of data

When clustering new data points, the algorithm should not require comparison with all the points processed in the past

The data must be processed as they are produced. Linear scan is required, random access is prohibitively expensive.

Page 7: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Identification of Changes

The algorithm must be able to:diagnose changes in evolving data

streamsdistinguish outliers from data points that

represent a new cluster

Page 8: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Current Algorithms

BIRCHSTREAMCLU-STREAM…

Page 9: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

BIRCH

Use CF vectors to store data CF = (N, ∑Xi

2 , ∑ Xi ) Xi is a vector

Store the number of points, the linear sum and the square sum of all data points in a micro-cluster

Sufficient to calculate centroids, radius, diameter and distances

N

i iX1

N

i iX1

Page 10: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

B-Tree

Root

16 22 29

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

7

Page 11: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Building of CF Tree

B-Tree, with a branch factor B, threshold T and L maximum number of entries in a leave node

CF3

CF1 CF2 CF3

CF6

CF4 CF5 CF6Leaf node

Non-Leaf node

Page 12: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Adjusting CF Tree

Increases the threshold T so that each leaf entry to absorb more points. T can be set as radius or diameter.

Leaf entries with “far fewer” points are regarded as “outliers” and written back to disk.

Page 13: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

STREAM

Process data streams in batches of points

Use weighted centroids Ci to represent ith batch of points.

Recursively cluster the weighted centroids until k-clusters

Page 14: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Problems with BIRCH & STREAM

Old data points are equally important as new data points

May not be able to detect new trends in evolving data stream

Page 15: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

CLU-STREAM

Also use CF vectors to store data summary

Use time stamps to record the elapsed time from the beginning

Take snapshots at different time stamp, favoring the most recent data

(Snapshot: micro-clusters stored at particular moments in the stream)

Page 16: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

CLU-STREAM (continue)

A snapshot contains q micro-clusters, q depends on the memory available

New data points will be assigned to one of the micro-clusters in previous snapshot if it falls within the maximum boundary of that micro-cluster.

Page 17: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

CLU-STREAM (continue)

If a new data points fails to fit into any current cluster, a new cluster is created, and an existing one is deleted or two merged.

A cluster is removed if the average time-stamp when it absorbs m new data points is least recent.

Page 18: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Detect New Trends

Comparing clustering results from snapshots to snapshots reveals trends in evolving data stream.

Page 19: Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

References

Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for Clustering Evolving Data Stream. In Proc. of the 29th VLDB Conference.

Barbara, D. (2003). Requirements for Clustering Data Streams. SIGKDD Explorations, 3 (2), 23-27.

Ester M., Kriegel H.-P., Sander J. & Xu X (1998). Clustering for Mining in Large Spatial Databases. Special Issue on Data Mining, KI-Journal, ScienTec Publishing, No. 1.

Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003). Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering, 15 (3), 515-528.

Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proc. of ACM SIGMOD International Conference on Management of Data.