Upload
bailey-emery
View
213
Download
0
Embed Size (px)
Citation preview
Clustering Data Streams
Chun Wei
Dept Computer & Information Technology
Advisor: Dr. Sprague
Data Stream
Massive data sets accumulated at an astonishing rate.
Examples: Tracking network data to study change in
traffic patterns and possible intrusionsTracking meteorological data, such as
temperatures
NASA MISR satellite
Collects several TB of satellite imagery data daily
Challenges
Compactness of data representationFast, incremental processing of new
data points (one-pass and linear access of data)
Clear and fast identification of changes in evolving clustering models
Compactness
Utilize a data structure that summarizes a group of data points, minimizing the storage space
The space required does not grow appreciably with the number of points processed
Incremental Processing of data
When clustering new data points, the algorithm should not require comparison with all the points processed in the past
The data must be processed as they are produced. Linear scan is required, random access is prohibitively expensive.
Identification of Changes
The algorithm must be able to:diagnose changes in evolving data
streamsdistinguish outliers from data points that
represent a new cluster
Current Algorithms
BIRCHSTREAMCLU-STREAM…
BIRCH
Use CF vectors to store data CF = (N, ∑Xi
2 , ∑ Xi ) Xi is a vector
Store the number of points, the linear sum and the square sum of all data points in a micro-cluster
Sufficient to calculate centroids, radius, diameter and distances
N
i iX1
N
i iX1
B-Tree
Root
16 22 29
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
7
Building of CF Tree
B-Tree, with a branch factor B, threshold T and L maximum number of entries in a leave node
CF3
CF1 CF2 CF3
CF6
CF4 CF5 CF6Leaf node
Non-Leaf node
Adjusting CF Tree
Increases the threshold T so that each leaf entry to absorb more points. T can be set as radius or diameter.
Leaf entries with “far fewer” points are regarded as “outliers” and written back to disk.
STREAM
Process data streams in batches of points
Use weighted centroids Ci to represent ith batch of points.
Recursively cluster the weighted centroids until k-clusters
Problems with BIRCH & STREAM
Old data points are equally important as new data points
May not be able to detect new trends in evolving data stream
CLU-STREAM
Also use CF vectors to store data summary
Use time stamps to record the elapsed time from the beginning
Take snapshots at different time stamp, favoring the most recent data
(Snapshot: micro-clusters stored at particular moments in the stream)
CLU-STREAM (continue)
A snapshot contains q micro-clusters, q depends on the memory available
New data points will be assigned to one of the micro-clusters in previous snapshot if it falls within the maximum boundary of that micro-cluster.
CLU-STREAM (continue)
If a new data points fails to fit into any current cluster, a new cluster is created, and an existing one is deleted or two merged.
A cluster is removed if the average time-stamp when it absorbs m new data points is least recent.
Detect New Trends
Comparing clustering results from snapshots to snapshots reveals trends in evolving data stream.
References
Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for Clustering Evolving Data Stream. In Proc. of the 29th VLDB Conference.
Barbara, D. (2003). Requirements for Clustering Data Streams. SIGKDD Explorations, 3 (2), 23-27.
Ester M., Kriegel H.-P., Sander J. & Xu X (1998). Clustering for Mining in Large Spatial Databases. Special Issue on Data Mining, KI-Journal, ScienTec Publishing, No. 1.
Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003). Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering, 15 (3), 515-528.
Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proc. of ACM SIGMOD International Conference on Management of Data.