Project Presentation CPSC 695 Prepared By: Priyadarshi Bhattacharya

Project PresentationCPSC 695

Prepared By:

Priyadarshi Bhattacharya

Outline of Talk

Introduction to clustering and its relevance to my research interests.

Discussion on existing clustering techniques and their shortcomings.

Introduction to a new Delaunay based clustering algorithm.

Experimental Results and comparison with other methods.

Direction of future research.

Clustering – Definition

Automatic identification of groups of similar objects.

A method of grouping data such that intracluster similarity is maximized and intercluster similarity is minimized.

Properties of clustering

Scalability: Clustering performance should decrease linearly with data size increase

Ability to detect clusters of different shapes Minimal input parameter Robust with regard to noise Insensitive to data input order Scalability to higher dimensions

(properties referred from “On Data Clustering Analysis: Scalability, Constraints and Validation” with minor

modifications)

Relevance to my research

Identification of high-risk areas in the sea based on incident data from the Maritime Activity and Risk Investigation System (MARIS), maintained primarily by the University of Halifax.

Incident Data

ClusteringAlgorithm

MarineRoute

Planning

(ESRI Shape File)

High-risk areas

Location of SAR Bases

Existing clustering algorithms

Clustering

Partitioning Hierarchical Density-based Grid-based

K-Means, K-Medoid BIRCH, CURE, ROCK, CHAMELEON

DBSCAN, TURN* WaveCluster1, CLIQUE

1WaveCluster: A novel clustering approach based on wavelet transforms. Applies a multi-resolution grid structure on the data space. For more details, refer to “Wavecluster: a multi-resolution clustering approach for very large spatial databases”, Proc. 24 th Conf. on Very Large Databases.

Shortcomings of existing methods Require large number of parameters to be input by user.

Example – number of clusters, threshold to quantify “similarity”, stopping condition, number of nearest neighbors etc.

Sensitivity to user-supplied parameters.

Capability of identifying clusters degrades with increase in noise.

Inability to identify clusters of widely varying shapes and sizes. Most detect spherical ones only.

Identification of dense clusters in presence of sparse ones, clusters connected by multiple bridges, closely lying dense clusters remains elusive.

CRYSTAL – A new Delaunay based clustering algorithm

The algorithm has 3 stages :

Triangulation phase: Forms the Delaunay Triangulation of the data points and sorts the vertices in the order of decreasing average length of adjacent edges.

Grow cluster phase: Scans the sorted vertex list and grows clusters from the vertices in that order, first encompassing first order neighbors, then second order neighbors and so on. The growth stops when the boundary of the cluster is determined.

Noise removal phase: The algorithm identifies noise as sparse clusters. They can be easily eliminated by removing clusters which are very small in size or which have a very low density.

Description of stage I Triangulation phase:

Triangulation is done in O(nlogn) time using the incremental algorithm.

An auxiliary grid structure (O(n) in size) is used to speed up the point location problem in the Delaunay Triangulation. This considerably reduces length of walk in the graph to locate the triangle containing the data point.

The well-known Winged-Edge data-structure is used to represent the Delaunay Triangulation because of its efficiency in answering proximity queries.

Description of Stage II Grow Cluster phase:

A queue is used to maintain a list of vertices in order, from which the cluster is grown. Only vertices that are not boundary points are inserted into the queue.

To decide whether a point belongs to the cluster, the edge length is compared with the average edge length of the cluster. To decide whether a point is on the boundary of a cluster, the average adjacent edge length of the point is compared to the average edge length of the cluster.

Description of Stage III Noise Removal Phase:

Noise in the data may be in the form of isolated data points or scattered throughout the data. In the former case,

cluster based at these data points will not be able to grow.

However, if the noise is scattered uniformly throughout the data, our algorithm identifies it as a single sparse cluster. This phase simply gets rid of noise by eliminating the cluster with the highest average edge length. Also any trivial clusters (size less than an acceptable number) are removed in this phase.

Complexity Analysis The algorithm operates in O(nlogn) time.

Delaunay Triangulation is generated in O(nlogn) time. As a vertex once assigned to a cluster is not considered again, the clustering is done in O(n) time.

Cluster size (1000) Vs Time consumed (ms)

Clustering in action

Experimental Results

Comparison with K-Means based approaches

Experimental Results (contd.)

1. Clusters of different shapes 2. Closely lying dense clusters


1. Clusters connected by multiple bridges 2. Clusters of widely varying density


Data set K-Means

GEM CRYSTAL


Results on t7.10k.dat (originally used in “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”)

Conclusion & Future Work

CRYSTAL is a fast O(nlogn) clustering algorithm that automatically identifies clusters of widely varying shapes, sizes and densities without requiring any input from user.

Future work will involve:

Application of the clustering algorithm in identification of high-risk areas in the sea using the MARIS database.

Extension of the algorithm to 3D.

Considering physical constraints in clustering. In GIS, physical constraints such as rivers, highways, mountain ranges can hinder or alter the clustering result.

References G. Papari, N. Petkov: Algorithm That Mimics Human Perceptual Grouping of Dot Patterns. Lecture Notes

in Computer Science (2005) 497-506 Vladimir Estivill-Castro, Ickjai Lee: AUTOCLUST: Automatic Clustering via Boundary Extraction for

Mining Massive Point-Data Sets. Fifth International Conference on Geocomputation (2000) Osmar R. Zaiane, Andrew Foss, Chi-Hoon Lee, Weinan Wang:

On Data Clustering Analysis: Scalability, Constraints and Validation.

Advances in Knowledge Discovery and Data Mining, Springer-Verlag (2002 ) Z.S.H. Chan, N. Kasabov: Efficient global clustering using the Greedy Elimination Method.

Electronics Letters 40 25 (2004 ) Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek: The global k-means clustering algorithm.

Pattern Recognition 36 2 (2003 ) 451-461 Ying Xu, Victor Olman, Dong Xu: Minimum Spanning Trees for Gene Expression Data Clustering.

Computational Protein Structure Group, Life Sciences Division, Oak Ridge National Laboratory, USA C. Eldershaw, M. Hegland: Cluster Analysis using Triangulation. Computational Techniques and

Applications CTAC97, 201-208. World Scientific, Singapore, 1997 Mir Abolfazl Mostafavi, Christopher Gold, Maciej Dakowicz: Delete and insert operations in

Voronoi/Delaunay methods and applications. Computers \& Geosciences 29 4 523-530 (2003) Atsuyuki Okabe, Barry Boots, Kokichi Sugihara: Spatial Tessellations: Concepts and Applications of

Voronoi Diagrams.

Thank You!

All 11 identified by CRYSTAL!

Questions?

Documents

Project Presentation CPSC 695 Prepared By: Priyadarshi Bhattacharya