[IEEE Integration (2010 IRI) - Las Vegas, NV, USA (2010.08.4-2010.08.6)] 2010 IEEE International Conference on Information Reuse & Integration - Heuristic based approach to clustering

Heuristic Based Approach to Clustering and its Time Critical Applications

Alan Chia-Lung Chen* Shang Gao* Reda Alhajj*,∇∇∇∇ Panagiotis Karampelas⊗⊗⊗⊗

*Dept of Computer Science

University of Calgary Calgary, Alberta, Canada

{aclchen,shanggao,alhajj}@ucalgary.ca

∇∇∇∇Dept of Computer Science Global University Beirut, Lebanon

⊗⊗⊗⊗Dept of Information Technology Hellenic American University

Athens, Greece [email protected]

Abstract

Clustering may be named as the first clustering technique addressed by the research community since 1960s. However, as databases continue to grow in size, numerous research studies have been undertaken to develop more efficient clustering algorithms and to improve the performance of existing ones. This paper demonstrates a general optimization technique applicable to clustering algorithms with a need to calculate distances and check them against some minimum distance condition. The optimization technique is a simple calculation that finds the minimum possible distance between two points, and checks this distance against the minimum distance condition; thus reusing already computed values and reducing the need to compute a more complicated distance function periodically. The proposed optimization technique has been applied to the agglomerative hierarchical clustering, k-means clustering, and DBSCAN algorithms with successful results. Runtimes for all three algorithms with this optimization scenario were reduced, and the clusters they returned were verified to remain the same as the original algorithms. The optimization technique also shows potential for reducing runtimes by a substantial amount for large databases. As well, the optimization technique shows potential for reducing runtimes more and more as databases grow larger and larger. Keywords: clustering algorithms, DBSCAN, density-based clustering, distance computation, hierarchical clustering, k-means clustering, performance.

1. Introduction Clustering is an essential problem in several applications, e.g., [10]; it has been extensively studied in the literature for long time. However, still there is a need for more effective clustering algorithms with acceptable performance [3]. Approximately 161 exabytes of digital data was produced in 2006, and this amount is estimated to grow to approximately 988 exabytes in 2010 [5]. As databases grow, the efficiency of algorithms should improve to

effectively handle emerging needs. Otherwise, clustering large databases will be time-consuming without any alternative capable of producing the intended result in reasonable time frame. This problem has been realized in the research community. According, some efforts developed some efficient algorithms for use on large databases such as BIRCH and CLARANS [9]. There has also been research on ways to improve the efficiency of a single clustering algorithm [1].

The main issue with these efforts is that they are too focused on a specific algorithm and that different clustering algorithms give different clusters as results [3]. This means that clustering algorithms are not chosen on performance qualifications alone, and improving the performance of all clustering algorithms is important. Finding a single way to improve the efficiency of clustering algorithms in general is important, because then the same idea can be applied to newly developed clustering algorithms and efforts can be adapted to research in other areas.

In our effort to produce a global solution that well fits all clustering solutions that apply Euclidean distance, this paper serves two objectives. First, compare the performance of the agglomerative hierarchical clustering, k-means clustering, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithms and create a basic understanding of the efficiencies of these algorithms. Second, demonstrate a simple and effective way to improve the efficiencies of clustering algorithms for large databases.

The rest of this paper is organized as follows. The necessary background is covered in Section 2. Performance improvement of the three algorithms is discussed in Section 3. The optimization process is presented in Section 4. Performance evaluation of the optimization with a simple check is reported in Section 5. Section 6 is conclusions and future research.

2. Background This paper will provide a brief overview of the hierarchical, k-means, and DBSCAN algorithms as implemented and utilized in this research. Basic versions of the algorithms

37IEEE IRI 2010, August 4-6, 2010, Las Vegas, Nevada, USA978-1-4244-8099-9/10/$26.00 ©2010 IEEE

with no extra optimizations were chosen to be implemented. This helps in well highlighting the power and demonstrating the applicability and effectiveness of the optimization approach proposed in this paper.

2.1 Agglomerative Hierarchical Clustering The agglomerative hierarchical clustering algorithm implemented is based on the one presented by Rui Xu and Don Wunsch [8]. The implementation of this algorithm requires one input, which is the minimum cluster distanceto specify when to stop merging clusters.

Pseudo-code of Hierarchical Clustering

Pseudo-code of K-Means

Pseudo-code of DBSCAN

2.2 K-Means Clustering The k-means clustering algorithm implemented is based on the one presented by J. MacQueen [4]. The implementation of this algorithm requires one input, k, which is the number of clusters the algorithm needs to find.

2.3 DBSCAN The DBSCAN algorithm implemented is based on the original DBSCAN idea proposed by M. Ester, H.-P. Kriegel, J. Sander and X. Xui [2]. The implementation of this algorithm requires two inputs, Eps and MinPts. Eps is the maximum distance for a point to be considered a neighbor of another point. MinPts is the minimum number of points in the neighborhood of a point for the point to be considered a core point.

3. Performance Improvement of the Three Algorithms

3.1 Method The three algorithms were all implemented in Java, then tested and timed on different datasets of 100, 250, 500, 1000, 1500 and 2000 points to avoid a bad dataset skewing the results. The datasets are comprised of points from a 3000 point 2-D dataset downloaded from [6]. The three algorithms were each executed three times on each dataset to avoid a bad run skewing the results. The average of each of the three runs on the datasets was then taken for all the different sizes of datasets.

It is difficult to compare the three algorithms, because they give different results from each other. Because of this, the k-means algorithm was used as a bridge to compare the three algorithms. This was done by running the k-means algorithm on the datasets with the number of clusters that the hierarchical and DBSCAN algorithms returned as the input. There is randomness in the k-means algorithm which could vary its runtime; therefore a seed was set for the random number generator used in the k-means algorithm. The hierarchical and DBSCAN algorithms were executed with the same inputs for all the executions on the different datasets.

3.2 Results

3.2.1 Analysis The results of the implementation showed that the fastest algorithm is the DBSCAN algorithm; this is followed by the k-means algorithm, and the slowest algorithm is the agglomerative hierarchical algorithm. The DBSCAN

Input: Eps and MinPts Step 1: Start with a point and find it’s neighbors (Points

with distances ≤ Eps) Mark point as visited

If number of neighbors < MinPts (Border Point) Mark as noise

Else (Core Point) Start a new cluster

Add the point to the cluster Go to the expand cluster step for this point

Step 2: Repeat step 1 for all points not visited

Expand Cluster Step: For all unvisited neighbors of the point Mark point as visited

Find the neighbor’s neighbors If number of neighbor’s neighbors ≥ MinPts

(Core Point) Add neighbor’s neighbors to the current list of neighbors of the point

Add the neighbor to the cluster if it’s not part of any other cluster

Input: number of clusters kStep 1: Arbitrarily select k points as cluster means Step 2: Assign each point to the cluster of the nearest cluster

mean Step 3: Recalculate the center of the new clusters to be the

cluster mean Step 4: Repeat steps 2 and 3 until points don’t get assigned

to different clusters

Input: minimum cluster distanceStep 1: Assume every point is a cluster Step 2: Find the clusters with the smallest distances between

each other’s centers Step 3: Merge the clusters with the smallest distance Step 4: Repeat steps 2 and 3 while the smallest distance

between all clusters < minimum cluster distance

38

algorithm does run a significant amount faster than the k-means algorithm, but this is not as huge a difference as when these two algorithms are compared to the hierarchical algorithm (see Figure 1 and Figure 2). In comparison to the hierarchical algorithm, both the k-means and DBSCAN algorithms run faster by a factor of over 100 for larger datasets. The runtime for all the three algorithms increase as the number of points increase, but the runtime for the hierarchical algorithm increased almost exponentially compared to the k-means and DBSCAN algorithms (see Figure 1 and Figure 2).

Figure 1. Performance comparison chart for hierarchical clustering and k-means with minimum cluster distance = 150.

Figure 2. Performance comparison chart for k-means and DBSCAN with Eps = 350 and MinPts = 4.

4. Optimization of Clustering Algorithms with a Simple Check The main time spend in these clustering algorithms is calculating the distance between points. There is a need to reduce these calculations as many of the calculations are not necessary. Distances already calculated can be used to help reduce other distance calculations. This is the main idea used to improve the performance of the algorithms. It is obvious that more distance computations will be expected as the number of data points increases. Therefore, the

power of the proposed approach becomes more visible as the size of the database increases. Example 4.1: Assume the distances for all points to A have been computed and the algorithm is currently looking at the distance from B to C (see Figure 3). In general, the distances of A to B and A to C can be used to help determine information on the distance of B to C.

Figure 3. Visualization of the example.

Figure 4. Visual concept of the minimum distance formula.

4.1 Minimum Distance The minimum possible distance between two points B and C is given by: MinimumDistance(B,C) = |Distance(A,B) – Distance(A,C)|

The formula works for any points A, B, and C because the point A is assumed to be the center point of a circle with a radius of the distance from A to C (see Figure 4). Then the only location the point C can occupy for the distance of B to C to be minimum is on the line of A to B (see Figure 4). This is why the formula subtracts the distance of A to C from the distance of A to B. This works for any position of the point B, because B is either on the circle, inside the circle, or outside the circle. If the point B is on the circle, then the minimum possible distance will be zero. If the point B is outside the circle, then the minimum possible distance will be a positive number. If the point B is inside the circle, then the minimum possible distance will be a negative number, which is why the formula uses the absolute value function to change it to a positive. This is basically switching the points, i.e., B can be C in this case.

4.2 Check The simple subtract and absolute calculation will generate the minimum possible distance between two points B and C, which can be used as a check against some minimum distance condition to see if the distance of B to C needs to

39

be calculated. In a number of cases, this will eliminate the need to do a more expensive calculation such as the Euclidean distance formula.

4.3 Distance Information Storage This optimization technique requires the storage of distance information for later use. This can be accomplished with a distance matrix (see Figure 5). The rows and columns represent the points and the entries (i, j) in the matrix represent the distances between the points of the entry’s in row i and column j (see Figure 3 and Figure 5).

If the distance matrix is symmetrical then only a fraction (upper triangle of the matrix) of the entries need to be filled out. This means the distance of A to B only needs to be filled out in one spot (see Figure 3 and Figure 5).

When concluding that there is no need to calculate the exact distance of two points B and C, the entry in the distance matrix needs to be marked. This makes it clear that the distance information of B and C cannot be used later on, because the exact distance of B to C was never calculated. This is done by filling in a negative number in the distance matrix, because distances cannot be negative (see Figure 5).

Figure 5. Example of a distance matrix.

4.4 Applying the Optimization with Check to Agglomerative Hierarchical Clustering The proposed optimization method should be implemented in the situation where the hierarchical algorithm is searching for the clusters with the smallest distances between each other’s centers. The minimum possible distance of two points would be checked against the smallest distance between two clusters found so far. This is because if the minimum possible distance of two points is not less than or equal to the smallest distance between the two clusters found so far, then this means the two clusters are not going to be merged during this turn. The distance information would be stored on a symmetrical distance matrix with the rows and columns representing the clusters.

Example: Assume the distances for all clusters to A are calculated and the algorithm is currently looking at the distance from B to C (see Figure 3).

Smallest.Distance.So.Far = S.D = 0.5 Min.Dist( ) = MinimumDistance( ) Dist( ) = Distance( ) Min.Dist(B,C) = |Dist(A,B) – Dist(A,C)|

Min.Dist(B,C) = |3-2| Min.Dist(B,C) = 1

If (Min.Dist(B,C) ≤ S.D) Calculate the Dist(B,C) If (Dist(B,C) ≤ S.D) Update variables Move on to next two clusters Else Don’t calculate the Dist(B,C) Move on to next two clusters

1 > 0.5, therefore there is no need to calculate the Distance(B,C), because B and C will not be merged this turn.

4.5 Applying the Optimization with Check to K-Means Clustering The proposed optimization method should be implemented in the situation where the k-means algorithm is searching for the nearest cluster mean of a point to assign the point to the cluster. The minimum possible distance of the point to the cluster mean would be checked against the distance of the nearest cluster mean so far. This is because if the minimum possible distance of the point to the cluster mean is not less than or equal to the distance of the nearest cluster mean so far, then this means that the point will not be assigned to the cluster currently being looked at. The distance information would be stored on a distance matrix with the rows representing the points and the columns representing the clusters means’ previous positions. For k-means clustering, because distances are calculated between points and cluster means instead of points and other points, more distance information is needed. The distance between the old position of the cluster mean and the new position of the cluster mean needs to be calculated, but this calculation can potentially save many other distance calculations. This distance information can be stored in an array.

Example: Assume the nearest cluster mean so far to a point B is N with the distance already calculated and the algorithm is currently looking at the distance from B to a cluster mean M' (see Figure 6). Also assume that the old mean position of M' is M, and the distance between the two is calculated.

Nearest.Mean.Distance.So.Far = N.M.D = 0.5 Min.Dist( ) = MinimumDistance( )

Dist( ) = Distance( ) Min.Dist(B,M') = |Dist(B,M) – Dist(M,M')| Min.Dist(B,M') = |3-2| Min.Dist(B,M') = 1

If (Min.Dist (B,M') ≤ N.M.D) Calculate the Dist(B,M') If (Dist(B,M') ≤ N.M.D) Update variables Move on to next cluster mean or point Else Don’t calculate the Dist(B,M') Move on to next cluster mean or point

1 > 0.5, therefore there is no need to calculate the Distance(B,M'), because B does not belong to the cluster of M'.

40

Figure 6. Visualization of the k-means example.

4.6 Applying the Optimization with Check to DBSCAN The proposed optimization method should be implemented in the beginning when the DBSCAN algorithm is searching for all the neighborhoods by checking the distances of every point to every other point. The minimum possible distance of two points would be checked against the input Eps. This is because if the minimum possible distance of two points is not less than or equal to the Eps, then this means that the two points are not neighbors of each other. The distance information would be stored on a symmetrical distance matrix with the rows and columns representing the points.

Example: Assume the distances for all points to A are calculated and the algorithm is currently looking at the distance from B to C to figure out if they are neighbors (see Figure 3). Also assume that the input Eps is 0.8.

Eps = 0.8 Min.Dist( ) = MinimumDistance( ) Dist( ) = Distance( ) Min.Dist(B,C) = |Dist(A,B) – Dist(A,C)| Min.Dist(B,C) = |3-2| Min.Dist(B,C) = 1

If (Min.Dist(B,C) ≤ Eps) Calculate the Dist(B,C) If (Dist(B,C) ≤ Eps) Update neighborhoods for B and C Move on to next two points Else Don’t calculate the Dist(B,C) Move on to next two points

1 > 0.8, therefore there is no need to calculate the Distance(B,C), because B and C are not neighbors.

5. Performance Evaluation of the Optimization with a Simple Check

5.1 Method Two versions of the three algorithms, each optimized with a check and without were all implemented in Java, then tested and timed on different datasets of 100, 250, 500, 1000, 1500 and 2000 points to avoid a bad dataset skewing the results. The datasets are comprised of points from a 3000 point 2-D dataset found from [6]. The three algorithms were each executed three times on each dataset to avoid a bad

run skewing the results. The average of each of the three runs on the two datasets was then taken for all the different sizes of the datasets. The runtimes of each of the three algorithms optimized with checks were compared to the runtime of the version of the algorithms without the optimization. There is randomness in the k-means algorithm which could vary its runtime; therefore the same seed was set for the random number generator used in the version of the k-means algorithm optimized with check and the version without. Both versions of all the three algorithms were executed with the same inputs for all the executions on the different datasets.

Figure 7. Performance evaluation chart for hierarchical clustering with minimum cluster distance = 150.

Figure 8. Performance evaluation chart for k-means clustering with k = 80.

Figure 9. Performance evaluation chart for DBSCAN with Eps = 350 and MinPts = 4.

41

5.2 Results 5.2.1 Analysis For all the three algorithms, adding the simple check reduced the runtimes of the algorithms (see Figures 7, 8, and 9). For all the three algorithms, executions of the algorithms on the smaller datasets of 100, 250, and 500 points showed an improvement in performance, but not as significant an improvement as the results for the executions on larger datasets of 1000, 1500 and 2000 points. This indicates that this optimization technique is more suitable for larger datasets. For all the three algorithms, the results of the optimization with the check appear to follow the same pattern in terms of performance as the number of points increase, but it can be seen that the gap between the average runtime with the check and without is increasing as the number of points increases (see Figures 7, 8, and 9).

The improvement in the cost would even be more visible once the dimensionality of the data points used in the test increases because such increase in dimensionality would lead to higher cost in distance computation. Thus it would be wise to avoid the distance computation whenever possible. This indicates that potentially a substantial amount of time can be reduced with the implementation of the optimization with a check for extremely large datasets.

6. Conclusions and Future Work Hierarchical clustering, k-means clustering, and DBSCAN are very different algorithms, but they do share similarities such as the need to calculate distances and check them against some minimum distance condition. There are times when it is important to focus on the similarities. The similarities between the three algorithms were used to generate a general optimization technique that can potentially be applied to algorithms that need to calculate distances and check them against some minimum distance condition. The optimization technique with a simple check, presented in this paper is a successful technique to improve the performance of the three algorithms researched in this paper. The results showed a reduced runtime for all the three algorithms that used the optimization technique with the check, and the potential of reducing runtime even further for larger datasets. This is important as the databases are growing, because this technique will be more effective as the amount of data grows. The clusters found from the implementation of the three algorithms optimized with checks were compared to the clusters found from the three algorithms without checks to ensure the optimization technique returned valid clusters.

There is a potential problem with this optimization technique, which is a memory issue. Although this optimization technique shows potential for greatly reducing

the runtime for extremely large datasets, with larger and larger datasets, this technique requires storage of more and more distance information. The problem is that at some point, there will not be enough room in memory to hold all of the required information.

Future research needs to test this optimization technique on extremely large datasets to determine if there is a substantial improvement in performance as predicted, and what the amount of improvement in performance is.

The memory requirement of this optimization technique also needs to be researched, because there are potential memory issues that could arise. There will need to be research on a more efficient storage of distance information, if memory issues do arise with this optimization technique.

The focus on this research was on 2-D data, therefore this optimization technique works for 2-D data, but research needs to be completed for applying this optimization technique to data with more than two dimensions. It is anticipated that the improvement in cost saving will be more visible because the higher the dimensionality the higher the cost of the distance computation.

References [1] Day, W., and Edelsbrunner, H. “Efficient algorithms for

agglomerative hierarchical clustering methods.” Journal of Classification, Vol.1, pp.7-24, December 1984.

[2] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X.. “A density-based algorithm for discovering clusters in large spatial databases with noise,” Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp.226-231, 1996.

[3] Keim, D. A. and Hinneburg, A. “Clustering techniques for large data sets—from the past to the future.” Tutorial Notes of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp.141-181, 1999.

[4] MacQueen, J., “Some methods for classification and analysis of multivariate observations,” Proceedings of Berkeley Symposium on Math. Statist. & Prob., Vol. 1. Los Angeles, CA, pp.281-297, 1966.

[5] Parker, K. Turn off, tune out, drop in. The Washington Post (April 2009).

[6] Pasi Fränti’s Clustering datasets - A1. http://cs.joensuu.fi/sipu/datasets/a1.txt.

[7] Richards A. L., et al., “A comparison of four clustering methods for brain expression microarray data,” BMC Bioinformatics 2008, 9:490.

[8] Xu, R., and Wunsch, D. Clustering. IEEE Press, Hoboken NJ, 2009.

[9] Zhang, T., Ramakrishnan, R., and Livny, M., “BIRCH: an efficient data clustering method for very large databases,” Proceedings of ACM SIGMOD, Montreal, pp.103-114, 1996.

[10] Zang C., et al., “A clustering approach for identification of enriched domains from histone modification ChIP-Seq data,” Bioinformatics, 25(15):1952-1958, 2009.

42

Documents

[IEEE Integration (2010 IRI) - Las Vegas, NV, USA (2010.08.4-2010.08.6)] 2010 IEEE International Conference on Information Reuse & Integration - Heuristic based approach to clustering