[IEEE 2006 International Multi-Symposiums on Computer and Computational Sciences (IMSCCS) - Hangzhou, Zhejiang, China (2006.06.20-2006.06.24)] First International Multi-Symposiums

A Robust Hierarchical Clustering Algorithm and its Application in 3D Model Retrieval

Tianyang Lv1, 2, Shaobin Huang1, Xizhe Zhang2, and Zheng-xuan Wang2

1 College of Computer Science and Technology, Harbin Engineering University, Harbin, China2 College of Computer Science and Technology, Jilin University, Changchun, China

[email protected]

Abstract

Clustering techniques can be adopted to analyze 3D model database and improve the retrieval performance. However, 3D model database lack valuable prior knowledge. Thus, it becomes difficult for the clustering methods to pre-decide the appropriate parameter’s value. Moreover, clustering methods are short at handling outliers by treating outliers as “noise”. The paper introduces a robust hierarchical clustering algorithm for analyzing 3D model database. The proposed algorithm stops automatically by utilizing outlier information and adopts the concept of core group to reduce the influence of parameter on the clustering result. Core group refers to the data that are always clustered together. After discussing some desirable properties of the new algorithm, the paper conducts a series of experiments on Princeton Shape Benchmark and 2 real-life datasets from UCI. Comparative study demonstrates advantages of our algorithm.

1. Introduction

With the proliferation of 3D models and their wide spread through internet, 3D model retrieval becomes a hot research topic [1]. The researches in this field concentrate on the shape-based 3D model retrieval, since the traditional retrieval methods is not satisfying in their efficiency and effectiveness [2].

The analysis and organization of the 3D model databases will improve the retrieval performance. However, this topic is not fully explored in previous researches. For instance, ref. [3] adopts R*-tree to organize 3D model database. But R*-tree is not suitable for 3D models’ shape feature with high dimensionality.

To solve this problem, the clustering technique can be adopted to discover the distribution situation of 3D models’ shape feature; since models with similar feature are clustered together, the clustering result can be treated as a kind of classification of 3D model database; and the

retrieval of target model can be considered as searching for the nearest clusters of the target model.

However, it is difficult for the clustering techniques to select appropriate parameters’ value. Take the traditional hierarchical clustering method as an example. The methods fall into this category usually require the user-specified number k of final clusters to stop clustering. Moreover, clustering algorithms are also short at handling outliers. Some can not detect outliers, while others prune the small clusters or the clusters growing slowly as outliers, like CURE [4] and DBScan [5]. The latter scheme is treating the outlier-detection as the byproduct of clustering and can not mine outliers effectively.

To overcome these drawbacks, the paper proposes a novel hierarchical clustering method, which shows its advantages in two aspects:

First, the method cancels parameter k by stopping clustering automatically according to the dissimilarity degree implied by the outliers. It is based on the following observation: with the progressing of clustering, the dissimilarity D(CNN-A, CNN-B) between the two most similar clusters CNN-A and CNN-B at present is increasing. Thus, the clustering process should stop at the moment when CNN-A and CNN-B are so diverse from each other. The outlier-mining method can be adopted to provide that suitable diverse degree, since outliers are defined for their “great dissimilarity” from the others [6].

Second, the method adopts a novel concept of coregroup to reduce the influence of parameter. It is motivated by the phenomenon that some data are always clustered together, no matter what parameters’ value are. These data reflect the underline structure of a dataset more precisely. And the proposed clustering method tries to obtain the core group.

The paper applies the proposed algorithm in analyzing 3D model database and improving the efficiency of 3D model retrieval by obtaining the retrieving result from the nearest clusters of the target model.

The rest of paper is organized as follows: after introducing the related works in Section 2, Section 3 proposes a new outlier detection method. Section 4 introduces the clustering algorithm. After discussing

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06) 0-7695-2581-4/06 $20.00 © 2006 IEEE

some related topics in Section 5, Section 6 conducts series experiments. Section 7 summarizes the paper.

Table 1. Important notations Notation Description

N Total number of Data M Dimensionality of Data K Number of Final Clusters Ci The i th Cluster

D(Ci, Cj) Distance between Ci and Cj

2. Related works

In recent years, the clustering analysis technique experiences rapid improvements in many aspects. For example, reference [7] import the clustering results in the database index; the sampling scheme is proposed for handling large databases [8]; reference [9] proposes a new hierarchical clustering method based on dissimilarity increment to optimally estimate k, but the algorithm is short at handling outlier and deteting clusters with complex shape.

Among these improvements, CURE algorithm [4]employs r data (representative) to represent a cluster; shrinks the representatives towards the cluster’s centroid by a fraction before computing the distance between clusters to avoid noise.

The way to choose the representatives of cluster Ci is: if r ni, all data of Ci are representatives; otherwise, the first representative is the farthest one from Ci’s centroid and the data farthest from the previous chosen representative is selected as the next representative, do this iteratively till r representatives are decided. However, there are several shortcomings of CURE: needing the user-specified parameter k; interfered by the outliers before they are pruned; clusters’ density is not taken into consideration, which leads to wrong merging decision.

As a separate branch of professional outlier detection method, the distance-based method [10] is proposed and its main idea is: if the distances between data a and most other data are larger than the threshold Dout, a is an outlier. However, it is difficult to pre-decide an appropriate Dout and this method ignores the local distribution feature of one data. Take Fig. 1 as an example, p is the first outlier candidate because it is the farthest one from the others. However data locating near q are more compressed than those near p, thus q is a more natural outlier.

In the field of 3D model retrieval, reference [1] states a shape feature extraction method using spherical harmonic transformation on voxel descriptors of 3D model to achieve the property of rotation invariance. States its overview as follows: first, the 3D model is projected into a 2R×2R×2R voxel grid and set the

corresponding value of a voxel 1, if it contains point of polygonal surface, otherwise set the value of 0; then, normalize the model; thus, for each sphere with the radius r, the spherical function of a 3D model can be defined as:

( , ) ( , , ) ( sin( ) cos( ), cos( ) , sin( )sin( ) )

rf f r Voxel rR r R r R

(1)

where [0, ], [0,2 ] and . Each spherical harmonic function can be decomposed as the sum of different frequencies, like:

[1, ]r R

rf

1

( , ) ( , )B

lr r

L Of f (2)

,( , ) ( , )l

lr l m l

m lf a Y m (3)

Where is the harmonic homogeneous polynomial of l. Combining the signature { } of

),(mlY

0 1|| ||, || ||,...r rf f rf with different r, the shape descriptor for the 3D model is obtained, whose dimensionality depends on B and R.

The paper adopts this method to obtain the shape feature.

Figure 1. Local distribution features of data

3. The outlier mining method based on dissimilarity ratio

To overcome the shortcomings of traditional distance-based outlier mining method, this section proposes a method, which considers the local distribution feature of data using the concept of dissimilarity ratio and determine Dout semi-automatically.

Due to the characteristic of outliers, it is natural to conclude that the distance between an outlier and its kthnearest neighbor is much larger than that of a normal data, since the normal data are more compressive towards each other. Thus, the concept of dissimilarity ratio is concluded to describe the local distribution feature of one data.


Definition 1 (Dissimilarity Ratio) For any data a and its nearest neighbor b, if b’s nearest neighbor is not a,name b’s nearest neighbor as c, otherwise name b’ssecond nearest neighbor as c; thus the dissimilarity ratio Ratio(a) of a equals to:

( ) / ( ) ( ( ) ( ))( )

( ) / ( )NN NN NN NN

NN NN

D a D b if D a D bRatio a

D b D a otherwise(4)

Where DNN(a) is the distance between a and its nearest-neighbor. Obviously, Ratio(a) 1 and its value shows the isolation degree of a from its neighbors.

(a) Iris Dataset

(b) House Dataset Figure 2. The statistic distributions of the distance

between data and its nearest neighbor (Dis), dissimilarity increment (Incre), dissimilarity ratio

(Ratio) and OutDis adopted in outlier mining (Outlier)

And OutDis(a)=DNN(a)*Ratio(a) are used in detecting outliers. Fig. 2 is the statistic distribution of dissimilarity increment (Incre curve), dissimilarity ratio (Ratio curve)and OutDis (Outlier curve) of 2 real-life datasets [11]. It can be seen that the Ratio and Outlier curve obey more strictly to the exponential pattern.

The outlier detection threshold Dout is decided by referring to the even distribution pattern, which is very useful, since clusters and outliers exist when the real-life data distribute unevenly. The distances NND between each data and its nearest neighbor are the same in that case. And NND can be computed according to equation (5), where and is the maximum and the minimum of all data’s i th-dimension.

( )maxia ( )

minia

( ) ( ) 2max min

1

(( ) / )M

i i MNN

iD a a N (5)

To describe the diversity of the realistic distribution situation from the even distribution pattern, parameter is adopted and Dout= NND .

Thus, the criterion for outlier mining is:

Criterion 1 (Outlier detection): Data a is an outlier,

if

( ) ( ) 2max min

1(( ) / )

( )* ( ) ( )

Mi i M

iNN

a a ND a Ratio a .

A method is proposed to decide the value of parameter . It is based on following observation: the detected outlier number nout will increase much faster with further decrease of Dout after the actual outliers are detected. That is because outliers are extremely far away from the others while the normal are relatively near to each other, as the Outlier curve of Fig.1 shows it. Thus, state the method as follows:

(1) suppose = Step when only one outlier is detected and name Step as the step length,obviously ( ) ( ) NNStep NN farest farestD a a D ,

faresta satisfies ( ) ( ) ( ) (NN farest farest NND a a D b b) forany b;

(2) observe the increasing speed V of the detected outlier number under different value of ,viz.

outnout out StepV n n , where Stepl and

l={1, 2 …}, and call l the step Num. ;(3) if V reaches its first peak when i Stel p ,( 1)i Sl tep .

4. The hierarchical clustering algorithm using outlier information and core group

This section introduces the Clustering Algorithm using Outlier Information and Core group CAOIC. Besides proposing a new way to compute clusters’ distance, CAOIC adopts Core group to reduce the


influence of parameter during clustering and uses outlier information to stop clustering automatically.

4.1. Handling outliers

CAOIC performs outlier detection in two phases: (1) Detecting outliers using the method in section 3

before clustering; (2) Treating very small clusters in the clustering

result as outliers. The first phase reduces the interruptions coming

from outliers for clustering and the second refine the clustering result.

4.2. Computing clusters’ distance

Motivated by CURE, CAOIC uses representatives in computing clusters’ distance. In contrast to CURE, CAOIC will consider cluster’s density in computing clusters’ distance and cancel the parameter . CURE adopts to exclude “noise” and CAOIC achieves this by adopting the professional outlier pruning method stated in section 3.

As shown in [9], the dissimilarity among data in the same cluster will not change greatly. Thus, CAOIC decides the distance D(Ci, Cj) between Ci and Cjaccording to two factors: first, the distance Dmin(Ci, Cj) of the nearest representatives coming from Ci and Cjrespectively; second, the factor measures the change of cluster’s density Den. The density of Ci or Cjapproximately equals the average distances among itsrepresentatives. For the new-borne cluster Cnew created by merging Ci and Cj, Den(Cnew)=Dmin(Ci,Cj). Then, (Ci) is defined as follows and so is (Cj):

( ) / ( )( ) ( ( ) ( ))

( ) / ( )

i new

i i

new i

Den C Den CC if Den C D

Den C Den C elsenewen C (6)

Since it is impossible to compute the density of the cluster with only one data, define D(Ci,Cj)=Dmin(Ci,Cj) in that case. And the way to compute D(Ci,Cj) is:

( , ) ( ( ) ( )) / 2

( , ) ( 1) & &( 1)

( , )

min i j i j

i j i j

min i j

D C C C CD C C if n n

D C C else (7)

Obviously, factor reflects the influence of cluster’s density on merging decision. Easy to prove that ( ( ) ( )) / 2 1i jC C , which means the bigger the

difference between the density of Ci or Cj with that of Cnew, the less possibility for Ci and Cj to be merged.

4.3. Reducing parameter’s influence

CAOIC adopts parameter r to compute D(Ci, Cj).Therefore, the value of r will influence the clustering results.

In the setting of agglomerative clustering, it is mainly the merging decision under different parameter’s value that results in differences among the clustering results. For example, Fig. 3 illustrates the clustering process under r=2 and r=5. Obviously, the difference between the clustering results is caused by the merging decision at step 2, when {a, b} is merged with {c} when r=2 but merged with {d} when r=5.

(a) r=2 (b) r=5Figure 3. Example of the clustering process under

different value of r

According to the above analysis, CAOIC tries to make each result cluster constructed by core group by forecasting the influence of parameter r on the clustering result. For cluster Ci, we can always find its real nearest neighbor if r>=ni, but it is unsure in case of r<ni since the representatives may not describe their cluster’s attribute very well. This leads to the mistake in finding Ci’s nearest neighbor Cj. Thus, if Ci and Cj are merged, the new born cluster will not contain core group.

Therefore, the proposed algorithm will re-compute the nearest neighbor of the to-be-merged cluster if its size is larger than r. If its nearest neighbor computing is not accurate under current value of r, the cluster will not be involved in the following clustering, viz. the cluster is frozen.

4.4. Stopping automatically according to outlier information

Without user-specified condition to stop clustering, it is necessary to extract this information from the processed data. As stated, it is a suitable opportunity to stop clustering if the clusters to be merged are too dissimilar. We use Dout as the dissimilarity threshold, since Dout is used to detect outliers, while the major characteristic of outliers is their dissimilarity from the others.

Thus, the stop criterion of clustering is:


Criterion 2 (Automatic Stop): Suppose CNN-A and CNN-B are the most similar clusters at present, stop clustering if D(CNN-A,CNN-B)>Dout.

And the overview of CAOIC is shown in Figure 4.

Algorithm CAOIC(r)1.{ Read M-dimensional data and obtain amax and amin;2. Treat each data as a cluster; 3. for ( i=0; i<N; i++ ) 4. Compute Ci’s nearest-neighbor; 5. Determine the value of step;6. Name the nearest clusters at present as CNN-A, CNN-

B;B7. Dout = outlier (amax, amin, step) //detect outliers 8. while (D(CNN-A, CNN-B) Dout && CNN-A!=null ) 9. { if ( !isFrozen( CNN-A, CNN-B) ) 10. Merge clusters CNN-A and CNN-B;11. Update CNN-A, CNN-B; } 12. Output the clustering result; 13. } //End of CAOIC

Figure 4. The clustering algorithm using outlier information and representatives

5. Discussion

5.1. Further clustering of core group

Since CAOIC removes some clusters as core groupsduring clustering, the number of the obtained core groupswill be larger than user-preferred number k. To obtain the desired k clusters, we adopt the K-means to perform further clustering of core groups. In that phase, a coregroup is considered as a whole and is represented by its center. Furthermore, the k largest core groups are selected as the initial points of K-means. It is because the larger the core group is, the closer it is to the center of the result clusters. This will improve the clustering performance of K-means and reduce its iteration times.

5.2. Comparison with other works

As we know, previous researches usually treat clustering and outlier detection as separate topics and little effort has been taken in combining them.

The proposed method CAOIC makes clustering work together with outlier detection quite well. First, the proposed outlier-detection method excludes most disturbances of outliers for clustering; second, the outlier information is utilized in the determination of k. CAOIC achieves these with less work load, since it determines

vectors amax and amin in data inputting and the computing of all data’s nearest neighbor is the fore-step for both outlier-detection and clustering.

CAOIC also applies multi-scheme to reduce the influence of parameters: it cancels the traditional parameter k; automatically decides the parameter ;adopts the concept of core group to handle parameter r.

5.3. Complexity analysis

The complexity of traditional hierarchical algorithm is O(N2). Since CAOIC is constructed on the traditional method, it is only necessary to analyze the complexity of each change. The complexity increases by O(N) to perform one more scan to detect outliers. It costs O(r* ni)to compute cluster Ci’s representatives and costs O(r2) to compute its density; since these are needed to compute (N-k) times for only the new-born clusters, the complexity increases by O((r*ni+r2)*(N-k)) at most. And it costs no more than O(r2*(N-r)) to decide whether the to-be-merged cluster should be frozen for (N-k) times. k is automatically determined and indirectly influences the complexity. Take all these changes in total, the complexity increases by O((r*ni+r2*(N-r)+r2)*(N-k)+N).Since r2<<N in most cases, for instance r=5 when N>225,the complexity of CAOIC equals O(N2).

6. Experiment and Analysis

The experiment adopts the Iris and Wine datasets from UCI to compare the proposed algorithm with K-means, the Frozen algorithm of [9], etc.. The Princeton Shape Benchmark (PSB) with 1814 models [12] is used to test the performance of the combining method in 3D model retrieval. The feature extraction method of section 2 with R=32 and B=10 is applied to obtain the shape feature from 3D models with the dimensionality 320. Figure 5 shows the statistics distribution of the element of spherical harmonic transformation result for each

( , )rf . In experiment, we adopt the Euclidean distance.


Figure 5. Average value of each dimension for the respective r

The criterions Entropy and Purity of [13] are used to measure the clustering results’ quality:

1 1

1( lolog

g )j jqk

i i

i j i i

n nEntropy

N q n nin (8)

1

1 max( )k

jiji

Purity nN

(9)

nij is the number of data of j th original class assigned

to the i th cluster. The better the clustering result, the smaller is the Entropy and the bigger is the Purity.

6.1. Clustering Performance

Table 2 gives the clustering results of CAOIC and K-means when it is applied on core group. It also states the detected number of outlier.

To be more persuasive, Table 2 shows the best clustering results of K-means, DBScan and Frozen under all possible parameters’ value, while the number of result clusters of K-means is little larger than user-desired k. It can be seen that CAOIC and CAOIC+K-means achieve better performance than the others.

Since K-means approaches its optimization within 2 iterations when K-means is performing clustering on coregroups, we set its iteration time 1 in the experiment for K-means + core group. Figure 6 is the clustering performance under the different iteration times when K-means is applied in handling the core groups of Iris dataset.

6.2. Application in 3D model retrieval

When r=3 and =0.6*5, CAOIC obtains 268 coregroups from Princeton Shape Benchmark with the smallest size of 2. Table 3 lists the details of C60 and C268.It can be observed that CAOIC clusters the models with similar shape together, especially when the feature extraction method satisfies the request that models with similar shape have similar feature. Since that request cannot always be true, some clustering mistakes can be found. Table 4 lists part of the detected outliers along with step num l, under which they are pruned.

We also applied the auto-stopped methods DBScan and Frozen. Under all possible value of parameters, Frozen algorithm obtains over 1200 clusters, which means almost all clusters have only 1 model, while DBScan tends to obtain huge clusters. For instance, when =0.2 and MPts=2, DBScan gets 66 clusters with n0=715,

n1=1028, n2, n3…n7=2, and n8, n9…n65=1. Obviously, these clustering results are not acceptable.

Then, we perform the further clustering on coregroups to get k clusters and test the retrieval performance by treating each model of PSB as a target model. Table 5 states the total retrieval time and the number of the retrieved model that is among 5 closest models of the target. In comparison with the linear scan method, the proposed method cost less that 10% time, while obtaining the fairly good results.

Table 2. Overview of the clustering results of CAOIC and other algorithms

Data-set Parameters k Purit

yEntropy nout

Iris =2.2, r=5 10 0.92 0.11 5CAOIC

Wine =1.8, r=5 15 0.89 0.22 2

Iris k=3,Iterations=1 -- 0.87 0.30 -- CAOIC

+ K-Means Wine k=3,

Iterations=1 -- 0.86 0.34 --

Iris =4.0 2 0.67 0.42 -- Frozen

Wine =0.5 13 0.73 0.50 --

Iris =0.7, MPts=3 2 0.69 0.41 --

DBScanWine =35,

MPts=3 6 0.68 0.59 --

Iris k=4,Iterations=20 -- 0.90 0.21 K-

Means Wine k=7,Iterations=20 -- 0.73 0.49

Figure 6. The clustering performance under different iteration times of K-means of Iris dataset

7. Conclusion

To analyze the complicated 3D model databases and reduce the influence of parameter on the clustering


results, the paper proposes a new clustering algorithm CAOIC that integrates outlier detection with clustering and adopts core group to reduce the influence of parameter. Experimental results show CAOIC’s good performance in the application of 3D model retrieval.

Acknowledgements

This work is sponsored by the Natural Science Research Foundation of Harbin Engineering University under the grant # HEUFT05007 to Tianyang Lv and the Natural Science Research Foundation of Harbin Engineering University under the grant # HEUF04090 to Shaobin Huang and Innovation Foundation of the Graduate Innovation Lab of Jilin University under the grant # 503010 to Xizhe Zhang.

References

(1) T.Funkhouser, et al, “A Search Engine for 3D Models. ACM Transactions on Graphics”, Vol.22, No 1, Jan. 2003, pp.85-105.

(2) Indriyati Atmosukarto, Pros Naval, “A Survey of 3D Model Retrieval Systems”, http: // www.comp.nus.edu.sg/~indri/downloads/literature-review.ps, 2003.

(3) Lou K., Iyer N., etc, “Supporting Effective and Efficient Three-Dimensional Shape Retrieval”, TMCE 2004-Tools and Methods for Competitive Engineering, Lausanne, Switzerland, TMCE Volume 1, 2004, pp. 199-210.

(4) S. Guha, R. Rastogi, and K. Shim, “CURE: an Efficient Clustering Algorithm for Large Database”, in: Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, Washington: ACM Press, pp. 73-84, 1998.

(5) M. Ester, H.-P. Kriegel, J. Sander and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial

Databases with Noise”, in Proceedings of 2nd int. Conf. on Knowledge Discovery and Data Mining (KDD'96), Portland, Oregon, AAAI Press, pp. 226-231, 1996.

(6) Hawkins D., Identification of Outliers, Chapman and Hall, London, 1980.

(7) Dantong Yu, Aidong Zhang, “ClusterTree: Integration of Cluster Representation and Nearest-Neighbor Search for Large Data Sets with High Dimensions”, IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 5, September/October 2003, pp. 1316-1337.

(8) George Kollios, etc., “Efficient biased sampling for approximate clustering and outlier detection in large data sets”, IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 15, No. 5, May 2003, pp. 1170-1181.

(9) Ana L.N. Fred and José M.N. Leitão, “A new Cluster Isolation criterion Based on Dissimilarity Increments”, IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 25, No. 8, pp. 944-958, August 2003.

(10)Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, “Efficient Algorithms for mining outliers from Large Data Sets”, Proceedings of the 2000 ACM SIGMOD international conference on Management of data. Dallas, Texas, United States, 2000, pp: 427 – 438.

(11) Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

(12) Shilane P., Min P., Kazhdan M., Funkhouser T., “The Princeton Shape Benchmark”, in: Proceedings of the Shape Modeling International 2004 (SMI'04), Genova, Italy, June 2004. pp. 388-399.

(13) Ying Zhao, George Karypis. Criterion Functions for Document Clustering: Experiment and Analysis. Technical Report #01-40, University of Minnesota, 2001.


Table 3. Cluster’s detail of C60 and C268

Cluster Cluster’s Detail

C60

C268

Table 4. Part of the detected outliers and the respective value of step num l M741 (L=1) M737(L=2) M416(L=2) M1401(L=3) M1678(L=3)

Table 5. The retrieval performance comparison Total Retrieval Time Total number of 5-NN

COAIC+K-means (k=28) 3688 ms 7508COAIC+K-means (k=40) 2859 ms 7138

CURE (k=28) 16906 ms 7285Linear Scan 41188 ms 8070


Documents

[IEEE 2006 International Multi-Symposiums on Computer and Computational Sciences (IMSCCS) - Hangzhou, Zhejiang, China (2006.06.20-2006.06.24)] First International Multi-Symposiums