12
Rough-DBSCAN: A fast hybrid density based clustering method for large data sets P. Viswanath a , V. Suresh Babu b, * a Pattern Recognition Research Lab, Department of Computer Science and Engineering, NRI Institute of Technology, Guntur 522 009, Andhra Pradesh, India b Institute for Research in Applicable Computing, Department of Computing and Information Systems, University of Bedfordshire, Luton Campus, Park Square, Luton, LU1 3JU, UK article info Article history: Received 13 February 2008 Received in revised form 16 June 2009 Available online 15 August 2009 Communicated by L. Heutte Keywords: Clustering Density based clustering DBSCAN Leaders Rough sets abstract Density based clustering techniques like DBSCAN are attractive because it can find arbitrary shaped clus- ters along with noisy outliers. Its time requirement is Oðn 2 Þ where n is the size of the dataset, and because of this it is not a suitable one to work with large datasets. A solution proposed in the paper is to apply the leaders clustering method first to derive the prototypes called leaders from the dataset which along with prototypes preserves the density information also, then to use these leaders to derive the density based clusters. The proposed hybrid clustering technique called rough-DBSCAN has a time complexity of OðnÞ only and is analyzed using rough set theory. Experimental studies are done using both synthetic and real world datasets to compare rough-DBSCAN with DBSCAN. It is shown that for large datasets rough- DBSCAN can find a similar clustering as found by the DBSCAN, but is consistently faster than DBSCAN. Also some properties of the leaders as prototypes are formally established. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Density based clustering techniques like DBSCAN (Density Based Spatial Clustering Applications with Noise) (Ester et al., 1996) can discover consistent arbitrary shaped clusters along with detection of noisy outliers. This is in contrast with K-means type clustering methods (Duda et al., 2000) which finds compact and spherical shaped clusters. Single-linkage method (Jain et al., 1999) can find arbitrary shaped clusters, but apart from its huge time requirements, it is a very sensitive method to noise. Methods like BIRCH (Tian et al., 1996) which uses scalable schemes (BIRCH uses CF-tree to represent the hierarchy of clusterings) are suitable for spherical shaped compact clusters only. In this perspective, density based methods are attractive. But this also suffers from huge computational requirements. For DBSCAN, the time complex- ity is Oðn 2 Þ where n is size of the dataset. So it can take huge amounts of time with large datasets and hence is not suitable for data mining applications. One way to overcome this problem is to build an index over the dataset like R tree (Guttman, 1984) which will be useful for finding neighbors of a pattern based on a distance measure. But this solution is suitable only when the dimensionality of the data is low. Hybrid clustering methods are recently used (Cheu et al., 2004; Lin and Chen, 2005; Viswanath and Pinkesh, 2006) to overcome the problems with large datasets. The basic technique is to first find suitable prototypes from the large dataset and then to apply the clustering method using only the prototypes. These kind of schemes can be seen as approximate methods where the solution can deviate from that of the original method (which uses the entire dataset) based on the representative power (quality) of the proto- types. The paper proposes to derive prototypes using a fast cluster- ing method called the leaders clustering method (Spath, 1980). The prototypes derived are called leaders. Since leaders alone cannot be used to obtain the density information, the leaders method is modified where along with leaders the number of patterns grouped with it is also preserved. These leaders are used to find density based clusters present in the given dataset. The proposed hybrid clustering scheme called rough-DBSCAN is a modification of the well known density based clustering method DBSCAN (Ester et al., 1996). Rough set theory (Pawlak, 1982) is a mathematical ap- proach used in various ways for various applications like feature reduction (Yan and Li, 2006; Shiu et al., 2006), classification (Anan- thanarayana et al., 2003), and clustering (Asharaf and Murty, 2004), etc., to handle imprecision, vagueness, and uncertainty in data analysis. In this paper, rough set theory is used to analyze the proposed method. In literature there exist other rough set based clustering methods like rough K-means (Pawan and Chad, 2004), rough fuzzy leaders (Asharaf and Murty, 2003), etc. The present paper is a detailed and a critically analyzed version of a preliminary result presented in a short paper (Viswanath and Pinkesh, 2006) at a Conference. Some of the important additions to the present paper are, (i) some properties of using leaders as pro- totypes are theoretically established, (ii) the proposed method, i.e., rough-DBSCAN is theoretically shown to run in a linear time of the input dataset size, (iii) the quality of clustering results of the proposed method are analyzed using rough set theory (Pawlak, 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.08.008 * Corresponding author. E-mail addresses: [email protected] (P. Viswanath), Suresh. [email protected] (V. Suresh Babu). Pattern Recognition Letters 30 (2009) 1477–1488 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

Embed Size (px)

Citation preview

Page 1: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

Pattern Recognition Letters 30 (2009) 1477–1488

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

P. Viswanath a, V. Suresh Babu b,*

a Pattern Recognition Research Lab, Department of Computer Science and Engineering, NRI Institute of Technology, Guntur 522 009, Andhra Pradesh, Indiab Institute for Research in Applicable Computing, Department of Computing and Information Systems, University of Bedfordshire, Luton Campus, Park Square, Luton, LU1 3JU, UK

a r t i c l e i n f o a b s t r a c t

Article history:Received 13 February 2008Received in revised form 16 June 2009Available online 15 August 2009

Communicated by L. Heutte

Keywords:ClusteringDensity based clusteringDBSCANLeadersRough sets

0167-8655/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.patrec.2009.08.008

* Corresponding author.E-mail addresses: [email protected]

[email protected] (V. Suresh Babu).

Density based clustering techniques like DBSCAN are attractive because it can find arbitrary shaped clus-ters along with noisy outliers. Its time requirement is Oðn2Þwhere n is the size of the dataset, and becauseof this it is not a suitable one to work with large datasets. A solution proposed in the paper is to apply theleaders clustering method first to derive the prototypes called leaders from the dataset which along withprototypes preserves the density information also, then to use these leaders to derive the density basedclusters. The proposed hybrid clustering technique called rough-DBSCAN has a time complexity of OðnÞonly and is analyzed using rough set theory. Experimental studies are done using both synthetic and realworld datasets to compare rough-DBSCAN with DBSCAN. It is shown that for large datasets rough-DBSCAN can find a similar clustering as found by the DBSCAN, but is consistently faster than DBSCAN.Also some properties of the leaders as prototypes are formally established.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction clustering method using only the prototypes. These kind of

Density based clustering techniques like DBSCAN (DensityBased Spatial Clustering Applications with Noise) (Ester et al.,1996) can discover consistent arbitrary shaped clusters along withdetection of noisy outliers. This is in contrast with K-means typeclustering methods (Duda et al., 2000) which finds compact andspherical shaped clusters. Single-linkage method (Jain et al.,1999) can find arbitrary shaped clusters, but apart from its hugetime requirements, it is a very sensitive method to noise. Methodslike BIRCH (Tian et al., 1996) which uses scalable schemes (BIRCHuses CF-tree to represent the hierarchy of clusterings) are suitablefor spherical shaped compact clusters only. In this perspective,density based methods are attractive. But this also suffers fromhuge computational requirements. For DBSCAN, the time complex-ity is Oðn2Þ where n is size of the dataset. So it can take hugeamounts of time with large datasets and hence is not suitable fordata mining applications. One way to overcome this problem isto build an index over the dataset like R� tree (Guttman, 1984)which will be useful for finding neighbors of a pattern based on adistance measure. But this solution is suitable only when thedimensionality of the data is low.

Hybrid clustering methods are recently used (Cheu et al., 2004;Lin and Chen, 2005; Viswanath and Pinkesh, 2006) to overcome theproblems with large datasets. The basic technique is to first findsuitable prototypes from the large dataset and then to apply the

ll rights reserved.

om (P. Viswanath), Suresh.

schemes can be seen as approximate methods where the solutioncan deviate from that of the original method (which uses the entiredataset) based on the representative power (quality) of the proto-types. The paper proposes to derive prototypes using a fast cluster-ing method called the leaders clustering method (Spath, 1980). Theprototypes derived are called leaders. Since leaders alone cannotbe used to obtain the density information, the leaders method ismodified where along with leaders the number of patternsgrouped with it is also preserved. These leaders are used to finddensity based clusters present in the given dataset. The proposedhybrid clustering scheme called rough-DBSCAN is a modificationof the well known density based clustering method DBSCAN (Esteret al., 1996). Rough set theory (Pawlak, 1982) is a mathematical ap-proach used in various ways for various applications like featurereduction (Yan and Li, 2006; Shiu et al., 2006), classification (Anan-thanarayana et al., 2003), and clustering (Asharaf and Murty,2004), etc., to handle imprecision, vagueness, and uncertainty indata analysis. In this paper, rough set theory is used to analyzethe proposed method. In literature there exist other rough setbased clustering methods like rough K-means (Pawan and Chad,2004), rough fuzzy leaders (Asharaf and Murty, 2003), etc.

The present paper is a detailed and a critically analyzed versionof a preliminary result presented in a short paper (Viswanath andPinkesh, 2006) at a Conference. Some of the important additions tothe present paper are, (i) some properties of using leaders as pro-totypes are theoretically established, (ii) the proposed method,i.e., rough-DBSCAN is theoretically shown to run in a linear timeof the input dataset size, (iii) the quality of clustering results ofthe proposed method are analyzed using rough set theory (Pawlak,

Page 2: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

1478 P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488

1991), and (iv) experimental results using some additional realworld large datasets are given.

Rest of the paper is organized as follows. Section 2 briefly de-scribes the overview of the clustering methods followed by de-tailed description about DBSCAN method while Section 3 reviewsleaders clustering method and presents a modified leaders methodcalled counted-leaders. Further, a lower and an upper-bound on thenumber of leaders is formally established. The proposed clusteringmethod rough-DBSCAN along with its relationship with theDBSCAN are presented in Section 4. Experimental results are givenin Section 5. Finally, some of the conclusions are given in Section 6.

2. Literature review

This section initially describes the overview of clustering meth-ods briefly followed by in-detail description about DBSCAN.

2.1. The overview of clustering methods

Clustering methods can be broadly divided into two types, viz.,Hierarchical and Partitional clustering methods. In Hierarchicalmethods, a hierarchy of partitions is obtained where each level(of the hierarchy) represents a partition of the data set. If pi andpiþ1 are two successive levels, then normally, either pi is a refine-ment of piþ1 or piþ1 is a refinement of pi. Clustering at one level isobtained by either merging few blocks (clusters) of the previous le-vel or by splitting a cluster into few blocks. Single-link, complete-link, average-link clustering methods are the most widely usedmethods of this category. Single-link clustering method is sensitiveto noisy patterns and it merges two clusters if the two clusters areconnected by a chain of noisy patterns (Jain et al., 1999). The timecomplexity of single-link/complete-link/average-link is Oðn3Þwhere n is the number of patterns in the dataset. CURE (Guhaet al., 1998) clustering method is an improvement over the sin-gle-link which selects several scattered data points carefully asthe representatives for each cluster and shrinks these representa-tives towards the centroid in order to eliminate the chaining effect.Let Nsamples be the set of patterns as representatives for whole data-set then its computational complexity is OðN2

sampleÞ. In addition,CURE uses spatial searching data structure, kd-trees, to searchthe nearest representatives. Partitional clustering methods initiallyobtains a partition (may be a random partition) and progressivelyrefines it to minimize a criterion function. Let t be the number ofiterations required to converge the squared error criterion. A pop-ular method of this category is the K-means clustering method(MacQueen, 1967) which tries to minimize a squared error crite-rion (Jain et al., 1999). An extended method which is based onarithmetic median value/median string unlike arithmetic mean va-lue in K-mean clustering method is called K-median clusteringmethod (Kohonen, 1985). It is suitable for both numerical and cat-egorical datasets. K-means clustering method is sensitive to noisypatterns and hence a robust method called K-medoids clusteringmethod is given which is based on medoids. It is robust to the pres-ence of outliers and is known as Partition Around Medoids (PAM)(Kaufman and Rousseeuw, 1990). Partitional methods initially se-lects K patterns as their cluster representatives and then iterativelyimproves upon this selection. BIRCH (Tian et al., 1996) is a hybridclustering method which initially partitions the data set into sub-clusters and it uses cluster features (CF) to represent a subcluster.Later BIRCH applies a global clustering method on those generatedsubclusters. It uses CF tree for intermediate representation of thedataset. Let N be the number of nodes in the tree and let B bethe branching factor of the CF-tree then the space complexity ofBIRCH is OðNBÞ. Apart from these a few others are based on prob-abilistic models like mixture-resolving and mode-seeking methods

where the assumption is that the patterns to be clustered aredrawn from one of several distributions, and the goal is to identifythe parameters which is mostly done by maximum-likelihood orExpectation Maximization methods (Duda et al., 2000).

In recent years, spectral clustering has become one of the mostpopular clustering methods which uses the concepts of the spectralgraph theory (Fillippone et al., 2008; Ulrike von Luxburg, 2006).Initially a graph is constructed using the given set of patternswhich are represented as the vertices of the graph. Each weightedundirected edge between a pair of vertices xi, xj represents a sim-ilarity between the patterns xi and xj. If two vertices xi, xj are notconnected then the similarity between them is zero. A few popularmethods are used to construct a similarity graph which are �-near-est neighbor graph, k-nearest neighbor graph and fully connectedgraph. The �-nearest neighbor graph connects all points whosepairwise distance are smaller than � whereas k-nearest neighborgraph is to connect vertex xi with vertex xj if xj is among the k near-est neighbors of xi. However, k-nearest neighbor graph leads to adirected graph, as the neighborhood relationship is not symmetric.In order to make the neighborhood relation to be symmetric it isproposed to simply ignore the directions of edges or connect vertexxi and vertex xj if xi is among the k-nearest neighbor of xj and viceversa. In the fully connected graph simply connect all points byusing similarity measure with each other. Spectral clustering canbe seen as a graph cut problem and the core of this theory is (i)Find a partition of the graph based on eigenvalue decompositionof Laplacian matrix of the weighted graph which is obtained fromthe data set, or (ii) Find a partition of the graph such that the edgesbetween different groups have a very low weight and edges withina group have high weight, or (iii) Find a partition of the graph suchthat the random walks stays long within the same cluster and sel-dom between clusters and it is called spectral clustering in the ran-dom walk point of view etc., Spectral clustering is sensitive to thechanges in the similarity graph and the choice of its parameters.The time and space complexity for building similarity graph isOðn2Þ and hence it is not a suitable one for large datasets, unlessin rare cases where the similarity graph is a high sparse one.

Nonparametric methods for density-based clustering have alsobeen developed (Jain et al., 1999; Jain and Dubes, 1988) These areinspired by the Parzen-window method to nonparametric densityestimation. One of the recent and widely used density based cluster-ing method of this type is the DBSCAN (Density Based Spatial Clus-tering of Applications with Noise) (Ester et al., 1996) which groupsdense and nearby patterns together to form clusters. This approachcan automatically find noisy outliers which is not achieved in mostof the other methods. Since the paper deals with DBSCAN and itsimprovements it is described in a separate section below.

2.2. DBSCAN

DBSCAN groups the data points which are dense and near-byinto a single cluster. For example, if the underlying probabilitydensity function for the data is as given in Fig. 1, and let the thresh-old density to distinguish between dense and non-dense points beT , then the clusters found are C1;C2 and C3. Patterns in the datasetwhich do not belong to any of the clusters are called noisy patterns.Details of the method are given below.

Since only a dataset is given as input (instead of density func-tion), density at a point is found non-parametrically. It assumesthat probability density over a small region is uniformly distrib-uted and the density is given by m=nV where m is the number ofpoints out of n input data points that are falling in the regionand V is the volume of the region. The region is assumed to be ahyper sphere of radius � and hence the threshold density can bespecified by a parameter MinPts, minimum number of points re-quired to be present in the region to make it dense.

Page 3: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

x

p(x)

C C C1 2 3

T

Fig. 1. Clusters found by DBSCAN.

P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488 1479

According to DBSCAN if a pattern x is dense1 then it is part of acluster. A non-dense pattern can also be part of a cluster as a borderpattern of the cluster if it is at a distance less than or equal to � froma dense pattern, otherwise it is a noisy outlier. Two patterns x1 andx2 are in a cluster if there is a sequence of patterns x1; y1; y2; . . . ; ym; x2

in the dataset such that:

(1) the distance between successive patterns in the sequence isless than or equal to �,

(2) if m ¼ 0, then at least one of x1 and x2 is dense, and(3) if m > 0, then the patterns y1; y2; . . . ; ym are dense.

The method DBSCAN is given in Algorithm 1 where D is the in-put dataset, N�ðx;DÞ is the subset of patterns in D that are presentin the hyper-sphere of radius � at x where x 2 D. cardðN�ðx;DÞÞ isthe cardinality of the set which is nothing but jN�ðx;DÞj. For given� and MinPts, DBSCAN finds a dense point in the input set and ex-pands it by merging neighboring dense regions together. The algo-rithm marks each pattern of D with a cluster identifier (cid) whichgives the cluster to which the pattern belongs or gives a mark‘‘noise” indicating that the pattern is a noisy one. One additionalmark ‘‘seen” is used to distinguish between the patterns whichare processed from that which are not. Note that a pattern whichis initially marked as ‘‘noise” can later become a border point of acluster and hence the ‘‘noise” mark can be deleted.

Algorithm 1. DBSCAN(D, �, MinPts)

{Each cluster is given an identifier cid}

1 DBdenseto the

cid = 0;for each pattern x in D do

if x is not marked as ‘‘seen” then

SCAN gpoint aspattern

Mark x as ‘‘seen”;Find N�ðx;DÞ;if cardðN�ðx;DÞÞ <MinPts then

Mark x as ‘‘noise”;else

Mark x as ‘‘seen”;cid = cid + 1;Mark each pattern of N�ðx;DÞ with clusteridentifier cid;Add each pattern of N�ðx;DÞ which is not markedas ‘‘seen” to the list queue(cid).while queue(cid) is not empty do

Table 1Time and space complexities of the well known clustering methods.

Method Time complexity Space complexity

Single/complete link Oðn3Þ Oðn2ÞK-means Oðn � K � tÞ OðKÞK-medians Oðn2 � tÞ OðnÞ

2

Take a pattern y from queue(cid) and mark it as‘‘seen”;if cardðN�ðy;DÞÞ >MinPts then

Mark each pattern of N�ðy;DÞ with clusteridentifier cid;If any pattern of N�ðy;DÞ is marked as‘‘noise” then remove this mark.Add each pattern of N�ðy;DÞ which is notmarked as ‘‘seen” to the list queue(cid).

end ifRemove y from queue(cid).

end whileend if

end ifend forOutput all patterns of D along with their cid or ‘‘noise” mark;

K-medoids (PAM) OðK � ðn� KÞ Þ OðKÞDBSCAN Oðn2Þ OðnÞ

iven in (Ester et al., 1996) calls a dense point as a core point and a non-a non-core point. This modification is done to bring the material closer

recognition community.

It can be seen that the time consuming step in the DBSCAN is infinding N�ðx;DÞ which can take OðnÞ time where x 2 D and jDj ¼ n.Hence the time complexity of the method is Oðn2Þ. On the other-hand, if we derive k leaders first by applying the leaders clusteringmethod and subsequently applying DBSCAN only with leaders hasa total time complexity of Oðnþ k2Þ. This is because the leadersclustering method, as explained in subsequent sections, takes onlyOðnÞ time to derive the leaders. Also it is formally established thatan upper-bound on the number of leaders i.e., k exists which isindependent of both n and the distribution from which the datasetis drawn, provided the assumption that the data is drawn from abounded region of the feature space holds. Nevertheless, in gen-eral, based on the experimental studies, it is shown that k is con-siderably smaller than n, especially with large datasets. Hencethe hybrid scheme can be a faster one to work with large datasets.Table 1 describes the space and time complexities of well knownclustering methods and rough-DBSCAN.

3. Leaders

In this section, we first present the leaders clustering method(Spath, 1980; Hartigan, 1975) then as an improvement of this wepresent the counted-leaders method which preserves the crucialdensity information when the cluster representatives called leadersare derived.

Leaders method finds a partition of the given dataset like mostof the clustering methods. Its primary advantage is its runningtime which is linear in the size of the input dataset. To be moreprecise, it can find the partition in OðnÞ time where n is the datasetsize. It needs to read (or scan) the dataset only once from the sec-ondary memory and hence is also termed as an on-line clusteringmethod. Because of these factors the leaders method is gaining

CURE OðN2sampleÞ OðN2

sampleÞBIRCH OðnÞ (2 database scans) OðN � BÞLeader OðnÞ (1 database scan) OðkÞRough-DBSCAN Oðnþ k2Þ (1 database scan) OðkÞ

Page 4: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

*

******

***

*****

************* ****

*

***

****

**

**

***

*

**

Leaders of the twoclusters

Fig. 2. Leaders is a bad clustering method.

1480 P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488

popularity in the fields like data mining where the dataset sizes arevery large.

The leaders method is given in Algorithm 2. For a given thresh-old distance s, leaders method works as follows. It maintains a setof leaders L, which is initially empty and is incrementally built.For each pattern x in the dataset D, if there is a leader l 2L suchthat distance2 between x and l is less than or equal to s, then x is as-signed to the cluster represented by l. In this case, we call x as a fol-lower of the leader l. Note that even if there are many such leaders,only one is chosen. If there is no such leader then x itself becomes anew leader and is added to L. The algorithm outputs the set of lead-ers L. Each leader can be seen as a representative for the cluster ofpatterns which are grouped with it.

Algorithm 2. Leaders(D, s)

L ¼ ;;

2 Foand Euany di

for each x 2 D doFind a l 2L such that kl� xk 6 sif there is no such l or when L ¼ ; then

L=L [ fxg;end if

end forOutput L.

Fig. 3. Ordered-Leaders find semi-spherical clusters.

As a clustering method the leaders clustering method has thefollowing shortcomings. Let the distance threshold used in themethod be s. The distance between two patterns that are groupedwith a leader are guaranteed to be at-most 2s. In this context, fol-lowers of a leader can be seen as similar. But there is no guaranteethat followers of two different leaders are separated by a distance ofat least s. Indeed, even when we present the same pattern twice,both of these may not be assigned to the same leader. So, patternsin distinct clusters need not be dissimilar. The clusters for twoleaders can be as shown in Fig. 2.

To some extent these problems can be reduced as follows.

(1) Ordering the leaders when they are derived: A pattern x isassigned to a leader l such that kl� xk 6 s. If there are morethan one such leader then a leader is chosen according to apredefined ordering of the leaders. The ordering could bethe same as the ordering in which the leaders are created.The resulting clustering is a set of semi-spheres as shownin Fig. 3.

(2) The clusters can be further refined by first deriving the lead-ers set i.e., L and then reassigning each pattern to the clos-est leader in the set L. This requires one more databasescan. This step, we call as the refinement step.

For the proposed method in the paper, leaders are used as pro-totypes to derive a density based clustering, and hence along withthe leaders the number of followers it has is also preserved. Thenumber of followers a leader has is called the leader’s count. Theobjective of the paper is to find a fast density based clusteringmethod whose results are as close to the results of the methodDBSCAN. The proposed method uses some of the neighboring lead-ers along with their count values, and hence the refinement stepwhich requires an additional database scan is not done. The pro-posed modified leaders clustering method called the Counted-Lead-ers is given in the Algorithm 3. The algorithm outputs a set oftriplets LI where each one consists of a leader, its followers set,and its count.

r the sake of simplicity, we assume that the patterns are from a Euclidean spaceclidean distance is used. Whereas the proposed methods are applicable with

stance metric.

Algorithm 3. Counted-Leaders(D, s)

L ¼ ;;for each x 2 D do

Find a leader l 2L such that Distanceðx; lÞ < s. If thereare many such leaders then choose the first one accord-ing to the ordering in which the set L is created.if there is no such l or when L ¼ ; then

L ¼L [ fxg;followersðxÞ ¼ fxg;countðxÞ ¼ 1;

elsefollowersðlÞ ¼ followersðlÞ [ fxg;countðlÞ ¼ countðlÞ þ 1;

end ifend forOutput LI ¼ fðl; followersðlÞ; countðlÞÞjl 2Lg.

3.1. Leaders: some properties

With a few assumptions which simplifies the analysis, as de-scribed below, we want to make an important claim about thenumber of leaders, which is: there is an upper-bound on the numberof leaders which is independent of the size of the dataset from whichthe leaders are derived and also is independent of the probability dis-tribution from which the dataset is drawn. A lower bound on thenumber of leaders is also given which for small datasets depends

Page 5: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488 1481

on both the dataset size and the distribution from which the data-set is drawn.

Let the leaders set L is derived using a threshold distance sfrom the dataset D. Each pattern x in D is i.i.d. drawn from aprobability density function pðxÞ. Since the scanning order of D

influences L, we assume that D is an ordered dataset, the orderingbeing same as the scanning order. The scanning order is specifiedby a mapping m : f1;2; . . . ;ng ! D where jDj ¼ n. The set of allsuch mappings is denoted by O. Similarly L is an ordered set,the ordering being same as the order in which the leadersare added to the initially empty set L which is incrementallybuilt (see Algorithm 3). Let BsðzÞ be the region which is a closed ballof radius s at z in the feature space. For a pattern z 2 D,let Nsðz;DÞ ¼ fxjx 2 D; kz� xk 6 sg ¼ fx1; x2; . . . ; xrþ1g. Nsðz;DÞ isthe set of neighbors of z (including z) which are in BsðzÞ. Apartfrom z there are r patterns in Nsðz;DÞ where r is a non-negativeinteger.

The following observations and assumptions are used in thesubsequent derivations.

(1) If l1 and l2 are two distinct leaders in L, then kl1 � l2k > s.(2) For a pattern z 2 D, if z is not a leader then there is a leader l

such that l is in Nsðz;DÞ � fzg and z is a follower of l.(3) It is assumed that, the scanning order is independent ofjNsðz;DÞj, i.e., the number of training patterns in BsðzÞ.

(4) It is assumed that, for a pattern z 2 D, a patterny 2 D� Nsðz;DÞ has no influence on z to be a leader or z tobe not a leader. But y can have an indirect influence on zin cases where y is a leader and has a follower which is inNsðz;DÞ. This assumption is made to simplify the analysis.

(5) It is assumed that,R

BsðzÞ pðxÞdx ¼ pðzÞVs, where Vs is the vol-ume of the closed ball BsðzÞ.

(6) It is assumed that, the dataset is drawn from a closed andbounded region of the feature space.3 Theoretically, this neednot be true. But, in all most all practical situations, the rangeof values a feature can take is bounded. Hence the dataset isfrom a bounded region of the feature space is not an unrealis-tic assumption.

Lemma 3.1. Given that a pattern z is in D, provided thatjNsðz;DÞj ¼ r þ 1, the expected probability by which z can become aleader, averaging over O (i.e., all scanning orders of D), isEO½Pðz 2Ljz 2 DÞ�P 1

rþ1 :

Proof. Since a pattern y 2 D� Nsðz;DÞ has no influence on z to bea leader or z to be not a leader, Pðz 2Ljz 2 DÞ is independent of theordering of patterns in D� Nsðz;DÞ. Because of this, we can restrictour attention to patterns in Nsðz;DÞ alone. z can definitely becomea leader when z occurs, according to the scanning order, before allof the patters in Nsðz;DÞ � fzg. The total number of orderings ofpatterns in Nsðz;DÞ is ðr þ 1Þ!. The number of orderings in which zoccurs before others is r!. So, EO½Pðz 2Ljz 2 DÞ�P r!

ðrþ1Þ! ¼ 1rþ1, since

there are situations where a pattern y in Nsðz;DÞ � fzg even thoughoccurs before z according to the scanning order, might be afollower of some other leader and hence is not in competition withz to become a leader. h

Lemma 3.2. Let jNsðz;DÞj ¼ r þ 1 and let R ¼ f0;1; . . . ;n� 1g.Averaging over all possible values of r (i.e., R) and all possible scan-ning orders of D (i.e., O), for a given pattern z in D, the expected prob-ability by which it can become a leader is

3 In other words, this assumption is same as saying that the feature space is acompact metric space.

ER�O½Pðz 2Ljz 2 DÞ�P 1Pn� ð1� PÞn

Pn;

where jDj ¼ n; and P ¼Z

BsðzÞpðxÞdx:

Proof. Since it is assumed that the scanning order is independentof jNsðz;DÞj, ER�O½Pðz 2Ljz 2 DÞ�=ER½EO½Pðz 2Ljz 2 DÞ��:

Since it is given that z 2 D, from the remaining ðn� 1Þ i.i.d.patterns, the probability that r patterns can fall in BsðzÞ is

n� 1r

� �Prð1� PÞn�1�r : This is the probability by which ðr þ 1Þ

patterns among the n patterns in D can fall in BsðzÞ. So,

EO½Pðz 2Ljz 2 DÞ�P 1rþ1 with a probability n� 1

r

� �Pr

ð1� PÞn�1�r , and r 2 R. So, the expected value of

EO½Pðz 2Ljz 2 DÞ� averaging over all possible values of r is,

ER�O½Pðz 2Ljz 2 DÞ�PXn�1

r¼0

1r þ 1

n� 1r

� �Prð1� PÞn�1�r

PXn�1

r¼0

1Pn

nr þ 1

� �Prþ1ð1� PÞn�ðrþ1Þ

By replacing r þ 1 with j, we get

ER�O½Pðz 2Ljz 2 DÞ�

PXn

j¼1

1Pnðnj ÞP

jð1� PÞn�j

P1

Pn

Xn

j¼0

ðnj ÞPjð1� PÞn�j

!� 1

Pnðn0ÞP

0ð1� PÞn

P1

PnðP þ ð1� PÞÞn � ð1� PÞn

Pn

P1

Pn� ð1� PÞn

Pn�

Since z is given to be in BsðzÞ, we have P – 0. Lemma 3.2 is animportant one and some situations to understand this better aregiven below.

(1) When n ¼ 1, z is the only pattern in D, and henceE½Pðz 2Ljz 2 DÞ�P 1

P � 1�PP ¼ 1. That is, z must be a leader.

(2) When P ¼ 1, all n patterns lie in BsðzÞ, henceE½Pðz 2Ljz 2 DÞ�P 1

n� 0n ¼ 1

n.

Theorem 3.1. Let the dataset D be drawn from a closed and boundedregion S of the feature space. Then the number of leaders that can bederived from D using a threshold distance s, such that s > 0, is at leastVSVs�R

Sð1�pðzÞVsÞn

Vsdz, where VS is the volume of the region S and Vs is

the volume of a hyper-sphere of radius s.

Proof. The expected value of ER�O½Pðz 2Ljz 2 DÞ� averaging over Sis,

ES�R�O½Pðz 2Ljz 2 DÞ� ¼Z

SER�O½Pðz 2Ljz 2 DRÞ�pðzÞdz

PZ

S

1Pn� ð1� PÞn

Pn

� �pðzÞdz;

where P ¼R

BsðzÞ pðxÞdx and jDj ¼ n.Since it is assumed that

RBsðzÞ pðxÞdx ¼ pðzÞVs, and hence

ES�R�O½Pðz 2Ljz 2 DÞ�PZ

S

1npðzÞVs

pðzÞdz�Z

S

ð1� PÞn

npðzÞVspðzÞdz

PVs

nVs�Z

S

ð1� PÞn

nVsdz:

Page 6: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

1482 P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488

Now ES�R�O½Pðz 2Ljz 2 DÞ� is the average probability by which apattern in D is a leader. Since there are n patterns in D which arei.i.d drawn from a distribution, the number of leaders k is given by

k ¼ nES�R�O½Pðz 2Ljz 2 DÞ�P Vs

Vs�Z

S

ð1� pðzÞVsÞn

Vsdz �

Theorem 3.1 gives a lower-bound on the number of leaders. Forlarge values of n, the value of ð1� pðzÞVsÞn is very small which canbe neglected and hence the lower-bound on the number of leaderscan be taken as VS

Vs. The following theorem gives an upper-bound on

the number of leaders.

Theorem 3.2. Let the dataset D be drawn from a closed and boundedregion S of the feature space. Then the number of leaders that can bederived from D using a threshold distance s, such that s > 0, is at mostVS

Vs=2where VS is the volume of the region S and Vs=2 is the volume of a

hyper-sphere of radius s=2.

Proof. Let the leaders set be L and jLj ¼ k: For two distinct lead-ers l1 and l2 in L, it is guaranteed that kl1 � l2k > s. Assume that ateach leader l 2L we place a closed ball of radius s=2, i.e., Bs=2ðlÞ.These closed balls will not intersect with each other. The total vol-ume of these closed balls is kVs=2. Since, the dataset is assumed tobe drawn from a bounded region whose volume is VS, we havekVs=2 6 VS. Hence k 6 VS

Vs=2. h

Let for n ¼ n0,R

Sð1�pðzÞVsÞn

Vsdz ¼ 1. Then for n P n0 we have:Z

S

ð1� pðzÞVsÞn

Vsdz 6 1: ð1Þ

4 L is used to denote the set of leaders alone.

Corollary 3.1. If s > 0, for n P n0 the following inequalities holds,

Vs

Vs

� �� 1 6 k 6

Vs

Vs=2

� �; ð2Þ

k 6 n ð3Þ

Proof. From Theorems 3.1, 3.2, and Inequality (1) the Inequality(2) directly follows. The set of leaders is a subset of the datasetfrom which the leaders are derived and hence the Inequality (3)is true. h

Inequality (2) states that for a sufficiently large dataset (forwhich n P n0) the number of leaders has lower and upper boundswhich are independent of the dataset size and also are indepen-dent of the distribution function from which the dataset is drawn.Inequality (3) states that the number of leaders can never exceedthe dataset size. Hence for large datasets, even when n is infinity,since k is a finite and bounded one, working with leaders has lesstime requirements when compared to that of working with the en-tire dataset. It should be noted that, for a high dimensional data

set, VsVs

� �and Vs

Vs=2

� �are very large numbers (exponential in the

number of dimensions). Hence for small data sets the bound givenin Inequality (2) does not hold, since n could be much smaller than

both VsVs

� �and Vs

Vs=2

� �. Similarly the bound does not hold when the

assumption that the data set is drawn from a closed and boundedregion does not hold.

4. Rough-DBSCAN

In this section, we present the proposed density based cluster-ing method called rough-DBSCAN which aims at achieving a similarresult as that of the DBSCAN but in a much smaller time require-ment. Rough-DBSCAN uses the same values for the parameters �and MinPts as that used by the DBSCAN, but instead of partitioning

the dataset directly, it first partitions the set of leaders derivedfrom the dataset which later can be expanded into a partition ofthe dataset by replacing each leader by the set of patterns groupedwith it, i.e., its followers set.

Rough-DBSCAN works with LI ¼ fðl; followersðlÞ; countðlÞÞjl 2Lg,4 the output of Algorithm 3, but it uses the set of leadersand their respective count values alone. Hence a leader needs tobe categorized as either dense or non-dense using only count val-ues of some of its neighboring leaders. According to DBSCAN a lea-der (which is also a pattern in the given dataset D) is dense ifcardðN�ðl;DÞÞP MinPts, is non-dense otherwise. From the set ofleaders using the leader’s count values alone it is not possible tofind N�ðl;DÞ exactly. But it can be roughly obtained by finding alower and an upper approximations as explained below.

Let Ll ¼ flj 2Ljklj � lk < �� sg. Ll is a set of leaders where foreach one all its followers are guaranteed to be within the radius �from l. For a leader l a lower approximation of N�ðl;DÞ is

N�ðl;DÞ ¼[

l2Ll

followersðlÞ: ð4Þ

Let Ll ¼ flj 2Ljklj � lk 6 �þ sg. For a leader l an upper approxima-tion of N�ðl;DÞ is

N�ðl;DÞ ¼[

l2Ll

followersðlÞ: ð5Þ

It should be noted that

N�ðl;DÞ# N�ðl;DÞ# N�ðl;DÞ; ð6Þ

lims!0

N�ðl;DÞ ¼ lims!0

N�ðl;DÞ ¼ N�ðl;DÞ ð7Þ

and

cardðN�ðl;DÞÞ ¼Xl2Ll

countðlÞ; ð8Þ

cardðN�ðl;DÞÞ ¼Xl2Ll

countðlÞ: ð9Þ

Let DBSCAN, for a given � and MinPts finds a partition p of the data-set D. If the Algorithm 1 (i.e., DBSCAN) is applied with the set ofleaders L where cardðN�ðl;DÞÞ is replaced by cardðN�ðl;DÞÞ, it willfind a partition of L. Let this is transformed into a partition of D

by replacing each leader by its followers. Let this partition be p. Adense leader according to cardðN�ðl;DÞÞ can become non-denseaccording to cardðN�ðl;DÞÞ and also it can become a noisy pattern.For example if p is as shown in Fig. 4a, then p could be as shownin Fig. 4b where some of the clusters are further divided into sub-clusters and some of the patterns which are in a cluster can be trea-ted as noisy patterns. Similarly, if instead of cardðN�ðl;DÞÞ whencardðN�ðl;DÞÞ is used we may get a partition p as shown inFig. 4c. The set of all partitions of D with distance between two par-titions p1 and p2 defined as dðp1;p2Þ ¼ 1� Rand-Indexðp1;p2Þforms a metric space in which as s! 0 both p and p will approachp. The similarity measure Rand-Index (Rand, 1971; Hubert and Ara-bie, 1985) is described in Section 5.

An interpolated value between cardðN�ðl;DÞÞ andcardðN�ðl;DÞÞcan be taken as an approximation of cardðN�ðl;DÞÞto obtain a partition p0 of the dataset. For example, the average va-lue of cardðN�ðl;DÞÞ and cardðN�ðl;DÞÞcan be used. But, this paperproposes to use rough-card ðN�ðl;DÞÞ as explained below.

Let Ll ¼ flj 2Ljklj � lk 6 �g.

rough-cardðN�ðl;DÞÞ ¼Xl2Ll

countðlÞ: ð10Þ

The leaders in Ll �Ll are leaders like li and lj as shown in Fig. 5which lie at the border of the hyper-sphere. Leaders like li has some

Page 7: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

The partition found by πDBSCAN, π_

The lower bound partition found by rough−DBSCAN, π

_The upper bound partition found by rough−DBSCAN,

(a) (c)(b)

Fig. 4. Lower and upper approximations of the partition.

P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488 1483

of its followers outside of the hyper-sphere, whereas leaders like lj

which themself are outside of the hyper-sphere can have some of itsfollowers inside the hyper-sphere. On average, since the distribu-tion is assumed to be uniform over a small region like the hyper-sphere of radius �, the number of leaders like li is same as the num-ber of leaders like lj. Hence by taking only leaders like li into accountcan give a better cardinality estimate which is called rough-card(N�ðl;DÞ).

The proposed clustering method rough-DBSCAN is given inAlgorithm 4.

Algorithm 4. Rough-DBSCAN(LI,�, MinPts)

(i) Let L be the set of leaders which is extracted from LI.(ii) Do DBSCAN(L,�,MinPts) by replacing cardðN�ðl;DÞÞ with

rough� cardðN�ðl;DÞÞ, where l is a leader in L. Let thepartition obtained be p0L.

(iii) Obtain p0 from p0L by replacing each leader l byfollowers(l).

(iv) Output p0 which is a partition of D.

4.1. Rough-DBSCAN: computational requirements

The space complexity of rough-DBSCAN is OðnÞ where n is thedataset size. This is same as that of DBSCAN. The time requirementof rough-DBSCAN is Oðnþ k2Þ where k is the number of leaders i.e.,jLj. But the time requirement of DBSCAN is Oðn2Þ. Since it is guar-anteed that k 6 n, and assuming that the dataset is drawn from a

Fig. 5. Different types of leaders.

bounded region of the feature-space, there is an upper-bound fork (see Theorem 3.2) which is a constant and is independent of n,hence the running time of rough-DBSCAN is OðnÞ. So it is guaran-teed that rough-DBSCAN is faster than DBSCAN, and it is especiallysuitable for large datasets.

4.2. Relationship between DBSCAN and rough-DBSCAN

This section analyzes the conditions required to be satisfied ifrough-DBSCAN’s result has to be the same as the DBSCAN’sresult.

For a pattern x 2 D, let leader (x) be the leader l such that x 2 fol-lowers(l). Let the clustering output of DBSCAN be the partition ofthe dataset denoted by pD, and that of rough-DBSCAN be denotedas p0D. In both of these cases, noisy patterns are grouped into a sep-arate cluster so that pD and p0D are partitions of D. The followingare the requirements to be satisfied, if pD ¼ p0D has to be true.

Let ðx1; x2Þ be an arbitrary pair of patterns in D�D, then

Requirement 1: If x1 and x2 are in distinct clusters according topD then leader(x1) and leader(x2) are also in distinct clustersaccording to p0D.Requirement 2: If x1 and x2 are in a cluster according to pD thenleader(x1) and leader(x2) are also in a cluster according to p0D.

There are two types of mistakes made by rough-DBSCAN be-cause of which p0D may not be the same as pD. These are:

� [1.]Density mistake: A pattern x according to DBSCAN can bedense (or non-dense), but leader ðxÞ according to rough-DBSCANmay be non-dense (or dense).

� [2.]Near-ness mistake: We say that two patterns p and q are tran-sitively near if there is a sequence of patterns p; y1; y2; . . . ; ym; qsuch that the distance between any two successive patterns isless than or equal to �. If two patterns p and q are transitivelynear according to DBSCAN but leader ðpÞ is not transitively nearto leader ðqÞ according to rough-DBSCAN (i.e., there is nosequence of leaders leader ðpÞ, l1; l2; . . . ; leader ðqÞ such that suc-cessive leaders are less than or equal to � distance), then we saythat there is a near-ness mistake.

If rough-DBSCAN method uses the Eq. (10) to find the cardinal-ity then according to the argument given in Section 4 it is not anunrealistic one to assume that there are no density mistakes whens < �.

The Requirement 1 is satisfied easily as shown by the followingLemma.

Page 8: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

1484 P. Viswanath, V. Suresh Babu / Pattern Reco

Lemma 4.1. Assuming that there are no density mistakes, if twoarbitrary patterns x1 and x2 are in distinct clusters according to DBSCANthen they are in distinct clusters according to rough-DBSCAN also.

Proof. Suppose that x1 and x2 are in distinct clusters, but leader(x1)and leader(x2) are in a same cluster. We show that this leads to acontradiction.

Two patterns p and q are in a cluster according to DBSCAN, ifthere is a sequence of patterns s : ðp; y1; y2; . . . ; ym; qÞ in D whichsatisfies, (i) if m ¼ 0 then at least one of p or q is dense andkp� qk 6 �, (ii) All of the patterns y1; . . . ; ym are dense and distancebetween any two successive patterns in the sequence s is less thanor equal to �.

x1 and x2 are in a same cluster according to rough-DBSCANmeans that there is a sequence s0 : ðx1; l1; l2; . . . ; lm; x2Þ whichsatisfies the above mentioned requirements. Then, sincel1; l2; . . . ; lm are also patterns in D, according to DBSCAN x1 and x2

must be in a cluster. This is the required contradiction. h

The analysis of the Requirement 2 is a more involved one. Weuse the term called Nearness-Index which is defined below.

Definition 4.1 (Nearness-index). For a set of patterns S, we say thatd-nearness is true if for every pair of patterns p and q in S there is asequence of patterns p; y1; y2; . . . ; ym; q such that (i) the distancebetween any two successive patterns in the sequence is less thanor equal to d, (ii) if m ¼ 0, at least one of p and q is dense, and (iii) ifm > 0, y1; y2; . . . ; ym are dense. Then the Nearness-Index of the set Sis the minimum d value such that d-nearness is true.

Lemma 4.2. If the distance between two patterns x1 and x2 is a thenthe distance between leader(x1) and leader(x2) is less than or equal toaþ 2s where s is the threshold distance used while deriving the leaders.

Proof. Since the patterns are assumed to be drawn from a metricspace, the distance measure satisfies triangle inequality. Let the dis-tance between two patterns x1 and x2 is denoted by distðx1; x2Þ. Wehave distðx1; leaderðx1ÞÞ 6 s. So, distðx2; leaderðx1ÞÞ 6 aþ s accord-ing to the triangle inequality. We have distðx2; leaderðx2ÞÞ 6 s. So,distðleaderðx1Þ; leaderðx2ÞÞ 6 aþ 2s. h

Lemma 4.3. Let two patterns x and y are in a cluster C according toDBSCAN where C is a cluster which is other than the cluster consistingof the noisy patterns. If the Nearness-Index for C is less than or equal to�� 2s, then x and y are also in a cluster according to rough-DBSCANprovided there are no density-mistakes.

Proof. x and y are in the cluster C and the Nearness-Index for C isless than or equal to �� 2s means that there is a sequence of pat-terns x; z1; z2; . . . ; zp; y in C satisfying (i) distance between any twosuccessive patterns in the sequence is less than or equal to �� 2s,(ii) if p ¼ 0, at least one of x and y is dense, and (iii) if p > 0,z1; z2; . . . ; zp are dense. Consider the sequence leaderðxÞ; l1; l2;

. . . ; lp; leaderðyÞ where li ¼ leaderðziÞ for i 2 f1;2; . . . ; pg. Accordingto Lemma 4.2 in this sequence successive patterns are at a distanceless than or equal to �. Since, we assume that there are no densitymistakes leaderðxÞ and leaderðyÞmust be in the same cluster accord-ing to rough-DBSCAN. This in turn means that x and y are in thesame cluster according to rough-DBSCAN. h

Theorem 4.1. Rough-DBSCAN’s result is same as the DBSCAN’s resultif the following conditions are true.

(1) There are no density-mistakes.(2) Each cluster of patterns has the Nearness-Index of at-most

�� 2s.

Proof. Directly follows from Lemmas 4.1 and 4.3.

Theorem 4.1 states a sufficient condition for DBSCAN and

rough-DBSCAN to give the same result. But this is not a necessarycondition. That is, even when these conditions are not met it ispossible in certain cases that rough-DBSCAN’s result is same asDBSCAN’s result.One simple reason for this is that the two mistakesviz., density-mistakes and nearness-mistakes can cancel eachother.

For �� 2s P 0, it is required that s 6 �=2. h

Nearness-Index is related to the distance between neighboringpatterns in a cluster. If the distance between neighboring patternsin a cluster is small then it is likely that the rough-DBSCAN’s resultis closer to the DBSCAN’s result. As the dataset size increases, thedistance between neighboring patterns in a cluster reduces andhence the Nearness-Index reduces. Suppose the probability densityrequired to make a cluster dense be t. Let there are n patterns inthe dataset. Then in a cluster in a hyper-sphere of radius � theexpected number of patterns present is k P tnV where V is thevolume of the hyper-sphere. Assuming that within this hyper-sphere the density is uniform, on average distance between twoneighboring patterns in the hyper-sphere is less than or equal to�=k ¼ �=tnV . This is nothing but the Nearness-Index on average.We require Nearness-Index 6 �� 2s. That is, �=tnV 6 �� 2s. Thatis, on average, for n P �=tVð�� 2sÞ the required condition can besatisfied. So, it is more likely that, for large datasets the rough-DBSCAN’s result is same as DBSCAN’s result.

4.3. Rough-DBSCAN: further improvements

While the paper has a limited scope of analyzing the leaders asprototypes and its usage in a density based clustering scheme, it isworth noting that the time requirement of rough-DBSCAN can befurther reduced by: (i) building an index over the set of leaders likeR-tree index (Guttman, 1984), KD-tree index (Bentley, 1975), etc.,or (ii) clustering the leaders set L by using the leaders methodwith a larger threshold value which gives a coarser set of leaderswhich can be used to reduce the search time as described in theearlier version of the paper (Viswanath and Pinkesh, 2006).

5. Experimental results

Experimental studies are done with the following objectives.

(1) To compare the clustering result obtained by rough-DBSCANwith that of DBSCAN.

(2) To compare the time taken by rough-DBSCAN with that ofDBSCAN.

The clustering result of DBSCAN and rough-DBSCAN arecompared using the similarity measure Rand-Index (Rand, 1971;Hubert and Arabie, 1985) which is described below.

For a n element dataset, let p and p0 be two partitions. Let a bethe number of pairs of patterns in the dataset which are present ina block of p and also present in a block of p0. That is, these pairs aregrouped together according to p and also according to p0. Let b bethe number of pairs in the dataset which are not grouped in a blockaccording to p and also according to p0. That is, these pairs are suchthat, each of them are grouped with a different block (cluster)according to p and also according to p0. Then

Rand-Indexðp;p0Þ ¼ aþ bn2ð Þ

:

Rand-Index has a value between 0 and 1, with 0 indicating that thetwo partitions do not agree on any pair of patterns and 1 indicatingthat the partitions are exactly the same.

gnition Letters 30 (2009) 1477–1488

Page 9: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

Table 2Details of datasets used.

Dataset Number of features Number of patterns

Gaussian mixture 2 30,000Banana 2 4000Shuttle 9 58,000Letter 16 20,000Pendigits 16 10,992

P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488 1485

Experimental studies are done with two synthetic and threestandard datasets whose properties are summarized in the Table2. The three standard datasets are Shuttle, Letter and Pendigits data-sets where the Shuttle and the Letter datasets are obtained from theURL: htpp://www.csie.ntu.edu.tw/�cjlin/libsvmtools/datasets/binar-y.html and Pendigits dataset is obtained from the URL: http://mlear-n.ics.uci.edu/databases/pendigits. The synthetic datasets are: (i)the Gaussian mixture dataset, and (ii) the Banana dataset. TheGaussian mixture dataset is a two dimensional dataset having30000 patterns which is generated from the distribution (density)function 1

3 p1ðxÞ þ 13 p2ðxÞ þ 1

3 p3ðxÞ where piðxÞ ¼ Nðli;RiÞ for i ¼ 1;2and 3, where l1 ¼ ð0;0Þ

T;l2 ¼ ð6;6Þ

T;l3 ¼ ð6;�6ÞT ;

R1 ¼12 00 1

!; R2 ¼

1 00 1

� �and R3 ¼

1 00 1

2

!:

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

2K 4K 6K 8K 10K 12K 14K 16K 18K 20K

Num

ber

of le

ader

s

Number of patterns

Letter data set

Threshold Value=0.25 =0.3

=0.35 =0.4

2000

4000

6000

8000

10000

12000

14000

5K 10K 15K 20K 25K 30K 35K 40K 45K 50K 58K

Num

ber

of le

ader

s

Number of patterns

Shuttle data set

Threshold Value=0.001 =0.005

=0.01 =0.02

Fig. 6. Number of leaders v

That is, it is a three component mixture where each component is aGaussian distribution. The other synthetic dataset i.e., the bananadataset is also a two dimensional data having 4000 patterns consist-ing of two banana shaped clusters as shown in Fig. 8.

Experiments are conducted on a PC with an Intel P4 processor(3.2 GHz) with 512 MB RAM. Initially the comparison betweennumber of patterns versus number of leaders for various datasetsare given in Fig. 6 where the horizontal axis represents the varyingdataset sizes and the vertical axis represents the number of lead-ers. The data set is varied each time by adding of patters to the pre-vious (existing) dataset and the scanning order of the dataset is thesame order in which it is present in the dataset. The comparisons ofRand-Index and execution time for DBSCAN and rough-DBSCAN areshown in a table for each dataset. The first column of the tableshows the Rand-Index values between DBSCAN and rough-DBSCANfor different threshold values and varying dataset sizes. The secondcolumn shows the execution time of rough-DBSCAN for differentthreshold values and varying dataset sizes. Finally, The third col-umn shows the execution time of DBSCAN for varying datasetsizes. The threshold parameter s used in the leaders clusteringmethod is also varied. For the DBSCAN and rough-DBSCAN theparameters � and MinPts used are same. But, as the dataset sizeis varied the MinPts is also varied proportionately (so that theapparent cut-off density threshold that distinguishes dense pointsfrom non-dense points remains the same). The results are summa-rized below.

2000

4000

6000

8000

10000

1K 2K 3K 4K 5K 6K 7K 8K 9K 10.9K

Num

ber

of le

ader

s

Number of patterns

Pendigits data set

Threshold Value=15 =20 =25 =30

5000

10000

15000

20000

25000

30000

3K 6K 9K 12K 15K 18K 21K 24K 27K 30K

Num

ber

of L

eade

rs

Number of patterns

Synthetic data set

Threshold Value=0.02 =0.04 =0.06 =0.08

s. number of patterns.

Page 10: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

Noisypatternsaccording to

DBSCAN

Fig. 8. Synthetic: banana data set.

Table 4Synthetic: banana data set.

1486 P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488

5.1. The Gaussian mixture dataset

The Fig. 7 is a pictorial representation of the DBSCAN’s resultand the rough-DBSCAN’s results for 1000 randomly chosen pat-terns. The rough-DBSCAN’s results are shown when the leadersare derived using two different threshold values viz., s ¼ 0:23and s ¼ 0:02. From the Fig. 7 it can be seen that for a larger s theDBSCAN’s result and the rough-DBSCAN’s result deviates consider-ably, but when the s value is lowered the difference is also low.

The Table 3 shows the comparison between Rand-Index andtime(in seconds) for varying dataset sizes and s values. The s val-ues used (i.e., the leaders threshold value to derive the leaders) are0.08, 0.06, 0.04, 0.02, the � is chosen as 0.12. The parameter MinPtsis 10 when the dataset size is 3000, and it is varied proportionatelywith the dataset. For the threshold s ¼ 0:02, the Rand-Index valueis 0.991642 over the whole data set and the execution time ofrough-DBSCAN is less than 60% that of the DBSCAN. It is observedfrom the plots that when the number of patterns are increasingrough-DBSCAN’s result is approaching to the DBSCAN’s result.

s Rand-Index

DBSCAN’s execution time(s)

Rough-DBSCAN’s executiontime (s)

8 0.981 26.42 0.0794 0.993 26.42 0.2372 1 26.42 0.711

5.2. The banana dataset

The banana dataset which consists of two banana shaped clus-ters is shown in Fig. 8. The experimental results are summarized inTable 4. For s ¼ 8 the results of DBSCAN and rough-DBSCAN areconsiderably different and the Fig. 8 shows some of the patternswhich are seen by DBSCAN as noise but not by rough-DBSCAN.As the s value is reduced the rough-DBSCAN’s result approachesto that of the DBSCAN’s and with s ¼ 2, the rough-DBSCAN givesthe same result as that of the DBSCAN, but execution time ofrough-DBSCAN is less than 3% to that of the DBSCAN.

Fig. 7. Synthetic: Gaussian mixture data

Table 3Synthetic: Gaussian mixture data set.

Dataset size Rand-Index Rough-Ds value s value

0.08 0.06 0.04 0.02 0.08

3000 0.690 0.805 0.906 0.983 0.356000 0.770 0.890 0.934 0.983 0.769000 0.810 0.898 0.945 0.969 1.1712,000 0.868 0.915 0.979 0.989 1.5515,000 0.883 0.918 0.954 0.958 1.9718,000 0.906 0.937 0.959 0.986 2.3221,000 0.886 0.965 0.976 0.986 2.8024,000 0.919 0.968 0.980 0.988 3.1727,000 0.929 0.971 0.982 0.991 3.5530,000 0.932 0.971 0.988 0.992 3.92

5.3. The Pendigits dataset

The Table 5 shows the results obtained for the Pendigits dataset.The � value used is 40. The MinPts used when the dataset size is1000 is 4. The s values used are 30, 25, 20, 15.

set representation for 1000 patterns.

BSCAN’s execution time(s) DBSCAN’s execution time (s)

0.06 0.04 0.02

0.50 0.72 0.98 2.011.25 2.15 3.54 4.842.05 3.84 7.03 8.812.83 5.66 11.72 14.943.67 7.70 17.00 24.084.34 9.32 22.83 33.615.26 11.79 29.10 46.296.11 13.44 36.69 61.616.67 15.17 41.61 76.297.58 17.24 49.49 104.90

Page 11: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

Table 5Pendigits data set.

Dataset size Rand-Index Rough-DBSCAN’s execution time (s) DBSCAN’s execution time (s)s value s value

30 25 20 15 30 25 20 15

1000 0.828 0.910 0.965 0.996 0.13 0.19 0.26 0.30 1.062000 0.839 0.915 0.968 0.996 0.35 0.58 0.88 1.14 1.923000 0.842 0.916 0.974 0.996 0.64 1.11 1.78 2.50 3.864000 0.838 0.911 0.975 0.998 0.95 1.72 2.90 4.26 5.985000 0.841 0.915 0.974 0.999 1.35 2.44 4.31 6.63 8.136000 0.851 0.920 0.973 0.9996 1.68 3.22 5.80 9.19 11.077000 0.853 0.946 0.983 0.9996 2.04 3.94 7.40 12.09 14.098000 0.851 0.965 0.986 0.9997 2.48 4.90 9.26 15.47 18.199000 0.852 0.966 0.984 0.9998 2.88 5.69 11.24 19.20 24.5610992 0.853 0.982 0.997 0.9999 3.70 7.58 15.33 27.52 33.60

P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488 1487

With s ¼ 15, the rough-DBSCAN’s result is very close (Rand-In-dex = 0.999988) to that of the DBSCAN, but has the execution timeof less than 82% to that of the DBSCAN, when the whole dataset isused. Further, it can be observed that when the number of patternsare increasing rough-DBSCAN’s result is approaching to theDBSCAN’s result.

5.4. The Shuttle dataset

The Table 6 summarizes the results obtained for the Shuttledataset. The s values used are 0.02, 0.01, 0.005, 0.001. The � valueused is 0.03. The parameter MinPts when the dataset size is 5000 istaken to be 20. It can be seen that when s ¼ 0:001 both rough-DBSCAN and DBSCAN gives the same result (Rand-Index = 1), butrough-DBSCAN’s execution time is less than 10% of that ofDBSCAN’s when the entire dataset is used. It is observed fromthe plots that when the number of patterns are increasing rough-DBSCAN’s result is approaching to the DBSCAN’s result.

Table 6Shuttle data set.

Dataset size Rand-Index Rough-Ds value s value

0.02 0.01 0.005 0.001 0.02

5000 0.813 0.997 0.997 0.9996 0.3310,000 0.814 0.998 0.999 0.9996 0.6015,000 0.823 0.993 0.993 0.9997 0.9020,000 0.821 0.998 0.998 0.9993 1.1325,000 0.823 0.997 0.998 0.9994 1.3930,000 0.823 0.999 0.999 0.9998 1.6635,000 0.833 0.999 0.999 0.9999 1.8940,000 0.859 0.999 0.999 0.9998 2.1945,000 0.856 0.999 0.999 0.9997 2.3950,000 0.858 0.998 0.998 0.9998 2.7158,000 0.864 0.998 0.999 0.9999 3.05

Table 7Letter data set.

Dataset size Rand-Index Rough-DBs value s value

0.4 0.35 0.3 0.25 0.4

2000 0.637 0.827 0.868 0.918 0.654000 0.647 0.834 0.878 0.929 1.786000 0.648 0.845 0.875 0.932 3.088000 0.680 0.855 0.900 0.949 4.5810000 0.704 0.849 0.891 0.951 5.9712000 0.710 0.848 0.901 0.955 7.3214000 0.670 0.854 0.902 0.955 9.0016000 0.709 0.868 0.901 0.955 10.3318000 0.728 0.868 0.899 0.958 11.6620000 0.764 0.868 0.902 0.959 13.17

5.5. The Letter dataset

The experimental results with the Letter dataset is summarizedin Table 7. The s values used are 0.4,0.35,0.3,0.25. The � value usedis 0.5. The parameter MinPts when the dataset size is 2000 patternsis 8. It can be seen that the Rand-Index is 0.959704 when the entiredataset is used with s ¼ 0:25, but the time taken by rough-DBSCANis less than 50% of that by DBSCAN. Further, it is observed thatwhen the number of patterns are increasing rough-DBSCAN’s resultis approaching to the DBSCAN’s result.

6. Conclusions

The paper presented a novel hybrid clustering scheme to getdensity based arbitrary shaped clusters. The method is calledrough-DBSCAN which first derives the prototypes called leadersby using the leaders clustering method and then uses these proto-types in deriving the density based clusters. The leaders clustering

BSCAN’s execution time (s) DBSCAN’s execution time (s)

0.01 0.005 0.001

0.91 1.01 2.36 8.081.71 1.88 5.75 23.562.40 2.65 9.22 42.673.07 3.40 12.46 67.923.70 4.10 15.61 101.824.30 4.71 18.60 145.684.93 5.44 21.81 202.695.56 6.10 24.55 268.786.14 6.77 27.62 321.686.70 7.40 30.40 436.817.62 8.41 35.11 553.79

SCAN’s execution time (s) DBSCAN’s execution time (s)

0.35 0.3 0.25

0.84 0.95 1.16 5.862.61 3.00 4.11 12.784.83 5.75 8.19 21.157.36 9.18 13.67 32.8310.21 12.96 20.28 44.6713.04 16.71 27.29 58.6616.62 21.30 34.90 74.6719.28 25.47 43.72 92.8722.68 29.99 53.24 112.6825.97 35.41 64.51 134.59

Page 12: 2009 DBSCAN A Fast Hybrid Density Based Clustering Method for Large Data Sets

1488 P. Viswanath, V. Suresh Babu / Pattern Recognition Letters 30 (2009) 1477–1488

method is modified to store the number of followers each leaderhas which enables in estimating the densities. The proposedmethod is analyzed using the rough set theory. Theoretically someproperties of the leaders and relationship between DBSCAN andrough-DBSCAN are established. It is shown that the proposedrough-DBSCAN’s running time is linear in the number of patternswhereas it is quadratic for the DBSCAN. Experimental studiesshows that when the dataset size increases, when the thresholddistance used to derive the leaders is low, the rough-DBSCAN’s re-sult is very close to that of DBSCAN’s result, but consumes lesstime. The proposed clustering method rough-DBSCAN is a scalableone suitable to derive density based clusters when the datasetsizes are large like those in data mining applications.

References

Ananthanarayana, V.S., Murthy, M.N., Subramanian, D.K., 2003. Tree structure forefficient data mining using rough sets. Pattern Recognition Lett. 24 (6), 851–862.

Asharaf, S., Murty, M.N., 2003. An adaptive rough fuzzy single pass algorithm forclustering large data sets. Pattern Recognition 36 (12), 3015–3018.

Asharaf, S., Murty, M.N., 2004. A rough fuzzy approach to web usage categorization.Fuzzy Sets Systems 148 (1), 119–129.

Bentley, J.L., 1975. Multidimensional binary search trees used for associativesearching. Commun. ACM 18 (9), 509–517.

Cheu, E.Y., Kwoh, C.K., Zhou, Z., 2004. On the two-level hybrid clustering algorithm.In: Proc. Internat. Conf. on Artificial Intelligence in Science and Technology, pp.138–142.

Duda, R.O., Hart, P.E., Stork, D.G., 2000. Pattern Classification, second ed. A Wiley-Interscience Publication, John Wiley & Sons.

Ester, M., Kriegel, H.P., Xu, X., 1996. A density-based algorithm for discoveringclusters in large spatial databases with noise. In: Proc. 2nd ACM SIGKDD,Portand, Oregon, pp. 226–231.

Fillippone, M., Camastra, F., Masulli, F., Rovetta, S., 2008. A survey of kernel andspectral methods for clustering. Pattern Recognition 41, 176–190.

Guha, S., Rastogi, R., Shim, K., 1998. Cure: An efficient clustering algorithm for largedatabases. In: Proc. Conf. Management of Data (ACM SIGMOD’98), pp. 73–84.

Guttman, A., 1984. R-trees: A dynamic index structure for spatial searching. In: 13thACM SIGMOD Internat. Conf. Management Data, vol. 2, Boston, MA, 1984, pp.47–57.

Hartigan, J.A., 1975. Clustering Algorithms. John Wiley & Sons, New York.Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classification 2, 193–

218.Jain, A., Dubes, R., 1988. Algorithms for Clustering Data. Prentice Hall, Englewood

Cliffs, NJ, USA.Jain, A., Murty, M.N., Flynn, P.J., 1999. Data clustering: A review. ACM Comput. Surv.

31 (3), 264–323.Kaufman, L., Rousseeuw, P., 1990. Finding Groups in Data: An Introduction to

Cluster Analysis. John Wiley & Sons, New York.Kohonen, T., 1985. Median strings. Pattern Recognition Lett. 3, 309–313.Lin, C.-R., Chen, M.-S., 2005. Combining partitional and hierarchical algorithms for

robust and efficient data clustering with cohesion self-merging. IEEE Trans.Knowledge Data Eng. 17 (2), 145–159.

MacQueen, J.B., 1967. Some methods for classification and analysis of multivariateobservations. Proc. Fifth Berkeley Symp. on Mathematical Statistics andProbability, vol. 1. University of California Press, Berkeley, pp. 281–297.

Pawan, L., Chad, W., 2004. Interval set clustering of web users with rough k-means.J. Intell. Inform. Systems 23 (1), 5–16.

Pawlak, Z., 1982. Rough sets. Internat. J. Comput. Inform. Sci. 11, 341–356.Pawlak, Z., 1991. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer

Academic Publishing, Dordrecht.Rand, W.M., 1971. Objective criteria for the evaluation of clustering methods. J.

Amer. Statist. Assoc. 66, 846–850.Shiu, L.Y., S.K.C., Li, P.S.K., 2006. A rough set based case based reasoner for text

categorization. Internat. J. Approximate Reason. 41(2), 229–255.Spath, H., 1980. Cluster Analysis Algorithms for Data Reduction and Classification.

Ellis Horwood, Chichester, UK.Tian, Z., Raghu, R., Micon, L., 1996. BIRCH: An efficient data clustering method for

very large databases. In: Proc. ACM SIGMOD Internat. Conf. Management ofData, pp. 103–114.

Ulrike von Luxburg, 2006. A tutorial on spectral clusgtering, Tech. Rep. TR-149, MaxPlanck Institute for Biological Cybernetics.

Viswanath, P., Pinkesh, R., 2006. l-dbscan: A fast hybrid density based clusteringmethod. Proc. 18th Internat. Conf. on Pattern Recognition (ICPR-06), vol. 1. IEEEComputer Society, Hong Kong, pp. 912–915.

Yan, S.S., Li, P.S.K., 2006. Combining feature reduction and case selection in buildingCBR classifiers. IEEE Trans. Knowledge Data Eng. 18 (3), 415–429.