Selective sampling for approximate clustering of very large data sets

Selective Sampling for ApproximateClustering of Very Large Data SetsLiang Wang,1,† James C. Bezdek,2,‡ Christopher Leckie,1,∗

Ramamohanarao Kotagiri1,†1Department of Computer Science and Software Engineering,The University of Melbourne, Victoria, 3010, Australia2Department of Computer Science, University of West Florida,Pensacola, FL 32514, USA

A key challenge in pattern recognition is how to scale the computational efficiency of clusteringalgorithms on large data sets. The extension of non-Euclidean relational fuzzy c-means (NERF)clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF)clustering algorithm, which comprises four phases: (i) finding distinguished features that monitorprogressive sampling; (ii) progressively sampling from a N × N relational matrix RN to obtain an × n sample matrix Rn; (iii) clustering Rn with literal NERF; and (iv) extending the clusters inRn to the remainder of the relational data. Previously published examples on several fairly smalldata sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases(i) and (ii), i.e., finding Rn, are not very practical because the sample size n often turns out to beroughly 50% of N , and this over-sampling defeats the whole purpose of eNERF. In this paper, weexamine the performance of the sampling scheme of eNERF with respect to different parameters.We propose a modified sampling scheme for use with eNERF that combines simple randomsampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT(scalable visual assessment of clustering tendency). We demonstrate that our modified samplingscheme can eliminate over-sampling of the original progressive sampling scheme, thus enablingthe processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility andeffectiveness of the proposed sampling method. We also find that actually running eNERF on adata set of this size is very costly in terms of computation time. Thus, our results demonstrate thatfurther modification of eNERF, especially the extension stage, will be needed before it is trulypractical for VL data. C© 2008 Wiley Periodicals, Inc.

1. INTRODUCTION

A decade ago, Huber1 defined large data sets as using in the order of magnitudeof n = 108 bytes, and said “Some simple standard database management tasks

∗Author to whom all correspondence should be addressed: [email protected].† e-mails: {lwwang, rao}@csse.unimelb.edu.au.‡ e-mail: [email protected].

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 23, 313–331 (2008)C© 2008 Wiley Periodicals, Inc. Published online in Wiley InterScience(www.interscience.wiley.com). • DOI 10.1002/int.20268

314 WANG ET AL.

Figure 1. Population R∞ and samples RVL, RL, and RSS.

with computational complexity O(n) or O (nlog n) remain feasible beyond terabytemonster sets, while others (e.g., clustering) blow up already near large data sets.” Asthe size of data sets continues to increase, machine learning faces more demandingrequirements for efficient data analysis. Data collection often results in sets thatexceed the primary memory capacity of computer workstations, so it is very hardto cluster really large data sets. Computer capacity will continue to increase, butso will the data size. There will always be data sets that are too large to handle forany computer, so extensible methods to very large (VL) data sets are of continuedimportance.

How to cluster in VL data sets? There are two broad approaches to this problem:distributed clustering and sampling plus extension. We will follow the latter schemein this paper. An extension scheme clusters a representative (and manageably sized)sample of the full data, and then extends the sample result to obtain (approximate)labels for the remaining data. In contrast, a literal scheme applies the clusteringalgorithm without modification to the full data set. When the data set is VL, samplingand extension offers a clustering solution (i.e., makes clustering feasible) for caseswhere it is not possible to use the literal approach alone. If the data set is merelylarge (L), but still loadable, then an extended scheme may offer an approximatesolution comparable to the literal solution at a significantly reduced computationalcost—in other words, it accelerates the literal scheme. Benefits for the two casescan be summarized as feasibility for VL data and acceleration for L data. Bothsituations are depicted in Figure 1, where the data set to be clustered is RL orRV L. R∞ is the source population, and RS represents a sample of either RV L orRL. In a nutshell, sampling procures RS , clusters it with a literal algorithm, andthen the extension algorithm uses the sample-based clustering to label the rest ofthe data in RL − RS or RV L − RS . A fundamental difference between these twocases involves the calculation of an approximation error. For L data sets, we canassess the approximation error by measuring the difference between the clusteringsolutions obtained using the corresponding extended and literal schemes. On theother hand, the only solution generally available for a VL data set is that obtained bythe extended scheme, in which case the approximation error cannot be measured.Thus, our confidence in the accuracy of extended clusters in the unverifiable case

International Journal of Intelligent Systems DOI 10.1002/int

APPROXIMATE CLUSTERING OF VERY LARGE DATA SETS 315

(RV L) is necessarily derived from the verified good behavior we can observe byconducting various RL experiments.

Recently, the fuzzy c-means (FCM) clustering algorithm2 has been success-fully extended for VL image data,3 VL object data,4 and VL relational data.5 Theterm “relational data” refers to a data set that is represented only by a set of pairwisesimilarities (or dissimilarities) between objects, rather than by a vector of featurevalues for each object. Different variations of the basic extension scheme are illus-trated in these papers, but always on fairly small data sets. This was done to enablecomparisons between approximate and literal solutions to gain some confidencein the overall procedure. This paper discusses the first attempt to actually use oneof these methods, namely, eNERF described in Bezdek et al.5 for a VL relationaldata set. How well does it work? Not so well! We find that several modifications ofeNERF are necessary before it is really useful for VL data—to wit, the samplingprocedure needs to be amended, and furthermore, the extension procedure is so timeconsuming that although it does work well, it is not very practical. We will tackle thefirst problem in this paper, and highlight the second problem as an issue for futureresearch. An additional issue that is important for clustering in practice is how tochoose the number of clusters to seek (c), which is crucial for any of the extended(or conventional) forms of c-means clustering algorithms. In this paper, that job istaken over by the use of sVAT,6 which enables us to visually examine the structureof samples from a truly very large data set, in advance of using (any) clusteringalgorithm. Although it does not do clustering, sVAT helps us select an appropriatevalue for c. It should be mentioned that sVAT is a scalable version of VAT7 for largedata sets that can be implemented in a computationally efficient manner.

The core of any extension scheme is sampling. Simple random sampling hasbeen widely used, for example, to speed up fuzzy c-means,8 to make support vectormachines scalable,9 and so on. Alternatively, Provost et al.10 provide a very read-able analysis and summary of progressive sampling schemes. Their study shows thatprogressive sampling is more efficient than literal implementations on the full dataunder many different circumstances. Meek et al.11 propose a progressive samplingapproach for (EM) clustering. These authors advocate the use of a learning curveassociated with the expected gain to be realized by moving to the next sample size inthe schedule. An important difference between sampling for the FCM-based exten-sions in a number of works3−5 and that used in others10,11 is that that the extendedclustering algorithms require just a single run of the primary clustering algorithm onthe accepted sample, whereas the methods in the literature10,11 assume that the par-ent algorithm will be run on each sample in the schedule until progressive samplingterminates, and the algorithmic output on the terminal sample is the final result.

There are other ways to cluster VL data in the recent literature.12−19 Singlepass algorithms use all of the VL data sequentially, with some schemes processingthe full data sets in manageable chunks, and then combining the results from eachchunk. Examples of single pass algorithms can be found.15−19 Object data schemesfor clustering very large object data sets include a graph-based approach,18 a sample-based hierarchical algorithm—CURE,19 and a method in Bradley et al.15 that buildsprobabilistic clusters from a random subsample and then validates the sample resultusing additional data.


316 WANG ET AL.

In contrast to this previous work, in this paper, we evaluate the effectiveness ofprogressive sampling in practice in the context of the eNERF-clustering algorithmfor VL relational data, and we investigate alternative approaches to sampling in thiscontext. The main contributions of this paper are as follows:

— We demonstrate the susceptibility of the progressive sampling scheme in eNERF tooversample the data;

— We develop a modified sampling scheme for eNERF, called selective sampling, whichuses a combination of random sampling and nearest-prototype partitioning to build a setof samples for clustering; and

— We demonstrate the effectiveness of our selective sampling scheme in terms of reducingthe amount of data that needs to be sampled, while still maintaining the accuracy of thefinal clustering results compared to those of the original eNERF algorithm.

In Section 2 we briefly review eNERF. In Section 3, we analyze the progressivesampling scheme used by eNERF5 that leads to untenable sample sizes and presenta modified sampling scheme to overcome this limitation. Section 4 gives severalexamples that compare the original and modified sampling schemes based on thequality of approximate clusters for L or VL data sets. Section 5 summarizes ourwork and lists some open questions for future research.

2. eNERF

Consider a set of N objects {o1, . . . , oN }. Numerical object data has the form X= {x1, . . . , xN }⊂ �s, where the coordinates of xi provide feature values describingobject oi . Sometimes measurements are numerical relational data, which consistof N2 pairwise similarities (or dissimilarities), represented by a matrix R = [Rij =(dis-) similarity (oi , oj )| 1 ≤ i, j ≤ N]. We can always convert X into dissimilaritydata D by computing dij = ‖xi − xj‖ in any vector norm on �s, so most relationalclustering algorithms are (implicitly) applicable to object data. However, there areboth similarity and dissimilarity relational data sets that do not begin as object data.For example, consider the problem of finding clusters of related products among theset of products sold by a supermarket. We may have a matrix of similarities betweenproducts, based on the frequency of their co-occurrence in customer transactions,but no object representation to describe the products. In cases such as these, we haveno choice but to use a relational algorithm such as eNERF.

Let c be the number of classes, 1 < c < n. Crisp label vectors in �c looklike yi = (0, . . . , 1, . . . , 0)T , 1 in the ith place meaning that objects with this la-bel belong to class i. Fuzzy (and probabilistic) label vectors (with, e.g., c = 3)look like y = (0.1, 0.6, 0.3)T ; they have entries in [0,1] that sum to 1. We fol-low the usual notation for fuzzy and hard labels:Nf c = {y = (y1, . . . yc)T ∈ �c :∑

yi = 1, 0 ≤ yi ≤ 1 , ∀ i}; Nhc = {y ∈ Nf c : yi ∈ {0, 1 }, ∀ i}. Henceforth,ON = {o1, . . . , oN } has dissimilarity matrix DN , and On = {oi1, . . . , oin} of size n

has dissimilarity matrix Dn. We assume that DN = [dij ] satisfies, for 1 ≤ i, j ≤ N ,dij ≥ 0; dij = dji ; dii = 0 (dissimilarity).

Clustering in ON (or somewhat sloppily, in DN ) is the assignment of (hard orfuzzy) label vectors to the objects in ON . A c-partition of ON (or DN ) is a set of (cN)



values {uik} arrayed as a c × N matrix U = [U1 U2 . . . UN ] = [uik], where Uk , thekth column of U, is the label vector for ok . The element uik of U is the membershipof ok in cluster i. Each column of U sums to 1, and each row of U must have at leastone non-zero entry to qualify as a nondegenerate c-partition of the objects.

The eNERF algorithm comprises the sequential application of four algorithms5:DF, PS, LNERF and xNERF. In brief, algorithm DF determines h “distinguishedfeatures” {mk: 1≤ k ≤ h}, which are used to guide progressive sampling. Thisamounts to picking rows (corresponding to the distinguished feature choices) thatcorrespond to very dissimilar objects. Algorithm PS randomly selects—withoutreplacement—an initial subset of values indexed in {1, . . . , N}, and then tests thesample On to see whether it is a good (statistical) fit to the full data set. The criterionfor acceptability of On is based on comparing the distribution of the distinguishedfeatures found by DF in the columns corresponding to On to the correspondingdistribution of the distinguished features in ON using the divergence test statistic.Note that the use of a histogram-based test for such acceptability requires selectionof the desired number of histogram bins b, and the equal content bins are elegantlychosen in Bezdek et al.5 If the sample passes, progressive sampling terminates. Ifthe sample fails, it is augmented by randomly selecting additional data, withoutreplacement, adding the new data to the sample, and then retesting it. This cyclecontinues until a representative sample Dn is found. After termination of PS, literalNERF (LNERF) finds a fuzzy c-partition of Dn. Finally, the extension procedurexNERF uses the results of LNERF clustering to obtain labels for the remaining N – n

objects in the data, which, when combined with the n LNERF labels, provide a fuzzypartition of the full data set DN (equivalently, the full set of objects {o1, . . . , oN }).Details of this procedure are too complicated to repeat in this paper. The reader mayrefer to Bezdek et al.5 for a complete discussion.

Figure 2 summarizes eNERF, which is composed of the four algorithms, thatis, DF, PS, LNERF, and xNERF, as discussed earlier. The shaded cells in the upperleft panel depict an H × H candidate matrix from which h “distinguished features”{mk} are chosen by algorithm DF. The shaded rows in the upper right panel arethe full rows of DN corresponding to the h rows in DH . This is the input matrixto PS. The shaded cells in the lower right panel are the entries of Dn, the samplematrix selected by PS to be processed by LNERF.20 The lower left panel depictsa (reindexed) version of DN . The blocks of shaded cells in the first n columns arethe fixed values of Dn, whereas the n shaded cells in column j , n+ 1 ≤ j ≤ N ,are used by xNERF to estimate a label vector for object oj , j = n + 1, . . . , N . Thefinal operation in this sequence is to form a c-partition of ON by concatenatingUxNERF, N−n with ULNERF,n, thereby creating Uapp = [ULNERF,n|UxNERF, N−n]c × N ,the approximation to literal clusters in DN that cannot be computed when DN isunloadable.

3. SELECTIVE SAMPLING

Samples often provide sufficient accuracy with far less computational costthan clustering the entire data set directly. However, obtaining “just enough” useful


318 WANG ET AL.

Figure 2. Architecture of the original eNERF-clustering algorithm.

samples to achieve sufficient accuracy is a challenging problem. The motivation forusing (DF + PS) is to obtain a representative set of samples as the basis for clustering.For progressive sampling in eNERF, the statistical divergence test between thesample and population is performed for each selected distinguished feature, andsampling is terminated only when all tests for all features pass. We believe thatthere are computationally cheaper and/or better methods for terminating progressivesampling. Is the divergence test the best predictor? The answer is probably “no.”Unfortunately, the divergence test appears to be very conservative in practice, and thesample size n yielded by this scheme often turns out to be nearly 50% of N , which isabsolutely impractical for real VL data. Example 1 in Section 4 will show that evenfor L data, PS may not terminate until n is 20%–50% of N . We have examined thesampling performance of eNERF with respect to a variety of parameter adjustmentsin Section 4, and we find that sample size remains impractically large, and cannotbe greatly reduced by any combination of PS parameters.

To address this problem, we have developed a revised sampling scheme thatcombines elements from Bezdek et al.5 and Hathaway et al.6 with simple randomsampling (RS). Our new scheme, called selective sampling (SS), operates as follows:First, we obtain NN samples DNN from the original N samples by random sampling



uniformly, that is, DN →DNN with RS. These samples DNN act as a quasi-populationof manageable size for the subsequent steps. We then select c′ distinguished objectsfrom an H × H subset of DNN , so that these c′ objects can be used as prototypes toguide sampling, that is, DNN →DH , from which we obtain c′ distinguished featuresusing DF from eNERF. Next, we group each object in {o1, o2, . . . , oNN } with itsnearest distinguished object, that is, we build the nearest-prototype partition ONN =C1 ∪ C2 ∪ . . . ∪ Cc′ . Finally, we select a small number (ni) of samples from eachCi so that

∑c′i=1 ni = n. The advantage of this revised sampling scheme is that

it eliminates PS, and thus, its tendency to oversample ON . In addition, it can berealized in a computationally efficient manner, while enabling the processing oftruly VL data. For convenience, we refer to this selective sampling scheme as theSS algorithm.

******************************************************************

The SS AlgorithmData Input: N × N pairwise dissimilarities DN ;User Inputs:

◦ c′ = an overestimate of the true but unknown number of clusters c (i.e., the number ofdistinguished objects to select);

◦ n = an (approximate) sample size;◦ NN = an intermediate sample size;◦ H = the number of candidates objects from which the c′ distinguished objects are selected.

SS1: Obtain NN samples DNN from the original N objects by randomly samplingDN , which acts as a quasi-population.

This is necessary to handle VL data; and it is also plausible because, in general,c � N . Moreover, too many samples for VL data basically provide redundantinformation to characterize the structure of the whole data set.

SS2: Select the indices m1, . . . , mc′ of the c′ distinguished objects from H candidaterows (e.g., top H × H matrix from DNN ).

The matrix DH is loadable whereas DNN is not. This step uses DF from eNERF.5

The choice for H is generally determined by the memory available to store H ×(H– 1)/2 elements from DNN .

• Randomly select the first distinguished object, for example, m1 = 1 (without loss ofgenerality)

• Initialize the search array s = (s1, . . . , sH ) = (d11, . . . ., d1H )• For i = 2, . . . , c′

◦ Update s = (min{s1, dmi−1, 1}, min{s2, dmi−1, 2}, . . . , min{sH , dmi−1, H })◦ Select mi = arg max{sj }, 1 ≤ j ≤ H .

SS3: Group each object in {o1, . . . , oNN } with its nearest distinguished object.

This step finds a crisp c′-partition of ONN using the nearest prototype rule.


320 WANG ET AL.

• Initialize the respective distinguished objects index sets C1 = C2 = · · · = Cc′ = Ø;• For k = 1 to NN

◦ Select q = arg min{dmj , k}, 1 ≤ j ≤ c′

◦ and then update Cq = Cq ∪ {k}

SS4: Select data for the sample matrix Dn near each of the distinguished objects

• For i = 1: c′

◦ Compute the ith group representative subsample size ni = �n · |Ci |/NN�.◦ Randomly select ni indices from Ci .

• Let sample S denotes the union of all the randomly selected indices and define n = |S|.

Output: The n × n principal submatrix Dn of DNN corresponding to the rows/columns indices in S.

******************************************************************It should be noted that, if a set of objects O represented by the relational

dissimilarity matrix D can be partitioned into c ≥ 1 compact and separated (CS)clusters according to Dunn’s definition of CS clusters,21 and if c′ ≥ c, then SS2will select at least one distinguished object from each cluster. And in this case,the proportion of objects in the sample from each cluster equals the proportion ofobjects in the (quasi-) population from the same cluster.

4. NUMERICAL EXAMPLES

Our aim in this section is to test the limitations of using progressive sam-pling (PS) in eNERF, and to investigate the effectiveness of our selective sampling(SS) scheme as an alternative sampling scheme for eNERF. To test the limita-tions of PS, we first test whether the performance of eNERF is sensitive to dif-ferent strategies in DF for choosing distinguished features, or different parametersetting in PS. Having established the tendency of PS in eNERF to oversample,we then evaluate the accuracy of eNERF using our SS algorithm on a VL dataset.

Computation was done on a PC with 2 GB RAM and 3.0 GHz CPU. TheeNERF routines are written in version 7.3 of MATLAB. The termination normswe use in the numerical experiments are as follows: for LNERF, the sup norm formatrices, regarding them as vectors in �cn; and for xNERF, the sup norm on �c. Ourprograms have not been optimized to run most efficiently. All numerical examplesuse mixtures of normal distributions as the source of the data sets. We use data ofthis kind simply because it is easy to generate VL data having known characteristicswhich are useful for assessment of the SS algorithm.



Our partition and error notations are as follows:

Utrue = true hard partition (recorded during data generation)

Ulit = ULNERF, N = LNERF output, when applied to all of DN

Uapp = [ULNERF, n|UxNERF, N−n] = eNERF output that approximates Ulit

E1 = ‖Ulit − Uapp‖F =√∑∑

(ulit, ik − uapp, ik)2;

approximation error of fuzzy memberships (1)

E2 = 0.5 ·c∑

i = 1

N∑k = 1

|H(Ulit)ik − H(Uapp)ik|; approximation match error (2)

E3 = 0.5 ·c∑

i = 1

N∑k = 1

|(Utrue)ik − H(Uapp)ik|; approximation-training error (3)

E4 = 0.5 ·c∑

i = 1

N∑k = 1

|(Utrue)ik − H(Ulit)ik|; literal-training error (4)

Equation 1 is the value of the difference between Ulit and Uapp in the Frobeniusnorm: It measures the approximation error of fuzzy memberships in Ulit by those inUapp. Equation 2 counts the number of mismatches between columns in the hardenedversions H(Ulit) and H(Uapp). The function H “hardens” U ∈ Mfcn by finding themaximum entry in each column of U ; replacing it with a 1, and setting the remaining(c − 1) entries to 0. Consequently, E2 measures the approximation error of crisplabels produced by hardening both Ulit and Uapp. E2 is not an “error rate” in theusual sense, but Equation 3 for E3 is the classification (training) error of eNERFwhen its fuzzy partition is hardened and taken as an estimate of the true crisp labelsof points in the sample. Equation 4 for E4 is the classification error of LNERF, whenits fuzzy partition is hardened and taken as an estimate of the true crisp labels ofpoints in the sample.

Example 1. This example uses two different matrices of squared Euclidean dis-tances between each pair of vectors in a set of N = 3000 points in �2. The objectiveof this example is to examine (DF + PS) in eNERF, and to compare it to our modifiedSS algorithm by examining the accuracy of the corresponding eNERF partitions.The vectors are distributed according to a mixture of c = 5 bivariate normal distri-butions. For data I, the components are as follows: mixing proportions p1 = 0.3,p2 = 0.2, p3 = 0.1, p4 = 0.2, p5 = 0.2; means µ1 = (−3, −4)T , µ2 = (0, 0)T ,µ3 = (3, 3)T ; µ4 = (5, −2)T , µ5 = (−6, 5)T ; and covariance matrices �1 = �2 =∑

3 = ∑4 = ∑

5 =[

1 00 1

]. Samples drawn from this mixture are shown in the

upper left view of Figure 3. For data II, the components are follows: mixing


322 WANG ET AL.

Figure 3. Scatter plots and their corresponding VAT images of data I (left) and data II (right).

proportions p1 = 0.2, p2 = 0.2, p3 = 0.3, p4 = 0.1, p5 = 0.2; means µ1 =(−3, −3)T , µ2 = (0, 0)T , µ3 = (3, 3)T ; µ4 = (3, −2)T , µ5 = (−3, 3)T ; and co-

variance matrices, �1 =[

1 00 0.2

], �2 =

[1 00 1

], �3 =

[0.1 00 1

], �4 =

[0.5 00 1

], �5 =[

0.2 00 1

]. Data II are scatter plotted in the upper right frame of Figure 3.

VAT7 is a helpful tool for determining possible cluster structure in small datasets. The VAT approach presents pairwise dissimilarity information about the N

objects as a square digital image with N2 pixels, after the objects are suitablyreordered, so that the image is better able to highlight potential cluster structure.Dark blocks along the diagonal of a VAT image suggest cluster structure in theunderlying data. VAT images of data sets I and II are shown directly below theirscatter plots. Both images indicate that the data possess c = 5 clusters, and, for thesewell-defined clusters, the size of each block roughly mirrors the number of vectorsdrawn from each component of the mixture. For example, the five blocks in the leftimage have the approximate ratio 2:3:2:2:1 corresponding to (the rearrangement of)the points captured with priors p2:p1:p4:p5:p3. Thus, we choose c = 5 as an inputvalue for clustering in data sets I and II.



We performed eNERF using the following parameter values: divergence ac-ceptance threshold εPS = 0.80; number of DF candidates H = 1000; number ofDFs h = 6; number of histogram bins b = 10; initial sample percentage p = (10%of N = 3000) = 300; and incremental sample percentage �p = (1% of N = 3000)= 30. In addition, the fuzzy weighting constant m = 2, number of clusters c = 5,and the stopping tolerances were for LNERF, εL = 0.00001 and for xNERF, εx =0.001.

(a) Experiments with respect to different selection of the first distinguishedfeature in DF of eNERF

First, we test several different methods to select the first distinguished featurefor DF in eNERF, which are described as follows: (1) DF1: Original method ineNERF, that is, m1 = 1; (2) DF2: Choose m1 randomly in {1, 2, . . . , H }; (3) DF3:Find max{dij } in DHH , and set m1 = i and m2 = j ; (4) DF4: Randomly select h

rows in {1, 2, . . . , H }; and (5) DF5: Select h rows corresponding to close-to-meandata points. (This can only be done when object data and true labels are available.)When h is large, the computational load is too large. But if h is too small, thedistinguished features may not represent complex structural features of the data. Asstated above, if c features (corresponding to rows) that are very different from eachother (corresponding to very dissimilar objects) are selected, then it is likely thateach feature consists of one object (a representative) from each of the c clusters.Intuitively, a value of h close to c seems more suitable for any data set with unknowndata structures (we use h = 6 in the following experiments with c = 5 inferred fromthe VAT images in Figure 3).

We performed 100 trials of eNERF on the two data sets. Although we used astatistical function to terminate progressive sampling, we view it merely as a measureof goodness of fit. The results are listed in Tables I and II. Note that columns 2,4, and 5 have been rounded off because they represent integers. We do this for allreported integers in the experimental results. From these two tables, we see that

(1) for different DF methods the resulting sample sizes are very close (nearly 40%–50%of N );

(2) for different DF methods the results with respect to E1, E2, and E3 are also very similar.(3) different ways to initialize PS using DF1–DF5 make little difference on the final sample

size and clustering errors.

Table I. Average results of 100 trials using different DFs on Data I.

Approximate Average AverageAverage error of fuzzy numbers of numbers of

DFs sample size memberships mismatches trainingh = 6 (n) (E1) (E2) errors (E3)

DF1 1197 0.63 5 45DF2 1307 0.62 6 45DF3 1238 0.62 6 45DF4 1212 0.66 6 45DF5 1245 0.67 6 45


324 WANG ET AL.

Table II. Average results of 100 trials using different DFs on Data II.

Approximate Average AverageAverage error of fuzzy numbers of numbers of

DFs sample size memberships mismatches trainingh = 6 (n) (E1) (E2) errors (E3)

DF1 1534 0.68 8 82DF2 1561 0.61 7 82DF3 1524 0.68 8 81DF4 1584 0.60 7 82DF5 1633 0.64 8 81

The average number of samples drawn before PS is terminated in Table I is1240 = 41.3% of N , and for Table II we draw 1567 = 52.2% of N . This shows that,as the clusters become less well separated, (DF + PS) draws more and more samplesbefore PS terminates, and the number of training errors for the more complex dataII is nearly twice what it is for data I. The error measure E1 compares the values ofcN = 5 × 3000 = 15,000 memberships produced by exact (Ulit) and approximate(Uapp) NERF clustering. The 10 values shown in column 2 of Tables I and II areremarkably consistent and very small. (The upper bound on E1 for these partitionsis

√15000 = 122.47.) Thus, the accuracy of approximation is very good. What is

not so good is the number of samples (n) processed by LNERF.(b) Experiments with respect to different parameter settings in (DF+ PS) of

eNERFNow, we examine the performance in terms of the starting sample size n0, the

divergence test significance level α, and the number of equal content histogram binsb. Gu et al.22 show that an improper starting sample size can make PS expensivein terms of computation and excessive database scans to fetch the sample data. Theefficiency of PS is highest when the starting sample size n0 is equal to the minimallynecessary sample size nmin (which is impossible to determine from theory). Thevalue of b will impact the level of detail that is supported by the sample. A smallvalue will be too coarse to capture the details of the data distribution. A higher valueincreases the computational complexity because of the higher computational cost ofobtaining the bin counts and computing the divergence test for the larger number ofbins. Also, because of the quantization, there may exist very small values in somehistogram bins, which may be inapplicable to some histogram-based comparisonmethods. We performed 25 trials on these two data sets with respect to differentsettings for α, b, and n0. The average results for 25 trials with each combination ofparameter settings are listed in Tables III and IV.

We can consider these results within the context of the corresponding learningcurve. A learning curve10 depicts the relationship between the sample size andmodel accuracies. Learning curves typically have a steeply sloping portion for smallsample sizes, a more gently sloping middle portion as the sample sizes increase,and a plateau late for sufficient sample sizes. When a learning curve reaches itsfinal plateau, we say it has converged, and denote the training set size at which



Table III. Average results of 25 trials with different parameter settings on Data I.

b = 10, n0 = 300 b = 20, n0 = 300 b = 30, n0 = 300

α n E1 E2 E3 n E1 E2 E3 n E1 E2 E3

0.5 902 0.82 7 45 702 1.26 11 46 575 1.09 10 460.6 1034 0.70 6 45 846 0.91 7 45 642 1.06 9 450.7 1352 0.54 5 45 1001 0.77 7 45 942 0.77 6 450.8 1561 0.54 5 45 1222 0.64 6 45 1081 0.75 7 450.9 1961 0.38 3 45 1565 0.46 5 45 1355 0.63 6 45

b = 10, n0 = 150 b = 20, n0 = 150 b = 30, n0 = 150

α n E1 E2 E3 n E1 E2 E3 n E1 E2 E3

0.5 792 0.94 7 45 638 1.16 10 46 619 1.31 11 460.6 884 0.95 9 45 726 0.93 6 46 709 1.02 8 460.7 1312 0.62 5 45 1024 0.78 7 45 881 0.95 8 450.8 1628 0.51 4 45 1328 0.63 5 45 1082 0.74 6 460.9 2008 0.37 4 45 1470 0.51 4 45 1344 0.62 5 45

Table IV. Average results of 25 trials with different parameter settings on Data II.

b = 10, n0 = 300 b = 20, n0 = 300 b = 30, n0 = 300

α n E1 E2 E3 n E1 E2 E3 n E1 E2 E3

0.5 930 1.12 15 79 662 1.21 14 82 586 1.38 15 830.6 1020 0.99 11 82 820 1.03 12 82 733 1.27 15 830.7 1272 0.76 9 80 1030 0.93 12 78 916 1.11 14 820.8 1495 0.71 8 82 1108 0.80 9 81 1075 0.86 10 820.9 1765 0.62 8 81 1466 0.70 8 82 1248 0.59 7 82

b = 10, n0 = 150 b = 20, n0 = 150 b = 30, n0 = 150

α n E1 E2 E3 n E1 E2 E3 n E1 E2 E3

0.5 625 1.41 15 84 605 1.49 17 82 572 1.37 15 820.6 857 0.85 10 80 761 1.12 12 83 698 1.43 16 840.7 1146 0.91 10 82 949 0.98 12 81 772 1.11 11 820.8 1569 0.62 6 82 1168 0.96 12 79 1064 0.9 10 830.9 1985 0.43 5 81 1482 0.61 6 82 1320 0.8 8 82

convergence occurs as the size of the smallest sufficient training set nmin.22 FromTables III and IV, we can draw the following conclusions:

(1) The final sample size is sensitive to the selection of the initial sample size n0. As shownhere, the smaller n0 is, the smaller the final sample size is. Note that 150 and 300 arewithin the middle section of the learning curves for these two data sets (please refer toTables V and VI).

(2) Increasing the significance level α of the divergence test always increases the size of theaccepted subsample n.

(3) As the number of histogram bins b increases, the final sample size will generally decrease.


326 WANG ET AL.

Table V. Average results of 25 trials with the SS algorithm on Data I.

Selective sampling (SS) Random sampling (RS)

Sample size (n) E1 E2 E3 E1 E2 E3

25 6.47 78 99 46.29 684 70050 3.83 26 52 4.61 37 5975 3.10 23 51 3.99 33 54

100 2.97 23 51 3.09 22 50150 2.29 17 48 2.48 18 49200 2.00 14 47 2.12 16 48250 1.73 13 46 1.90 15 47300 1.51 12 46 1.64 13 47600 0.92 7 46 1.06 9 46900 0.72 6 45 0.79 6 47

1200 0.57 4 45 0.65 6 471500 0.46 4 45 0.55 5 45

Table VI. Average results of 25 trials with the SS algorithm on Data II.

Selective sampling (SS) Random sampling (RS)

Sample size (n) E1 E2 E3 E1 E2 E3

25 7.42 90 125 28.90 440 47250 5.10 53 100 6.62 79 12575 4.37 48 95 5.89 74 120

100 3.65 41 90 4.03 44 97150 2.92 32 84 2.97 33 90200 2.38 28 81 2.68 26 84250 2.18 26 80 2.45 26 87300 1.95 23 81 2.26 26 86600 1.29 16 78 1.41 15 83900 1.04 14 77 1.08 13 80

1200 0.94 13 77 0.83 10 811500 0.89 13 77 0.71 9 80

(4) As the sample size n increases, E1 and E2 will decrease, but this is not necessarilytrue for E3. This is probably because the sample sizes used are all above the minimallysufficient sample size nmin, leading to the resulting accuracies being stable. Please seeTables V and VI for cross-reference.

(5) The final samples appear to be oversampling (even the minimum value is near 600 inthe case of α = 0.5, b = 30, n0 = 150). Note that a significance level of α = 0.5 is veryweak, and still results in a large final sample size.

(c) Experiments with the SS algorithmNow we test the SS sampling rule using H = 1000, NN = 2000, and c′ = 6.

For comparison, we also list the results of using random sampling (RS), that is,uniformly drawing n samples from the original N samples without replacement.The average results of 25 trials are clearly summarized in Tables V and VI, from



which the following can generally be seen:

(1) The clustering errors are still small, similar to the above experiments. The error E1

decreases monotonically as n increases, using SS and RS, and falls into the same range(about 0.6) as in Tables I and II. So, the approximation of eNERF is good and thedifference steadily decreases as the sample size increases.

(2) Using samples with size of about 200 (about 7% of |X|) seems to be sufficient to obtainsteady results (i.e., arrive at the relatively flat section in the learning curve).

(3) For n = 25 → 50, the results change dramatically. For n = 75–300, E3 seems to stayon the gently sloping section of the learning curve. For sample sizes above 300, E3 isbasically stable, whereas E1 and E2 continue to decrease slightly.

(4) As expected, E1, E2, and E3 increase with decreasing separation of clusters and areessentially unchanged with increasing stringency of the different subsample selectionstrategies.

(5) Generally, for the same sample size, selective sampling is far better than random samplingwhen n is below the smallest sufficient training set nmin. Selective sampling is slightlybetter than random sampling when n is larger than the minimum needed sample sizenmin. Note that generating a random sample from a single-table data set typically requiresscanning the entire table once, thus a larger n will result in considerably high disk I/Ocost.

(6) These conclusions are only drawn from these two specific small data sets, so they maynot be used to argue that SS will be always better than RS.

(7) These basic numerical results encourage us to assert that eNERF with the selectivesampling scheme will yield very acceptable and reasonably accurate results in the VLdata case.

Example 2. Data III is a set of N = 3,000,000 samples drawn from a mixture ofc = 5 bivariate normal distributions. The components are as follows: mixing pro-portions p1 = 0.2, p2 = 0.2, p3 = 0.3, p4 = 0.1, p5 = 0.2; means µ1 = (−3, −3)T ,µ2 = (0, 0)T , µ3 = (4, 3)T , µ4 = (5, −2)T , µ5 = (−5, 4)T ; and covariance matri-

ces, �1 =[

1 00 0.2

],�2 =

[1 00 1

],�3 =

[0.1 00 1

],�4 =

[0.5 00 1

], �5 =

[0.2 00 1

]. The data

are scatter plotted in Figure 4, from which we see that five clusters are visually ap-parent, but there is a lot of mixing between outliers from components in the mixture.We cannot perform literal NERF on 3,000,000 data, so Ulit was actually obtainedby applying the (object data) fuzzy c-means algorithm to the object data displayedin Figure 4. By the duality theory,20 the fuzzy partition Ulit obtained by using literalFCM is identical to the one obtained by running literal NERF on the corresponding(Euclidean) distance matrix.

Computing squared Euclidean distances between pairs of vectors in X yieldsa matrix DN with N × (N– 1)/2 = [4.5(1012) – 1.5(106)] = O(1012) dissimilarities.We are not able to calculate, load, and process the full distance data matrix DN .The memory required to store just the top half of the relational data matrix isabout 36 TB as a “.mat” file; and the number of elements in a relational matrix ofsize 3,000,000 × 3,000,000 is far greater than the maximum integer possible withMATLAB, so element indexing becomes a problem. Here, we use an approximate“lookup table” mode by just storing the object data set X, accessing only the vectorsneeded to make a particular distance computation and release the memory used bythese vectors immediately after to avoid being “out of memory.” If we had only the


328 WANG ET AL.

Figure 4. Scatter plot of data III.

relational data, this would represent a very large clustering problem relative to thecomputing environment available.

As discussed in Section 3, to cluster DN using eNERF with our selectivesampling algorithm, we first performed random sampling, obtaining the intermediatesamples for NN = 60,000, 45,000, and 30,000, which correspond to 2%, 1.5%, and1% of the input data. Recall that index NN denotes the size of the intermediatesample; n denotes the size of the final sample submitted to LNERF, which findsexact clusters in Dn. For each NN, we show results for two values of n. Four imagesare shown in Figure 5, and all of them are based on Dn with n = 2500. Figure 5ashows a sVAT image on Dn obtained by the sVAT6 sampling method for N =3,000,000 (sVAT is an approximation to VAT for images of more than about 5000objects); and Figures 5b–5d are VAT images7 on Dn obtained by using our new SSalgorithm in terms of three different intermediate sample sizes NN. From Figure 5,we can see that the sVAT image indicates that there are c = 5 clusters in DN . Fordifferent intermediate sample sizes NN, the VAT images of the corresponding Dn

matrices remain interpretable. That is, for this data set, the SS algorithm seems toextract a representative sample of the VL parent data set. Thus, we should choosec = 5 for eNERF.

We have evaluated our modified version of eNERF using the selective samplingon data III using different values of NN and n. For LNERF, we used c = 5, m = 2and εL = 0.00001. Initializing with the correct hard five-partition of the data, then × n submatrix Dn was clustered by LNERF. As mentioned above, we use LFCM



Figure 5. sVAT image of the full data N = 3,000,000 (a), and VAT images of the sampled Dn

using the SS algorithm with NN = 60,000 (b), NN = 45,000 (c), and NN = 30,000 (d).

Table VII. Results on data III (one run/case).

Full data

NN n E2 E3 E4

60,000 2,500 1,456 [0.048%] 24,652 [0.822%] 24,758 [0.825%]60,000 600 7,007 [0.233%] 26,215 [0.874%] 24,758 [0.825%]45,000 2,500 4,284 [0.143%] 23,388 [0.780%] 24,758 [0.825%]45,000 450 6,123 [0.204%] 22,391 [0.746%] 24,758 [0.825%]30,000 2,500 784 [0.026%] 24,283 [0.809%] 24,758 [0.825%]30,000 300 7,739 [0.258%] 25,462 [0.849%] 24,758 [0.825%]

on the object data to obtain Ulit for enabling measuring errors in Equations 2 and4. The xNERF phase required an n × 3,000,000 slice of DN (which had storagerequirements of roughly 6480 MB even for n = 300). To avoid exhausting memoryin MATLAB, the processing was broken up by calling the extension routine multipletimes, each time supplying it with Dn and an additional n× 1000 subblock of DN

(the last time may not be exactly n × 1000 depending on the value of n). Each chunkwas used to extend the partition for another 1000 objects. The stopping criterionfor the extension iteration used εx = 0.001. We computed the errors with respect toEquations 2–4, as summarized in Table VII. The whole eNERF process took about2 days for each case with different NN and n. Thus, Table VII only lists the resultswith respect to one run per case. From this table, it can be seen that all measures oferror are very small. This was certainly not an easy clustering problem, in terms ofhow well separated the clusters actually are, but the point here was to demonstratethe feasibility property of eNERF using our selective sampling scheme on truly VLdata, and this example demonstrates this.

5. DISCUSSION

In this paper, we have devised and demonstrated selective sampling for eNERF,an extended version of non-Euclidean relational fuzzy c-means for approximateclustering in very large (unloadable) relational data. Our examples demonstrate the


330 WANG ET AL.

need for revision of the original eNERF sampling procedure (DF + PS). Specifically,termination of progressive sampling with the divergence test as outlined in Bezdeket al.5 almost always leads to oversampling, so when the data are truly VL, so is thesample found by (DF + PS). Our revised method of selective sampling (SS) seemsto overcome this problem on the initial tests that we have performed.

Several needs and opportunities for future work immediately come to mind.An efficient implementation of eNERF, particularly for xNERF, needs to be made tofind out to what degree, and for what problems, eNERF can provide acceleration toa literal NERF approach. Is an alternative, noniterative, extension scheme possible?An important follow-up to this research is to apply eNERF with SS to a variety oftruly VL data sets to show our method is generally applicable to other realistic data.

Acknowledgments

The authors would like to express their thanks to anonymous reviewers for their valuablesuggestions and comments. This work was partially supported by ARC Discovery Project (Grantno. DP0663196).

References

1. Huber P. Massive data workshop: The morning after. In: Massive data sets. New York:National Academy Press; 1996. pp. 169–184.

2. Bezdek JC, Keller JM, Krishnapuram R, Pal N. Fuzzy models and algorithms for patternrecognition and image processing. Norwell, MA: Kluwer; 1999.

3. Pal NR, Bezdek JC. Complexity reduction for “large image” processing. IEEE Trans Sys-tems, Man Cybernet B 2002;32:598–611.

4. Hathaway RJ, Bezdek JC. Extending fuzzy and probabilistic clustering to very large datasets. Comp Stat Anal 2006;51:215–234.

5. Bezdek JC, Hathaway RJ, Huband JM, Leckie C, Kotagiri R. Approximate clustering invery large relational data. Int J Intell Syst 2006;21:817–841.

6. Hathaway RJ, Bezdek JC, Huband JM. Scalable visual assessment of cluster tendency forlarge data sets. Pattern Recognit 2006;39:1315–1324.

7. Bezdek JC, Hathaway RJ. VAT: A tool for visual assessment of (cluster) tendency. In: ProcInt Joint Conf on Neural Networks; 2002. pp. 2225–2230.

8. Cheng TW, Goldgof DB, Hall LO. Fast fuzzy clustering. Fuzzy Sets Syst 1998;93:49–56.9. Pavlov D, Chudova D, Smyth P. Towards scalable support vector machines using squashing.

In: Proc ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, 2000. pp. 295–299.

10. Provost F, Jensen D, Oates T. Efficient progressive sampling. In: Proc 5th knowledge dis-covery and data mining, New York; New York: ACM Press; 1999. pp. 23–32.

11. Meek C, Thiesson B, Heckerman D. The learning curve sampling method applied to modelbased clustering. J Mach Learn Res 2002;2:397–418.

12. Bradley PS, Fayyad U, Reina C. Scaling clustering algorithms to largedatabases. In: Proc Int Conf on Knowledge Discovery and Data Mining, 1998.pp. 9–15.

13. Gupta C, Grossman R. GenIc: A single pass generalized incremental algorithm for clustering.In: Proc SIAM Int Conf on Data Mining, 2004.

14. Domingos P, Hulten G. A general method for scaling up machine learning algorithms andits application to clustering. In: Proc Int Conf on Machine Learning, 2001. pp. 106–113.



15. Bradley P, Fayyad U, Reina C. Scaling clustering algorithms to large databases. In: Proc 4thInt Conf Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press; 1998. pp.9–15.

16. Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for verylarge databases. In: Proc ACM SIGMOD Int Conf on Management of Data. New York:ACM Press; 1996. pp. 103–114.

17. Ganti V, Ramakrishnan R, Gehrke J, Powell AL, French JC. Clustering large data sets inarbitrary metric spaces. In: Proc 15th Int. Conf on Data Engineering. Los Alamitos, CA:IEEE CS Press; 1999. pp. 502–511.

18. Ng RT, Han J. Efficient and effective clustering methods for spatial data mining. In: Proc.20th Int Conf on Very Large Databases. San Francisco, CA: Morgan Kauffman; 1994.pp. 144–155.

19. Guha S, Rastogi R, Shim K. CURE: An efficient clustering algorithm for large databases.In: Proc ACM SIGMOD Int Conf. on Management of Data, 1998. pp. 73–84.

20. Hathaway RJ, Bezdek JC. NERF c-means: Non-Euclidean relational fuzzy clustering. Pat-tern Recognit 1994;27:429–437.

21. Dunn JC. Indices of partition fuzziness and the detection of clusters in large data sets. In:Gupta MM, editor. Fuzzy automata and decision processes. New York: Elsevier; 1976.

22. Gu B, Liu B, Hu F, Liu H. Efficiently determining the starting sample size for progressivesampling. In: Proc Euro Conf on Machine Learning, 2001. pp. 192–202.


Documents

Selective sampling for approximate clustering of very large data sets