Clustering using multilayer perceptrons

Nonlinear Analysis 71 (2009) e2807–e2813

Contents lists available at ScienceDirect

Nonlinear Analysis

journal homepage: www.elsevier.com/locate/na

Clustering using multilayer perceptronsDimitrios Charalampidis ∗, Barry MuldreyElectrical Engineering Department, University of New Orleans, 2000 Lakeshore Dr. New Orleans, LA 70148, United States

a r t i c l e i n f o

MSC:68T99

Keywords:ClusteringMultilayer perceptronsNeural networks

a b s t r a c t

In this paper we present a multilayer perceptron-based approach for data clustering.Traditionally, data clustering is performed using either exemplar-based methods thatemploy some form of similarity or distance measure, discriminatory function-basedmethods that attempt to identify one or several cluster-dividing hyper-surfaces, point-by-point associative methods that attempt to form groups of points in a pyramidalmanner by directly examining the proximity between pairs of points or groups ofpoints, and probabilistic methods which assume that data are sampled from mixturedistributions. Commonly, in exemplar-based methods, each cluster is represented bya multi-dimensional centroid. In this paper, we explore the function approximationcapabilities ofmultilayer perceptronneural networks in order to build exemplarswhich arenot simply points but curves or surfaces. The proposed technique aims to group data pointsinto arbitrary-shaped point clouds. The proposed approach may exhibit problems similarto other traditional exemplar-based clustering techniques such as k-means, includingconvergence to local minimum solutions with respect to the cost function. However, itis illustrated in this work that approaches such as split-and-merge can be appropriatelyadjusted and employed in the proposed technique, in order to alleviate the problem ofreaching poor local minimum solutions.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Cluster analysis is one of the main tools for exploring the underlying structure of a data set. Clustering finds importantapplications in a wide variety of disciplines including remote sensing, pattern recognition, image processing, and computervision. The prime objective of a clustering technique is to partition a given data set consisting of N-dimensional points orvectors into a fixed number of L clusters. Cluster homogeneity is one of the most important requirements for an effectiveclustering technique, and implies that all vectors within a cluster should be more similar to each other than to vectorsassociated to other clusters. Several conventional clustering methods assume that vectors which are closely located in theN-dimensional space are likely to belong to the same cluster. For instance, a cluster can be considered homogeneous ifall associated points are closely located to the cluster’s centroid. In an alternative definition, a cluster may be consideredhomogeneous if any point within the cluster has a significant number of neighbors, which are also part of the same cluster.Proximity is usually determined by means of a similarity or dissimilarity measure, and the clustering algorithm associatesvectors to clusters based on that measure. The most popular distance measures are the Euclidean distance and, the moregeneral, Mahalanobis distance.However, the notion of similaritymay vary depending on the application,while other cluster propertiesmaybe important

or necessary. Depending on the definition of distance and homogeneity, clustersmay tend to form hyper-spherical or hyper-ellipsoidal point clouds, or may take the form of any arbitrary shape given that there is an obvious high point density withinthe cluster that defines or separates it from other clusters.

∗ Corresponding author. Tel.: +1 504 280 7415; fax: +1 504 280 7415.E-mail address: [email protected] (D. Charalampidis).

0362-546X/$ – see front matter© 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.na.2009.06.064

http://www.elsevier.com/locate/na

http://www.elsevier.com/locate/na

mailto:[email protected]

http://dx.doi.org/10.1016/j.na.2009.06.064

e2808 D. Charalampidis, B. Muldrey / Nonlinear Analysis 71 (2009) e2807–e2813

Extensive overviews of several clustering algorithms can be found in [1–8]. One of the most widely used clusteringalgorithms is K -means and its variations [7,8]. Several other clustering approaches exist, including those which are neuralnetwork (NN) based [9–12]. During the past few years NNs have been extensively used for classification and patternrecognition tasks. The main body of the work in this area has been concentrated in the use of feedforward NN models suchas Back Propagation NN [13] and Radial Basis Function NN [14]. However, NNs have mostly been used for classification andfunction approximation applications, with the exception of NNs that appear to have some association to exemplar-basedtechniques, such as the Kohonen feature maps [10], and Adaptive Resonance Theory NNs [11,12].It was identified by the authors that in addition to the classification and function approximation capabilities, multilayer

perceptron (MLP) NNs exhibit characteristics which are desirable in several data clustering applications. More specifically,the function represented by each MLP can be thought of as a highly versatile multi-dimensional exemplar curve or surface.It will be described in Section 2 that a group of MLP NNs can be used to generate a group of exemplar curves or surfaces inorder to cluster data points into several arbitrary-shaped point clouds.The authors believe that, to the best of their knowledge, although techniques that cluster data into arbitrary-shaped

clusters exist in the literature, the proposed approach is the first exemplar-based that attempts to tackle this problemusing MLP NNs. The approach in [15] determines the required seeds, and thus the sub-clusters within a cluster, throughidentification of the cluster border points. Border points are indicative of the elongation or convexity of a cluster. Twosub-clusters are merged based on a minimal spanning tree of seeds and the density at the border region between the twosub-clusters. The algorithm in [16] forms arbitrary-shaped clusters by grouping points into several fuzzy ellipsoidal sub-clusters and by merging the sub-clusters based on their Bhattacharya distances. In [17] a graph-theoretic, non-parametricclustering approach is presented. The proposed technique builds clusters by defining hyper-surface/curve exemplars insteadof merging closely located sub-clusters/points or defining cluster boundaries at sparsely populated regions. Thus, theproposed technique is naturally capable of handling crossing clusters, since exemplar hyper-surfaces/curves associated todifferent clusters are allowed to cross each other. Other techniques that produce arbitrary-shaped clusters tend to mergecrossing clusters into a single one. It should bementioned at this point that the cost functions used and the clusters producedby clustering techniques are often subjective. Thus, it is not claimed that the proposed technique is objectively superior, butthat it has certain capabilities that other techniques may not possess.The paper is structured as follows. Section 2 introduces the proposed technique. Results that illustrate the effectiveness

of the proposed technique are presented in Section 3. Section 4 closes with some concluding remarks and plans for futurework.

2. Proposed method

2.1. The basic clustering technique

Consider the case where the data points are located in an N-dimensional space, and that the goal of the clusteringtechnique is to group M data points, pm, m = 1, 2, . . . ,M , into L clusters. In what follows, it will be assumed that thepoints pm are row vectors. The basic idea of the proposed technique is to form exemplar hyper-curves or hyper-surfaces inorder to represent each cluster. As it will be described next, exemplar surfaces are more difficult to handle than exemplarcurves. In this work, an MLP-based method is employed in order to produce such exemplar curves/surfaces. In particular,each NN is used to determine a curve/surface to serve as an exemplar for each cluster.In order to achieve the aforementioned goal, each N-dimensional point pm is split into two lower-dimensional vectors,

namely qm and tm, where qm includes some of the coordinate values of pm, while tm includes the remaining coordinatesvalues. The first vector, qm, is used as the input to the networks, while the second vector, tm, is used as the target. Essentially,each network’s objective is to map qmto tm, and thus define a curve/surface close to the vicinity of pm. The success or, moreaccurately, the lack of success of the lth network tomap qm to tm ismeasured by the square of the Euclidean distance betweenthe target output, tm and the actual output, s

(l)m , produced as a result of presenting qm to the lth network:

E(l,m) = (tm − s(l)m )T(tm − s(l)m ) (1)

where the superscript T indicates the transpose operation. The first step in the algorithm is to initialize the L networksconsidering random weights and biases. In the second step, all points pm are split and used to train all L networks. The goalof the second step is to bring all exemplar curves/surfaces close to the actual location of the data points. In the third step,all points are presented (without training) to all networks. A point pm is assigned to the lth network if the square distanceE(l,m) between the sub-vectors tm and s

(l)m is smaller compared to any other square distance E(lo,m), l 6= lo, considering of

course that the input to all networks is qm. The process is repeated for all points pm. In the fourth step, the networks areretrained using only those points that selected them during the third step. Steps 3 and 4 are repeated until no significantchanges in the point assignment are observed. The algorithmic steps are presented in Table 1.It can be observed that the technique has certain similaritieswith the traditionalK -means clustering algorithm.However,

the exemplars produced by the networks are not point centroids, but hyper-curves/surfaces. In order for the exemplars tobe hyper-curves, as opposed to hyper-surfaces, the inputs qm should be scalars, qm = qm, representing a single coordinatevalue, while tm should include all remaining coordinate values.

D. Charalampidis, B. Muldrey / Nonlinear Analysis 71 (2009) e2807–e2813 e2809

Table 1The proposed algorithm.

Step 1: Initialize the L networks considering random weights and biasesStep 2: Split each point pm,m = 1, . . . ,M into two sub-vectors qm and tm . Train all L networks using all points, considering qm as the input and tm as theoutput.Step 3: Present, without training, all points to all networks. A point pm is assigned to the lth network if the square distance E(l,m) between thesub-vectors tm and sm(l) is smaller compared to any other square distance E(lo,m), l 6= lo .Step 4: Retrain the networks only those points that selected them during the third step.Step 5: If significant changes in the point assignment are observed, go back to step 3, otherwise terminate the algorithm.

-0.5 0 0.5

0

Orig

inal

Dat

a

-0.5

0.5

-0.5 0 0.5

0

-0.5 0 0.5

0

-0.5 0 0.5

0

-0.5 0 0.5

0

-0.5 0 0.5

0

Net

wor

k -1

-0.5

0.5

-0.5

0.5C

lust

erin

g R

esul

ts

-0.5

0.5

Clu

ster

ing

Res

ults

-0.5

0.5

Net

wor

k -2

Net

wor

ks 1

, 2

-0.5

0.5

a b

c d

e f

Fig. 1. Split-and-merge example.

2.2. Split-and-merge

Similar toK -means andother exemplar-based approaches, it is possible that the clustering resultmaynot be the optimumone. In order to reduce this problem, a split-and-merge approach appropriate to the proposed method can be employed.More specifically, after one training pass for each network (step 4 in Table 1), the clusters are split into sub-clustersdepending on the distance between neighboring points. In order to achieve this goal in an efficient manner, the exemplarsare considered to be hyper-curves as opposed to hyper-surfaces. All points pm are sorted with respect to the correspondinginput part of the data points, namely qm, which is possible for hyper-curves since qm is a scalar. Then, the Euclidean distancebetween two neighboring, with respect to qm, points pm and pn is determined. If the distance exceeds a threshold, then thecluster is split at that point. Once all clusters are split in sub-clusters, they are merged again in the following manner. First,the endpoints for each cluster are determined. The endpoints for each cluster are the pm points for which the correspondingqm values are either theminimum or themaximum among all qmvalues in the sub-cluster. Two clusters aremerged togetherif one of the two endpoints corresponding to one of the two clusters is closest to one of the two endpoints of the other clusterthan to any other endpoint of any other cluster. This process is repeated until the sub-clusters are merged into L clusters.


-0.5 0 0.5

-0.2

0

0.2

0.4

Orig

inal

Dat

a

-0.4

0.6

-0.5 0 0.5

-0.2

0

0.2

0.4

-0.5 0 0.5

-0.2

0

0.2

0.4

-0.5 0 0.5

-0.2

0

0.2

0.4

-0.4

0.6

Net

wor

k -1

-0.4

0.6

Net

wor

k -2

-0.4

0.6

Clu

ster

ing

Res

ults

Fig. 2. Two-dimensional clustering example.

Fig. 1 presents a simple two-cluster example to illustrate the split-and-merge technique. Fig. 1(a) presents the originaldata. Fig. 1(b) and (c) present the exemplar curves associated to the two NNs, without using the split-and-merge technique.The NNs have not converged to the best possible solution, as shown in Fig. 1(d). The points shown in Fig. 1(d), assumingthat they are sorted according to qm (x-axis), are connected by lines whose length indicates their distance. ‘‘Stars’’ are pointswhose distance from their neighbors exceeds the splitting threshold. Thus, the clusters should be split at those points. Thereis an obvious point around qm = 0.05 where both clusters should be split. Fig. 1(e) shows the two NN exemplar curves,and Fig. 1(f) the final clustering results, after the split-and-merge has been applied. In Fig. 1(f) the ‘‘stars’’ indicate that thesepoints are located away from their neighbors. However, although the clusters are split at those points, they aremerged againat the exact same points.

3. Results

In this section, results that illustrate the performance of the proposed MLP clustering technique are presented. In allexamples shown, Back Propagation neural networks have been used. In all cases, the networks are trained using theLevenberg–Marquardt method [18,19].Fig. 2 depicts an example where the original data set consists of points located in two-dimensional space. Thus, the

dimensionality of pm is 2, while both qm and tm are scalars. The exemplar curves represented by the two NNs are shown inthe same figure. This example presents an interesting result, not commonly obtained with traditional clustering methods,in which not only do the clusters have an irregular shape, but they are also overlapping.Fig. 3 presents an example similar to the one depicted in Fig. 2. However, in this case the two resulted clusters are

not overlapping. The exemplar curves represented by the two MLPs are also different that those shown in Fig. 2. Theresults presented in the two figures confirm that, similarly to traditional clustering techniques, the proposed techniquemay produce different clustering results, depending on the data and the initial conditions (initial weight and bias values).However, in both cases the results are reasonable since the ‘‘goodness’’ of a clustering result may be subjective.Fig. 4 presents an example where the input data are located in a three-dimensional space. Therefore, the dimensionality

of pm is 3. Moreover, the dimensionality of tm is 2, while qm is a scalar. The two clusters are easily separable, and this isconfirmed by the clustering results shown. The two exemplar functions represented by the two NNs are also shown in orderto illustrate that the NNs are successful in buildingmeaningful exemplar curves. Fig. 5 presents a similar example. However,in this case the data points are a noisy version of those depicted in Fig. 4. In this case, the separation between the two clustersis not as apparent as in the case of Fig. 4. However, the proposed techniquewas able to producemeaningful clustering resultsby building meaningful exemplar curves in order to represent the two clusters (Fig. 5).


Orig

inal

Dat

a

Net

wor

k -1

Net

wor

k -2

Clu

ster

ing

Res

ults

-0.5 0 0.5

-0.4

-0.2

0

0.2

0.4

0.6

-0.5 0 0.5

-0.4

-0.2

0

0.2

0.4

0.6

-0.5 0 0.5

-0.4

-0.2

0

0.2

0.4

0.6

-0.5 0 0.5

-0.4

-0.2

0

0.2

0.4

0.6

Fig. 3. Similar two-dimensional clustering example as in Fig. 1, with different clustering results.

-0.50

0.5

-0.5

0

0.5

-0.5

0

0.5

-0.50

0.5

-0.5

0

0.5

-0.5

0

0.5

-0.50

0.5

-0.5

0

0.5

-0.5

0

0.5

-0.50

0.5

-0.5

0

0.5

-0.5

0

0.5

Network -1

Network -2 Clustering Results

Original Data

Fig. 4. Three-dimensional clustering example.

4. Conclusions and future work

In this paper, a new clustering method that employs MLP NNs is introduced. Each NN is associated to a single cluster.More specifically, the NNs are used to build exemplar curves, as opposed to the traditionally used exemplar points, in order


-0.50

0.5 -0.5

0

0.5

-0.5

0

0.5

Original Data

-0.50

0.5 -0.5

0

0.5

-0.5

0

0.5

Network -1

-0.50

0.5 -0.5

0

0.5

-0.5

0

0.5

Network -2

-0.50

0.5 -0.5

0

0.5

-0.5

0

0.5

Clustering Results

Fig. 5. Similar three-dimensional clustering example as in Fig. 4, with noisy data points.

to represent arbitrary-shaped clusters. In order to solve the problem of locally optimal clustering, an appropriate split-and-merge method has been proposed. One of the advantages of the proposed technique is that it has the capability to clusterdata into partially overlapping clusters. Essentially, this can be achieved in cases where the exemplar curves, which theNNs represent, cross (or almost cross) each other. Therefore, the proposed technique can reveal the underlying structure ofoverlapping data clusters which is usually not possible using traditional clustering techniques.Several points in the algorithm may be improved in future work. Currently, the exemplar curves are limited in the

sense that they have to be functions of a single coordinate (qm) within the data vectors (pm). The problem can be solved infuture work by considering multiple networks to be associated to a single cluster, and by investigating merging of exemplarcurves to produce loop-like and other arbitrary-shaped exemplars with no limitation in terms of the relation between thecoordinates of a data point.Another issue to be addressed in futurework is the fact that the distances from the clusters, as calculated in this work, are

not necessarily distances from the manifold. They simply represent regression error vectors of lower dimensionality thanthe feature space. More geometrically meaningful distances of points to the manifolds will be investigated in the future.Additionally, inclusion of a regularizing term in the objective function might help in gradually reducing the complexity ofthe NNs during training. For instance, in the current implementation, NNs that claim only a few patternsmay employ a largenumber of necessary weights and thus, be prone to over-fitting.

Acknowledgement

This work was supported by the Research Competitiveness Subprogram (RCS), Louisiana Board of Regents, ContractNumber: LEQSF(2007-10)-RD-A-41.

References

[1] A.K. Jain, R.C. Dubes, Algorithms for Clustering, Prentice Hall, Englewood Cliffs, NJ, 1988.[2] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973.[3] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, 1981.[4] J. Hartigan, Clustering Algorithms, Wiley, New York, 1975.[5] J. Tou, R. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, Mass, 1974.


[6] E. Ruspini, A new approach to clustering, Information and Control 15 (1) (1969) 22–32.[7] D. Charalampidis, A modified K -means algorithm for circular invariant clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12) (2005) 1856–1865.

[8] M.C. Su, C.-H. Chou, A modified version of the K -means algorithm with a distance based on cluster symmetry, IEEE Transactions on Pattern Analysisand Machine Intelligence 23 (6) (2001).

[9] T. Kohonen, The neural phonetic typewriter, IEEE Computer 27 (3) (1988) 11–12.[10] T. Kohonen, Self-Organization and Associative Memory, 3rd ed., Springer-Verlag, New York, Berlin, 1989.[11] G.A. Carpenter, S. Grossberg, A massively parallel architecture for a self-organizing neural pattern recognition machine, Computer Vision, Graphics,

and Image Processing 37 (1987) 54–115.[12] G.A. Carpenter, S. Grossberg, ART2: Self-organization of stable category recognition codes for analog input patterns, Application Optics 26 (23) (1987)

4919–4930.[13] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representation by error propagation, in: Parallel Distribution Processing: Explorations in

the Microstructure of Cognition, vol. 1, MIT Press, Cambridge, MA, 1986 (Chapter 8).[14] J.E. Moody, C.J. Parker, Fast learning in networks of locally tuned processing units, Neural Computation 1 (1989) 281–294.[15] M.S. Baghshah, S.B. Shouraki, A fuzzy clustering algorithm for finding arbitrary shaped clusters, in: IEEE/ACS International Conference on Computer

Systems and Applications, March 31–April 4, 2008, pp. 559–566.[16] D. Chaudhuri, B.B. Chaudhuri, A novel multiseed nonhierarchical data clustering technique, IEEE Transactions on Systems, Man, and Cybernetics, Part

B 27 (5) (1997) 871–876.[17] W.L.G. Koontz, P.M. Narendra, K. Fukunaga, A graph-theoretic approach to nonparametric cluster analysis, IEEE Transactions on Computers C-25 (9)

(1976) 936–944.[18] K. Levenberg, A method for the solution of certain non-linear problems in least squares, The Quarterly of Applied Mathematics 2 (1944) 164–168.[19] Donald Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM Journal on Applied Mathematics 11 (1963) 431–441.

Documents

Clustering using multilayer perceptrons