[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Self-Organising Map

(ICNN&B'05 Invited Paper)

Seif-Organising Map as a Natural Kernel MethodHujun Yin

School of Electrical and Electronic EngineeringThe University ofManchesterManchester, M60 1QD, UK

E-mail: [email protected]

Abstract-In this paper, two recent kernel SOMs arereviewed and it is shown that the kernel SOMs can beformally derivedfrom an energyfunction ofthe SOM inthe feature space. Various kernel functions are readilyapplicable to the kernel SOM, while their performanceand choices ofkernelparameters depend on theproblem.This paper shows that with a symmetric and density-typekernel function, the kernel SOM is equivalent to ahomoscedastic Self-Organising Mixture Network, anentropy-based density estimator. It also explains that theSOM approximates naturally a kernel method.

I. INTRODUCTION

The Self-Organising, Map (SOM) [10] is one of themost popular and widely applied neural network modelsowing to its several distinct features over other neuralnetworks such as nonlinear mapping of input to outputspace and topological preserving. The SOM has beenstudied, applied and extended extensively in the lastcouple of decades and many insights have been gainedsince its introduction [10]. However numerous newdevelopments and applications are continuing to emerge[2,3,9].

Kernel methods have received a great deal of attentionin the past few years, especially in the supervisedlearning community [16]. By applying kernel functions tothe input space, a nonlinear, complex problem canbecome linear in the high dimensional feature space [1].Typical examples are the Support Vector Machines [5].Kernel methods have also been applied to unsupervisedlearning models such as principal component analysis[15], principal factor analysis, projection pursuit andcanonical correlation analysis [6]. Two kernel variants ofthe SOM have been proposed recently. MacDonald andFyfe [13] derived a kernel SOM from kernelising thek-means clustering algorithm with added neighbourhood.Andras [4] and Pan, Chen and Zhang [14] have proposeda kernel SOM by transforming the input space to a featurespace using nonlinear kernel functions.The objectives of these kernel SOMs are different from

some earlier approaches [7] and [17, 18], which aim at

optimising the topographic mapping and approximatingdata distribution respectively. In [7], Graepel, Burger andObermayer apply kernel functions to transform the inputto high dimensional space, thus transforming the distancemetric to nonlinear and adding more flexibility invector-quantising and capturing the data structures. Invan Hulle [17] and Yin and Allinson [18], neurons in theSOM are treated as Gaussian (or other) kernels, theresulting map approximates a mixture of Gaussian (orother) distribution of the data. It further establishes a linkbetween the mixture model and the self-organisationprocess [18].

In Section 2 two recent kernel SOMs are reviewed andit is further shown that one of them can be derived froman energy function [11, 8]. Furthermore, we show that thekernel self-organisation can be performed in thetransformed space completely. The proposed methodunifies the kernel approaches to the SOM. In Section 3, adirect relationship between the kernel SOM and an earlierSelf-Organising Mixture Network (SOMN) [18] isrevealed. The relation further explains that the kernelSOM is an underlying mixture model and that the SOM isalready an approximate of a kernel method. Conclusionsare given in Section 4.

II. KERNEL SOMs

A kernel is a function K: XXX E 9t, where X is the inputspace. This function is a dot product of mapping function0(x), i.e. x(x;y) = ()(x),0(y)), where 0 X -+ F, F is a

high dimensional inner product feature space.

A. Type IKernelSOM

Following the kernel PCA [15], a k-means based kernelSOM (type I) has been proposed by MacDonald et. al. [13].Each data point x is mapped to the feature space via 0(x).Each mean can be described as a weighted sum of themapping functions, mi = Yi"0(x"n), where {)} are the

n

constructing parameters. The algorithm then selects a mean

0-7803-9422-4/05/$20.00 C2005 IEEE1891

or assigns the data with the minimum distance between themapped point and the mean,

I| 0(x)-mi || =11 0(X)-_Yi,nO(Xn) 112n f I )

=g(X x)-2 ErYnK(x,x,n )+ rmn (Xn,X m)n n,m

The update of the mean is based on a soft learningalgorithm,

mi (t + 1) = mj (t) + A[b(x) - m, (t)] (2)

Therefore this kernel SOM can be entirely operated in thefeature space.

Link to Energy Function

From the energy fumction point of view, the SOMminimises the following energy [11, 8], at least for thediscrete case, *

(10)E= EIh(i, ) 11x-m 112 p(x)dxi 1

where A is the nor-malised winning frequency of the i-th where Vi is the Voronoi tesselation of the neuron i.The extension of this energy function in the feature spacemean and is defined as,

A =- i(x)j:t+lAy./n=l ;i,n

(3)

where 4 is the winning counter and is often as a Gaussianfunction between two neurons. The updating formula Eq. (2)leads to the following updating rules for the constructingparameters of the kernel weights,

Yi,n (t + 1) ={Yin (t)(l- A), for nt+ 14forn=t+1 (4

B. Type II Kernel SOM

There is another, direct way to kernelise the SOM bymapping the data point to the feature space then applying theSOM in the mapped space. The winning rules of this secondtype of kernel SOM have been proposed as follows either inthe input space [14],

v = argmin 11 x - mi 112, (5)

or in the feature space [4]:

v = arg minII 0(x)-_b(Mi ) 112 (6)

The weight updating rule is proposed as [4, 14],

mi, (t+ 1) = mi (t) + a(t)h(v(x),i)VJ(x,mi) (7)

where J(x,m,))=lb(x)- %(i) 12 is the distance functionin the feature space or the proposed objective function. While,a(t) and h(v(x),i) are the learning rate andneighbourhood function respectively.Note that,

J(x,m,i) =11 0(X) -_ (mi) 112= 4(x, x) + x4mi,mi) - 2 (x,mi)(8)

and

'VJ(x,mi = aMi,am,) _ ax#mi,) (9)

10x

EF = J h(i,j) jo(x)_-(mj) 112 p(x)dX (11)i j

The kernel SOM can be seen as a result of directminimising this transformed energy stochastically, i.e., usingthe sample gradient,

aEF = h(v(x), j) II (x) - b(m1) 112(2

= -2h(v(x), i)VJ(x, mij)This leads to the weight updating rule Eq. (6).

For type I kernel SOM, as the neurons' weights are definedin the feature space, we can define a corresponding distancebetween a neuron weight and a mapped input in the featurespace as,

J'(x,m,) =I M.(x)-m' 112 = K(x, x) + Mi ,m - 24(x) - mi(13)

Then,VJ'(x, mi) = 2m, - 20(x) (14)

Replacing the distance measure in above energy function(11) and its differentiation (12) withJ' and using the steepestdescent method we obtain the updating rule for the neurons'weights as,

m, (t + 1) = mi (t) + 4a(t)h(v(x), i)(t(x) - mi,) (15)

As can be seen it is similar to Eq. (2), the updating rule fortype I kernel SOM.

Various kernel functions such as Gaussian, Cauchy,logarithm, polynomial, are readily applicable to the kernelSOM [12]. For example, for Gaussian kernel, the winningand weight updating rules are,

* The use of this energy function may imply a difference winningrule, v= argmin h(i,j) II - m, 2. For the simplicity we

still use the simple winning rule, Eq. (5) or (6). When the numberof neuron is large and the neighbourhood function is symmetric,the two become similar.

1892

t1)

(12)

Table 1: Classification errors on UCI colon cancer dataset. M, A and V denote the minimumdistance, average distance and majority voting methods to label the nodes [12].

Kernel Type IKernelSOM Type IIKernelSOM SOM

M A V M A V M A vGaussian 5.6 5.8 5.6 5.3 5.3 5.7Cauchy 5.5 5.6 5.5 5.5 5.5 4.8 4.3 7.0 3.8Log 4.6 4.6 4.6 5.2 5.2 4.6

v = argnminJ(x,m,) = argmin[-2K(x,mi)]i 1

=argmin[-exp(- IIXM 112 )]Mi(t+1)= mj (t)+

1 ex(-Ix-m.i 112.a(t)h(v(x),i) 2 exp( )(x-mi)respectively.

Experimental results on various benchmark datasetsshown that although kernel SOM does produce bclassification performance when the kernel parametenoptimised (often empirically) in some cases and there iclear evidence to indicate that the kernel SOM will al'outperform the SOM [12]. A typical set of results is givTable 1.

III. SELF-ORGANISING MIxTuRE NETWORK

The self-organising mixture network (SOMN)extends and adapts the SOM to a mixture density modcwhich each node characterises a conditional probaldistribution. The joint-probability density of the data (onetwork) is described by a mixture distribution,

K

p(x I )=Zpi(x i)Pii=l

where p, (x 9i ) is the i-th component-conditional derand Oi is the parameter for the i-th conditional density,2 ... K, e=(9l,92,...9K)T , and Pi is the prior probalof the i-th component or node and is also called the miweights. For example, a Gaussian or Cauchy mixture hafollowing the conditional densities respectively,

pi(x ) (2if)d/2 11/2 exp[2(x-mi)TyT1(x-]

p,(x 9) =

Xr li I1 [1 +(X - mi)T 1- (x -Mi)]

where 9, = {mi, Ii} are the mean vector and covariancematrix respectively.

Suppose that the true environmental data densityfunction and the estimated one are p(x) and pb(x)respectively. The Kullback-Leibler information metricmeasures the divergence between these two, and is definedas:

1= -log p(x)p(x)dx (21)

When the estimated density is modelled as a mixturedistribution, i.e. a function of various sub-densities and theirparameters, one can seek the optimal estimate of theseparameters by minimising the Kullback-Leibler metric viaits partial differentials in respect to every model parameter,i.e.

-1 Ir 6J3x (EflII. -~~ '']p(x)dx , i=1I, 2,...Kd91 JP(x IO") d (22)

As the true data density is unknown, the stochasticgradient was used for solving these non-directly solvableequations. This results in the following adaptive updatingrules for the parameters and priors [18],

9i(t+1) =9i(t)+a(t)h(v(x),i)[ 1x (x I

bingt p(xI)

se= Pi (t) -a(t)h(v(x), i) [P(i Ix) - Js (t)]

(23)

(24)

where a(t) is the learning coefficient or rate at time step t,mn )j and 0<a(t)<1 and decreases monotonically. The19 neighbourhood function h(v(x), i) is further introduced to

(19) restrict the learning in a neighbourhood of the winner, which(20) is found via maximum (estimated) posterior probability of

the node,

1893

P(i Ix) = 'P(xO) (25)p(x I0)

When the SOMN is limited to the homoscedastic case, i.e.equal variances and equal priors (or non-informative priors)for all components, only the means are the learning variables.The above winner rule becomes,

v=argmaxLP'(xI9) (26)

Eq. (26) is equivalent to rule Eq. (6) or (8), when the densityfunction is isotopic or a function of lllx-mll.The corresponding weight updating is,

mi (t + 1) = m, (t) + cx(t)h(v(x),i) 1 i/(X I )Zj 1 (x 9,) eofl

(27)

It can be seen that Eq. (27) bears similarity to Eq. (7).Again if the conditional density function is of a kernel typeand isotopic or vice versa (e.g. Gaussian and Cauchyfunctions), the above rule leads to the same result as Eq. (7),with an additional normalising factor Zfi(x 9j) = p(x).

When the data density is relatively smooth, this factor isonly a (same) scalar value to all nodes.

For example, for a Gaussian mixture with equal varianceand prior for all nodes, it is easy to show that the winningand mean updating rules are,

v = argmax[exp(-lx im11 (28)

mi(t +1) = mi (t) + a(t)h(v(x), i) 2 12o0r2 p3 (xjI 9j)

exp(- llx-m, 12x -m

(29)They are equivalent to those of the kernel SOM withGaussian kernels, i.e. Eqs.(16) and (17).The equivalence between the SOMN and kernel SOM on

one hand explains that the kernel SOM is approximating amixture density model using the kernel function as theprototype conditional density. On the other hand, as the SOMis a special case of the SOMN with equal variance and priorfor all nodes and when the number ofnodes is great, the SOMis a natural kernel method.

IV. CONCLUSIONS

In this paper, the relation between kernel SOM andself-organising mixture network (SOMN) has been

established. When the conditional density function is of akernel type or the kernel function is of density type, andboth are symmetric functions as it is often the case, thentwo methods are equivalent. Then the kernel SOM can beunderstood as an entropy optimised mixture densitylearner. As the SOM is a special case of SOMN, this inturn explains that the SOM approximates the kernelmethod naturally.

REFERENCES

[1] M. Aizerman, E. Braverman and L. Rozonoer (1964),Theoretical foundations of the potential function method inpattern recognition learning, Automation and RemoteControl, vol. 25, 821-837.

[2] N. Allinson, K. Obermnayer, H. Yin (Eds.) (2002), SpecialIssue on New Developments on SeIf-Organising Maps,Neural Networks, vol. 15.

[3] N. Allinson, H. Yin, L. Allinson and J. Slack (Eds.) (2001),Advances in Self-Organising Maps, London, Spnnger.

[4] P. Andras (2002), Kernel-Kohonen networks, InternationalJournal ofNeural Systems, vol 12, 117-135.

[5] C. Cortes and V. Vapnik (1995), Support vector networks,Machine Learning, vol. 20, 273-297.

[6] C. Fyfe, D. MacDonald and D. Charles (2000),Unsupervised learning using radial kernels, in: Howlett, eds.Recent Advances in Radial Basis Networks.

[7] R.J. T. Graepel, M. Burger and K. Obermayer (1998),Self-organizing maps: Generalization and new optimizationtechniques, Neurocomputing, vol 21, 173-190.

[8] T. Heskes (1999), Energy functions for self-organizing maps,In E. Oja and S. Kaski, eds, Kohonen Maps, Elsevier.

[9] M. Ishikawa, R. Miikulainen and H. Ritter (Eds.) (2004),Special Issue on New Developments on Sel/-OrganisingSystems, Neural Networks, vol. 17.

[10] T. Kohonen (1999), Self-Organising Maps, Springer.[11] J. Lampinen and E. Oja (1992), Clustering properties of

hierarchical self-organizing maps, Journal of MathematicalImaging and Vision, vol 2, 261-272.

[12] K.W. Lau and H. Yin (2005), Kernel self-organisinig mapsfor classification, submitted to Neurocomputing.

[13] D. MacDonald and C. Fyfe (2000), The kernel selforganising map, Applied Computational IntelligenceResearch Unit, The University of Paisley.

[14] Z.S. Pan, S.C. Chen and D.Q. Zhang (2004), A kernel-baseSOM classifer in input space, Acta Electronica Sinica, vol32, 227-231 (in Chinese).

[15] B. Sch6lkopf, A. Smola and K.R. Muller (1998), Nonlinearcomponent analysis as a kernel eigenvalue problem, NeuralComputation, vol. 10, 1299-1319.

[16] J. Shawe-Taylor and N. Cristianini (2004), Kernel Methodsfor Pattern Analysis, Cambridge University Press.

[17] M. van Hulle (2002), Kernel-based topographic mapfornation achieved with an information-theoretic approach,Neural Networks, vol 15, 1029-1039.

[18] H. Yin and N. Allinson (2001), Self-organising mixturenetworks for probability density estimation, IEEE trans.Neural Networks, vol 12, 405-411

1894

Documents

[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Self-Organising Map