Anarta Ghosh, Michael Biehl and Barbara Hammer- Performance Analysis of LVQ Algorithms: A Statistical Physics Approach

8/3/2019 Anarta Ghosh, Michael Biehl and Barbara Hammer- Performance Analysis of LVQ Algorithms: A Statistical Physics Ap

1/11

Performance Analysis of LVQ Algorithms: A Statistical

Physics Approach

Anarta Ghosh1,3, Michael Biehl1 and Barbara Hammer2

1- Rijksuniversiteit Groningen - Mathematics and Computing ScienceP.O. Box 800, NL-9700 AV Groningen - The Netherlands

2- Clausthal University of Technology - Institute of Computer ScienceD-98678 Clausthal-Zellerfeld - Germany

3- [email protected]

Abstract

Learning vector quantization (LVQ) constitutes a powerful and intuitive method for adaptive nearest prototype classification.However, original LVQ has been introduced based on heuristics and numerous modifications exist to achieve better convergenceand stability. Recently, a mathematical foundation by means of a cost function has been proposed which, as a limiting case,yields a learning rule similar to classical LVQ2.1. It also motivates a modification which shows better stability. However, theexact dynamics as well as the generalization ability of many LVQ algorithms have not been thoroughly investigated so far.Using concepts from statistical physics and the theory of on-line learning, we present a mathematical framework to analysethe performance of different LVQ algorithms in a typical scenario in terms of their dynamics, sensitivity to initial conditions,and generalization ability. Significant differences in the algorithmic stability and generalization ability can be found alreadyfor slightly different variants of LVQ. We study five LVQ algorithms in detail: Kohonens original LVQ1, unsupervised vectorquantization (VQ), a mixture of VQ and LVQ, LVQ2.1, and a variant of LVQ which is based on a cost function. Surprisingly basicLVQ1 shows very good performance in terms of stability, asymptotic generalization ability, and robustness to initializationsand model parameters which, in many cases, is superior to recent alternative proposals.

Key words: Online learning, LVQ, vector quantization, thermodynamic limit, order parameters.

1 Introduction

Due to its simplicity, flexibility, and efficiency, Learn-ing Vector Quantization (LVQ) as introduced by Koho-nen has been widely used in a variety of areas, includingreal time applications like speech recognition [1416].Several modifications of basic LVQ have been proposedwhich aim at a larger flexibility, faster convergence, moreflexible metrics, or better adaptation to Bayesian de-cision boundaries, etc. [5,12,14]. Thereby, most learn-ing schemes including basic LVQ have been proposed

on heuristic grounds and their dynamics is not clear.In particular, there exist potentially powerful extensionslike LVQ2.1 which require the introduction of additionalheuristics, e.g. the so-called window rule for stability,which are not well understood theoretically.

Recently, several approaches relate LVQ-type learn-ing schemes to exact mathematical concepts and thusopen the way towards a solid mathematical justifica-tion for LVQ type learning algorithms. The directionsare mainly twofold: On one hand, cost functions havebeen proposed which, possibly as a limiting case, leadto LVQ-type gradient schemes such as the approaches

in [12,20,21]. Thereby, the nature of the cost functionhas consequences on the stability of the algorithm aspointed out in [19,21]; in addition, it allows a principledextension of LVQ-type learning schemes to complexsituations introducing e.g. neighborhood cooperationof the prototypes or adaptive kernels into the classifier[11]. On the other hand, generalization bounds of thealgorithms have been derived by means of statisticallearning theory, as obtained in [4,10], which character-ize the generalization capability of LVQ and variantsthereof in terms of the hypothesis margin of the classi-fier. Interestingly, the cost function of some extensions

of LVQ includes a term which measures the structuralrisk thus aiming at margin maximization during train-ing similar to support vector machines [10]. However,the exact connection of the classification accuracy andthe cost function, is not clear for these proposals. Inaddition, the relation between the learning scheme andthe generalization ability has not been investigated inmathematical terms so far. Furthermore, the formula-tion via a cost function is often limited to approximatescenarios which reach the exact crisp LVQ-type learningscheme only as a limit case. Thus, there is a need for asystematic investigation of these methods in terms of

Preprint submitted to Neural Networks 10 March 2006


2/11

their generalization ability, dynamics, and sensitivity toinitial conditions.

It has been shown in [3,7] for unsupervised prototype-based methods that a rigorous mathematical analysis ofstatistical learning algorithms is possible in some cases

and might yield quite unexpected results. Recently, firstresults for supervised learning schemes such as LVQhave been presented in the article [2], however, they arerestricted to methods which adapt only one prototypeat a time, excluding methods like LVQ2.1 and variantsthereof. In this work we extend this study to a represen-tative collection of popular LVQ-type learning schemes,using concepts from statistical physics. We investigatethe learning dynamics, generalization ability, as well assensitivity to model parameters and initialization.

The dynamics of training is studied along the successfultheory of on-line learning [1,6,18], considering learningfrom a sequence of uncorrelated, random training datagenerated according to a model distribution unknown tothe training scheme and the limit N , N being thedata dimensionality. In this limit, the system dynamicscan be described by coupled ordinary differential equa-tions in terms of characteristic quantities, the solutionsof which provide insight into the learning dynamics andinteresting quantities such as the generalization error.

Here the investigation and comparison of algorithms aredone with respect to the typical behavior of large systemsin the framework of a model situation. The approachpresented in this paper complements other paradigmswhich provide rigorously exact results without makingexplicit assumptions about, for instance, the statisticsof the data, see e.g. [3,7]. Our analysis of typical behav-ior can also be performed for heuristically formulatedalgorithms which lack, for example, a direct relation toa cost function.

The model of the training data is given in Section 2.In Section 3 we present the LVQ algorithms that arestudied in this paper. The method for the dynamicaland performance analysis of these LVQ algorithms isdescribed in Section 4. In Section 5 we put forward theresults in terms of performance of the LVQ algorithmsin the proposed theoretical framework. A brief summaryand conclusions are presented in Section 6. At the endwe provide some key mathematical results used in thepaper as an appendix.

2 The data model

We study a simple though relevant model situation withtwo prototypes and two classes. Note that this situa-tion captures important aspects at the decision bound-ary between classes. Since LVQ type learning schemesperform a local adaptation, this simple setting providesinsight into the dynamics and generalization ability ofinteresting areas at the class boundaries when learninga more complex data set. We denote the prototypes as

ws RN, s {1, 1}. An input data vector RNis classified as class s iff d(, ws) < d(, ws), where dis some distance measure (typically Euclidean). At ev-ery time step , the learning process for the prototype

vectors makes use of a labeled training example (, )where

{1,

1

}is the class of the observed training

data .

We restrict our analysis to random input training datawhich are independently distributed according to a bi-

modal distribution P() =

=1 pP(|). p is theprior probability of the class ,p1+p1 = 1. In our studywe choose the class conditional distribution P(|) asGaussians with mean vector B and independent com-ponents with variance v:

p(|) = 1(2v)

N2

exp

1

2

B

2v

(1)

We consider orthonormal class center vectors, i.e. Bl Bm = l,m, where .,. is the Kronecker delta. The orthog-onality condition merely fixes the position of the classcentres with respect to the origin while the parameter controls the separation of the class centers. . denotesthe average over P() and . denotes the conditionalaverages over P(|), hence . = =1 p.. For aninput from cluster we have, e.g., j = ( B )j and 2 =

Nj=1

2j

=N

j=1

v + j2

= v N +

2 2 = (p1v1 +p1v1) N + 2. In the mathe-matical treatment we will exploit formally the thermo-dynamic limit N , which corresponds to very high-dimensional data and prototypes. Among other simpli-fying consequences this allows, for instance, to neglectthe term 2 on the right hand side of the above expres-sion for 2. Hence we have: 2 N(p1v1 +p1v1)

In high dimensions the Gaussians overlap significantly.The cluster structure of the data becomes apparent when

projected into the plane spanned by

B1, B1

, while

projections in a randomly chosen two-dimensional sub-space overlap completely. In an attempt to learn the

classification scheme, the relevant directions B1

R

N

have to be identified. Obviously this task becomes highlynon-trivial for large N.

3 LVQ algorithms

We consider the following generic structure of LVQ al-gorithms:

wl = wl

1 +

Nf({ wl 1}, , )( wl 1),

l 1, = 1, 2 . . . (2)

2


3/11

where is the so called learning rate. The specific form

of fl = f({ wl1}, , ) is determined by the algo-rithm. In the following dl = (

w1l )2 is the squaredEuclidean distance between the prototype and the newtraining data. We consider the following learning rulesdetermined by different forms of fl:

(I) LVQ2.1:fl = (l) [13]. In our model with two prototypes,LVQ2.1 updates both of them at each learning step ac-cording to the class of the training data. A prototype ismoved closer to (away from) the data-point if the labelof the data is the same as (different from) the label ofthe prototype. As pointed out in [21], this learning rulecan be seen as a limiting case of maximizing the likeli-hood ratio of the correct and wrong class distributionwhich are both described by Gaussian mixtures. Be-cause the ratio is not bounded from above, divergencescan occur. Adaptation is often restricted to a windowaround the decision surface to prevent this behavior.We do not consider a window rule in this article, but we

will introduce early stopping to prevent divergence.(II) LFM:fl = (l)

d d

, where is the Heaviside

function. This is the crisp version of robust soft learningvector quantization (RSLVQ) proposed in [21]. In themodel considered here, the prototypes are adapted onlyaccording to the misclassified data, hence the namelearning from mistakes (LFM) is used for this prescrip-tion. RSLVQ results from an optimization of a costfunction which considers the ratio of the class distribu-tion and unlabeled data distribution. Since this ratio isbounded, stability can be expected.(III) LVQ1:fl = ld

l dl . This extension of competitive

learning to labeled data corresponds to Kohonens orig-inal LVQ1 [13]. The update is towards if the examplebelongs to the class represented by the winning pro-totype, the correct winner. On the contrary, a wrongwinner is moved away from the current input.(IV) LVQ+:fl =

12

[1 + l]

dl dl

. In this scheme the updateis non-zero only for a correct winner and, then, alwayspositive. Hence, a prototype wS can only accumulateupdates from its own class = S. We will use the ab-breviation LVQ+ for this prescription.(V) VQ:fl =

dl dl

. This update rule disregards the ac-tual data label and always moves the winner towards

the example input. It corresponds to unsupervised Vec-tor Quantization (VQ) and aims at finding prototypeswhich yield a good representation of the data in the senseof Euclidean distances. The choice fl =

dl dl

can

also be interpreted as describing two prototypes whichrepresent the same class and compete for updates fromexamples from this very class only.Note that the VQ procedure can be readily formulatedas a stochastic gradient descent with respect to the

quantization error

e() =

S=1

1

2( w1S )2 (dS d+S), (3)

see e.g. [8] for details. While intuitively clear and wellmotivated, the other algorithms mentioned above lacksuch a straightforward interpretation as stochastic gra-dient descent with respect to a cost function.

4 Dynamics and Performance Analysis

The key steps for the dynamical analysis of LVQ typelearning rules can be sketched as follows: 1. The origi-nal system with many degrees of freedom is character-ized in terms of only few quantities, the socalled macro-scopic order parameters. For these quantities, recursion

relations can be derived from the learning algorithm.2. Application of the central limit theorem enables usto perform an average over the random sequence of ex-ample data by means of Gaussian integrations. 3. Self-averaging properties of the order parameters allow torestrict the description to their mean values. Fluctua-tions of the stochastic dynamics can be neglected in thelimit N . 4. A continuous time limit leads to thedescription of the dynamics in terms of coupled, deter-ministic ordinary differential equations (ODE) in termsof the above mentioned order parameters. 5. The (nu-merical) integration of the ODE for a given modulationfunction and initial conditions yields the evolution of or-der parameters in the course of learning. From the latterone can directly obtain the learning curve, i.e. the gen-eralization ability of the LVQ classifier as a function ofthe number of example data.

ODE for the learning dynamics

We assume that learning is driven by statisticallyindependent training examples such that the pro-cess is Markovian. For an underlying data distribu-tion which has the same second order statistics asthe mixture of Gaussians introduced above, the sys-tem dynamics can be analysed using only few orderparameters which depend on the relevant charac-teristics of the prototypes. In our setting, the sys-tem will be described in terms of the projections{Rlm = wl Bm, Qlm = wl wm}, l , m {1}. In thethermodynamic limit, these order parameters becomeself-averaging [17], i.e. the fluctuations about theirmean-values can be neglected as N . In [17] adetailed mathematical foundation of this selfaveragingproperty is given for on-line learning. This propertyfacilitates an analysis of the stochastic evolution of theprototype vectors in terms of a deterministic system ofdifferential equations and helps to analyse the systemin an exact theoretical way. One can get the followingrecurrence relations from the generic learning scheme

3


4/11

(2) [9]:

Rlm R1lm =

N

bm R1lm

fl (4)

Qlm Q1lm =

N

(hl Q1lm )fm + (hm Q1lm )fl

+ fl fm (5)where hl = w

1l , bm = Bm , Rlm = wl

Bm, Qlm = w

l wm. As the analysis is done for very

large N, terms ofO(1/N2) are neglected in (5).

Defining t N

, for N , t can be conceived as acontinuous time variable and the order parameters Rlmand Qlm as functions of t become self-averaging withrespect to the random sequence of input training data.An average is performed over the disorder introducedby the randomness in the training data and (4) and (5)become a coupled system of differential equations [9]:

dRlmdt

= bmfl flRlm (6)

dQlmdt

= hlfm fmQlm + hmfl flQlm

(7)

+ 2

=1vpfl fm

After plugging in the exact forms of fl, the averages in(6) and (7) can be computed in terms of an integra-tion over the density p

x = (h1, h1, b1, b1)

[9], see

the appendix. The Central Limit theorem yields that, inthe limit N , x N(C, ) (for class ) where and C are the mean vector and covariance matrixof x respectively, cf. [9]. The first order and second or-der statistics of x, viz. and C respectively, can becomputed using the following conditional averages [9]:

hS = R1S , b = ,,hS hT hS hT = v Q1SThS b hS b = v R1S ,

b b

b

b

= v ,

where S,T,,, {1, 1} and .,. is the Kronecker-Delta. Hence, the density ofh1 and b1 is given in termsof the model parameters , p1, v1 and the set of orderparameters as defined above. Note that this holds for

any distribution p() with the same second order statis-tics as characterized above, thus it is not necessary forthe analysis that the conditional densities in Eqn. (1) of are Gaussians.

Inserting the closed form expressions of these averages(cf. (A.1), (A.5), (A.6)) yields to the final form of thesystem of ODEs for a given learning rule. In Eqn. (A.9)

we present the final form of ODEs for the LFM algo-rithm. For the other algorithms, we refer to [9].

The system of ODEs for a specific modulation functionfl, can explicitly be solved by (possibly numeric) integra-tion. In case of LVQ2.1 an exact analytic integration ispossible. For given initial conditions

{RST(0), QST(0)

},

learning rate , and model parameters {p1, , v1} oneobtains the typical evolution of the characteristic quan-tities {RST(t), QST(t)}. This forms the basis for an anal-ysis of the performance and the convergence propertiesof LVQ-type algorithms in this study.

We will consider training of prototypes that are ini-tialized as independent random vectors of squaredlength Q1 with no prior knowledge about the clusterpositions. In terms of order parameters this impliesQ11(0) = Q1, Q1,1(0) = Q1, Q11(0) = RST (0) =0 for all S, = 1, Q1 Q1.

Generalization ability

After training, the success of learning can be quantifiedin terms of the generalization error, i.e. the probabilityfor misclassifying novel, random data which did not ap-pear in the training sequence. The generalization errorcan be decomposed into the two contributions stemmingfrom misclassified data from cluster = 1 and cluster = 1:g = p11 +p11 with = (d d+) .

(8)Exploiting the central limit theorem in the same fashionas above, one can formulate the generalization error (g)

as an explicit function of the order parameters and datastatistics (see appendix and [9]):

=

Q Q 2(R R )

2

v

Q11 2Q11 + Q11

(9)

where (z) =z dx

ex2/22

.

By inserting {RST(t), QST(t)} we obtain the learningcurve g(t), i.e. the typical generalization error after on-line training with tNrandom examples. Here, once more,we exploit the fact that the order parameters and, thus,also g are self-averaging non-fluctuating quantities in

the thermodynamic limit N .In order to verify the correctness of the aforementionedtheoretical framework, we compare the solutions of thesystem of differential equations with the Monte Carlosimulation results and find an excellent agreement al-ready for N 100 in the simulations. Fig. 1(a) and 1(b)show how the average result in simulations approachesthe theoretical prediction and how the correspondingvariance vanishes with increasing N.

For a stochastic gradient descent procedures like VQ,the expectation value of the associated cost function is

4


5/11

0 0.2 0.41.5

1.55

1.6

1.65

1.7

1.75

1.8

1/N

MeanvalueofR

11

att=10

0 0.2 0.4 0.60

0.1

0.2

0.3

0.4

0.5

0.6

1/N

VarianceofR

11

0 0.2 0.4 0.6 0.8 10.24

0.245

0.25

0.255

0.26

0.265

0.27

Learning Rate

GeneralizationError

Fig. 1. (a: top left frame) Convergence of Monte Carlo re-sults to the theoretical prediction for N . The openring at 1

N= 0 marks the theoretical result for R11 at t = 10;

dots correspond to Monte Carlo results on average over 100independent runs. (b: top right frame) Self averaging prop-erty: the variance of R11 at t = 10 vanishes with increas-ing N in Monte Carlo simulations. In both (a) and (b) fol-lowing parameter values are used as one example setting:v1 = 9, v1 = 16, = 3, p1 = 0.8, = 0.1. (c: bottom frame)Dependence of the generalization error for t for LVQ1(with parameter values: = v1 = v1 = 1, p1 = 0.5) on thelearning rate .

0 50 100 150 200 2500.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

t

Generalization

Error

LVQ1

LVQ2.1

LFMLVQ+

Fig. 2. Evolution of the generalization er-ror for different LVQ algorithms. Parameters:v1 = v1 = = 1, = 0.2, p1 = 0.8. As the objective of VQis to minimize the quantization error and not to achieve

good generalization ability, evolution of g for VQ is notshown in this figure.

minimized in the simultaneous limits of 0 and manyexamples, t = t . In the absence of a cost func-tion we can still consider the above limit, in which thesystem of ODE simplifies and can be expressed in termsof the rescaled t after neglecting terms 2. A fixedpoint analysis then yields a well defined asymptotic con-figuration, c.f. [8]. The dependence of the asymptotic gon the choice of learning rate is illustrated for LVQ1 inFig. 1(c).

0 20 40 60 80 1000

5

10

15x 10

5

t >

Q1

,1

0 5 10 15 20 250.1

0.15

0.2

0.25

0.3

0.35

0.4

t

GeneralizationError

Fig. 3. (a: left frame) Diverging order parameter in LVQ2.1.(b: right frame) Modality of generalization error with respectto t in the case of LVQ2.1. In both figures the results are forparameter values: v1 = v1 = = 1, = 0.2, p1 = 0.2.

5 Results - Performance of the LVQ algorithms

In Fig. 2 we illustrate the evolution of the generalizationerror in the course of training. Qualitatively all the al-gorithms show a similar evolution of the generalizationerror along the training process. The performance of the

algorithms is evaluated in terms of stability and gener-alization error. To quantify the generalization ability ofthe algorithms we define the following performance mea-sure:

P M =

1p1=0

(g,p1,lvq g,p1,bld)2 dp11p1=0

2g,p1,bld dp1

(10)

where g,p1,lvq and g,p1,bld are the generalization errorsthat can be achieved by a given LVQ algorithm (exceptfor LVQ2.1 this corresponds to the t asymptoticg) and the best linear decision rule, respectively, for agiven class prior probability p1. Unless otherwise spec-

ified, the generalization error of an optimal linear deci-sion rule is depicted as a dotted line in the figures and thesolid line represents the performance of the correspond-ing LVQ algorithm. We evaluate the performance of thealgorithms in terms of the asymptotic generalization er-ror for two example parameter settings: (i) Equal classvariances (v1 = v1): v1 = v1 = 1, = 1. (ii) Unequalclass variances (v1 = v1): v1 = 0.25, v1 = 0.81, = 1.Here-onwards, unless otherwise mentioned the resultsillustrated in all figures representing the performancesof the algorithms in terms of g are obtained choosingthese parameter values and the following initializationof w1:R11(0) = R11(0) = R11(0) = R11(0) = 0,Q11(0) = 0.001, Q11(0) = 0, Q11(0) = 0.002.

LVQ2.1

In Fig. 3(a) we illustrate the divergent behavior ofLVQ2.1. If the prior probabilities are skewed the proto-type corresponding to the class with lower probabilitydiverges during the learning process and this results ina trivial classification with g,p1,lvq2.1 = min(p1, p1).Note that in the singular case when p1 = p1 thebehavior of the differential equations differs from thegeneric case and LVQ2.1 yields prototypes which are

symmetric about (B1+B1)2 . Hence the performance

5


6/11

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

p1

GeneralizationError

PM =0.6474

Optimal

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

p1

GeneralizationError

PM =0.0524

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

p1

GeneralizationError

PM =0.1233

Fig. 4. Performance of LVQ2.1: (a: top left frame) Asymp-totic behavior for v1 = v1, note that for p1 = 0.5 the per-formance is optimal, in all other cases g = min(p1, p1).To highlight this different characteristics of LVQ2.1 com-pared to the other algorithms studied here we have shownthe asymptotic generalization error through a dot plot in-stead of a solid line use for the other algorithms. (b: top rightframe) LVQ2.1 with stopping criterion for v1 = v1. Theperformance is near optimal (c: bottom frame) LVQ2.1 withstopping criterion when v1 = v1. The performance measureP M as given here and in the following figures is defined in(10).

is optimal in the case of equal priors (Fig. 4(a)). Inhigh dimensions, this divergent behavior can also beobserved if a window rule of the original formulation[21] is used [9], thus this heuristic does not prevent theinstability of the algorithm. Alternative modificationswill be the subject of further work. As the most impor-tant objective of a classification algorithm is to achieveminimal generalization error, one way to deal with thisdivergent behavior of LVQ2.1 is to stop at a point whenthe generalization error is minimal, e.g. as measuredon a validation set. In Fig. 3(b) we see that, typicallythe generalization error has a modality with respect tot, hence an optimal stopping point exists. In Fig. 4 weillustrate the performance of LVQ2.1. Fig. 4(a) showsthe poor asymptotic behavior. Only for equal priors itachieves optimal performance. However, as depicted inFig. 4(b), an idealized early stopping method as de-scribed above can indeed give near optimal behavior forthe equal class variance case. The performance is worse

when we deal with unequal class variances (Fig. 4(c)).It is important to note that the existence of the mini-mum in g(t) and its location and depth depend on theprecise initialization of the vectors w1(0) see Fig. 11.This initialization issue is discussed in detail later.

LVQ1

Fig. 5(a) shows the convergent behavior of LVQ1. Theasymptotic generalization error as achieved by LVQ1 istypically quite close to the potential optimum g,p1,bld.Fig. 5(b) and (c) display the asymptotic generalization

0 50 100 15010

0

10

20

30

40

50

60

t

OrderParameters

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

p1

GeneralizationError

PM =0.0305

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

p1

GeneralizationError

PM =0.0828

Fig. 5. Performance of LVQ1: (a: top left frame) Dy-namics of the order parameters for model parameters:v1 = 4, v1 = 9, = 2, p1 = 0.8 and = 1.8. (b: topright frame) Generalization with v1 = v1. The perfor-mance is near optimal. (c: bottom frame) Generalization forv1 = v1. The performance is worse than that in the casewhen v1 = v1

.

error as a function of the prior p1 in two different set-tings of the model. In the left panel v1 = v1 whereas theright shows an example with different cluster variances.In the completely symmetric situation with equal vari-ances and balanced priors, p1 = p1, the LVQ1 resultcoincides with the best linear decision boundary which

is through ( B1 + B1)/2 for this setting. Wheneverthe cluster-variances are different, the symmetry about

p1 = 1/2 is lost but the performance is optimal for oneparticular (v1, v1)-dependent value of p1 (0, 1). Un-like LVQ2.1 the good performance of LVQ1 is invariantto initialization of prototype vectors w1, see Fig. 11.

LFM

The dynamics of the LFM algorithm is shown in Fig.6(a). We see that its performance is far from optimal inboth equal (Fig. 6(b)) and unequal class variance (6(c))cases. Hence, though LFM converges to a stable config-uration of the prototypes, it fails to give a near optimalperformance in terms of the asymptotic generalizationerror.

Further interesting properties which can be detectedwithin this theoretical analysis of the LFM algorithm areas follows: (i) The asymptotic generalization error is in-dependent of learning rate . It merely controls the mag-nitude of the fluctuations orthogonal to { B1, B1} andthe asymptotic distance of prototypes from the decisionboundary. (ii) p11 = p11 c.f. Eqn. (8). That means,the two contributions to the total g, Eqs. (8,9), becomeequal for t . As a consequence, LFM updates arebased on balanced data, asymptotically, as they are re-stricted to misclassified examples.

6


7/11

0 20 40 60 80 1004

2

0

2

4

6

8

10

t

OrderParameters

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

0.3

p1

GeneralizationError

PM =0.3994

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

p1

GeneralizationError

PM =0.5033

Fig. 6. Performance of LFM: (a: top left frame) Conver-gence for the following parameter values: v1 = v1 = = 1,

p1 = 0.8 and = 3. (b: top right frame) Generalization withv1 = v1. (c: bottom frame) Generalization with v1 = v1.In both cases the performance of LFM is far from optimal.

0 100 200 300 400

0.5

0

0.5

1

1.5

2

2.5

t

OrderParameters

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

p1

GeneralizationError

PM =1.2357

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

p1

GeneralizationError PM =

1.6498

Fig. 7. Performance of VQ: (a: top left frame) Evolution oforder parameters, parameters: v1 = v1 = = 1, p1 = 0.8and = 0.5. (b: top right frame) Generalization error whenv1 = v1 (c: b ottom frame) Generalization for v1 = v1.

Note that we consider only the crisp LFM procedurehere. It is very well possible that soft realizations of

RSLVQ as discussed in [20,21] yield significantly betterperformance.

VQ

In Fig. 7(a) we see the evolution of the order parame-ters in the course of training for VQ. The dynamics ofVQ has been studied in detail in [8] for the case of equalclass variances (v1 = v1) and equal priors (p1 = p1).Though the objective of VQ is to minimize the quantiza-tion error and not to achieve a good generalization abil-ity yet we can compute the asymptotic g from the proto-

0 50 100 150 200

10

0

10

20

30

t

OrderParameters

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

p1

GeneralizationError

PM =0.6068

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

p1

GeneralizationError

PM =0.7573

Fig. 8. Performance of LVQ+: (a: top left frame) Convergencewith parameter values: v1 = 9, v1 = 16, = 3, p1 = 0.8 and = 0.5. (b: top right frame) Generalization with v1 = v1.The performance is independent of the class prior probabil-ities (minmax). (c: bottom frame) Generalization with un-equal variances v1 = v1.

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

0.3

p

1

GeneralizationError

Optimal

LVQ1

LVQ2.1

LFM

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

p

1

GeneralizationError

Optimal

LVQ1

LVQ2.1

LFM

Fig. 9. Comparison of asymptotic performances of the al-gorithms: For the LVQ2.1 algorithm the performance withthe stopping criterion is shown. (a: left frame) For v1 = v1LVQ1 outperforms all other algorithms . (b: right frame)When v1 = v1 the absolute supremacy of LVQ1 is lost.There exists a range of values of p1 for which LVQ2.1 withthe stopping criterion outperforms LVQ1. However this per-formance of the LVQ2.1 with stopping is extremely sensitiveto the initialization of prototypes.

type configuration. In Fig. 7 we illustrate the asymptotic

generalization error for the VQ algorithm and note thefollowing interesting facts: In unsupervised VQ a strongprevalence, e.g. p1 1, will be accounted for by placingboth vectors inside the stronger cluster, thus achieving alow quantization error. Obviously this yields a poor clas-sification as indicated by the asymptotic value g = 1/2in the limiting cases p1 = 0 or 1. In the equal class vari-ance case forp1 = 1/2 the aim of representation happensto coincide with good generalization and g becomes op-timal, Fig. 7(b). In Fig. 7(c) we see that in case of un-equal class variances there exists a prior probability p1for which the generalization error of VQ is identical tothat of the best linear decision surface.

7


8/11

LVQ+

In Fig. 8(a) we illustrate the convergence of the LVQ+algorithm. LVQ+ updates each wS only with data fromclass S. As a consequence, the asymptotic positions ofthe w1 is always symmetric about the geometrical cen-

ter (B1 +

B1) and the asymptotic g is independentof the priors p1. Thus, in the equal class variance case

(Fig. 8(b)) LVQ+ is robust with respect to the varia-tions ofp1 after training, i.e. it is optimal in the senseof the minmax-criterion supp1 g(t) [5]. However in theunequal class variance case this minmax property is notobserved (Fig. 8(c)). Nevertheless the resulting asymp-totic g depends linearly on p1 and is tangent to

bldg .

Comparison of the performance of the algorithms

To facilitate a better understanding, we compare the per-formances of the algorithms in Fig. 9, where we presentthe asymptotic performance of the three relevant LVQalgorithms: LVQ1, LVQ2.1 with idealized stopping cri-terion, and LFM. As the performances of LVQ+ and VQare qualitatively entirely different (Fig. 8, 7) from theother algorithms, these two algorithms are not discussedin this comparison.

In Fig. 9(a), we see that LVQ1 outperforms the otheralgorithms for equal class variance, and LVQ2.1 withthe early stopping criterion yields results which are onlyslightly worse. However the superiority of LVQ1 is partlylost in the case of unequal class variance (see Fig. 9(b))where an interval forp1 exists for which the performanceof LVQ2.1 with stopping criterion is better than LVQ1.However if we compare the overall performance of LVQ1

for v1 = v1 through the performance measure P Mdefined in (10) with other algorithms then LVQ1 is foundto be the best algorithm among these LVQ variants.

The good performance of the LVQ1 algorithm can befurther investigated through a geometrical analysis ofrelevant quantities. Fig. 10 displays the trajectories of

prototypes projected onto the plane spanned by B1 andB1. Note that, as can be expected from symmetry ar-guments, the (t )-asymptotic projections of proto-types into the B1-plane are along the axis connectingthe cluster centers. Moreover, in the limit 0, theirasymptotic position lie exactly on the plane and fluctu-

ations orthogonal to B1 vanish. This is signaled by thefact that the order parameters for t satisfy QSS =R2S1+R

2S1, and Q11 = R11R11+R11R11 which

implies

wS (t ) = RS1 B1 + RS1 B1 for S = 1. (11)

Hence we can conclude that the actual prototype vec-tors asymptotically approach the above unique configu-ration. Note that, in general, stationarity of the order pa-rameters does not necessarily imply that w1 convergeto points in the N-dimensional space. For LVQ1 with

-1 -0.5 0 0.5 1 1.5

0

0.5

1

1.5

2

0

0

1

1

2

1

RS1

RS1

Fig. 10. LVQ1 for = 1.2, v1 = v1 = 1, and p1 = 0.8.Trajectories of prototypes in the limit 0, t . Solidlines correspond to the projections of prototypes into the

plane spanned by B1 and B1 (marked by open circles).The dots correspond to the pairs of values {RS1, RS1} ob-served at t = t = 2, 4, 6, 8, 10, 12, 14 in Monte Carlo sim-ulations with = 0.01 and N = 200, averaged over 100runs. Note that, because p1 > p1, w1 approaches its finalposition much faster and in fact overshoots. The inset dis-plays a close-up of the region around its stationary location.The short solid line marks the asymptotic decision boundaryas parameterized by the prototypes, the short dashed linemarks the best linear decision boundary. The latter is very

close to B1 for the pronounced dominance of the = 1cluster with p1 = 0.8.

> 0, for instance, fluctuations in the space orthogonal

to { B1, B1} persist even for constant {RST, QST}.

Fig. 10 reveals further information about the learningprocess. When learning from unbalanced data, e.g. p1 >p1 as in the example, the prototype representing thestronger cluster will be updated more frequently and infact overshoots, resulting in a non-monotonic behaviorofg (Fig. 2).

In Fig. 9 we see that the performances of LVQ1 andLVQ2.1 with stopping criterion are comparable. As thedistribution of the training data is unknown to the algo-rithms the performance of the algorithms should also bejudged in terms of robustness to the initialization of theprototype vectors w1. In Fig. (11) we illustrate this ro-bustness of the algorithms LVQ1 and LVQ2.1 with stop-

ping by considering an initialization difference from theusual R11(0) = R11(0) = R11(0) = R11(0) = 0,Q11(0) = 0.001, Q11(0) = 0, Q11(0) = 0.002, asspecified in the caption. We find that though there arevariations in the learning curves (g vs t) the asymp-totic performance (g for t ) of LVQ1 is robust toinitialization. However the performance of LVQ2.1 withthe stopping criterion is extremely sensitive to initial-ization; for a good performance it is required that theinitial decision boundary is already close to the optimalone. Since no density estimation is performed prior tothe training procedure such an ideal initialization of w1cannot be assured in general.

8


9/11

1 0 1

0

1

2

RS1

RS

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

p1

GeneralizationError

Fig. 11. Left Panel: The trajectories of the

prototypes in the plane spanned by { B1}corresponding to the following initialization:R11(0) = R11(0) = 0,R11( 0 ) = 0.5, R11( 0 ) = 0.7,Q11( 0 ) = 1.8, Q11( 0 ) = 0, Q11( 0 ) = 2.9. Comparingwith Fig. 10 we see that though the learning curves are dif-ferent due to different trajectories yet the asymptotic gener-alization errors are the same since the final configuration isinvariant to the initialization of w1. The other parametervalues are the same as in Fig. 10. Right Panel: Perfor-mance of LVQ2.1 with stopping criterion heavily dependson the initialization; the solid line corresponds to the ini-tialization: R11(0) = R11(0) = R11(0) = R11(0) = 0,Q11(0) = 0.001, Q11(0) = 0, Q11(0) = 0.002,whereas the dashed line is for:R11(0) = R11( 0 ) = 0, R11( 0 ) = 0.5, R11( 0 ) = 0.7,Q11(0) = 1.8, Q11(0) = 0, Q11(0) = 2.9. The dotted linecorresponds to g,p1,bld. The results are for v1 = v1.

0 0.5 0.65 10

0.1

0.2

0.3

0.4

0.5

p1

GeneralizationError

Optimal

LVQ1

VQ

LVQ+

Fig. 12. The g vs p1 curves for LVQ1, VQ and LVQ+ touchthat of the best linear decision surface at the same priorprobability p1 (in the neighborhood of 0.65 for the parametervalues used) for the unequal class variance case, v1 = v1.

Another interesting aspect of the LVQ1, VQ and LVQ+

algorithms is highlighted in Fig. 12. For unequal classvariances, these three algorithms give optimal perfor-mance for the same value ofp1 viz. in the neighborhoodof 0.65 for the model parameters used here.

6 Summary and Conclusions

We have investigated different variants of LVQ type al-gorithms in an exact mathematical way by means of thetheory of on-line learning. For N , using conceptsfrom statistical physics, the system dynamics can be de-scribed by few characteristic quantities, and the learn-

ing curves can be evaluated exactly also for heuristicallymotivated learning algorithms where a global cost func-tion is lacking, like for standard LVQ1, or where a costfunction has only been proposed for a soft version likefor LVQ2.1 and LFM.

Surprisingly, fundamentally different limiting solutionsare observed for the algorithms LVQ1, LVQ2.1, LFM,LVQ+ although their learning rules are quite similar.The behavior of LVQ2.1 is unstable and modificationssuch as a stopping rule become necessary. The gener-alization ability of the algorithms differs in particularfor unbalanced class distributions. Even more involvedproperties are revealed when the class variances differ. Itis remarkable that the basic LVQ1 algorithm shows nearoptimal generalization error for all choices of the priordistribution in the equal class variance case. LVQ2.1with stopping criterion also performs close to optimal forequal class variances. In the unequal class variance case,LVQ2.1 with stopping outperforms the other algorithmsfor a range of p1 and appropriate initial conditions.

However the good performance of LVQ2.1 with the ideal-ized stopping criterion is highly sensitive to initializationof prototype vectors. The performance degrades heavilyif the prototypes arenot initialized in such a way that theinitial decision surface is close to the optimal one. Dueto an unknown density prior to training the positioningof the cluster centers is unknown in a practical scenarioand hence the aforementioned ideal initialization can-not be assured. Whereas the asymptotic performance ofLVQ1 does not depend on initialization, though it yieldslearning curves (g vs t) which are different for differentinitializations. This partially mirrors well-known effectsof LVQ1 for given settings where a data cluster from

a different class can act as a barrier, slowing down theconvergence significantly.

Another interesting finding from this theoretical analysisis that in the equal class variance case LVQ+ achievesa minmax characteristics, however this special propertyis lost in the unequal class variance case.

The theoretical framework proposed in this article willbe used to study further characteristics of the dynamicssuch as fixed points, asymptotic positioning of the pro-totypes etc. The main goal of the research presented inthis article is to provide a deterministic description of thestochastic evolution of the learning process in an exact

mathematical way for interesting learning rules and inrelevant (though simple) situations, which will be help-ful in constructing efficient (in the Bayesian sense) LVQalgorithms.

References

[1] M. Biehl and N. Caticha. The statistical mechanics of on-linelearning and generalization. In M.A. Arbib, The Handbookof Brain Theory and Neural Networks, MIT Press, 2003.

[2] M. Biehl, A. Ghosh, and B. Hammer. Learning vectorquantization: the dynamics of winner-takes-all algorithms.Neurocomputing, In Press.

9


10/11

[3] M. Cottrell, J. C. Fort, and G. Pages. Theoretical aspects ofthe S.O.M algorithm , survey. Neuro-computing, 21:119138,1998.

[4] K. Crammer, R. Gilad-Bachrach, A. Navot, and A. Tishby.Margin analysis of the LVQ algorithm. In NIPS. 2002.

[5] R. O. Duda, P. E. Hart, and D. G. Stork. Patternclassification. 2e, New York: Wiley, 2000.

[6] A. Engel and C. van den Broeck, editors. The StatisticalMechanics of Learning. Cambridge University Press, 2001.

[7] J. C. Fort and G. Pages. Convergence of stochasticalgorithms: from the Kushner & Clark theorem to theLyapounov functional. Advances in applied probability,28:10721094, 1996.

[8] A. Freking, G. Reents, and M. Biehl. The dynamics ofcompetitive learning. Europhysics Letters 38, pages 7378,1996.

[9] A. Ghosh, M. Biehl, A. Freking, and G. Reents. A theoreticalframework for analysing the dynamics of LVQ: A statisticalphysics approach. Technical Report 2004-9-02, Mathematicsand Computing Science, University Groningen, P.O. Box800, 9700 AV Groningen, The Netherlands, December 2004,available from www.cs.rug.nl/~biehl.

[10] B. Hammer, M. Strickert, and T. Villmann. On thegeneralization ability of GRLVQ networks. Neural ProcessingLetters, 21(2):109120, 2005.

[11] B. Hammer, M. Strickert, and Th. Villmann. Supervisedneural gas with general similarity measure. Neural ProcessingLetters, 21(1):2144, 2005.

[12] B. Hammer and T. Villmann. Generalized relevance learningvector quantization. Neural Networks 15, pages 10591068,2002.

[13] T. Kohonen. Improved versions of learning vectorquantization. IJCNN, International Joint conference onNeural Networks, Vol. 1, pages 545550, 1990.

[14] T. Kohonen. Self-organizing maps. Springer, Berlin, 1995.

[15] E. McDermott and S. Katagiri. Prototype-based minimum

classification error/generalized probabilistic descent trainingfor various speech units. Computer Speech and Language,Vol. 8, No. 4, pages 351368, 1994.

[16] Neural Networks Research Centre. Bibliography onthe self-organizing maps (som) and learning vectorquantization (lvq). Otaniemi: Helsinki University ofTechnology. Available on-line: http://liinwww.ira.uka.de/bibliography/Neural/SOM.LVQ.html.

[17] G. Reents and R. Urbanczik. Self-averaging and on-linelearning. Physical review letters. Vol. 80. No. 24, pages 54455448, 1998.

[18] D. Saad, editor. Online learning in neural networks.Cambridge University Press, 1998.

[19] A. S. Sato and K. Yamada. Generalized learning vectorquantization. In G. Tesauro, D. Touretzky and T. Leen,

editors, Advances in Neural Information Processing Systems,Vol. 7, pages 423429, 1995.

[20] S. Seo, M. Bode, and K. Obermayer. Soft nearest prototypeclassification. IEEE Transactions on Neural Networks 14(2),pages 390398, 2003.

[21] S. Seo and K. Obermayer. Soft learning vector quantization.Neural Computation, 15, pages 15891604, 2003.

A The Mathematical Treatment

Here we outline some key mathematical results used inthe analysis. The formalism was first used in the context

of unsupervised vector quantization [8] and the calcula-tions were recently detailed in a Technical Report [9].

Throughout this appendix indices l ,m,k,s, {1}represent the class labels or cluster memberships.We furthermore use the following shorthands: (i)

S = (dS d+S ) for LVQ1, LVQ+ and VQ and(ii) = (d d) for LFM. For convenience thethree winner takes all algorithms, LVQ1, LVQ+, VQare collectively called the WTA algorithms.

A1. Averages:In order to obtain the final form the ODEs for a givenmodulation function, averages over the joint densityP(h1, h1, b1, b1) are performed.

LVQ2.1:The elementary averages involved in the system of ODEfor the LVQ2.1 can be expressed in a closed form asfollows :

bm =

=1pm,, hm =

=1

pRm,,

=

=1p. (A.1)

Other Algorithms:After inserting the corresponding modulation functionfl of LVQ1, LVQ+, VQ and LFM in the systems of ODEpresented in Eqs. (6) and (7) we encounter Heaviside

functions of the following generic form:

S = (S .x S ). (A.2)

In the case of WTA algorithms (LVQ1, LVQ+, VQ):S = (dS d+S ) = (S.x S) with

S = (+2S, 2S, 0, 0) and S = S(Q+S+S QSS) ,(A.3)

and for LFM: =

d d

= (.x )with

= (2, +2, 0, 0) and = (Q, Q) .(A.4)

After plugging in the modulation function fl performingthe averages in Eqs. (6,7) for the LFM, LVQ1, VQ andLVQ+ algorithms involves conditional means of the form

(x)nsk and sk

where (x)n is the nth component ofx = (h1, h1, b1,b1).

10


11/11

The above mentioned averages can be expressed in aclosed form in the following way [9]:

(x)nsk = (Ck s)n2sk

exp

1

2

sksk

2

+ (k)n

sksk

. (A.5)

sk =

sksk

. (A.6)

Where,

sk = C1

2

k s =

sCk s and sk = s k s(A.7)

(x) =

x

dze z

2

22

(A.8)

Using these averages the final form of the system of dif-ferential equations corresponding to different algorithmsare obtained [9].

For brevity we give an example of such final form for theLFM algorithm only and refer to [9] for other algorithms:

dRlmdt

= l

=1

p(C )nbm

2,exp

1

2

,,

2

+()nbm ,

,

=1p

,

,

Rlm

dQlmdt

=

l

=1p

(C )nhm2,

exp

12

,,

2

+ ()nhm ,

, l

=1 p

,

,

Qlm

+ m

=1p

(C )nhl2,

exp

12

,,

2

+ ()nhl ,

,

m

=1

p

,

,

Qlm

+ l m

=1vp

,,

.

(A.9)

Here, again, we have to insert , and , as defined in(A.7) with = (2, +2, 0, 0) and = (Q++ Q). Also,

nhm = 1 if m = 1

2 if m =

1and nbm =

3 if m = 1

4 if m =

1.

A2. The Generalization Error Using (A.6) we can di-rectly compute the generalization error as follows: g =

k=1 pkkk =

k=1 pk

k,kk,k

which

yields Eqs. (8,9) in the text after inserting sk and skas given in (A.7) with S = (+2S, 2S, 0, 0) and S =S(Q+S+S QSS).

11

Documents

Anarta Ghosh, Michael Biehl and Barbara Hammer- Performance Analysis of LVQ Algorithms: A Statistical Physics Approach