39
1 Michel Verleysen Radial-Basis Function Networks - 1 Radial-Basis Function Networks November 2002 Michel Verleysen Radial-Basis Function Networks - 2 Radial-Basis Function Networks Origin: Cover’s theorem Interpolation problem Regularization theory Generalized RBFN Universal approximation RBFN kernel regression Learning Centers vector quantization Widths Multiplying factors Other forms Vector quantization

Radial-Basis Function Networks Radial-Basis Function Networks

  • Upload
    others

  • View
    19

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Radial-Basis Function Networks Radial-Basis Function Networks

1

Michel Verleysen Radial-Basis Function Networks - 1

Radial-Basis Function Networks

November 2002

Michel Verleysen Radial-Basis Function Networks - 2

Radial-Basis Function Networks

p Origin: Cover’s theorem

p Interpolation problem

p Regularization theory

p Generalized RBFNp Universal approximationp RBFN ≈ kernel regression

p Learningp Centers → vector quantizationpWidthspMultiplying factorsp Other forms

p Vector quantization

Page 2: Radial-Basis Function Networks Radial-Basis Function Networks

2

Michel Verleysen Radial-Basis Function Networks - 3

Radial-Basis Function Networks

p Origin: Cover’s theorem

p Interpolation problem

p Regularization theory

p Generalized RBFNp Universal approximationp RBFN ≈ kernel regression

p Learningp Centers → vector quantizationpWidthspMultiplying factorsp Other forms

p Vector quantization

Michel Verleysen Radial-Basis Function Networks - 4

Origin: Covers’ theorem

p Covers’ theorem on separability of patterns (1965)p x1, x2,…, xP assigned to two classes C1 C2

p ϕ-separability:

p Cover’s theorem:p non-linear functions ϕ(x)p dimension hidden space > dimension input space→ probability of separability closer to 1

p Example

( )( )

∈<ϕ∈>ϕ∃

2

1

0

0

C

CT

T

xxw

xxww

linear quadratic

Page 3: Radial-Basis Function Networks Radial-Basis Function Networks

3

Michel Verleysen Radial-Basis Function Networks - 5

Radial-Basis Function Networks

p Origin: Cover’s theorem

p Interpolation problem

p Regularization theory

p Generalized RBFNp Universal approximationp RBFN ≈ kernel regression

p Learningp Centers → vector quantizationpWidthspMultiplying factorsp Other forms

p Vector quantization

Michel Verleysen Radial-Basis Function Networks - 6

Interpolation problem

p Given points (xp, tp), xp ∈ℜ D, tp ∈ℜ , 1≤p ≤P :p Find F : ℜ D → ℜ that satisfies

p RBF technique (Powell, 1988):

p are arbitrary non-linear functions (RBF)

p as many functions as data points

p centers fixed at known points xp

( ) Pp,tF ppK1==x

( ) ( )∑=

−ϕ=P

p

ppwF

1xxx

( )pxx −ϕ

Page 4: Radial-Basis Function Networks Radial-Basis Function Networks

4

Michel Verleysen Radial-Basis Function Networks - 7

Interpolation problem

p Into matrix form:

p Vital question: is Φ non-singular ?

( ) pp tF =x ( ) ( )∑=

−ϕ=P

p

pkwF

1xxx

=

ϕϕϕ

ϕϕϕϕϕϕ

PPPPPP

P

P

t

t

t

w

w

w

MM

L

MOMM

L

L

2

1

2

1

21

22221

11211

( )lkkl xx −ϕ=ϕ

where

xw =Φ xw 1−Φ=

Michel Verleysen Radial-Basis Function Networks - 8

Michelli’s theorem

p If points xk are distinct, Φ is non-singular (regardless of the dimension of the input space)

p Valid for a large class of RBF functions:

( ) 22 l, +−=ϕ cxcx

( )22

1

l,

+−=ϕ

cxcx

(l > 0)

( )

σ−

−=ϕ2

2

2exp

cxcx , (σ > 0)

non-localized function

localized functions

Page 5: Radial-Basis Function Networks Radial-Basis Function Networks

5

Michel Verleysen Radial-Basis Function Networks - 9

Radial-Basis Function Networks

p Origin: Cover’s theorem

p Interpolation problem

p Regularization theory

p Generalized RBFNp Universal approximationp RBFN ≈ kernel regression

p Learningp Centers → vector quantizationpWidthspMultiplying factorsp Other forms

p Vector quantization

Michel Verleysen Radial-Basis Function Networks - 10

Learning: ill-posed problem

p Necessity for regularizationp Error criterion:

t

x

( ) ( )( ) ( )wx CFtP

FEP

p

pp

21

21

1λ+−= ∑

=

MSE regularization

Page 6: Radial-Basis Function Networks Radial-Basis Function Networks

6

Michel Verleysen Radial-Basis Function Networks - 11

Solution to the regularization problem

p Poggio & Girosi (1990):p if C(w) is a (problem-dependent) linear differential operator, the

solution to

is of the following form:

where G() is a Green’s function,

Gkl = G(xk,xl)

( ) ( )∑=

=P

p

ppGwF

1xx,x

( ) tIGw 1−λ+=

( ) ( )( ) ( )wx CFtP

FEP

p

pp

21

21

1λ+−= ∑

=

Michel Verleysen Radial-Basis Function Networks - 12

Interpolation - Regularization

p Interpolation

p Exact interpolator

p Possible RBF:

p Regularization

p Exact interpolator

p Equal to the « interpolation » solution iff λ=0

p Example of Green’s function:

( ) ( )∑=

−ϕ=P

p

ppwF

1xxx

xw 1−Φ=

( ) ( )∑=

=P

p

ppGwF

1xx,x

( ) tIGw 1−λ+=

( )

σ

−−= 2

2

2exp

pp,G

xxxx( )

σ

−−=ϕ 2

2

2exp

pp,

xxxx

One RBF / Green’s function for each learning pattern!

Page 7: Radial-Basis Function Networks Radial-Basis Function Networks

7

Michel Verleysen Radial-Basis Function Networks - 13

Radial-Basis Function Networks

p Origin: Cover’s theorem

p Interpolation problem

p Regularization theory

p Generalized RBFNp Universal approximationp RBFN ≈ kernel regression

p Learningp Centers → vector quantizationpWidthspMultiplying factorsp Other forms

p Vector quantization

Michel Verleysen Radial-Basis Function Networks - 14

Generalized RBFN (GRBFN – RBFN)

p As many radial functions as learning patterns:p computationally (too) intensive

(inversion of PxP matrix grows with P3)p ill-conditioned matrixp regularization not easy (problem-specific)

→ Generalized RBFN approach!

Typically:p K << P

p

( ) ( )∑=

−ϕ=K

iiiwF

1cxx

( )

σ−

−=−ϕ2

2

2exp

i

ii

cxcx

Parameters:ci, σi, wi

Page 8: Radial-Basis Function Networks Radial-Basis Function Networks

8

Michel Verleysen Radial-Basis Function Networks - 15

Radial-Basis Function Networks (RBFN)

p Possibilities:p several outputs (common hidden layer)p bias (recommended) (see extensions)

x0(bias)

x1

xd

F(x)

ijc

iw

1st layer 2nd layer

( )icx −ϕ

Kσ if severaloutputs

( ) ( )∑=

−ϕ=K

iiiwF

1cxx

( )

σ−

−=−ϕ2

2

2exp

i

ii

cxcx

Michel Verleysen Radial-Basis Function Networks - 16

RBFN: universal approximation

p Park & Sandberg 1991:

p For any continuous input-output mapping function f(x)

p The theorem is stronger (radial symmetry not needed)p K not specifiedp Provides a theoretical basis for practical RBFN!

( ) ( ) ( ) ( )( ) [ ]( )∞∈>εε<−ϕ=∃ ∑=

,p,F,fLwF p

K

iii 10

1xxcxx

Page 9: Radial-Basis Function Networks Radial-Basis Function Networks

9

Michel Verleysen Radial-Basis Function Networks - 17

RBFN and kernel regression

p non-linear regression model

p estimation of f(x): average of t around x. More precisely:

p Need for estimates of and

→ Parzen-Rosenblatt density estimator

( ) Pp,yft ppppp ≤≤ε+=ε+= 1x

( ) [ ]( )

( )( )x

x

x

xx

x

X

f

dyy,yf

dyyyf

yEf

Y,

Y

∫∞∞−

∞∞−

=

=

=

( )y,f Y, xX ( )xXf

Michel Verleysen Radial-Basis Function Networks - 18

Parzen-Rosenblatt density estimator

with K() continuous, bounded, symmetric about the origin, with maximum value at 0, and with unit integral,is consistent (asymptotically unbiased).

p Estimation of

( ) ∑=

−=P

p

p

d hK

Phf̂

1

1 xxxx

( )y,f Y, xX

( )

−= ∑=

+ hyy

Kh

KPh

y,f̂pP

p

p

dY,1

11 xx

xX

Page 10: Radial-Basis Function Networks Radial-Basis Function Networks

10

Michel Verleysen Radial-Basis Function Networks - 19

RBFN and kernel regression

p Weighted average of yi

p called Nadaraya-Watson estimator (1964)p equivalent to Normalized RBFN in the unregularized context

( )( )

( )x

xx

X

X

f

dyy,yff

Y,∫∞∞−=

( )( )

( )

=

=

∞∞−

=

=

P

p

p

P

p

pp

Y,

hK

hKy

dyy,f̂yf̂

1

1

xx

xx

x

xx

X

X

Michel Verleysen Radial-Basis Function Networks - 20

RBFN ↔ MLP

p RBFNp single hidden layerp non-linear hidden layer

linear output layerp argument of hidden units:

Euclidean normp universal approximation

property

p local approximatorsp splitted learning

p MLPp single or multiple hidden layersp non-linear hidden layer

linear or non-linear output layerp argument of hidden units:

scalar productp universal approximation

property

p global approximatorsp global learning

Page 11: Radial-Basis Function Networks Radial-Basis Function Networks

11

Michel Verleysen Radial-Basis Function Networks - 21

Radial-Basis Function Networks

p Origin: Cover’s theorem

p Interpolation problem

p Regularization theory

p Generalized RBFNp Universal approximationp RBFN ≈ kernel regression

p Learningp Centers → vector quantizationpWidthspMultiplying factorsp Other forms

p Vector quantization

Michel Verleysen Radial-Basis Function Networks - 22

RBFN: learning strategies

p Parameters to be determined: ci, σi, wi

p Traditional learning strategy: splitted computation

1. centers ci

2. widths σi

3. weights wi

( ) ( )∑=

−ϕ=K

iiiwF

1cxx ( )

σ−

−=−ϕ2

2

2exp

i

ii

cxcx

Page 12: Radial-Basis Function Networks Radial-Basis Function Networks

12

Michel Verleysen Radial-Basis Function Networks - 23

RBFN: computation of centers

p Idea: centers ci must have the (density) properties of learning points xk

→ vector quantization

p seen in details hereafter

p This phase only uses the xk information, not the tk

Michel Verleysen Radial-Basis Function Networks - 24

RBFN: computation of widths

p Universal approximation property: valid with identical widthsp In practice (limited learning set): variable widths σi

p Idea: RBFN use local clusters

p choose σi according to standard deviation of clusters

Page 13: Radial-Basis Function Networks Radial-Basis Function Networks

13

Michel Verleysen Radial-Basis Function Networks - 25

RFBN: computation of weights

p Problem becomes linear !p Solution of least square criterion

leads to

where

p In practise: use SVD !

( )

σ−

−=−ϕ2

2

2exp

i

ii

cxcx( ) ( )∑

=−ϕ=

K

iiiwF

1cxx

constants !

( ) ( )( )∑=

−=P

p

pp FtP

FE12

1x

( ) TTΦΦΦtΦw

1−+ ==

( )ik

ki cxΦ −ϕ=ϕ≡

Michel Verleysen Radial-Basis Function Networks - 26

RBFN: gradient descent

p 3-steps method:

p Once ci, σi, wi have been set by the previous method,possibility of gradient descent on all parameters

p Some improvement, but p learning speedp local minimap risk of non-local basis functionsp etc.

( ) ∑=

σ−

−=K

i i

iiwF

12

2

2exp

cxx

unsupervised

1

2

supervised3

Page 14: Radial-Basis Function Networks Radial-Basis Function Networks

14

Michel Verleysen Radial-Basis Function Networks - 27

More elaborated models

p Add constant and linear terms

good idea (very difficult to approximate a constant with kernels…)

p Use normalized RBFN

basis functions are bouded [0,1] → can be interpreted as probability values (classification)

( ) 011

2

2

2exp 'wx'wwF

D

iii

K

i i

ii ∑∑

==++

σ−

−=cx

x

( ) ∑

∑=

=

σ

−−

σ−

=K

i K

j j

j

i

i

iwF1

12

2

2

2

2exp

2exp

cx

cx

x

Michel Verleysen Radial-Basis Function Networks - 28

Back to the widths…

p choose σi according to standard deviation of clustersp In the literature:

p where dmax = maximum distance between centroids [1]

p where index j scans the q nearest centroids to ci [2]

p where r is an overlap constant [3]

p …..

[1] S. Haykin, "Neural Networks a Comprehensive Foundation", Prentice-Hall Inc, second edition, 1999.[2] J. Moody and C. J. Darken, "Fast learning in networks of locally-tuned processing units", Neural Computation 1, pp. 281-294, 1989.[3] A. Saha and J. D. Keeler, ''Algorithms for Better Representation and Faster Learning in Radial Basis Function Networks", Advances in

Neural Information Processing Systems 2, Edited by David S. Touretzky, pp. 482-489, 1989.

Kd 2max=σ

∑=

−=σq

jjii q 1

21cc

( )jij

i r cc −=σ min

Page 15: Radial-Basis Function Networks Radial-Basis Function Networks

15

Michel Verleysen Radial-Basis Function Networks - 29

p Approximation of f(x) = 1 with a d-dimensional RBFN

p In theory: identical wi

p Experimentally: side effects→ only middle taken into account

p Error versus width

Basic example

Michel Verleysen Radial-Basis Function Networks - 30

Basic example: erros vs space dimension

Page 16: Radial-Basis Function Networks Radial-Basis Function Networks

16

Michel Verleysen Radial-Basis Function Networks - 31

Basic example: local decomposition?

Michel Verleysen Radial-Basis Function Networks - 32

p Choose the first minimum to preserve the locality of clusters

p The first local minimum is usually less sensitive to variability

Multiple local minima in error curve

Page 17: Radial-Basis Function Networks Radial-Basis Function Networks

17

Michel Verleysen Radial-Basis Function Networks - 33

Some concluding comments

p RBFN: easy learning (compared to MLP)p in a cross-validation scheme: important!

p Many RBFN models

p Even more RBFN learning schemes…

p Results not very sensitive to unsupervised part of learning (ci, σi)

p Open work for a priori (proble-dependent) choice of widths σi

Michel Verleysen Radial-Basis Function Networks - 34

Radial-Basis Function Networks

p Origin: Cover’s theorem

p Interpolation problem

p Regularization theory

p Generalized RBFNp Universal approximationp RBFN ≈ kernel regression

p Learningp Centers → vector quantizationpWidthspMultiplying factorsp Other forms

p Vector quantization

Page 18: Radial-Basis Function Networks Radial-Basis Function Networks

18

Michel Verleysen Radial-Basis Function Networks - 35

Back to the centers: Vector quantization

p Aim and principle - What is a vector quantizer ?

p Vector ↔ scalar quantization

p Lloyd’s principle and algorithm

p Initialization

p Neural algorithmsp Competitive learningp Frequency Sensitive Learning

p Winner-take-all ↔ winner-take-mostp Soft Competition Schemep Stochastic Relaxation Schemep Neural gas

Michel Verleysen Radial-Basis Function Networks - 36

Vector quantization

p Aim and principle - What is a vector quantizer ?

p Vector ↔ scalar quantization

p Lloyd’s principle and algorithm

p Initialization

p Neural algorithmsp Competitive learningp Frequency Sensitive Learning

p Winner-take-all ↔ winner-take-mostp Soft Competition Schemep Stochastic Relaxation Schemep Neural gas

Page 19: Radial-Basis Function Networks Radial-Basis Function Networks

19

Michel Verleysen Radial-Basis Function Networks - 37

Aim of vector quantization

p To reduce the size of a database

20976

419375

……………

……………

87542

312321

44532

Pvectors

N features

419375

……………

……………

312321

44532

Qvectors

N features

Q < P

Michel Verleysen Radial-Basis Function Networks - 38

Principle of vector quantization

p To project a continuous input space on a discrete output space, while minimizing the loss of information

Q vectors

Q vectors

..

.

.

..

......

..

......

..

.

......

... . .

..

.. . ... .

.

P points

Page 20: Radial-Basis Function Networks Radial-Basis Function Networks

20

Michel Verleysen Radial-Basis Function Networks - 39

Principle of vector quantization

p To define zones in the space, the set of points contained in each zone being projected on a representative vector (centroid)

p Example: 2-dimensional spaces

Michel Verleysen Radial-Basis Function Networks - 40

What is a vector quantizer ?

p Vector quantizer =

1. A codebook (set of centroids, or codewords)

2. A quantization function q:

Usually: q is defined by the nearest neighbour rule (according to some distance measure)

{ }Qj,j ≤≤= 1ym

( ) jii qq yxx =→≡

Page 21: Radial-Basis Function Networks Radial-Basis Function Networks

21

Michel Verleysen Radial-Basis Function Networks - 41

Distance measures

p Least square error

p r-norm error

p Mahalanobis distance (Γ is the covariance matrix of the inputs)

( ) ( ) ( ) ( )jiTjijiD

k

jk

ik

ji yx,d yxyxyxyx −−=−=−= ∑=1

22

( ) ( ) rD

k

rjk

ik

jir yx,d

1

1

−= ∑

=yx

( ) ( ) ( )jiTjiji ,d yxyxyxW −Γ−= −1

Michel Verleysen Radial-Basis Function Networks - 42

Vector quantization

p Aim and principle - What is a vector quantizer ?

p Vector ↔ scalar quantization

p Lloyd’s principle and algorithm

p Initialization

p Neural algorithmsp Competitive learningp Frequency Sensitive Learning

p Winner-take-all ↔ winner-take-mostp Soft Competition Schemep Stochastic Relaxation Schemep Neural gas

Page 22: Radial-Basis Function Networks Radial-Basis Function Networks

22

Michel Verleysen Radial-Basis Function Networks - 43

Vector ↔ scalar quantization

p Shannon: a vector quantizer always gives better results than the product of scalar quantizers, even if the probability densities are independent

p Example #1:uniform 2-D distribution

a

vectorquantization

( )[ ] 22

366

adE =− yx

scalarquantization

( )[ ] 22

3610

adE =− yx

Michel Verleysen Radial-Basis Function Networks - 44

Vector ↔ scalar quantization

p Shannon: a vector quantizer always gives better results than the product of scalar quantizers, even if the probability densities are independent

p Example #2:uniform 2-D distribution

same ratio #centroids/unit surface:EVQ = 0.962 ESQ

scalarquantization

( )[ ]16

42 c

dESQ =− yx

vectorquantization

( )[ ] 42 081 h.dEVQ =− yx

c

h

Page 23: Radial-Basis Function Networks Radial-Basis Function Networks

23

Michel Verleysen Radial-Basis Function Networks - 45

Vector quantization

p Aim and principle - What is a vector quantizer ?

p Vector ↔ scalar quantization

p Lloyd’s principle and algorithm

p Initialization

p Neural algorithmsp Competitive learningp Frequency Sensitive Learning

p Winner-take-all ↔ winner-take-mostp Soft Competition Schemep Stochastic Relaxation Schemep Neural gas

Michel Verleysen Radial-Basis Function Networks - 46

Lloyd’s principle

p 3 properties:1. The first one gives the best encoder, once the decoder is known2. The second one gives the best decoder, once the encoder is

known

3. There is no point on the borders between Voronoï regions (probability = 0)

p Optimal quantizer: properties are necessary, but not sufficient:

xi j yjencoder decoder

bad

good

Page 24: Radial-Basis Function Networks Radial-Basis Function Networks

24

Michel Verleysen Radial-Basis Function Networks - 47

Lloyd: property #1

p For a given decoder β, the best encoder is given by:

nearest-neighbor rule !

xi j yjencoder decoder

( ) ( )[ ] ( )j, jji

j

i β==α yyxx wheredargmin

Michel Verleysen Radial-Basis Function Networks - 48

Lloyd: property #2

p For a given encoder α, the best decoder is given by:

center-of-gravity rule!

xi j yjencoder decoder

( ) ( ) ( )[ ][ ]j,j iji

y j=α=β xyxdEargmin

Page 25: Radial-Basis Function Networks Radial-Basis Function Networks

25

Michel Verleysen Radial-Basis Function Networks - 49

Lloyd: property #3

p The probability to find a point xi on a border (between Voronoï regions) is zero !

probability = 0

Michel Verleysen Radial-Basis Function Networks - 50

Lloyd’s algorithm

1. Choice of an initial codebook.

2. All points xi are encoded; EVQ is evaluated.

3. If EVQ is small enough, then stop.

4. All centroids yj are replaced by the center-of-gravity of the data xi associated to yj in step 2.

5. Back to step 2.

Page 26: Radial-Basis Function Networks Radial-Basis Function Networks

26

Michel Verleysen Radial-Basis Function Networks - 51

Lloyd: example

xi: datayj: centroids

y1

y2 y3

y4

1. Initialization of thecodeboook

Michel Verleysen Radial-Basis Function Networks - 52

Lloyd: example

xi: datayj: centroids

y1

y2 y3

y4

2. Encoding(nearest-neighbor)

Page 27: Radial-Basis Function Networks Radial-Basis Function Networks

27

Michel Verleysen Radial-Basis Function Networks - 53

Lloyd: example

4. Decoding(center-of-gravity)

y1

y2 y3

y4

xi: datayj: centroids

Michel Verleysen Radial-Basis Function Networks - 54

Lloyd: example

2. Encoding(nearest-neighbor)

→ new bordersy2 y3

y4

y1

xi: datayj: centroids

Page 28: Radial-Basis Function Networks Radial-Basis Function Networks

28

Michel Verleysen Radial-Basis Function Networks - 55

Lloyd: example

y2

y3

y4

y1

xi: datayj: centroids

4. Decoding(center-of-gravity)

→ new positions ofcentroids

Michel Verleysen Radial-Basis Function Networks - 56

Lloyd: example

y3

y4

y1

xi: datayj: centroids

y2

2. Encoding(nearest-neighbor)

→ new borders

Page 29: Radial-Basis Function Networks Radial-Basis Function Networks

29

Michel Verleysen Radial-Basis Function Networks - 57

Lloyd: example

y3

y4y1

xi: datayj: centroids

y2

4. Decoding(center-of-gravity)

→ new positions ofcentroids

Michel Verleysen Radial-Basis Function Networks - 58

Lloyd: example

y3

y4y1

xi: datayj: centroids

y2

2. Encoding(nearest-neighbor)

→ final borders(convergence)

Page 30: Radial-Basis Function Networks Radial-Basis Function Networks

30

Michel Verleysen Radial-Basis Function Networks - 59

Lloyd’s algorithm: the names

p Lloyd’s algorithm

p Generalized Lloyd’s algorithm

p Linde-Buzzo-Gray (LBG) algorithm

p K-means

p ISODATA

p …

All based on the same principle!

Michel Verleysen Radial-Basis Function Networks - 60

Lloyd’s algorithm: properties

p The codebook is modified only after the presentation of the whole dataset

p The mean square error (EVQ) decreases at each iteration

p The risk of getting trapped in local minima is high

p The final quantizer depends on the initial one

Page 31: Radial-Basis Function Networks Radial-Basis Function Networks

31

Michel Verleysen Radial-Basis Function Networks - 61

Vector quantization

p Aim and principle - What is a vector quantizer ?

p Vector ↔ scalar quantization

p Lloyd’s principle and algorithm

p Initialization

p Neural algorithmsp Competitive learningp Frequency Sensitive Learning

p Winner-take-all ↔ winner-take-mostp Soft Competition Schemep Stochastic Relaxation Schemep Neural gas

Michel Verleysen Radial-Basis Function Networks - 62

How to initialize Lloyd’s algorithm ?

1. randomly in the input space

2. the Q first data points xi

3. Q randomly chosen data points xi

4. ‘Product codes’: the product of scalar quantizers

5. growing initial set:p a first centroid y1 is randomly chosen (in the data set)p a second centroid y2 is randomly chosen (in the data set);

if d(y1,y2) > threshold, y2 is keptp a third centroid y3 is randomly chosen (in the data set);

if d(y1,y3) > threshold AND d(y2,y3) > threshold, y3 is keptp …

Page 32: Radial-Basis Function Networks Radial-Basis Function Networks

32

Michel Verleysen Radial-Basis Function Networks - 63

How to initialize Lloyd’s algorithm ?

6. ‘pairwise nearest neighbor’:p a first codebook is built with all data points xi

p the two centroids yj nearest one from another are merged (center-of-gravity)

p …

Variant:p the increase of distortion (EVQ) is evaluated for the merge of each

pair of centroids yj; the pair gicing the lowest increase is merged.

Michel Verleysen Radial-Basis Function Networks - 64

How to initialize Lloyd’s algorithm ?

7. ‘Splitting’p a first centroid y1 is randomly chosen (in the data set)

p a second centroid y1+ε is created; Lloyd’s algorithm is applied to the new codebook

p two new centroids are created by perturbing the two existing ones; Lloyd’s algorithm is applied to this 4-centroids codebook

p …

Page 33: Radial-Basis Function Networks Radial-Basis Function Networks

33

Michel Verleysen Radial-Basis Function Networks - 65

Vector quantization

p Aim and principle - What is a vector quantizer ?

p Vector ↔ scalar quantization

p Lloyd’s principle and algorithm

p Initialization

p Neural algorithmsp Competitive learningp Frequency Sensitive Learning

p Winner-take-all ↔ winner-take-mostp Soft Competition Schemep Stochastic Relaxation Schemep Neural gas

Michel Verleysen Radial-Basis Function Networks - 66

Vector quantization: « neural » algorithms

p Principle: the codebook is (partly or fully) modified at each presentation of one data vector xi

p Advantages:p simplicityp adaptive algorithm (with varying data)

p possible parallelisationp speed ?p avoids local minima ?

Page 34: Radial-Basis Function Networks Radial-Basis Function Networks

34

Michel Verleysen Radial-Basis Function Networks - 67

Competitive learning

p Algorithm:

p For each input vector xi, the « winner » is selected:

p Adaptation rule: the winner is moved towards the input vector:

( ) ( ) Qk,j,,, jiki ≤≤≤ 1dd yxyx

( ) ( ) ( )kikk t yxyy −α+=+ t1

y2y3

y4

y1

xi: datayj: centroids

xi

yk(t)

yk(t+1)

Michel Verleysen Radial-Basis Function Networks - 68

Competitive learning

p Adaptation rule: stochastic gradient descent on

→ convergence to (local) minimum

p Robbins-Monro conditions on α:

p Local minima !

p Some centroids may be « lost »!

( )( ) ( )∫ −= xxyx dpE xj 2

y2y3

y4y1 never

« winner » !

( ) ( ) ∞<α∞=α ∑∑∞

=

= 0

2

0and

tttt

Page 35: Radial-Basis Function Networks Radial-Basis Function Networks

35

Michel Verleysen Radial-Basis Function Networks - 69

Frequency sensitive learning

p Competitive learning; some centroids may be lost during learning→ centroids often chosen (as winners) are penalized!

p Choice of winner is replaced by

where uj, uk are incremented each time they are chosen as winner (starting at 1)

( ) ( ) Qk,j,,u,u jijkik ≤≤≤ 1dd yxyx

Michel Verleysen Radial-Basis Function Networks - 70

Frequency sensitive learning

(with u1=1 and u2=3 for example)

y2

y3

y4y1 possible

« winner » !

( )

( )11

22

d

d

yx

yx

,u

,u

i

i

≥ xi

Page 36: Radial-Basis Function Networks Radial-Basis Function Networks

36

Michel Verleysen Radial-Basis Function Networks - 71

Vector quantization

p Aim and principle - What is a vector quantizer ?

p Vector ↔ scalar quantization

p Lloyd’s principle and algorithm

p Initialization

p Neural algorithmsp Competitive learningp Frequency Sensitive Learning

p Winner-take-all ↔ winner-take-mostp Soft Competition Schemep Stochastic Relaxation Schemep Neural gas

Michel Verleysen Radial-Basis Function Networks - 72

Winner-take-all – Winner-take-most

p Most VQ algorithms: if two centroids are close, it will be hard to separate them !

p (Competitive learning and LVQ: one or two winners are adapted)

p Solution: to adapt the winner and other centroids:p (with respect to the distance between the centroid and xi (SCS) )p stochastically with respect to the distance (SRS)

p with respect to the order of proximity with xi (neural-gas)p those in a neighborhood (on a grid) of the winner (Kohonen)These algorithms accelerate the VQ too !

Page 37: Radial-Basis Function Networks Radial-Basis Function Networks

37

Michel Verleysen Radial-Basis Function Networks - 73

Soft Competition Scheme (SCS)

p Adaptation rule on all centroids:

with

(T is made decreasing over time)

( ) ( ) ( ) ( )( )t,kGt kiikk yxxyy −α+=+ t1

( )( )

( )

∑=

−−

−−

=

Q

j

T

t

T

t

i

ji

ki

e

e,kG

1

2

2

yx

yx

x

Michel Verleysen Radial-Basis Function Networks - 74

Stochastic Relaxation Scheme (SRS)

p Adaptation rule on all centroids:

with

(T is made decreasing over time)

( ) ( ) ( ) ( )( )t,kGt kiikk yxxyy −α+=+ t1

( )( )

( )

∑=

−−

−−

=

−=

Q

j

T

t

T

t

i

ji

ki

e

eP

P

P,kG

1

2

2

1yprobabilitwith0

yprobabilitwith1

yx

yx

x

Page 38: Radial-Basis Function Networks Radial-Basis Function Networks

38

Michel Verleysen Radial-Basis Function Networks - 75

Neural gas

p Principle: centroids are ranked according to their distance with xi :

h(j, xi) = c if yj is the c-st nearest centroid from xi

p Adaptation rule:

( ) ( )( )

( )( )tet ki,kh

kki

yxyyx

−α+=+ λ−

t1

Michel Verleysen Radial-Basis Function Networks - 76

Neural gas

p Properties:

p if λ=0: competitive learning

p partial sorting possible

p improvement: prossibility of frequency sensitive learning

p The adaptation rule is a stochastic gradient descent on

( )( )

( ) ( )∑ ∫∑ =

λ

=

λ−=

Q

j

ji,jh

Q

l

lh dpte

e

Ei

1

2

12

1xxyx

x

Page 39: Radial-Basis Function Networks Radial-Basis Function Networks

39

Michel Verleysen Radial-Basis Function Networks - 77

Sources and references (RBFN)

p Most of the basic concepts developed in these slides come from the excellent book:p Neural networks – a comprehensive foundation, S. Haykin,

Macmillan College Publishing Company, 1994.

p Some supplementary comments come from the tutorial on RBF:p An overview of Radial Basis Function Networks, J. Ghosh & A.

Nag, in: Radial Basis Function Networks 2, R.J. Howlett & L.C. Jain eds., Physica-Verlag, 2001.

p The results on the RBFN basic exemple were generated by my colleague N. Benoudjit, and are submitted for publication.

Michel Verleysen Radial-Basis Function Networks - 78

Sources and references (VQ)

p A classical tutorial on vector quantizationp Vector quantization, R.M. Gray, IEEE ASSP Mag., vol.1, pp. 4-29, April

1984.

p Most concepts in this chapter come from the following (specialized) papers:p Vector quantization is speech coding, K. Makhoul, S. Roucos, H. Gish,

Proceedings IEEE, vol. 73, n.11, November 1985.p Neural-gas network for vector quantization and its application to time-

series prediction, T.M. Martinetz, S.G. Berkovich, K.J. Schulten, IEEE T. Neural Networks, vol.4, n.4, July 1993.

p Habituation in Learning Vector Quantization, T.? Gestzi, I. Csabai, Complex Systems, n.6, 1992.

p DVQ: Dynamic Vector Quantization – an incremental LVQ , F. Poirier, A. Ferrieux, in: Artificial Neural Networks, T. Kohonen et al. eds., Elsevier, 1991.

p Representation of nonlinear data structures thourgh a fast VQP neural network, P. Demartines, J. Hérault, Proc. Neuro-Nîmes 1993.