49
Giansalvo EXIN Cirrincione unit #6

Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x d t and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Giansalvo EXIN Cirrincione

unit #6

Page 2: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Problem: given a mapping: x d t and a TS of N points, finda function h(x) such that: h(x n) = t n n = 1, … , N

Radial basis functionsRadial basis functionsRadial basis functionsRadial basis functions

Exact interpolationExact interpolation

The exact interpolation problem requires every input vector to be mapped exactly onto the corresponding target vector.

N basis functions usually Euclidean

generalized linear discriminant function

(t n)(wn)

nn’ = (xn- xn’ )

Page 3: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Radial basis functionsRadial basis functionsRadial basis functionsRadial basis functions

Exact interpolationExact interpolation

Many properties of the interpolating function are relatively insensitive to the precise form of the non-linear kernel.

2

2

2x

ex

0 22

xx

xxx ln2 3xx

10 22 xx xx

localized

thin-plate spline function

multi-quadric function for = 1/2non-linear in the components of x

Page 4: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Radial basis functionsRadial basis functionsRadial basis functionsRadial basis functions

xxh 2sin4.05.0 Gaussian noise,

zero mean, = 0.05

30 points

Gaussian basis functions with = 0.067 (roughly twice the spacing of the data points)

Exact interpolationExact interpolation

highly oscillatory

Page 5: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Radial basis functionsRadial basis functionsRadial basis functionsRadial basis functions

Exact interpolationExact interpolation

more than one output variable

Page 6: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBFRBF

RBF’s provide a smooth interpolating function in which the number of basis functions is determined by the complexity of the mapping to be represented rather than by the size of the data set.

The number M of basis functions is less than N.

The centres of the basis functions are determined by the training process.

Each basis function has its own width j which is adaptive.

Bias parameters are included in the linear sum and compensate for the difference between the average value over the TS of the basis function activations and the corresponding average value of the targets.

Page 7: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBFRBF

Gaussian kernel

Elements ji

fixed at 1Md

universal approximator

best approximator

Page 8: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

There is a trade-off between using a smaller number of basis with many adjustable parameters

and a larger number of less flexible functions

RBFRBF

Page 9: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

There is a trade-off between using a smaller number of basis with many adjustable parameters

and a larger number of less flexible functions

RBFRBF

homeworkhomework

Suppose that all of the Gaussian basis functions in the network share a common covariance matrix . Show that the mapping represented by such a network is equivalent to that of a network of spherical Gaussian basis functions with common variance 2 = 1, provided the input vector x is first transformed by an appropriate linear transformation. Find expressions relating the transformed input vector and kernel centres to the corresponding original vectors.

Suppose that all of the Gaussian basis functions in the network share a common covariance matrix . Show that the mapping represented by such a network is equivalent to that of a network of spherical Gaussian basis functions with common variance 2 = 1, provided the input vector x is first transformed by an appropriate linear transformation. Find expressions relating the transformed input vector and kernel centres to the corresponding original vectors.

Page 10: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBF training

decoupling

first-layer weights

second-layer weights

kjw kjw j j

fixed first-layer weights generalized linear discriminantfast learning

Page 11: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBF training

normal equationsnormal equations

Page 12: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBF training

• M = 5• centres = random subset of the TS

= 0.4

= 0.08

= 10.0

Page 13: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

`Regularization theory`Regularization theory

differentialoperator

Set the functional derivative w.r.t. y(x) to zero:Set the functional derivative w.r.t. y(x) to zero:

adjoint differentialoperator to P

Euler-Lagrange equationsEuler-Lagrange equations

If P is translationally and rotationally invariant, G depends only on the distance between x and x’ (radial). The solution is given by:

If P is translationally and rotationally invariant, G depends only on the distance between x and x’ (radial). The solution is given by:

The solution can be written down in terms of the Green’s functions G(x,x’) of the operator PP:

The solution can be written down in terms of the Green’s functions G(x,x’) of the operator PP:^

RBF

Page 14: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Regularization theoryRegularization theory

Page 15: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Regularization theoryRegularization theory

Integrate over a small region around xn

Using the solution in terms of the Green’s functions:

The Green’s function is Gaussian with width parameter if the operator P is chosen as:

= 0 implies RB exact interpolation = 0 implies RB exact interpolation

Page 16: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Regularization theoryRegularization theory

• = 0.067• = 40• RB’s centred on each data

= 0 implies RB exact interpolation = 0 implies RB exact interpolation

Page 17: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

homework

Page 18: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Consider the functional derivative of the regularization functional (w.r.t. y(x)) given by:

show that the operator PP̂

By using successive integration by parts, and making use of identities:

is given by:

It should be assumed that boundary terms arising from the integration by parts can be neglected. Now find the radial Green’s function of this operator as follows. First introduce the multidimensional Fourier transform of G in the form:

Page 19: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Now substitute this result into the inverse Fourier transform of G and then show that the Green’s function is given by:

By using the last two formulae and using the following formula for the Fourier transform of the Dirac function:

where d is the dimensionality of x and s, show that the Fourier transform of the Green’s function is given by:

Page 20: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Regularization theoryRegularization theory

Regularization can also be applied to general RBF’s. Also, regularization terms can be considered for which the kernels are not necessarily the Green’s functions. For example:

penalizes mappings which have large curvatures. This regularizer leads to second-layer weights which are found by the solution of:

Page 21: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Relation to kernel regressionRelation to kernel regression

Technique for estimating regression functions from noisy data, based on methods of kernel density estimation

Technique for estimating regression functions from noisy data, based on methods of kernel density estimation

Consider a mapping : x y and a corresponding TS; a complete description of the statistical properties of the generator of the data is given by the probability density p(x,t) in the joint input-target space.

Consider a mapping : x y and a corresponding TS; a complete description of the statistical properties of the generator of the data is given by the probability density p(x,t) in the joint input-target space.

Parzen Gaussian kernel estimator

Page 22: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Relation to kernel regressionRelation to kernel regression

The optimal mapping is given by forming the regression, or conditional average tx, of the target data, conditioned on the input variables.

Page 23: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Relation to kernel regressionRelation to kernel regression

The optimal mapping is given by forming the regression, or conditional average tx, of the target data, conditioned on the input variables.

Nadaraya-Watson estimatorNadaraya-Watson estimator

Page 24: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Relation to kernel regressionRelation to kernel regression

Nadaraya-Watson estimatorNadaraya-Watson estimator

normalized Gaussianssecond-layer weights

Extension: replace the kernel estimator with an adaptive mixture modelExtension: replace the kernel estimator with an adaptive mixture model

Page 25: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBF’s for classificationRBF’s for classification

Model the class distributions by local kernel functionsModel the class distributions by local kernel functions

kCp x

Multilayerperceptron

Multilayerperceptron

Page 26: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

The outputs represent approximations to the posterior probabilities

RBF’s for classificationRBF’s for classification

RBFRBF

Hidden-to-output weight

Hidden-to-output weight

Page 27: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBF’s for classificationRBF’s for classification

Use a common pool of M basis functions, labelled by an index j, to represent all of the class-conditional densities

where

Page 28: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

RBF’s for classificationRBF’s for classification

M

jjkjk wCP

1

xx

M

jjkjk wCP

1

xx

The activations of the basis functions can be interpreted as the posterior probabilities of the presence of corresponding features in the input space, and the weights can similarly be interpreted as the posterior probabilities of class membership, given the presence of the features.The outputs represent the posterior probabilities of class membership.

Page 29: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Comparison with the multilayer perceptron

The hidden unit activation in a MLP is constant on parallel (d-1)-dimensional hyperplanes.

The hidden unit (RB) activation in a RBF is constant on concentric (d-1)-dimensional hyperspheres (more generally hyperellipsoids).

homeworkhomework

In a MLP a hidden unit has a constant activation function for input vectors which lie on a hyperplanar surface in input space given by wTx+w0=const., while for a spherical RBF a hidden unit has constant activation on a hyperspherical surface defined by ||x- ||2 =const.. Show that, for suitable choices of the parameters, these surfaces coincide if the input vectors are normalized to unit length. Illustrate this equivalence geometrically for vectors in a three-dimensional input space.

In a MLP a hidden unit has a constant activation function for input vectors which lie on a hyperplanar surface in input space given by wTx+w0=const., while for a spherical RBF a hidden unit has constant activation on a hyperspherical surface defined by ||x- ||2 =const.. Show that, for suitable choices of the parameters, these surfaces coincide if the input vectors are normalized to unit length. Illustrate this equivalence geometrically for vectors in a three-dimensional input space.

Page 30: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Comparison with the multilayer perceptron

The hidden unit activation in a MLP is constant on parallel (d-1)-dimensional hyperplanes.

The hidden unit (RB) activation in a RBF is constant on concentric (d-1)-dimensional hyperspheres (more generally hyperellipsoids).

The MLP forms a distributed representation in the space of activation values for the hidden units (problems: local minima, flat regions).

The RBF forms a representation in the space of activation values for the hidden units which is local w.r.t. the input space.

All of the parameters in a MLP are usually determined at the same time as part of a single global supervised training strategy.

RBF is trained in two steps (kernels determined first by unsupervised methods, weights then found by fast linear supervised methods).

Page 31: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

ignore any target information

basis function optimization

The basis function parameters should be chosen to form a representation of the pdf of

the input data.

The basis function parameters should be chosen to form a representation of the pdf of

the input data. j’s as prototypes of the inputs

Problem: if the basis function centres are used to fill out a compact d-dimensional region of the input space, then the number of kernel centres will be an exponential function of d.

Problem: input variables which have significant variance but play little role in determining the appropriate output variables (irrelevant inputs).

Page 32: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization

Problem: input variables which have significant variance but play little role in determining the appropriate output variables (irrelevant inputs).

y independent of x2y independent of x2

Page 33: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization

Problem: input variables which have significant variance but play little role in determining the appropriate output variables (irrelevant inputs).

y independent of x2y independent of x2

A MLP can learn to ignore irrelevant inputs and obtain accurate results with a small number of hidden units.

Page 34: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization

The optimal choice of basis function parameters for input density estimation needs not be optimal for representing the mapping of the output variables.

The optimal choice of basis function parameters for input density estimation needs not be optimal for representing the mapping of the output variables.

input

input pdftarget pdf

Page 35: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

• very fast• significantly sub-optimal

basis function optimization

subsets of data points

basis function centres

set them equal to a random subset of TS data (use as initial conditions)

start with all data points as kernel centres and then selectively remove centres in such a way as to have minimum disruption on the system performance

basis function widths

all equal and set to some multiple of the average distance between the kernel centres (overlap for smoothing)

determined from the average distance of each kernel to its L nearest neighbours, where L is typically small

Page 36: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization

K-means (basic ISODATA) clustering algorithm

decided in advanceSuppose there are N data points xn in total; it tries to find a set of K representatives vectors j where j = 1, ... ,k.

It seeks to partition the data points into K disjoint subsets Sj containing Nj data points in such a way as to minimize the sum-of-squares clustering function given by:

where j is the mean of the data points in set Sj and is given by:

Page 37: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization

K-means (basic ISODATA) clustering algorithm

batch (Lloyd, 1982)batch (Lloyd, 1982)batch (Lloyd, 1982)batch (Lloyd, 1982)

It begins by assigning the points at random to K sets and then computing the mean vectors of the points in each set. Next, each point is re-assigned to a new set according to which is the nearest mean vector. The means of the sets are then recomputed. This procedure is repeated until there is no further change in the grouping of the data points.

It begins by assigning the points at random to K sets and then computing the mean vectors of the points in each set. Next, each point is re-assigned to a new set according to which is the nearest mean vector. The means of the sets are then recomputed. This procedure is repeated until there is no further change in the grouping of the data points.

At each such iteration, the value of J will not increase.At each such iteration, the value of J will not increase.

Page 38: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization

K-means (basic ISODATA) clustering algorithm

batch (Lloyd, 1982)batch (Lloyd, 1982)batch (Lloyd, 1982)batch (Lloyd, 1982)

K = 2

Page 39: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization

K-means (basic ISODATA) clustering algorithm

sequential (MacQueen, 1967)sequential (MacQueen, 1967)sequential (MacQueen, 1967)sequential (MacQueen, 1967)

The initial centres are randomly chosen from the data points, and as each data point xn is presented, the nearest j is updated using:

The initial centres are randomly chosen from the data points, and as each data point xn is presented, the nearest j is updated using:

learning rate

Robbins-Monro for finding the root of a regression function given

by the derivative of J w.r.t. j

Once the centres of the kernels have been found, the covariance matrices of the kernels can be set to the covariances of the points assigned to the corresponding clusters.

Once the centres of the kernels have been found, the covariance matrices of the kernels can be set to the covariances of the points assigned to the corresponding clusters.

Page 40: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

homework

Page 41: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

Write a numerical implementation of the K-means clustering algorithm using both the batch and on-line versions. Illustrate the operation of the algorithm by generating data sets in two dimensions from a mixture of Gaussian distributions, and plotting the data points together with the trajectories of the estimated means during the course of the algorithm. Investigate how the results depend on the value of K in relation to the number of Gaussian distributions, and how they depend on the variances of the distributions in relation to their separation. Study the performance of the on-line version of the algorithm for different values of the learning rate parameter and compare the algorithm with the batch version.

Page 42: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

inpu

t dat

a pd

f

inpu

t dat

a pd

fbasis functions

basis functions

Gaussian mixture modelsGaussian mixture models

The basis functions of the RBF can be regarded as the components of a mixture density model whose parameters are to be optimized by ML.

max• P(j)• kernel parameters

Once the mixture model has been optimized, the mixing coefficients P(j) can be discarded, and the basis functions then used in the RBF in which the second-layer weights are found by supervised training.

Once the mixture model has been optimized, the mixing coefficients P(j) can be discarded, and the basis functions then used in the RBF in which the second-layer weights are found by supervised training.

Page 43: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

K-means as a particular limit of the EM K-means as a particular limit of the EM optimization of a Gaussian mixture model optimization of a Gaussian mixture model

K-means as a particular limit of the EM K-means as a particular limit of the EM optimization of a Gaussian mixture model optimization of a Gaussian mixture model

Page 44: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization(supervised training)

The basis function parameters for regression can be found by treating the kernel centres and widths, along with the second-layer weights, as adaptive parameters to be determined by minimization of an error function.

sum-of-squares error spherical Gaussian basis functions

non-linear computationally intensive optimization problem

Page 45: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization(supervised training)

well localized kernels

RBF training

input

Page 46: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization(supervised training)

RBF training

Page 47: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization(supervised training)

RBF training

activation only for a small fraction of kernels

training procedures can be speeded up significantly by identifying the relevant kernels and therefore avoiding unnecessary computation

Page 48: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,

basis function optimization(supervised training)

RBF training

coarse (unsupervised) to fine (supervised)

no guarantee basis function will remain localized !

Page 49: Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,