Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned

Giansalvo EXIN Cirrincione

unit #7/8

ERROR FUNCTIONSERROR FUNCTIONS

part one

Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables.

Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.

ERROR FUNCTIONSERROR FUNCTIONSBasic goal for TRAINING: model the underlying generator of the data for generalization on new data.

The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space.

The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space.

For a set of training data {xn, tn} drawn independently from the same distribution:For a set of training data {xn, tn} drawn independently from the same distribution:

modelled by the feed-forward

neural network

continuous (regression: prediction)

continuous (classification: probability of class membership )


n

nn xtpE ln

discrete (classification: class membership)

it determines the error function

Sum-of-squares errorOLS

approach

c target variables tk

the distributions of the target variables are independent the distributions of the target variables are Gaussian error k N( 0, ); doesn’t depend on x or on k

c target variables tk

the distributions of the target variables are independent the distributions of the target variables are Gaussian error k N( 0, ); doesn’t depend on x or on k

n

nn xtpE ln Sum-of-squares error

Sum-of-squares error

w* minimizes Eminimize E, computed at w*, w.r.t.

The optimum value of 2 is proportional to the residual value of the sum-of-squares error function at its minimum.

Of course, the use of a sum-of-squares error doesn’t require the target data to have a Gaussian distribution. However, if we use this error, then the results cannot distinguish between the true distribution and any other distribution having the same mean and variance.

Sum-of-squares error

training

validation

over all N patterns in the training set

over all N patterns in the training set

over all N ’ patterns in the test set

over all N ’ patterns in the test set

E = 0 perfect prediction of the test data E = 1 it is predicting the test data in the mean

linear output units MLP,RBF

M

wx ~;jz

linear output units

n

jnj

kjkj

nknk

z

wW

t

~

~

Z

T

SVD

linear output units

Lw Lw

w~ w~

fast

slow

fast

slow

wwww

~,~~min

~ LE wwww

~,~~min

~ LE

reduction of the number of iterations (smaller search space) greater cost per iteration

reduction of the number of iterations (smaller search space) greater cost per iteration

linear output units

Suppose the TS target patterns satisfy an exact linear relation:Suppose the TS target patterns satisfy an exact linear relation:Suppose the TS target patterns satisfy an exact linear relation:Suppose the TS target patterns satisfy an exact linear relation:

If the final-layer weights are determined by OLS, then the outputs If the final-layer weights are determined by OLS, then the outputs will satisfy the same linear constraint for arbitrary input patterns.will satisfy the same linear constraint for arbitrary input patterns.If the final-layer weights are determined by OLS, then the outputs If the final-layer weights are determined by OLS, then the outputs will satisfy the same linear constraint for arbitrary input patterns.will satisfy the same linear constraint for arbitrary input patterns.

tuTu 0

Interpretation of network inputsInterpretation of network inputs

For a network trained by minimizing a sum-of-squares error function, the outputs approximate the conditional averages of the target data

For a network trained by minimizing a sum-of-squares error function, the outputs approximate the conditional averages of the target data

Consider the limit in which the size N of the TS goes to infinity



regression of tk conditioned on x


KEY ASSUMPTIONS• the TS must be sufficiently large that it approximates an infinite TS

• the network output must be sufficiently general (weights for minimum)

• training in such a way as to find the appropriate minimum of the cost

This result doesn’t depend on the choice of network architecture or even if using a neural network at all. However, anns provide a framework for approximating arbitrary nonlinear multivariate mappings and can therefore in principle approximate the conditional average to arbitrary accuracy.


zero-mean


example


example

xxt 2sin3.0

is a RV drawn from a uniform distribution in the range (-0.1,0.1)

• MLP 1-5-1• sum-of squares error


The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance.

The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance.


part two

Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables.


xkCP


We can exploit a number of results ...

Minimum error-rate decisionsMinimum error-rate decisions

Note that the network outputs need not be close to 0 or 1 if the class-conditional density functions are overlapping.

Note that the network outputs need not be close to 0 or 1 if the class-conditional density functions are overlapping.



It can be enforced explicitly as part of the choice of network structure.

It can be enforced explicitly as part of the choice of network structure.

Outputs sum to 1Outputs sum to 1

The average of each output over all patterns in the TS should approximate the corresponding prior class probabilities.

The average of each output over all patterns in the TS should approximate the corresponding prior class probabilities.

These estimated priors can be compared with the sample estimates of the priors obtained from the fractions of patterns in each class within the TS. Differences are an indication that the network is not modelling the posterior probabilities accurately.

These estimated priors can be compared with the sample estimates of the priors obtained from the fractions of patterns in each class within the TS. Differences are an indication that the network is not modelling the posterior probabilities accurately.



Outputs sum to 1Outputs sum to 1

Compensating for different priorsCompensating for different priors

x

xx

1

11 p

CPCpCP kk

k x

xx

2

22 p

CPCpCP kk

k

x

xx

11

11 p

Cp

CP

CP k

k

k

xxx

xx

x

xkkk

kk CP

p

pCPCP

p

CpCP 2

1

222

121

kk CPCP 212

1 x

x normalization factor

Case: priors expected when the network is in use differ from those represented by the TS.

Case: priors expected when the network is in use differ from those represented by the TS.

• 1 = TS• 2 = use

Changes in priors can be accomodated without retraining

kt

x

xktp

p

sum-of-squares for classificationsum-of-squares for classification

every input vector in the TS is labelled by its class membership, represented by a set of target values tk

n

1-of-c coding

klnkl

n tC ,x

c

llklkk CPttp

1

xx

discrete RV

0x

1 xkCP

xkCP1


every input vector in the TS is labelled by its class membership, represented by a set of target values tk

n

c

llklkk CPttp

1

xx

xx kk CPy

kkkkk dttptty xxx

1-of-c coding

klnkl

n tC ,x


if the outputs represent probabilities, they should lie in the range (0,1) and should sum to 1 xx kk CPy

in the case of a 1-of-c coding scheme, the target values sum to unity for each pattern, and so the outputs will satisfy the same constraint

no guarantee that the outputs lie in the range (0,1)

for a network with linear output units and s-o-s error, if the target values satisfy a linear constraint, then the outputs will satisfy the same constraint for an arbitrary input

The s-o-s error is not the most appropriate for classification because it is derived from ML on the assumption of Gaussian distributed target data.


two class problemtwo class problem

1-of-c coding two output units

alternative approach

2

1

if 0

if 1

Ct

Ctnn

nn

x

x

single output

xxx 211 CPtCPttp

xx 1CPy xx yCP 12

Interpretation of hidden units

linear output units

n

jnj

kjkj

nknk

z

wW

t

~

~

Z

T

TZW T TZW T

Total covariance matrix for the activations at the output of the final hidden layer w.r.t. TS


linear output units

Between-class covariance matrix


linear output units

min

max

Nothing is specific to MLP or indeed to anns. The same result is obtained regardless of the functions (of the weights) zj and applies to any generalized linear discriminant in which the kernels are adaptive.


linear output units

min

max

The weights in the final layer are adjusted to produce an optimum discrimination of the classes of input vectors by means of a linear transformation. Minimizing the error of this linear discriminant requires the input data undergo a nonlinear transformation into the space spanned by the activations of the hidden units in such a way as to maximize the discriminant function J.


linear output units

min

max

Strong weighting of the feature extraction criterion in favour of classes with larger number of patterns

1-of-c

Cross-entropy for two classesCross-entropy for two classes

single output

wanted

yCP

CPy

12

1

x

x yCP

CPy

12

1

x

x

2

1

if 0

if 1

Ct

Ctnn

nn

x

xtarget coding scheme

cross-entropy error function

• Hopfield (1987)• Baum and Wilczek (1988)• Solla et al. (1988)• Hinton (1989)• Hampshire and Pearlmutter (1990)


Absoluteminimum nty nn nty nn

logistic activation function for the

output

y y a g a g a g 1 1

BP

Natural pairing sum-of-squares + linear output units cross-entropy + logistic output unit

Natural pairing sum-of-squares + linear output units cross-entropy + logistic output unit


1-of-ccoding

= 0

it doesn’t vanish when t n is continuous in the range (0,1) representing the probability of the input xn belonging to class C1

min at 0

Class-conditional pdf’s used to generate the TS

(equal priors)

dashed = Bayes

Cross-entropy for two classesCross-entropy for two classes MLP• one input unit • five hidden units (tanh)• one output unit (logistic)• cross-entropy• BFGS

example

sigmoid activation functions

single-layersingle-layer multi-layermulti-layer

exponential family of distributions(e.g. Gaussian,

binomial, Bernoulli, Poisson)

hidden unitoutput

xxx

x

x

xx

The network output is given by a logistic sigmoid activation function acting on a weighted linear combination of the outputs of those hidden units which send connections to the output unit.

Extension to the hidden units: provided such units use logistic sigmoids, their outputs can be interpreted as probabilities of the presence of corresponding features conditioned on the inputs to the units.

properties of the cross-entropy error

nnn ty nnn ty

the error function depends on the relative errors of the outputs(its minimization tends to result in similar relative errors on both small and large targets)

the s-o-s error function depends on the absolute errors(its minimization tends to result in similar absolute errors for each pattern)

the cross-entropy error function performs better than s-o-s at estimating

small probabilities

properties of the cross-entropy error

nnn ty nnn ty

2

1

if 0

if 1

Ct

Ctnn

nn

x


Manhattan error function

small n

compared with s-o-s: much stronger weight to smaller errors better for incorrectly labelled data

justification of the cross-entropy error

infinite data limit

set the functional derivative w.r.t. y(x) to zero

xx ty

as for s-o-s, the output of the network

approximates the conditional average of the target data for the

given input

justification of the cross-entropy error

xx ty

2

1

if 0

if 1

Ct

Ctnn

nn

x


Multiple independent attributesDetermine the probabilities of the presence or absence of a number of attributes (which need not be mutually exclusive).

x

Assumptionindependent

attributes

multipleoutputs yk represents the

probability that the kth attribute is present

With this choice of error function, the outputs should each have a logistic sigmoid activation function

c

k

tk

tk

kk yyp1

11xt

Multiple independent attributes

Show that the entropy measure E, derived for targets tk =0, 1, applies also in the case where the targets are probabilities with values in (0,1). Do this by considering an extended data set in which each pattern tk

n is replaced by a set of M patterns of which a fraction M tk

n is set to 1 and the remainder is set to 0, and then applying E to this extended TS.

HOME WORK

Cross-entropy for multiple classes

mutually exclusive classes

The probability of observing the set of target values tk

n = kl, given an input vector xn, is just:

1-of-c coding

klnkl

n tC ,x

One output yk for each class

ln

l yCp x

The {yk} are not independent as a result of the constraint k yk = 1

The absolute minimum w.r.t.{ykn} occurs when yk

n = tkn k, n

discreteKullback-Leibler

distance

If the output values are to be interpreted as probabilities, they must lie in the range (0,1) and sum to unity.

normalized exponential softmax

generalization of the logistic

sigmoid



kkk

kk

CPAw ln0

θ

θw

As with the logistic sigmoid, we can give a general motivation for the softmax by considering the posterior probability that a hidden unit activation z belongs to class Ck .

The outputs can be interpreted as probabilities of class membership, conditioned on the outputs of the hidden units.

Cross-entropy for multiple classesBP training

over the inputs to all output units

Natural pairing sum-of-squares + linear output units 2-class cross-entropy + logistic output unit c-class cross-entropy + softmax output units

Natural pairing sum-of-squares + linear output units 2-class cross-entropy + logistic output unit c-class cross-entropy + softmax output units

Consider the cross-entropy error function for multiple classes, together with a network whose outputs are given by a softmax activation function, in the limit of an infinite data set. Show that the network output functions yk(x) which minimize the error are given by the conditional averages of the target data xkt

hint Since the outputs are not independent, consider the functional derivative w.r.t. ak(x) instead.

Documents

Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned