View
215
Download
1
Tags:
Embed Size (px)
Citation preview
ERROR FUNCTIONSERROR FUNCTIONS
part one
Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables.
Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.
ERROR FUNCTIONSERROR FUNCTIONSBasic goal for TRAINING: model the underlying generator of the data for generalization on new data.
The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space.
The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space.
For a set of training data {xn, tn} drawn independently from the same distribution:For a set of training data {xn, tn} drawn independently from the same distribution:
modelled by the feed-forward
neural network
continuous (regression: prediction)
continuous (classification: probability of class membership )
ERROR FUNCTIONSERROR FUNCTIONS
n
nn xtpE ln
discrete (classification: class membership)
it determines the error function
Sum-of-squares errorOLS
approach
c target variables tk
the distributions of the target variables are independent the distributions of the target variables are Gaussian error k N( 0, ); doesn’t depend on x or on k
c target variables tk
the distributions of the target variables are independent the distributions of the target variables are Gaussian error k N( 0, ); doesn’t depend on x or on k
Sum-of-squares error
w* minimizes Eminimize E, computed at w*, w.r.t.
The optimum value of 2 is proportional to the residual value of the sum-of-squares error function at its minimum.
Of course, the use of a sum-of-squares error doesn’t require the target data to have a Gaussian distribution. However, if we use this error, then the results cannot distinguish between the true distribution and any other distribution having the same mean and variance.
Sum-of-squares error
training
validation
over all N patterns in the training set
over all N patterns in the training set
over all N ’ patterns in the test set
over all N ’ patterns in the test set
E = 0 perfect prediction of the test data E = 1 it is predicting the test data in the mean
linear output units
Lw Lw
w~ w~
fast
slow
fast
slow
wwww
~,~~min
~ LE wwww
~,~~min
~ LE
reduction of the number of iterations (smaller search space) greater cost per iteration
reduction of the number of iterations (smaller search space) greater cost per iteration
linear output units
Suppose the TS target patterns satisfy an exact linear relation:Suppose the TS target patterns satisfy an exact linear relation:Suppose the TS target patterns satisfy an exact linear relation:Suppose the TS target patterns satisfy an exact linear relation:
If the final-layer weights are determined by OLS, then the outputs If the final-layer weights are determined by OLS, then the outputs will satisfy the same linear constraint for arbitrary input patterns.will satisfy the same linear constraint for arbitrary input patterns.If the final-layer weights are determined by OLS, then the outputs If the final-layer weights are determined by OLS, then the outputs will satisfy the same linear constraint for arbitrary input patterns.will satisfy the same linear constraint for arbitrary input patterns.
tuTu 0
Interpretation of network inputsInterpretation of network inputs
For a network trained by minimizing a sum-of-squares error function, the outputs approximate the conditional averages of the target data
For a network trained by minimizing a sum-of-squares error function, the outputs approximate the conditional averages of the target data
Consider the limit in which the size N of the TS goes to infinity
Interpretation of network inputsInterpretation of network inputs
KEY ASSUMPTIONS• the TS must be sufficiently large that it approximates an infinite TS
• the network output must be sufficiently general (weights for minimum)
• training in such a way as to find the appropriate minimum of the cost
This result doesn’t depend on the choice of network architecture or even if using a neural network at all. However, anns provide a framework for approximating arbitrary nonlinear multivariate mappings and can therefore in principle approximate the conditional average to arbitrary accuracy.
Interpretation of network inputsInterpretation of network inputs
example
xxt 2sin3.0
is a RV drawn from a uniform distribution in the range (-0.1,0.1)
• MLP 1-5-1• sum-of squares error
Interpretation of network inputsInterpretation of network inputs
The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance.
The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance.
ERROR FUNCTIONSERROR FUNCTIONS
part two
Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables.
Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.
xkCP
Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.
We can exploit a number of results ...
Minimum error-rate decisionsMinimum error-rate decisions
Note that the network outputs need not be close to 0 or 1 if the class-conditional density functions are overlapping.
Note that the network outputs need not be close to 0 or 1 if the class-conditional density functions are overlapping.
We can exploit a number of results ...
Minimum error-rate decisionsMinimum error-rate decisions
It can be enforced explicitly as part of the choice of network structure.
It can be enforced explicitly as part of the choice of network structure.
Outputs sum to 1Outputs sum to 1
The average of each output over all patterns in the TS should approximate the corresponding prior class probabilities.
The average of each output over all patterns in the TS should approximate the corresponding prior class probabilities.
These estimated priors can be compared with the sample estimates of the priors obtained from the fractions of patterns in each class within the TS. Differences are an indication that the network is not modelling the posterior probabilities accurately.
These estimated priors can be compared with the sample estimates of the priors obtained from the fractions of patterns in each class within the TS. Differences are an indication that the network is not modelling the posterior probabilities accurately.
We can exploit a number of results ...
Minimum error-rate decisionsMinimum error-rate decisions
Outputs sum to 1Outputs sum to 1
Compensating for different priorsCompensating for different priors
x
xx
1
11 p
CPCpCP kk
k x
xx
2
22 p
CPCpCP kk
k
x
xx
11
11 p
Cp
CP
CP k
k
k
xxx
xx
x
xkkk
kk CP
p
pCPCP
p
CpCP 2
1
222
121
kk CPCP 212
1 x
x normalization factor
Case: priors expected when the network is in use differ from those represented by the TS.
Case: priors expected when the network is in use differ from those represented by the TS.
• 1 = TS• 2 = use
Changes in priors can be accomodated without retraining
kt
x
xktp
p
sum-of-squares for classificationsum-of-squares for classification
every input vector in the TS is labelled by its class membership, represented by a set of target values tk
n
1-of-c coding
klnkl
n tC ,x
c
llklkk CPttp
1
xx
discrete RV
0x
1 xkCP
xkCP1
sum-of-squares for classificationsum-of-squares for classification
every input vector in the TS is labelled by its class membership, represented by a set of target values tk
n
c
llklkk CPttp
1
xx
xx kk CPy
kkkkk dttptty xxx
1-of-c coding
klnkl
n tC ,x
sum-of-squares for classificationsum-of-squares for classification
if the outputs represent probabilities, they should lie in the range (0,1) and should sum to 1 xx kk CPy
in the case of a 1-of-c coding scheme, the target values sum to unity for each pattern, and so the outputs will satisfy the same constraint
no guarantee that the outputs lie in the range (0,1)
for a network with linear output units and s-o-s error, if the target values satisfy a linear constraint, then the outputs will satisfy the same constraint for an arbitrary input
The s-o-s error is not the most appropriate for classification because it is derived from ML on the assumption of Gaussian distributed target data.
sum-of-squares for classificationsum-of-squares for classification
two class problemtwo class problem
1-of-c coding two output units
alternative approach
2
1
if 0
if 1
Ct
Ctnn
nn
x
x
single output
xxx 211 CPtCPttp
xx 1CPy xx yCP 12
Interpretation of hidden units
linear output units
n
jnj
kjkj
nknk
z
wW
t
~
~
Z
T
TZW T TZW T
Total covariance matrix for the activations at the output of the final hidden layer w.r.t. TS
Interpretation of hidden units
linear output units
min
max
Nothing is specific to MLP or indeed to anns. The same result is obtained regardless of the functions (of the weights) zj and applies to any generalized linear discriminant in which the kernels are adaptive.
Interpretation of hidden units
linear output units
min
max
The weights in the final layer are adjusted to produce an optimum discrimination of the classes of input vectors by means of a linear transformation. Minimizing the error of this linear discriminant requires the input data undergo a nonlinear transformation into the space spanned by the activations of the hidden units in such a way as to maximize the discriminant function J.
Interpretation of hidden units
linear output units
min
max
Strong weighting of the feature extraction criterion in favour of classes with larger number of patterns
1-of-c
Cross-entropy for two classesCross-entropy for two classes
single output
wanted
yCP
CPy
12
1
x
x yCP
CPy
12
1
x
x
2
1
if 0
if 1
Ct
Ctnn
nn
x
xtarget coding scheme
cross-entropy error function
• Hopfield (1987)• Baum and Wilczek (1988)• Solla et al. (1988)• Hinton (1989)• Hampshire and Pearlmutter (1990)
Cross-entropy for two classesCross-entropy for two classes
Absoluteminimum nty nn nty nn
logistic activation function for the
output
y y a g a g a g 1 1
BP
Natural pairing sum-of-squares + linear output units cross-entropy + logistic output unit
Natural pairing sum-of-squares + linear output units cross-entropy + logistic output unit
Cross-entropy for two classesCross-entropy for two classes
1-of-ccoding
= 0
it doesn’t vanish when t n is continuous in the range (0,1) representing the probability of the input xn belonging to class C1
min at 0
Class-conditional pdf’s used to generate the TS
(equal priors)
dashed = Bayes
Cross-entropy for two classesCross-entropy for two classes MLP• one input unit • five hidden units (tanh)• one output unit (logistic)• cross-entropy• BFGS
example
sigmoid activation functions
single-layersingle-layer multi-layermulti-layer
exponential family of distributions(e.g. Gaussian,
binomial, Bernoulli, Poisson)
hidden unitoutput
xxx
x
x
xx
The network output is given by a logistic sigmoid activation function acting on a weighted linear combination of the outputs of those hidden units which send connections to the output unit.
Extension to the hidden units: provided such units use logistic sigmoids, their outputs can be interpreted as probabilities of the presence of corresponding features conditioned on the inputs to the units.
properties of the cross-entropy error
nnn ty nnn ty
the error function depends on the relative errors of the outputs(its minimization tends to result in similar relative errors on both small and large targets)
the s-o-s error function depends on the absolute errors(its minimization tends to result in similar absolute errors for each pattern)
the cross-entropy error function performs better than s-o-s at estimating
small probabilities
properties of the cross-entropy error
nnn ty nnn ty
2
1
if 0
if 1
Ct
Ctnn
nn
x
xtarget coding scheme
Manhattan error function
small n
compared with s-o-s: much stronger weight to smaller errors better for incorrectly labelled data
justification of the cross-entropy error
infinite data limit
set the functional derivative w.r.t. y(x) to zero
xx ty
as for s-o-s, the output of the network
approximates the conditional average of the target data for the
given input
Multiple independent attributesDetermine the probabilities of the presence or absence of a number of attributes (which need not be mutually exclusive).
x
Assumptionindependent
attributes
multipleoutputs yk represents the
probability that the kth attribute is present
With this choice of error function, the outputs should each have a logistic sigmoid activation function
c
k
tk
tk
kk yyp1
11xt
Multiple independent attributes
Show that the entropy measure E, derived for targets tk =0, 1, applies also in the case where the targets are probabilities with values in (0,1). Do this by considering an extended data set in which each pattern tk
n is replaced by a set of M patterns of which a fraction M tk
n is set to 1 and the remainder is set to 0, and then applying E to this extended TS.
HOME WORK
Cross-entropy for multiple classes
mutually exclusive classes
The probability of observing the set of target values tk
n = kl, given an input vector xn, is just:
1-of-c coding
klnkl
n tC ,x
One output yk for each class
ln
l yCp x
The {yk} are not independent as a result of the constraint k yk = 1
The absolute minimum w.r.t.{ykn} occurs when yk
n = tkn k, n
discreteKullback-Leibler
distance
If the output values are to be interpreted as probabilities, they must lie in the range (0,1) and sum to unity.
normalized exponential softmax
generalization of the logistic
sigmoid
Cross-entropy for multiple classes
Cross-entropy for multiple classes
kkk
kk
CPAw ln0
θ
θw
As with the logistic sigmoid, we can give a general motivation for the softmax by considering the posterior probability that a hidden unit activation z belongs to class Ck .
The outputs can be interpreted as probabilities of class membership, conditioned on the outputs of the hidden units.
Cross-entropy for multiple classesBP training
over the inputs to all output units
Natural pairing sum-of-squares + linear output units 2-class cross-entropy + logistic output unit c-class cross-entropy + softmax output units
Natural pairing sum-of-squares + linear output units 2-class cross-entropy + logistic output unit c-class cross-entropy + softmax output units
Consider the cross-entropy error function for multiple classes, together with a network whose outputs are given by a softmax activation function, in the limit of an infinite data set. Show that the network output functions yk(x) which minimize the error are given by the conditional averages of the target data xkt
hint Since the outputs are not independent, consider the functional derivative w.r.t. ak(x) instead.