rks - The University of Edinburgh · rks considering ers of units of ts as hidden called erceptrons (MLPs) 13 rks5 rks considering ers of units of ts as hidden called erceptrons (MLPs)

Multi-layer Neural Networks

Steve Renals

Informatics 2B— Learning and Data Lecture 138 March 2011

Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1

Overview

Multi-layer neural networks

Multi-layer perceptrons (MLPs)

The credit assignment problem for hidden units

Back-propagation of error (backprop) training


Limitations of single-layer neural networks

Single-layer neural networks have many advantages:

Easy to setup and trainExplicit link to statistical models:

Shared covariance Gaussian density functionsSigmoid output functions allow a link to posterior probabilities

Outputs are weighted sum of inputs: interpretablerepresentation

But some big limitations:

Can only represent a limited set of functionsDecision boundaries must be hyperplanesCan only perfectly separate linearly separable data


Limitations of single-layer neural networks

Single-layer neural networks have many advantages:

Easy to setup and trainExplicit link to statistical models:

Shared covariance Gaussian density functionsSigmoid output functions allow a link to posterior probabilities

Outputs are weighted sum of inputs: interpretablerepresentation

But some big limitations:

Can only represent a limited set of functionsDecision boundaries must be hyperplanesCan only perfectly separate linearly separable data


Generalised Linear Network

Generalises linear discriminants by adding another,non-adaptive layer:

yk(x) =M∑

j=0

wkjφj(x)

The input vector x is transformed using a set of M predefinednon-linear functions, φj(x), called basis functions.

This allows a much larger class of discriminant functions (infact can approximate any continuous function to an arbitraryaccuracy)

Multilayer neural networks employ adaptive basis functionswith parameters (weights) that may be estimated from thetraining data




yk(x) =M∑

j=0

wkjφj(x)







yk(x) =M∑

j=0

wkjφj(x)






Construct more general networks by considering layers ofprocessing units

Unlike generalised linear discriminants, each layer has a set ofadaptive weights

Layers that are not input or output are referred to as hidden

Often called multilayer perceptrons (MLPs)




















Multi-layer Perceptron

+

Inputs

x0 x1 xd

Bias

+

+ +

Hidden

Outputs

Biasz0 z1 zM

yKy1

w(1)Md

w(1)10

w(2)10 w(2)

KM


Building up the MLP (1)

First we take a M linear combinations of the d-dimensioninputs:

bj =d∑

i=0

w(1)ji xi

bj : activations

w(1)ji : first layer of weights

Activations transformed by a nonlinear activation function h(e.g. a sigmoid):

zj = h(bj) =1

1 + exp(−bj)

zj : hidden unit outputs



First we take a M linear combinations of the d-dimensioninputs:

bj =d∑

i=0

w(1)ji xi

bj : activations

w(1)ji : first layer of weights

Activations transformed by a nonlinear activation function h(e.g. a sigmoid):

zj = h(bj) =1

1 + exp(−bj)

zj : hidden unit outputs



Outputs of the hidden units are linearly combined to giveactivations of the output units:

ak =M∑

j=0

w(2)kj zj

The output units are transformed using an activation function(e.g. a sigmoid)

yk = g(ak) =1

1 + exp(−ak)

For multiclass problems, a softmax may be usedCombine to give the overall forward propagation equation forthe network:

yk = g

M∑

j=0

w(2)kj h

(d∑

i=0

w(1)ji xi

)




ak =M∑

j=0

w(2)kj zj


yk = g(ak) =1

1 + exp(−ak)

For multiclass problems, a softmax may be used

Combine to give the overall forward propagation equation forthe network:

yk = g

M∑

j=0

w(2)kj h

(d∑

i=0

w(1)ji xi

)




ak =M∑

j=0

w(2)kj zj


yk = g(ak) =1

1 + exp(−ak)

For multiclass problems, a softmax may be usedCombine to give the overall forward propagation equation forthe network:

yk = g

M∑

j=0

w(2)kj h

(d∑

i=0

w(1)ji xi

)


Multi-layer Perceptron

+

Inputs

x0 x1 xd

Bias

+

+ +

Hidden

Outputs

Biasz0 z1 zM

yKy1

w(1)Md

w(1)10

w(2)10 w(2)

KM


Training MLPs: Credit assignment

Hidden units make training the weights more complicated

The gradients for a single layer neural network have a simpleform:

∂En

∂wki= δkxi

For a multi-layer neural network: what is the “error” of a

hidden unit? how important is input-hidden weight w(1)ji to

output unit k?

Solution: we need to define derivatives of the error withrespect to each weight

Algorithm: back-propagation of error (backprop)

Backprop gives a way to compute the derivatives. Thesederivatives are used by an optimisation algorithm (e.g.gradient descent) to train the weights.





∂En

∂wki= δkxi



output unit k?








∂En

∂wki= δkxi



output unit k?








∂En

∂wki= δkxi



output unit k?








∂En

∂wki= δkxi



output unit k?








∂En

∂wki= δkxi



output unit k?





Training MLPs: Error function and required gradients

Sum-of-squares error function, obtained by summing over atraining set of N examples:

E =N∑

n=1

En

En =1

2

K∑

k=1

(ynk − tnk )2


Gradient descent training

Operation of gradient descent:1 Start with a guess for the weight matrix W (small random

numbers)2 Update the weights by adjusting the weight matrix in the

direction of −∇WE .3 Recompute the error, and iterate

The update for weight wki at iteration τ + 1 is:

w τ+1ki = w τ

ki − η∂E

∂wki

The parameter η is the learning rate


Training MLPs: Error function and required gradients

Sum-of-squares error function, obtained by summing over atraining set of N examples:

E =N∑

n=1

En

En =1

2

K∑

k=1

(ynk − tnk )2

To obtain the overall error gradients, we sum over the trainingexamples:

∂E

∂wkj=

N∑

n=1

∂En

∂wkj

∂E

∂wji=

N∑

n=1

∂En

∂wji


Training MLPs: Hidden-to-output weights

Write En in terms of hidden-to-output weights:

En =1

2

K∑

k=1

(g(ank)− tnk )2 =1

2

K∑

k=1

g

M∑

j=0

wkjznj

− tnk

2

Break down error derivatives

∂En

∂wkj=∂En

∂ank

∂ank∂wkj

∂En/∂ank is often referred to as the error signal, δnk

δnk =∂En

∂ank=∂En

∂ynk· ∂y

nk

∂ank= (ynk − tnk )g ′(ank)

Since∂ank∂wkj

= znj , we obtain∂En

∂wkj= δnkz

nj

This is similar to single-layer neural networks with a nonlinearactivation function.




En =1

2

K∑

k=1

(g(ank)− tnk )2 =1

2

K∑

k=1

g

M∑

j=0

wkjznj

− tnk

2


∂En

∂wkj=∂En

∂ank

∂ank∂wkj


δnk =∂En

∂ank=∂En

∂ynk· ∂y

nk


Since∂ank∂wkj


∂wkj= δnkz

nj





En =1

2

K∑

k=1

(g(ank)− tnk )2 =1

2

K∑

k=1

g

M∑

j=0

wkjznj

− tnk

2


∂En

∂wkj=∂En

∂ank

∂ank∂wkj


δnk =∂En

∂ank=∂En

∂ynk· ∂y

nk


Since∂ank∂wkj


∂wkj= δnkz

nj





En =1

2

K∑

k=1

(g(ank)− tnk )2 =1

2

K∑

k=1

g

M∑

j=0

wkjznj

− tnk

2


∂En

∂wkj=∂En

∂ank

∂ank∂wkj


δnk =∂En

∂ank=∂En

∂ynk· ∂y

nk


Since∂ank∂wkj


∂wkj= δnkz

nj





En =1

2

K∑

k=1

(g(ank)− tnk )2 =1

2

K∑

k=1

g

M∑

j=0

wkjznj

− tnk

2


∂En

∂wkj=∂En

∂ank

∂ank∂wkj


δnk =∂En

∂ank=∂En

∂ynk· ∂y

nk


Since∂ank∂wkj


∂wkj= δnkz

nj



Training MLPs: Input-to-hidden weights

To compute the error gradients for the input-to-hiddenweights we must take into account all the ways in whichhidden unit j (and hence weight wji ) can influence the error.Consider δnj , the error signal for hidden unit j :

δnj =∂En

∂bnj=

K∑

k=1

∂En

∂ank

∂ank∂bnj

=K∑

k=1

δnk∂ank∂bnj

Sum over all the output units’ contributions to δnj :

∂ank∂bnj

=∂ank∂znj

∂znj∂bnj

= wkjh′(bnj )

Substituting in we obtain:

δnj = h′(bnj )K∑

k=1

δnkwkj

This is the famous back-propagation of error (backprop)equation.




δnj =∂En

∂bnj=

K∑

k=1

∂En

∂ank

∂ank∂bnj

=K∑

k=1

δnk∂ank∂bnj


∂ank∂bnj

=∂ank∂znj

∂znj∂bnj

= wkjh′(bnj )



k=1

δnkwkj

This is the famous back-propagation of error (backprop)equation.




δnj =∂En

∂bnj=

K∑

k=1

∂En

∂ank

∂ank∂bnj

=K∑

k=1

δnk∂ank∂bnj


∂ank∂bnj

=∂ank∂znj

∂znj∂bnj

= wkjh′(bnj )



k=1

δnkwkj

This is the famous back-propagation of error (backprop)equation.Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 15

Back-propagation of error: hidden unit error signal

Outputs

Hidden units

z j

xi

w(2)1 j w(2)

! j

w(1)ji

yKy!y1

w(2)K j



Outputs

Hidden units

z j

xi

w(2)1 j w(2)

! j

δ!δ1

w(1)ji

yKy!y1

w(2)K j

!K



Outputs

Hidden units

z j

xi

w(2)1 j w(2)

! j

δ!δ1

w(1)ji

yKy!y1

w(2)K j

!j = h!(bj)!

!

!!w!j

!K


Back-propagation of error

The derivatives of the input-to-hidden weights can thus beevaluated using:

∂En

∂wji=∂En

∂bnj

∂bnj∂wji

= δnj xni

The back-propagation of error algorithm is summarised asfollows:

1 Apply the N input vectors from the training set, xn, to thenetwork and forward propagate to obtain the set of outputvectors yn

2 Using the target vectors tn compute the error E3 Evaluate the error signals δnk for each output unit4 Evaluate the error signals δnk for each hidden unit using

back-propagation of error5 Evaluate the derivatives for each training pattern, summing to

obtain the overall derivatives


Back-propagation of error

The derivatives of the input-to-hidden weights can thus beevaluated using:

∂En

∂wji=∂En

∂bnj

∂bnj∂wji

= δnj xni

The back-propagation of error algorithm is summarised asfollows:

1 Apply the N input vectors from the training set, xn, to thenetwork and forward propagate to obtain the set of outputvectors yn

2 Using the target vectors tn compute the error E3 Evaluate the error signals δnk for each output unit4 Evaluate the error signals δnk for each hidden unit using

back-propagation of error5 Evaluate the derivatives for each training pattern, summing to

obtain the overall derivatives


Gradient descent training

Operation of gradient descent:1 Start with a guess for the weight matrix W (small random

numbers)2 Update the weights by adjusting the weight matrix in the

direction of −∇WE .3 Recompute the error, and iterate

The update for weight wki at iteration τ + 1 is:

w τ+1ki = w τ

ki − η∂E

∂wki

The parameter η is the learning rate


MLP Example

Netlab demmlp2

MLP trained as a classifier on data from a known distribution


Training data

!2 !1 0 1 2

!2

!1

0

1

2

3

The Sampled Data Probability Density p(x)

!2 !1 0 1 2

!2

!1

0

1

2

3

Training Data pdf


Training data

Density p(x|red)

!2 0 2

!2

!1

0

1

2

3

0.05

0.1

0.15

0.2

Density p(x|yellow)

!2 0 2

!2

!1

0

1

2

3

0.1

0.2

0.3

0.4

0.5

0.6

Posterior Probability p(red|x)

!2 0 2

!2

!1

0

1

2

3

0.2

0.4

0.6

0.8

Posterior Probability p(yellow|x)

!2 0 2

!2

!1

0

1

2

3

0.2

0.4

0.6

0.8


MLP Example

Netlab demmlp2


Train MLP: 6 hidden units, 2 output units


MLP Network OutputTraining Data

BayesNetwork

Network Output

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Training data Network output


MLP Example

Netlab demmlp2



Compare with single layer network


SLN Network OutputTraining Data

BayesSLN

SLN Output

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Training data Network output


MLP Example

Netlab demmlp2




Classify test data


Decision boundaries

Test Data

Bayes decision boundarySLN decision boundary


MLP Example

Netlab demmlp2




Classify test data

Confusion matrices:

Optimal

79 2810 83

81% correct

MLP

80 2713 80

80% correct

SLN

79 3816 77

73% correct


Summary

Multi-layer perceptrons


Back-propagation of error training

Example


Documents

rks - The University of Edinburgh · rks considering ers of units of ts as hidden called erceptrons (MLPs) 13 rks5 rks considering ers of units of ts as hidden called erceptrons (MLPs)