Neural Networks - Oliver Schulte - CMPT 726

Feed-forward Networks Network Training Error Backpropagation Applications

Neural NetworksOliver Schulte - CMPT 726

Bishop PRML Ch. 5


Neural Networks

• Neural networks arise from attempts to modelhuman/animal brains

• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons

• Mathematical properties rather than plausibility• Prof. Hadley CMPT418


Uses of Neural Networks

• Pros• Good for continuous input variables.• General continuous function approximators.• Highly non-linear.• Learn feature functions.• Good to use in continuous domains with little knowledge:

• When you don’t know good features.• You don’t know the form of a good functional model.

• Cons• Not interpretable, “black box”.• Learning is slow.• Good generalization can require many datapoints.


Applications

There are many, many applications.• World-Champion Backgammon Player.• No Hands Across America Tour.• Digit Recognition with 99.26% accuracy.• ...


Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications


Outline


Network Training


Applications



• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjφj(x)

for fixed non-linear basis functions φ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:

φj(x) = f

M∑j=1

. . .



• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjφj(x)

for fixed non-linear basis functions φ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons)we let each basis function be another non-linear function oflinear combination of the inputs:

φj(x) = f

M∑j=1

. . .


Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear

combinations:

aj =

D∑i=1

w(1)ji xi + w(1)

j0

These aj are known as activations• Pass through an activation function h(·) to get output

zj = h(aj)• Model of an individual neuron

from Russell and Norvig, AIMA2e



combinations:

aj =

D∑i=1

w(1)ji xi + w(1)

j0






combinations:

aj =

D∑i=1

w(1)ji xi + w(1)

j0





Activation Functions

• Can use a variety of activation functions• Sigmoidal (S-shaped)

• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)

• Hyperbolic tangent tanh• Radial basis function zj =

∑i(xi − wji)

2

• Softmax• Useful for multi-class classification

• Identity• Useful for regression

• Threshold• . . .

• Needs to be differentiable for gradient-based learning(later)

• Can use different activation functions in each unit



x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

• Connect together a number of these units into afeed-forward network (DAG)

• Above shows a network with one layer of hidden units• Implements function:

yk(x,w) = h

M∑j=1

w(2)kj h

(D∑

i=1

w(1)ji xi + w(1)

j0

)+ w(2)

k0


Hidden Units Compute Basis Functions

• red dots = network function• dashed line = hidden unit activation function.• blue dots = data points


Outline


Network Training


Applications


Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our

network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminativemodel, ML:

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn

E(w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}






E(w) =

N∑n=1

{y(xn,w)− tn}2


p(t|w) =

N∏n=1


E(w) = −N∑

n=1







E(w) =

N∑n=1

{y(xn,w)− tn}2


p(t|w) =

N∏n=1


E(w) = −N∑

n=1



Parameter Optimization

w1

w2

E(w)

wA wB wC

∇E

• For either of these problems, the error function E(w) isnasty

• Nasty = non-convex• Non-convex = has local minima


Descent Methods

• The typical strategy for optimization problems of this sort isa descent method:

w(τ+1) = w(τ) + ∆w(τ)

• These come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w(τ))• Newton-Raphson (second order) ∇2

• All of these can be used here, stochastic gradient descentis particularly effective

• Redundancy in training data, escaping local minima


Descent Methods


w(τ+1) = w(τ) + ∆w(τ)





Descent Methods


w(τ+1) = w(τ) + ∆w(τ)





Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivativeswith respect to weights

• Numerical method for calculating error derivatives, usefinite differences:

∂En

∂wji≈

En(wji + ε)− En(wji − ε)2ε

• How much computation would this take with W weights inthe network?

• O(W) per derivative, O(W2) total per gradient descent step


Computing Gradients




∂En

∂wji≈





Computing Gradients




∂En

∂wji≈





Computing Gradients




∂En

∂wji≈





Outline


Network Training


Applications



• Backprop is an efficient method for computing errorderivatives ∂En

∂wji

• O(W) to compute derivatives wrt all weights

• First, feed training example xn forward through the network,storing all activations aj

• Calculating derivatives for weights connected to outputnodes is easy

• e.g. For linear output nodes yk =∑

i wkizi:

∂En

∂wki=

∂

∂wki

12

(yn − tn)2 = (yn − tn)xni

• For hidden layers, propagate error backwards from theoutput nodes




∂wji





i wkizi:

∂En

∂wki=

∂

∂wki

12






∂wji





i wkizi:

∂En

∂wki=

∂

∂wki

12




Chain Rule for Partial Derivatives

• A “reminder”• For f (x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v:

∂f∂u

=∂f∂x∂x∂u

+∂f∂y∂y∂u

and

∂f∂v

=∂f∂x∂x∂v

+∂f∂y∂y∂v


Error Backpropagation• We can write

∂En

∂wji=

∂

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layeras node j

• Using the chain rule:

∂En

∂wji=∂En

∂aj

∂aj

∂wji+∑

k

∂En

∂ak

∂ak

∂wji

where∑

k runs over all other nodes k in the same layer asnode j.

• Since ak does not depend on wji, all terms in thesummation go to 0

∂En

∂wji=∂En

∂aj

∂aj

∂wji



∂En

∂wji=

∂

∂wjiEn(aj1 , aj2 , . . . , ajm)



∂En

∂wji=∂En

∂aj

∂aj

∂wji+∑

k

∂En

∂ak

∂ak

∂wji

where∑



∂En

∂wji=∂En

∂aj

∂aj

∂wji



∂En

∂wji=

∂

∂wjiEn(aj1 , aj2 , . . . , ajm)



∂En

∂wji=∂En

∂aj

∂aj

∂wji+∑

k

∂En

∂ak

∂ak

∂wji

where∑



∂En

∂wji=∂En

∂aj

∂aj

∂wji


Error Backpropagation cont.

• Introduce error δj ≡ ∂En∂aj

∂En

∂wji= δj

∂aj

∂wji

• Other factor is:

∂aj

∂wji=

∂

∂wji

∑k

wjkzk = zi



• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑

k

∂En

∂ak︸︷︷︸δk

∂ak

∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑

k

wkjδk

• A weighted sum of the later error “caused” by this weight



• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑

k

∂En

∂ak︸︷︷︸δk

∂ak

∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑

k

wkjδk

• A weighted sum of the later error “caused” by this weight


Outline


Network Training


Applications


Applications of Neural Networks

• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)


Hand-written Digit Recognition

• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images


LeNet-5

INPUT

32x32

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5

C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connection

Full connection

Gaussian connections

OUTPUT

10

• LeNet developed by Yann LeCun et al.• Convolutional neural network

• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry

• Seehttp://www.codeproject.com/KB/library/NeuralNetRecognition.aspx


4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3

9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4

8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1

9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8

4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4

8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8

1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1

2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0

4!>9 2!>8

• The 82 errors made by LeNet5 (0.82% test error rate)


Conclusion

• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or

classification• Similar to linear models, except with adaptive non-linear

basis functions• These allow us to do more than e.g. linear decision

boundaries

• Different error functions• Learning is more difficult, error function not convex

• Use stochastic gradient descent, obtain (good?) localminimum

• Backpropagation for efficient gradient computation

Documents

Neural Networks - Oliver Schulte - CMPT 726