Deep Neural Networks (1) Hidden layers; Back-propagation · Deep Neural Networks (1) Hidden layers;...

Deep Neural Networks (1)Hidden layers; Back-propagation

Steve Renals

Machine Learning Practical — MLP Lecture 34 October 2017 / 9 October 2017

MLP Lecture 3 Deep Neural Networks (1) 1

Recap: Softmax single layer network

class 1 class 2 class 3

softmax

inputs

Single layer network

Single-layer network, 1 output, 2 inputs

Geometric interpretation

Single-layer network, 1 output, 2 inputs

y(w;x) = 0

Bishop, sec 3.1 MLP Lecture 3 Deep Neural Networks (1) 4

Single layer network

Single-layer network, 3 outputs, 2 inputs

Example data (three classes)

−4 −3 −2 −1 0 1 2 3 4 5−4

Classification regions with single-layer network

−4 −3 −2 −1 0 1 2 3 4 5−4

5Plot of Decision regions

Single-layer networks are limited to linear classification boundaries

Single layer network trained on MNIST Digits

010 Outputs

784 Inputs

784x10 weight matrix

1 2 3 4 5 6 7 8 9

. . . .

Output weights define a “template” for each class

Hinton Diagrams

Visualise the weights for class k

. . . .400 (20x20) inputs

Hinton diagram for single layer network trained on MNIST

Weights for each class act as a “discriminative template”Inner product of class weights and input to measure closeness to each templateClassify to the closest template (maximum value output)

Multi-Layer Networks

From templates to features

Good classification needs to cope with the variability of real data: scale, skew,rotation, translation, ....Very difficult to do with a single template per classCould have multiple templates per task... this will work, but we can do better

Use features rather than templates

(images from: Nielsen, chapter 1)MLP Lecture 3 Deep Neural Networks (1) 12

Incorporating features in neural network architecture

Layered processing: inputs - features - classification

0 1 2 3 4 5 6 7 8 9

. . . .

. . . .. . . .

How to obtain features? - learning!

Incorporating features in neural network architecture

0 1 2 3 4 5 6 7 8 9

. . . .

. . . .. . . .

Multi-layer network

. . . .

. . . . g ggg

. . . . + +++

+ + + +. . . .

f f f f. . . .Softmax

Sigmoid

w(2)kj

w(1)ji

Outputs

Inputs

Hidden layerh(1)j

yk = softmax

w(2)kr h

(1)r + bk

)h(1)j = sigmoid

w(1)js xs + bj

Multi-layer network for MNIST

(image from: Michael Nielsen, Neural Networks and Deep Learning,

http://neuralnetworksanddeeplearning.com/chap1.html)MLP Lecture 3 Deep Neural Networks (1) 16

Training multi-layer networks: Credit assignment

Hidden units make training the weights more complicated, since the hidden unitsaffect the error function indirectly via all the outputs

The credit assignment problem

what is the “error” of a hidden unit?how important is input-hidden weight w

(1)ji to output unit k?

what is the gradient of the error with respect to each weight?

Solution: back-propagation of error (backprop)

Backprop enables the gradients to be computed. These gradients are used bygradient descent to train the weights.

Training output weights

Outputs

Hidden units

w(2)1 j w(2)

w(1)ji

yKy�y1

w(2)K j

g(2)`g

@w(2)kj

= g(2)k h

Training MLPs: Error function and required gradients

Cross-entropy error function:

En = −C∑

tnk ln ynk

Required gradients: ∂En

∂w(2)kj

∂w(1)ji

∂b(2)k

∂b(1)j

Gradient for hidden-to-output weights similar to single-layer network:

∂w(2)kj

=∂En

∂a(2)k

· ∂a(2)k

∂wkj=

∂yc· ∂yc

∂a(2)k

)· ∂a

∂wkj

= (yk − tk)︸︷︷︸g(2)k

Back-propagation of error: hidden unit error signal

Outputs

Hidden units

w(2)1 j w(2)

w(1)ji

yKy�y1

w(2)K j

g(1)j =

g(2)l w`j

!hj(1 � hj)

g(2)1 g

(2)` g

@w(1)ji

= g(1)j xi

Training MLPs: Input-to-hidden weights

∂w(1)ji

=∂En

∂a(1)j︸︷︷︸

·∂a

∂w(1)ji︸︷︷︸xi

To compute g(1)j = ∂En/∂a

(1)j , the error signal for hidden unit j , we must sum over all

the output units’ contributions to g(1)j :

g(1)j =

∂a(2)c

· ∂a(2)c

∂a(1)j

g(2)c · ∂a

∂h(1)j

∂a(1)j

g(2)c w

)h(1)j (1 − h

(1)j )

Training MLPs: Gradients

∂w(2)kj

= (yk − tk)︸︷︷︸g(2)k

·h(1)j

∂w(1)ji

g(2)c w

)h(1)j (1 − h

(1)j )

︸︷︷︸g(1)j

Exercise: write down expressions for the gradients w.r.t. the biases

∂b(2)k

∂b(1)j

Back-propagation of error: hidden unit error signal

Outputs

Hidden units

w(2)1 j w(2)

w(1)ji

yKy�y1

w(2)K j

g(1)j =

g(2)l w`j

!hj(1 � hj)

g(2)1 g

(2)` g

@w(1)ji

= g(1)j xi

Back-propagation of error

The back-propagation of error algorithm is summarised as follows:

1 Apply an input vectors from the training set, x, to the network and forwardpropagate to obtain the output vector y

2 Using the target vector t compute the error E n

3 Evaluate the error gradients g(2)k for each output unit

4 Evaluate the error gradients g(1)j for each hidden unit using back-propagation of error

5 Evaluate the derivatives for each training pattern

Back-propagation can be extended to multiple hidden layers, in each casecomputing the g (`)s for the current layer as a weighted sum of the g (`+1)s of thenext layer

Training with multiple hidden layers

Outputs

g(3)1 g

(3)` g

Hidden

y1 y` yK

w(3)1k

w(3)`k

w(3)Kk

w(2)1j

w(2)kj

w(2)Hj

h(2)Hh

g(2)1 g

w(1)ji

xiInputs

g(2)k =

g(3)m wmk

(2)k (1 � h

(2)k )

@w(2)kj

= g(2)k h

Are there alternativesto Sigmoid Hidden Units?

Sigmoid function

−5 −4 −3 −2 −1 0 1 2 3 4 50

Logistic sigmoid activation function g(a) = 1/(1+exp(−a))

Sigmoid Hidden Units

Compress unbounded inputs to (0,1), saturating high magnitudes to 1

Interpretable as the probability of a feature defined by their weight vector

Interpretable as the (normalised) firing rate of a neuron

However...

Saturation causes gradients to approach 0: If the output of a sigmoid unit is h,then then gradient is h(1 − h) which approaches 0 as h saturates to 0 or 1 - andhence the gradients it multiplies into approach 0. Very small gradients result invery small parameter changes, so learning becomes very slow

Outputs are not centred at 0: The output of a sigmoid layer will have mean> 0.This is numerically undesirable.

tanh(x) =ex − e−x

ex + e−x

sigmoid(x) =1 + tanh(x/2)

Derivative:

dxtanh(x) = 1 − tanh2(x)

tanh hidden units

tanh has same shape as sigmoid but hasoutput range ±1

Results about approximation capabilityof sigmoid networks also apply to tanhnetworks

Possible reason to prefer tanh oversigmoid: allowing units to be positive ornegative allows gradient for weights intoa hidden unit to have a different sign

Saturation still a problem

Hidden

h(2)kh

�(2)k

h(1)jh

Rectified Linear Unit – ReLU

relu(x) = max(0, x)

Derivative:d

dxrelu(x) =

{0 if x ≤ 0

1 if x > 0

ReLU hidden units

Similar approximation results to tanh and sigmoid hidden units

Empirical results for speech and vision show consistent improvements using reluover sigmoid or tanh

Unlike tanh or sigmoid there is no positive saturation – saturation results in verysmall derivatives (and hence slower learning)

Negative input to relu results in zero gradient (and hence no learning)

Relu is computationally efficient: max(0, x)

Relu units can “die” (i.e. respond with 0 to everything)

Relu units can be very sensitive to the learning rate

Summary

Understanding what single-layer networks compute

How multi-layer networks allow feature computation

Training multi-layer networks using back-propagation of error

Tanh and ReLU activation functions

Multi-layer networks are also referred to as deep neural networks ormulti-layer perceptrons

Reading:

Nielsen, chapter 2Goodfellow, sections 6.3, 6.4, 6.5Bishop, sections 3.1, 3.2, and chapter 4

Deep Neural Networks (1) Hidden layers; Back-propagation · Deep Neural Networks (1) Hidden layers;...

Documents

Chapter 20, Section 5Chapter 20, Section 5 12. Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Input units Hidden units

1 Artificial Neural Networks. 2 Commercial ANNs Commercial ANNs incorporate three and sometimes four layers, including one or two hidden layers. Each

Modeling of crack propagation in weak snowpack layers ... · Modeling of crack propagation in weak snowpack layers ... 2.2 Discrete element model ... at play in granular assemblies

Artificial neural network model & hidden layers in multilayer artificial neural networks

A Peek Into the Hidden Layers of a Convolutional Neural ... · A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization Lens KDD’18 Workshop, , London,

HIDDEN LAYERS Painting and Process in Europe, 1500–1800

Chapter 4 Supervised learning: Mltil Nt k IIMultilayer ...ypeng/NN/F11NN/lecture... · – Hidden layers of adaline nodesHidden layers of adaline nodes ... Madaline MRI net: – Output

Machine Learning Neural Networks (2). Multi-layers Network Let the network of 3 layers – Input layer – Hidden layer – Output layer Each layer has different

Intelligent Application Networks Webinar - ww2.amstat.org · Mule • Scatter-Gather • Choice Train R Packages • Keras • Rsampler Tensorﬂow • 35 inputs • 2 hidden layers

Seismic Refraction Method to Study Subsoil Structure...Seismic source Earth layers Geophones Seismograph SeisImager (software package) Seismogram (software " Figure 1: Propagation

General Video Game AI: Learning from Screen Capturerepository.essex.ac.uk/20342/1/1704.06945v1.pdf · layer. The ﬁrst three hidden layers are convolutional layers and the last one

An Introduction to HF Propagation...© 2016 Upper Pinellas Amateur Radio Club 01/18/2017 Page 3 Agenda Overview of Propagation The Ionosphere & Layers Refraction Multi-hop Propagation

Multilayer Neural Networks: One or Two Hidden Layers? · Multilayer neural networks: one or two hidden layers? ... "four-quadrant" dichotomy which generalizes the XOR function [4]

Multilayer Perceptrons Lecture 11labs.seas.wustl.edu/bme/...MultiLayerPerceptrons.pdf · 4! Multilayer Perceptron! g Graph of a multilayer perceptron with two hidden layers" Input

Backpropagation: The Good, the Bad and the Ugly · Keith L. Downing Backpropagation: The Good, the Bad and the Ugly. Practical Tips 1 Only add as many hidden layers and hidden nodes

A Back-Propagation Algorithm with Optimal Use of Hidden Unitspapers.nips.cc/...a...optimal-use-of-hidden-units.pdf · This paper describes an algorithm which makes optimal use of

Perfectly matched layers for radio wave propagation in ...Perfectly matched layers for radio wave propagation in inhomogeneous magnetized plasmas Natalia A. Gondarenko a,*,1, Parvez

Project-Team POEMS · 7.4.Wave propagation in complex media11 7.4.1.Perfectly Matched Layers in plasmas and metamaterials11 7.4.2.Transparent Boundary Conditions for the Wave Propagation

Multilayer Neural Networks: One or Two Hidden Layers?papers.nips.cc/paper/1239-multilayer-neural... · Multilayer Neural Networks: One or Two Hidden Layers? 149 1.1 NOTATIONS AND

CS407 Neural Computation...Feedforward Networks… • One I/P and one O/P layer • One or more hidden layers • Each hidden layer is built from artificial neurons • Each element