48
Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016

Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Machine learning - HT 2016

8. Neural Networks

Varun Kanade

University of OxfordFebruary 19, 2016

Page 2: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Outline

Today, we’ll study feedforward neural networks

I Multi-layer perceptrons

I Application to classification or regression settings

I Backpropagation to compute gradients

I Finally understand what model:forward and model:backward is doing

1

Page 3: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Artificial Neuron : Logistic Regression

1

x1

x2

Σ y = Pr(y = 1 | x,w, b)

b

w1

w2

2

Page 4: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) : Classification

1

x1

x2

1

Σ

Σ

1 Σ y = Pr(y = 1 | x,W,b)

b21

w211

w212

b12

w121

w122

w311

w212

b31

3

Page 5: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) : Regression

1

x1

x2

1

Σ

Σ

1 Σ y = E[y | x,W,b]

b21

w211

w212

b12

w121

w122

w311

w212

b31

4

Page 6: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

A Simple Example

y = σ

(w4

11σ(w3

11σ(w211x1 + w2

12x2 + b21) + w312σ(w2

21x1 + w222x2 + b22) + b31

)+ w4

12σ(w3

21σ(w211x1 + w2

12x2 + b21) + w322σ(w2

21x1 + w222x2 + b22) + b31

)+ b41

)

5

Page 7: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Feedforward Neural Networks

Layer 2(Hidden)

Layer 1(Input)

Layer 3(Hidden)

Layer 4(Output)

6

Page 8: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

layer 2

layer l − 1

layer l

layer L− 1

layer L

z1input x δ1 ∂L∂z1

zL+1loss L

δl

δL model = nn.Sequential()model:add(nn.Linear(n_in, n_h1));model:add(nn.Sigmoid());......model:add(nn.Linear(n_h_L-1, n_out));model:Sigmoid();criterion = nn.MSECriterion();

model:forward(x)criterion:forward(model.output, y)

dL_do = criterion:backward(model.output,y)model:backward(x, dL_do)

7

Page 9: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

layer 2

layer l − 1

layer l

layer L− 1

layer L

z1input x δ1 ∂L∂z1

zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)

(2) zl = Wlal−1 + bl

(3) al = f(zl)

(4) L(aL, y)

8

Page 10: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Output Layer

layer L (zL)

aL−1 δL

aL

zL = WLaL−1 + bL

aL = f(zL)

Loss: L(y,aL) = criterion : forward(aL, y)

δL = ∂L∂zL

= ∂L∂aL f ′(zL)

9

Page 11: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Back Propagation

layer l (zl)

al−1 δl

al δl+1

al (the inputs into layer l + 1)

zl+1 = Wl+1al + bl+1 (wl+1j,k weight on connection from kth

unit in layer l to jth unit in layer l+ 1)

al = f(zl) (f is a non-linearity)

δl+1 = ∂L∂zl+1

δl = ∂L∂zl

=(

∂zl+1

∂zl

)T· ∂L∂zl+1

10

Page 12: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Gradients wrt Parameters

layer l (zl)

al−1 δl

al δl+1

zl = Wlal−1 + bl (wlj,k weight on connection from kth

unit in layer l-1 to jth unit in layer l)

δl = ∂L∂zl

11

Page 13: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

layer 2

layer l − 1

layer l

layer L− 1

layer L

z1input x δ1 ∂L∂z1

zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)

(2) zl = Wlal−1 + bl

(3) al = f(zl)

(4) L(aL, y)

Back-propagation Equations

(1) δL = ∂L∂aL f ′(zL)

(2) δl = (Wl+1)T δl+1 f ′(zl)

(3) ∂L∂bl = δl

(4) ∂L∂Wl = δl(al−1)T

12

Page 14: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Output Layer : LogSoftmax

layer L (zL)

aL−1 δL

aLcriterion = nn.ClassNLLCriterion();

Loss: L(y,aL) = −aLyaL = LogSoftmax(zL)

δL = ∂L∂zL

=(

∂aL

∂zL

)T· ∂L∂aL

13

Page 15: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

layer 2

layer l − 1

layer l

layer L− 1

layer L

z1input x δ1 ∂L∂z1

zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)

(2) zl = gl(al−1;Wl,bl)

(3) al = f l(zl)

(4) L(aL, y)

Back-propagation Equations

(1) δL =(

∂aL

∂zL

)T· ∂L∂aL

(2) δl =(

∂zl+1

∂al · ∂al

∂zl

)T· δl+1

(3) ∂L∂bl =

(∂zl

∂bl

)T· δl

(4) ∂L∂Wl =

(∂zl

∂Wl

)T· δl

14

Page 16: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

layer 2

layer l − 1

layer l

layer L− 1

layer L

z1input x δ1 ∂L∂z1

zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)

(2) zl = Wlal−1 + bl

(3) al = f(zl)

(4) L(aL, y)

Back-propagation Equations

(1) δL = ∂L∂aL f ′(zL)

(2) δl = (Wl+1)T δl+1 f ′(zl)

(3) ∂L∂bl = δl

(4) ∂L∂Wl = δl(al−1)T

15

Page 17: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Computational Questions

What is the running time to compute the gradient for a single data point?

What is the space requirement?

Can we process multiple examples together?

16

Page 18: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

17

Page 19: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

18

Page 20: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Training Deep Neural Networks

I Back-propagation gives gradient

I Stochastic gradient descent is the method of choice

I RegularisationI How do we add `1 or `2 regularisation?I Don’t regularise bias terms

I How about convergence?

I What did we learn in the last 10 years, that we didn’t know in the 80s?

19

Page 21: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Training Feedforward Deep Networks

Layer 2(Hidden)

Layer 1(Input)

Layer 3(Hidden)

Layer 4(Output)

Why do we get non-convex optimisation problem?

20

Page 22: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

A toy example

1

x ∈ −1, 1

a21 Target is y = 1−x

2

w21

b21

21

Page 23: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Propagating Gradients Backwards

z11 a4

1w21 w3

1 w41

b21 b31 b41

22

Page 24: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Digit Classification Problem on Sheet 4

One-hot encoding Binary encoding

23

Page 25: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Digit Classification: Squared Loss vs Cross Entropy

Squared Error Cross Entropy

(See paper by Glorot and Bengio (2010))

24

Page 26: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Avoiding Saturation

Use rectified linear units

ReLU(z) = max(0, z)

Sometimes, you will see justf(z) = |z|

Other varieties: leaky ReLUs,parametric ReLUs

25

Page 27: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Initialising Weights and Biases

Why is initialising important?

Suppose we were using a sigmoid unit, howwould you initialise the incoming weights?

What if it were a ReLU unit?

How about the biases?

26

Page 28: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Avoiding Overfitting

Deep Neural Networks have a lot of parameters

I Fully connected layers with n1, n2, .., nk units have at leastn1n2 + n2n3 + · · ·+ nk−1nk parameters

I MLP for digit recognition had 2 million parameters!

I For image detection, the neural net used by Krizhevsky, Sutskever,Hinton (2012) has 60 million parameters and 1.2 million training images

I How do we avoid deep neural networks from overfitting?

27

Page 29: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Early Stopping

Maintain validation set and stop trainingwhen error on validation set stopsdecreasing.

What are the computational costs?

What are the advantages?

Early stopping leads to better generalisation

(See paper by Hardt, Recht and Singer (2015))

28

Page 30: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Add DataTypically, getting additional data is either impossible or expensive

Fake the data!

Images can be translated slight, rotated slightly, change of brightness, etc.

Google Offline Translate trained on entirely fake data!

(Google Research Blog)29

Page 31: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Add Data

Adversarial Training

Take trained (or partially trained model)

Create examples by modifications ‘‘imperceptible to the human eye’’, butwhere the model fails

(Szegedy et al., Goodfellow et al.)

30

Page 32: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Other Ideas to Reduce Overfitting

Hard constraints on weights

Gradient Clipping

Inject noise into the system

Enforce sparsity in the neural network

Unsupervised Pre-training(Bengio et al.)

31

Page 33: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Bagging and Dropout

Bagging (Leo Breiman - 1994)

I Given datasetD = 〈(xi, yi)〉mi=1, sampleD1,D2, · · · ,Dk of sizem fromD with replacement

I Train classifiers f1, . . . , fk onD1, . . . ,Dk

I When predicting use majority (or average if using regression)

I Clearly this approach is not practical for deep networks

32

Page 34: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Dropout

I For input x each hidden unit with probability 1/2 independently

I Every input, will have a potentially different mask

I Potentially exponentially different models, but have ‘‘same weights’’

I After training whole network is used by halving all the weights

(Srivastava, Hinton, Krizhevsky, 2014)

33

Page 35: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Errors Made by MLP for Digit Recognition

34

Page 36: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Avoiding Overfitting

I Use weight sharing a.k.a tied weights in the model

I Exploit invariances -- translation

I Exploit locality in images, audio, text, etc.

I Convolutional Neural Networks (convnets)

35

Page 37: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Convolutional Neural Networks (convnets)

(Fukushima, LeCun, Hinton 1980s)

36

Page 38: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Convolution

37

Page 39: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Image Convolution

Source: L. W. Kheng38

Page 40: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Source: Krizhevsky, Sutskever, Hinton (2012)39

Page 41: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Source: Krizhevsky, Sutskever, Hinton (2012); Wikipedia40

Page 42: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Source: Krizhevsky, Sutskever, Hinton (2012)41

Page 43: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Source: Zeiler and Fergus (2013)42

Page 44: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Source: Zeiler and Fergus (2013)43

Page 45: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Convnet in Torch

convnet

model = nn.Sequential ()model:add(nn.Reshape (1, 32, 32))-- layer 1:model:add(nn.SpatialConvolution (1, 16, 5, 5))model:add(nn.ReLU ())model:add(nn.SpatialMaxPooling (2, 2, 2, 2))-- layer 2:model:add(nn.SpatialConvolution (16, 128, 5, 5))model:add(nn.ReLU ())model:add(nn.SpatialMaxPooling (2, 2, 2, 2))-- layer 3:model:add(nn.Reshape (128*5*5))model:add(nn.Linear (128*5*5 , 200))model:add(nn.ReLU ())-- outputmodel:add(nn.Linear (200, 10))model:add(nn.LogSoftMax ())

44

Page 46: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Convolutional Layer

zl+1i′,j′,f ′ = bf ′ +

Wf′∑i=1

Hf′∑j=1

Fl∑f=1

ali′+i−1,j′+j−1,fwl+1,f ′

i,j,f

∂zl+1i′,j′,f ′

∂wl+1,f ′

i,j,f

= ali′+i−1,j′+j−1,f

∂L

∂wl+1,f ′

i,j,f

=∑i′,j′

δl+1i′,j′,f ′a

li′+i−1,j′+j−1,f

45

Page 47: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Convolutional Layer

zl+1i′,j′,f ′ = bf ′ +

Wf′∑i=1

Hf′∑j=1

Fl∑f=1

ali′+i−1,j′+j−1,fwl+1,f ′

i,j,f

∂zl+1i′,j′,f ′

∂ali,j,f= wl+1,f ′

i−i′+1,j−j′+1,f

∂L

∂ali,j,f=∑

i′,j′,f ′

δl+1i′,j′,f ′w

l+1,f ′

i−i′+1,j−j′+1,f

46

Page 48: Machine learning - HT 2016 8. Neural Networks · Machine learning - HT 2016 8. Neural Networks Varun Kanade University of Oxford February 19, 2016. ... Multilayer Perceptron (MLP)

Max-Pooling Layer

bl+1i′,j′ = max

i,j∈Ω(i′,j′)ali,j

∂bl+1i′,j′

∂ali,j= I

((i, j) = argmax

i,j∈Ω(i′,j′)

ali,j

)

47