59
1/59 1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

1/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

Page 2: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Lecture 4aFeedforward neural network

October 30, 2015

Page 3: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Table of contents

1 1. Objectives of Lecture 4a

2 2. Multilayer perceptron2.1. Feedforward data flow2.2. Back propagation algorithm

3 3. Training neural networks3.1. Simple neural network3.2. Training general feedforward neural network

Page 4: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

4/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

1. Objectives of Lecture 4a

Objective 1

Learn the basic formalism of multilayer feedforward neural network

Objective 2

Learn the back propagation algorithm

Objective 3

Learn about the basic issues related to training of neural network

Page 5: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

5/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Objective 4

Learn some useful tricks for training the neural network

Page 6: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

6/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

2. Multilayer perceptron2.1. Feedforward data flow

Page 7: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

7/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

First layer

Page 8: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

8/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

First layer

Page 9: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

9/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

z1i ∶ input (pre-activation) to the ith neuron in Layer 1

z1i =

d

∑j=1

ω1ijxj + b1

i

b1i ∶ bias at the ith neuron in Layer 1 in vector notation

z1=W 1x + b

h1i ∶ output of the ith neuron in Layer 1

h1i = ϕ1(z

1i ),

in vector notationh1

= ϕ1(z1),

Page 10: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

10/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

where

ϕ1(z) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

sigm(z) =1

1 + e−ztanh(z)ReLU(z) = max(z ,0) = z+

etc .

At the `th layer

Page 11: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

11/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

The input (pre-activation) Layer at the ith neuron in Layer `

z`i =∑j

ω`ijh

`−1j + bi ,

in vector notationz` =W `h`−1

+ b`

The output at the ith neuron in Layer `

h`j = ϕ`(z`j )

in vector notationh` = ϕ`(W

`h`−1+ b`)

Page 12: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

12/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

The output layer: Layer L

pre-activationzL =W LhL−1

+ bL

OutputhL = ϕL(z

L)

In case of multi-class classification (K classes)

hL = softmax(zL)

i.e.

hLi =ezi

∑Kk=1 e

zk=

exp(W Li ⋅ h

L−1 + bLi )

∑Kk=1 exp(W L

k ⋅hL−1 + bLk)

hLi ∼ P(Y = 1 ∣ x)

[Note: Wi ⋅ denotes the ith row of matrix W ]

In case of regression, hLi ∈ R,∀i

Page 13: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

13/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

The loss (error) function

For real output (regression)

E =∑k

∣hLk − yk ∣2

For discrete output (classification)Recall: Multivariate Bernoulli

P(y) = µy11 ⋯µ

yKK

Given data D = {(x(t), y (t))}Nt=1

Likelihood

∏t

P(y (t) ∣ x(t)) =∏t

µy(t)1

1 ⋯µy(t)KK

Page 14: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

14/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

log likelihood

∑t

{y(t)1 logµ1 +⋯ + y

(t)K logµK}

hLi maximizes this log likelihood and is the estimator of µi .The error is defined to be the negative of the log likelihood(hLi minimizes this error)

E = −N

∑t=1

K

∑k=1

y(t)k log hLk

= −N

∑t=1

K

∑k=1

I(y (t) = k) logez

Lk

∑kj=1 e

zLj,

where zLj =W Lj ⋅ h

L−1 + bLj[Note: this error is called the softmax error or the crossentropy error]

Page 15: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

15/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

Remark

For regression, the `2-error is typical. But one can use `1-errorlike

E =∑k

∣hLk − yk ∣,

or other convex function of hL − y

For classification, this softmax error is typical. But can useother similar errors

Feedforward network has the property that once the values(z` or h`) of all the neurons of a Layer ` are given, the valuesof layers that come after Layer ` are all determined by them(assuming all weights and biases are fixed). Thus can writethe error E(x , y) as E(h`, y) or E(z`, y) for any ` = 0,1,⋯,L

Page 16: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

16/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

2.2. Back propagation algorithm

Page 17: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

17/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Change of E w.r.t. ω`ij

Fix y , treat E as a function of the values (z` or h`) of the `thlayer. Then

∂E

∂z`i=dh`idz`i

∂E

∂h`i, (1)

where

dh`idz`i

=

⎧⎪⎪⎪⎨⎪⎪⎪⎩

h`i (1 − h`i ) if ϕ` is sigm

I(z`i ≥ 0) if ϕ` is ReLU

sech2z`i if ϕ` is tanh z

Then∂E

∂h`−1j

=∑i

∂z`i∂h`−1

j

∂E

∂z`i.

Fromz`i =∑

j

ω`ijh

`−1j + b`i , (2)

Page 18: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

18/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

get∂z`i∂h`−1

j

= ω`ij

∂z`i∂ω`

ij

= h`−1j

Thus∂E

∂h`−1j

=∑j

ω`ij∂E

∂z`i(3)

Using (2), we get

∂E

∂ω`ij

=∂z`i∂ω`

ij

∂E

∂z`i(4)

Thus∂E

∂ω`ij

= h`−1j

∂E

∂z`i

Page 19: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

19/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Change of E w.r.t. b`i

The weight connecting the bias neuron in Layer ` − 1 to theith neuron in Layer ` is b`i = ω

`i0

h`−10 = 1 and there is no input (pre-activation) to the bias

neuron

Page 20: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

20/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Note from (2)∂z`i∂b`i

=∂z`i∂ω`

i0

= 1

Thus from (4)∂E

∂b`i=∂E

∂ω`i0

=∂E

∂z`i

Hence we get

∂E

∂ω`ij

= h`−1j

∂E

∂z`i∂E

∂b`i=∂E

∂ω`i0

=∂E

∂z`i(5)

Page 21: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

21/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Propagation mechanism

The data flows from Layer ` − 1 to Layer `, i.e. move forward(hence the name “feedforward network”)

Error derivative can compute∂E

∂h`i

by(1)Ð→

∂E

∂z`i

by(3)Ð→

∂E

∂h`−1i

by(1)Ð→

∂E

∂z`−1i

→ so on

Namely, the error derivatives can be computed from theoutput layer (Layer L) and backward all the way to the inputlayer (Layer 0) (hence the name back propagation)

Equation (5) is the basic equation to be used for the gradientdescent algorithm

Page 22: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

22/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

3. Training neural networks3.1. Simple neural network

Reference: Hinton’s Coursera Lectures

h = ω1x1 + ω2x2

`2-error for a single training set ((x1, x2), y)

E = E(ω1, ω2) = (ω1x1 + ω2x2 − y)2

Page 23: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

23/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

x1, x2, y are fixedω1 and ω2 are variables

Page 24: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

24/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Page 25: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

25/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Gradient

∇E = (∂E

∂ω1,∂E

∂ω2)

= (2x1(ω1x1 + ω2x2 − y),2x2(ω1x1 + ω2x2 − y)) ∼ (x1, x2)

∇E is pointing perpendicularly to the line ω1x1 + ω2x2 − y = 0

Page 26: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

26/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

`2-error for two data points

D = {((x11 , x

12 ), y

1), ((x21 , x

22 ), y

2)}

E = E(ω1, ω2) = (ω1x11 + ω2x

12 − y1)2 + (ω1x

21 + ω2x

22 − y2)2

Page 27: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

27/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

The level sets E = constant are ellipses

Page 28: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

28/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Steepest descent: full batch learning

ω(new) = ω(old) − ε∇E(ω(old))

ε: learning rate

Page 29: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

29/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Stochastic gradient descent: online learning

Page 30: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

30/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Pathological situation

If two lines are almost parallel

Page 31: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

31/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

then the ellipse has a ravine-like shape

Page 32: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

32/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Full batch

Page 33: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

33/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Online

Page 34: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

34/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

In either case, ω(new) does not move much in the minimumdirection [oscillation phenomenon]

If the learning rate is big, ω(new) diverges

Page 35: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

35/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

3.2. Training general feedforward neural network

Data: ((x1, x2), y)

y = 0,1

Page 36: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

36/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Example: error of binary classifier

Error

E = − I(y = 1) logeω1x1+ω2x2

1 + eω1x1+ω2x2

− I(y = 0) log1

1 + eω1x1+ω2x2

When y = 1,

ErrorE = − log sigm(ω1x1 + ω2x2)

Page 37: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

37/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

sigm(z) is concave for z > 0log is also concaveThus E = − log sigm(ω1x1 + ω2x2) is convex whenω1x1 + ω2x2 > 0

when y = 0

Error

E = − log1

1 + eω1x1+ω2x2

1

1 + etis concave for t < 0

log is also concaveE is convex if ω1x1 + ω2x2 < 0

Thus E is convex in the “correct” region, but not everywhere

Page 38: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

38/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Error

The error of a general neural network is a very complicatedfunction of huge number of variables

Page 39: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

39/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

The purpose of training (learning) is to find the value of ωthat minimizes E .

(Stochastic) gradient descent

The basic workhorse is a variant of gradient descent

Two stages

Initialization: how to find a “good” starting point(configuration of ω)Algorithm: how to get to a “good” minimum point from thegiven starting point

Page 40: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

40/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Basic issues

(A) Mode

Full-batch learning (gradient descent): use the entire data setMinim-batch learning (stochastic gradient descent): divide thedata set into a family of smaller data sets (mini-batches) anduse each mini-batch alternatingly [Preferred method for largedata set with much redundancy]Online learning (stochastic gradient descent): every mini-batchconsisting of single data point

(B) InputWhether to use the given data as input or do sometransformation of it

(C) WeightHow to set weights initially and change them in the course oftraining

Page 41: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

41/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(D) Learning rateHow to choose learning rate(s) initially and change it (them)in the course of training

(E) Generalization error

How to avoid overfittingHow to estimate the generalization error

(B) Input

Example(Hinton)

h = ω1x1 + ω2x2

Page 42: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

42/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Data ((101,101),2), ((101,99),0)

Page 43: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

43/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Subtract 100 ⇒ ((1,1),2), ((1,−1),0)

Page 44: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

44/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Example (Hinton)

Data ((0.1,10),2), ((0.1,−10),0)

Page 45: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

45/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Divide componentwise by average magnitude⇒ ((1,1),2), ((1,−1),0)

Page 46: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

46/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

If the inputs are highly correlated, they tend to create“ravines”. To alleviate this problem,

decorrelate the inputnormalize each co-ordinate value to have similar variability (i.e.variance)

May use, e.g., PCA or autoencoder (see the forthcominglecture)

Page 47: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

47/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(C) Weight

Weights are to be determined by the (learning) algorithm

Initialization is still an issue

Random initializationBreaking symmetry

Fan-in

Fan-in of a neuron (layer) is the number of layer from theinput to the neuron (layer) in questionBig fan-in may result in big change in the value of the neuronin the latter layer even with small change in the earlier layer.Thus, better to initialize the incoming weight of the neuronwith big fan-in smallWith small fan-in, may initialize the incoming weight bigger(not so small)

Page 48: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

48/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

General (esp. deep) neural network has lots of local minimumin which the gradient descent process may get trapped. Thisis one of the central issues in the training of neural network

Pre-training has the effect of putting the initial positionreasonably close to he intended minimum [see the forthcominglectures]

Page 49: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

49/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(D) Learning rate

Big learning rate

learn quickly (error decreases rapidly) in the early stageand then may start to oscillate and the error gets erratic

Small learning rate

takes long time to learnmay get stuck in a “bad” local minimum

Rule of thumb

If error oscillates, decrease the learning rateIf error decreases consistently, increase the learning rate

Using different learning rate for each weight

The magnitude of weights vary greatly, and it may causeproblem if a single learning rate is used for all weightsSee Hinton’s Coursera lecture for this

Page 50: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

50/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Momentum method

IdeaIn a ravine, oscillatory phenomenon occurs

Gradient descentω(t) = ω(t − 1) − ε∇E(ω(t − 1))ω(t + 1) = ω(t) − ε∇E(ω(t))

Page 51: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

51/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

ω(t) − ω(t − 1) and ω(t + 1) − ω(t1) are nearly opposite toeach other. If added up, they nearly cancel each other and theresulting vector is

Page 52: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

52/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

This vector may point roughly to the bottom of ravine,

the momentum method came out of this kind of observation

Page 53: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

53/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Algorithm

Keep track of the momentum vector v(t)

Given weight ω(t) and momentum v(t), let

v(t + 1) = µv(t) − ε∇E(ω(t))

ω(t + 1) = ω(t) + v(t + 1)

µ ∶ momentum decay coefficientε ∶ learning rate

Page 54: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

54/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Improved momentum method (Sutskever et al. a la Nesterov)

v(t + 1) = µv(t) − ε∇E(ω(t) + µv(t))

ω(t + 1) = ω(t) + v(t + 1)

Page 55: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

55/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

In the ravine, the sequence {ω(t)} may look like

At the beginning, set the momentum coefficient small (e.g.,0.5) and eventually increase it big (e.g., 0.9)

Page 56: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

56/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Other methods

Separate adaptive learning rate

Rmsprop

[See Hinton]

Page 57: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

57/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(E) Generalization error

Neural network tends to overfit

Some care is needed to control the generalization error

Make use of machine learning techniques

Regularization: Add regularizing term in the error function(standard regularization); keep weights within some prescribedbound; keep the network simpler;Model selection: Try many neural networks and choose thebest one according to the model selection criterion; try earlystoppingAggregation/Bagging: Train many neural network and use theaveraging technique; try bootstrap in conjunction withaggregationRandomization: Apply dropoutetc.

Page 58: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

58/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Dropout

Imitation of random forests idea

Repeat

at the start of each training step, randomly select someneuronsremove them together with the edges connected to themdo the training with the remaining networkput back the removed neurons and edges (with old values)

This forces each edge (weight) to individually adapt to thepatterns without the co-operation from other edges (weights)

Page 59: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

59/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network