Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf ·...

Preview:

Citation preview

1/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

2/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Lecture 4aFeedforward neural network

October 30, 2015

3/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Table of contents

1 1. Objectives of Lecture 4a

2 2. Multilayer perceptron2.1. Feedforward data flow2.2. Back propagation algorithm

3 3. Training neural networks3.1. Simple neural network3.2. Training general feedforward neural network

4/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

1. Objectives of Lecture 4a

Objective 1

Learn the basic formalism of multilayer feedforward neural network

Objective 2

Learn the back propagation algorithm

Objective 3

Learn about the basic issues related to training of neural network

5/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Objective 4

Learn some useful tricks for training the neural network

6/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

2. Multilayer perceptron2.1. Feedforward data flow

7/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

First layer

8/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

First layer

9/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

z1i ∶ input (pre-activation) to the ith neuron in Layer 1

z1i =

d

∑j=1

ω1ijxj + b1

i

b1i ∶ bias at the ith neuron in Layer 1 in vector notation

z1=W 1x + b

h1i ∶ output of the ith neuron in Layer 1

h1i = ϕ1(z

1i ),

in vector notationh1

= ϕ1(z1),

10/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

where

ϕ1(z) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

sigm(z) =1

1 + e−ztanh(z)ReLU(z) = max(z ,0) = z+

etc .

At the `th layer

11/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

The input (pre-activation) Layer at the ith neuron in Layer `

z`i =∑j

ω`ijh

`−1j + bi ,

in vector notationz` =W `h`−1

+ b`

The output at the ith neuron in Layer `

h`j = ϕ`(z`j )

in vector notationh` = ϕ`(W

`h`−1+ b`)

12/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

The output layer: Layer L

pre-activationzL =W LhL−1

+ bL

OutputhL = ϕL(z

L)

In case of multi-class classification (K classes)

hL = softmax(zL)

i.e.

hLi =ezi

∑Kk=1 e

zk=

exp(W Li ⋅ h

L−1 + bLi )

∑Kk=1 exp(W L

k ⋅hL−1 + bLk)

hLi ∼ P(Y = 1 ∣ x)

[Note: Wi ⋅ denotes the ith row of matrix W ]

In case of regression, hLi ∈ R,∀i

13/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

The loss (error) function

For real output (regression)

E =∑k

∣hLk − yk ∣2

For discrete output (classification)Recall: Multivariate Bernoulli

P(y) = µy11 ⋯µ

yKK

Given data D = {(x(t), y (t))}Nt=1

Likelihood

∏t

P(y (t) ∣ x(t)) =∏t

µy(t)1

1 ⋯µy(t)KK

14/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

log likelihood

∑t

{y(t)1 logµ1 +⋯ + y

(t)K logµK}

hLi maximizes this log likelihood and is the estimator of µi .The error is defined to be the negative of the log likelihood(hLi minimizes this error)

E = −N

∑t=1

K

∑k=1

y(t)k log hLk

= −N

∑t=1

K

∑k=1

I(y (t) = k) logez

Lk

∑kj=1 e

zLj,

where zLj =W Lj ⋅ h

L−1 + bLj[Note: this error is called the softmax error or the crossentropy error]

15/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.1. Feedforward data flow

Remark

For regression, the `2-error is typical. But one can use `1-errorlike

E =∑k

∣hLk − yk ∣,

or other convex function of hL − y

For classification, this softmax error is typical. But can useother similar errors

Feedforward network has the property that once the values(z` or h`) of all the neurons of a Layer ` are given, the valuesof layers that come after Layer ` are all determined by them(assuming all weights and biases are fixed). Thus can writethe error E(x , y) as E(h`, y) or E(z`, y) for any ` = 0,1,⋯,L

16/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

2.2. Back propagation algorithm

17/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Change of E w.r.t. ω`ij

Fix y , treat E as a function of the values (z` or h`) of the `thlayer. Then

∂E

∂z`i=dh`idz`i

∂E

∂h`i, (1)

where

dh`idz`i

=

⎧⎪⎪⎪⎨⎪⎪⎪⎩

h`i (1 − h`i ) if ϕ` is sigm

I(z`i ≥ 0) if ϕ` is ReLU

sech2z`i if ϕ` is tanh z

Then∂E

∂h`−1j

=∑i

∂z`i∂h`−1

j

∂E

∂z`i.

Fromz`i =∑

j

ω`ijh

`−1j + b`i , (2)

18/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

get∂z`i∂h`−1

j

= ω`ij

∂z`i∂ω`

ij

= h`−1j

Thus∂E

∂h`−1j

=∑j

ω`ij∂E

∂z`i(3)

Using (2), we get

∂E

∂ω`ij

=∂z`i∂ω`

ij

∂E

∂z`i(4)

Thus∂E

∂ω`ij

= h`−1j

∂E

∂z`i

19/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Change of E w.r.t. b`i

The weight connecting the bias neuron in Layer ` − 1 to theith neuron in Layer ` is b`i = ω

`i0

h`−10 = 1 and there is no input (pre-activation) to the bias

neuron

20/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Note from (2)∂z`i∂b`i

=∂z`i∂ω`

i0

= 1

Thus from (4)∂E

∂b`i=∂E

∂ω`i0

=∂E

∂z`i

Hence we get

∂E

∂ω`ij

= h`−1j

∂E

∂z`i∂E

∂b`i=∂E

∂ω`i0

=∂E

∂z`i(5)

21/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

2.2. Back propagation algorithm

Propagation mechanism

The data flows from Layer ` − 1 to Layer `, i.e. move forward(hence the name “feedforward network”)

Error derivative can compute∂E

∂h`i

by(1)Ð→

∂E

∂z`i

by(3)Ð→

∂E

∂h`−1i

by(1)Ð→

∂E

∂z`−1i

→ so on

Namely, the error derivatives can be computed from theoutput layer (Layer L) and backward all the way to the inputlayer (Layer 0) (hence the name back propagation)

Equation (5) is the basic equation to be used for the gradientdescent algorithm

22/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

3. Training neural networks3.1. Simple neural network

Reference: Hinton’s Coursera Lectures

h = ω1x1 + ω2x2

`2-error for a single training set ((x1, x2), y)

E = E(ω1, ω2) = (ω1x1 + ω2x2 − y)2

23/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

x1, x2, y are fixedω1 and ω2 are variables

24/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

25/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Gradient

∇E = (∂E

∂ω1,∂E

∂ω2)

= (2x1(ω1x1 + ω2x2 − y),2x2(ω1x1 + ω2x2 − y)) ∼ (x1, x2)

∇E is pointing perpendicularly to the line ω1x1 + ω2x2 − y = 0

26/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

`2-error for two data points

D = {((x11 , x

12 ), y

1), ((x21 , x

22 ), y

2)}

E = E(ω1, ω2) = (ω1x11 + ω2x

12 − y1)2 + (ω1x

21 + ω2x

22 − y2)2

27/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

The level sets E = constant are ellipses

28/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Steepest descent: full batch learning

ω(new) = ω(old) − ε∇E(ω(old))

ε: learning rate

29/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Stochastic gradient descent: online learning

30/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Pathological situation

If two lines are almost parallel

31/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

then the ellipse has a ravine-like shape

32/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Full batch

33/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

Online

34/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.1. Simple neural network

In either case, ω(new) does not move much in the minimumdirection [oscillation phenomenon]

If the learning rate is big, ω(new) diverges

35/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

3.2. Training general feedforward neural network

Data: ((x1, x2), y)

y = 0,1

36/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Example: error of binary classifier

Error

E = − I(y = 1) logeω1x1+ω2x2

1 + eω1x1+ω2x2

− I(y = 0) log1

1 + eω1x1+ω2x2

When y = 1,

ErrorE = − log sigm(ω1x1 + ω2x2)

37/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

sigm(z) is concave for z > 0log is also concaveThus E = − log sigm(ω1x1 + ω2x2) is convex whenω1x1 + ω2x2 > 0

when y = 0

Error

E = − log1

1 + eω1x1+ω2x2

1

1 + etis concave for t < 0

log is also concaveE is convex if ω1x1 + ω2x2 < 0

Thus E is convex in the “correct” region, but not everywhere

38/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Error

The error of a general neural network is a very complicatedfunction of huge number of variables

39/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

The purpose of training (learning) is to find the value of ωthat minimizes E .

(Stochastic) gradient descent

The basic workhorse is a variant of gradient descent

Two stages

Initialization: how to find a “good” starting point(configuration of ω)Algorithm: how to get to a “good” minimum point from thegiven starting point

40/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Basic issues

(A) Mode

Full-batch learning (gradient descent): use the entire data setMinim-batch learning (stochastic gradient descent): divide thedata set into a family of smaller data sets (mini-batches) anduse each mini-batch alternatingly [Preferred method for largedata set with much redundancy]Online learning (stochastic gradient descent): every mini-batchconsisting of single data point

(B) InputWhether to use the given data as input or do sometransformation of it

(C) WeightHow to set weights initially and change them in the course oftraining

41/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(D) Learning rateHow to choose learning rate(s) initially and change it (them)in the course of training

(E) Generalization error

How to avoid overfittingHow to estimate the generalization error

(B) Input

Example(Hinton)

h = ω1x1 + ω2x2

42/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Data ((101,101),2), ((101,99),0)

43/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Subtract 100 ⇒ ((1,1),2), ((1,−1),0)

44/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Example (Hinton)

Data ((0.1,10),2), ((0.1,−10),0)

45/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Divide componentwise by average magnitude⇒ ((1,1),2), ((1,−1),0)

46/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

If the inputs are highly correlated, they tend to create“ravines”. To alleviate this problem,

decorrelate the inputnormalize each co-ordinate value to have similar variability (i.e.variance)

May use, e.g., PCA or autoencoder (see the forthcominglecture)

47/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(C) Weight

Weights are to be determined by the (learning) algorithm

Initialization is still an issue

Random initializationBreaking symmetry

Fan-in

Fan-in of a neuron (layer) is the number of layer from theinput to the neuron (layer) in questionBig fan-in may result in big change in the value of the neuronin the latter layer even with small change in the earlier layer.Thus, better to initialize the incoming weight of the neuronwith big fan-in smallWith small fan-in, may initialize the incoming weight bigger(not so small)

48/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

General (esp. deep) neural network has lots of local minimumin which the gradient descent process may get trapped. Thisis one of the central issues in the training of neural network

Pre-training has the effect of putting the initial positionreasonably close to he intended minimum [see the forthcominglectures]

49/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(D) Learning rate

Big learning rate

learn quickly (error decreases rapidly) in the early stageand then may start to oscillate and the error gets erratic

Small learning rate

takes long time to learnmay get stuck in a “bad” local minimum

Rule of thumb

If error oscillates, decrease the learning rateIf error decreases consistently, increase the learning rate

Using different learning rate for each weight

The magnitude of weights vary greatly, and it may causeproblem if a single learning rate is used for all weightsSee Hinton’s Coursera lecture for this

50/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Momentum method

IdeaIn a ravine, oscillatory phenomenon occurs

Gradient descentω(t) = ω(t − 1) − ε∇E(ω(t − 1))ω(t + 1) = ω(t) − ε∇E(ω(t))

51/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

ω(t) − ω(t − 1) and ω(t + 1) − ω(t1) are nearly opposite toeach other. If added up, they nearly cancel each other and theresulting vector is

52/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

This vector may point roughly to the bottom of ravine,

the momentum method came out of this kind of observation

53/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Algorithm

Keep track of the momentum vector v(t)

Given weight ω(t) and momentum v(t), let

v(t + 1) = µv(t) − ε∇E(ω(t))

ω(t + 1) = ω(t) + v(t + 1)

µ ∶ momentum decay coefficientε ∶ learning rate

54/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Improved momentum method (Sutskever et al. a la Nesterov)

v(t + 1) = µv(t) − ε∇E(ω(t) + µv(t))

ω(t + 1) = ω(t) + v(t + 1)

55/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

In the ravine, the sequence {ω(t)} may look like

At the beginning, set the momentum coefficient small (e.g.,0.5) and eventually increase it big (e.g., 0.9)

56/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Other methods

Separate adaptive learning rate

Rmsprop

[See Hinton]

57/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

(E) Generalization error

Neural network tends to overfit

Some care is needed to control the generalization error

Make use of machine learning techniques

Regularization: Add regularizing term in the error function(standard regularization); keep weights within some prescribedbound; keep the network simpler;Model selection: Try many neural networks and choose thebest one according to the model selection criterion; try earlystoppingAggregation/Bagging: Train many neural network and use theaveraging technique; try bootstrap in conjunction withaggregationRandomization: Apply dropoutetc.

58/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Dropout

Imitation of random forests idea

Repeat

at the start of each training step, randomly select someneuronsremove them together with the edges connected to themdo the training with the remaining networkput back the removed neurons and edges (with old values)

This forces each edge (weight) to individually adapt to thepatterns without the co-operation from other edges (weights)

59/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

3.2. Training general feedforward neural network

Recommended