Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

1/59

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

2/59


Lecture 4aFeedforward neural network

October 30, 2015

3/59


Table of contents

1 1. Objectives of Lecture 4a

2 2. Multilayer perceptron2.1. Feedforward data flow2.2. Back propagation algorithm

3 3. Training neural networks3.1. Simple neural network3.2. Training general feedforward neural network

4/59


1. Objectives of Lecture 4a

Objective 1

Learn the basic formalism of multilayer feedforward neural network

Objective 2

Learn the back propagation algorithm

Objective 3

Learn about the basic issues related to training of neural network

5/59


Objective 4

Learn some useful tricks for training the neural network

6/59


2.1. Feedforward data flow

2. Multilayer perceptron2.1. Feedforward data flow

7/59



First layer

8/59



First layer

9/59



z1i ∶ input (pre-activation) to the ith neuron in Layer 1

z1i =

d

∑j=1

ω1ijxj + b1

i

b1i ∶ bias at the ith neuron in Layer 1 in vector notation

z1=W 1x + b

h1i ∶ output of the ith neuron in Layer 1

h1i = ϕ1(z

1i ),

in vector notationh1

= ϕ1(z1),

10/59



where

ϕ1(z) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

sigm(z) =1

1 + e−ztanh(z)ReLU(z) = max(z ,0) = z+

etc .

At the `th layer

11/59



The input (pre-activation) Layer at the ith neuron in Layer `

zì =∑j

ωìjh

`−1j + bi ,

in vector notationz` =W `h`−1

+ b`

The output at the ith neuron in Layer `

h`j = ϕ`(z`j )

in vector notationh` = ϕ`(W

`h`−1+ b`)

12/59



The output layer: Layer L

pre-activationzL =W LhL−1

+ bL

OutputhL = ϕL(z

L)

In case of multi-class classification (K classes)

hL = softmax(zL)

i.e.

hLi =ezi

∑Kk=1 e

zk=

exp(W Li ⋅ h

L−1 + bLi )

∑Kk=1 exp(W L

k ⋅hL−1 + bLk)

hLi ∼ P(Y = 1 ∣ x)

[Note: Wi ⋅ denotes the ith row of matrix W ]

In case of regression, hLi ∈ R,∀i

13/59



The loss (error) function

For real output (regression)

E =∑k

∣hLk − yk ∣2

For discrete output (classification)Recall: Multivariate Bernoulli

P(y) = µy11 ⋯µ

yKK

Given data D = {(x(t), y (t))}Nt=1

Likelihood

∏t

P(y (t) ∣ x(t)) =∏t

µy(t)1

1 ⋯µy(t)KK

14/59



log likelihood

∑t

{y(t)1 logµ1 +⋯ + y

(t)K logµK}

hLi maximizes this log likelihood and is the estimator of µi .The error is defined to be the negative of the log likelihood(hLi minimizes this error)

E = −N

∑t=1

K

∑k=1

y(t)k log hLk

= −N

∑t=1

K

∑k=1

I(y (t) = k) logez

Lk

∑kj=1 e

zLj,

where zLj =W Lj ⋅ h

L−1 + bLj[Note: this error is called the softmax error or the crossentropy error]

15/59



Remark

For regression, the `2-error is typical. But one can use `1-errorlike

E =∑k

∣hLk − yk ∣,

or other convex function of hL − y

For classification, this softmax error is typical. But can useother similar errors

Feedforward network has the property that once the values(z` or h`) of all the neurons of a Layer ` are given, the valuesof layers that come after Layer ` are all determined by them(assuming all weights and biases are fixed). Thus can writethe error E(x , y) as E(h`, y) or E(z`, y) for any ` = 0,1,⋯,L

16/59


2.2. Back propagation algorithm


17/59



Change of E w.r.t. ωìj

Fix y , treat E as a function of the values (z` or h`) of the `thlayer. Then

∂E

∂zì=dhìdzì

∂E

∂hì, (1)

where

dhìdzì

=

⎧⎪⎪⎪⎨⎪⎪⎪⎩

hì (1 − hì ) if ϕ` is sigm

I(zì ≥ 0) if ϕ` is ReLU

sech2zì if ϕ` is tanh z

Then∂E

∂h`−1j

=∑i

∂zì∂h`−1

j

∂E

∂zì.

Fromzì =∑

j

ωìjh

`−1j + bì , (2)

18/59



get∂zì∂h`−1

j

= ωìj

∂zì∂ω`

ij

= h`−1j

Thus∂E

∂h`−1j

=∑j

ωìj∂E

∂zì(3)

Using (2), we get

∂E

∂ωìj

=∂zì∂ω`

ij

∂E

∂zì(4)

Thus∂E

∂ωìj

= h`−1j

∂E

∂zì

19/59



Change of E w.r.t. bì

The weight connecting the bias neuron in Layer ` − 1 to theith neuron in Layer ` is bì = ω

ì0

h`−10 = 1 and there is no input (pre-activation) to the bias

neuron

20/59



Note from (2)∂zì∂bì

=∂zì∂ω`

i0

= 1

Thus from (4)∂E

∂bì=∂E

∂ωì0

=∂E

∂zì

Hence we get

∂E

∂ωìj

= h`−1j

∂E

∂zì∂E

∂bì=∂E

∂ωì0

=∂E

∂zì(5)

21/59



Propagation mechanism

The data flows from Layer ` − 1 to Layer `, i.e. move forward(hence the name “feedforward network”)

Error derivative can compute∂E

∂hì

by(1)Ð→

∂E

∂zì

by(3)Ð→

∂E

∂h`−1i

by(1)Ð→

∂E

∂z`−1i

→ so on

Namely, the error derivatives can be computed from theoutput layer (Layer L) and backward all the way to the inputlayer (Layer 0) (hence the name back propagation)

Equation (5) is the basic equation to be used for the gradientdescent algorithm

22/59


3.1. Simple neural network

3. Training neural networks3.1. Simple neural network

Reference: Hinton’s Coursera Lectures

h = ω1x1 + ω2x2

`2-error for a single training set ((x1, x2), y)

E = E(ω1, ω2) = (ω1x1 + ω2x2 − y)2

23/59



x1, x2, y are fixedω1 and ω2 are variables

24/59



25/59



Gradient

∇E = (∂E

∂ω1,∂E

∂ω2)

= (2x1(ω1x1 + ω2x2 − y),2x2(ω1x1 + ω2x2 − y)) ∼ (x1, x2)

∇E is pointing perpendicularly to the line ω1x1 + ω2x2 − y = 0

26/59



`2-error for two data points

D = {((x11 , x

12 ), y

1), ((x21 , x

22 ), y

2)}

E = E(ω1, ω2) = (ω1x11 + ω2x

12 − y1)2 + (ω1x

21 + ω2x

22 − y2)2

27/59



The level sets E = constant are ellipses

28/59



Steepest descent: full batch learning

ω(new) = ω(old) − ε∇E(ω(old))

ε: learning rate

29/59



Stochastic gradient descent: online learning

30/59



Pathological situation

If two lines are almost parallel

31/59



then the ellipse has a ravine-like shape

32/59



Full batch

33/59



Online

34/59



In either case, ω(new) does not move much in the minimumdirection [oscillation phenomenon]

If the learning rate is big, ω(new) diverges

35/59


3.2. Training general feedforward neural network


Data: ((x1, x2), y)

y = 0,1

36/59



Example: error of binary classifier

Error

E = − I(y = 1) logeω1x1+ω2x2

1 + eω1x1+ω2x2

− I(y = 0) log1

1 + eω1x1+ω2x2

When y = 1,

ErrorE = − log sigm(ω1x1 + ω2x2)

37/59



sigm(z) is concave for z > 0log is also concaveThus E = − log sigm(ω1x1 + ω2x2) is convex whenω1x1 + ω2x2 > 0

when y = 0

Error

E = − log1

1 + eω1x1+ω2x2

1

1 + etis concave for t < 0

log is also concaveE is convex if ω1x1 + ω2x2 < 0

Thus E is convex in the “correct” region, but not everywhere

38/59



Error

The error of a general neural network is a very complicatedfunction of huge number of variables

39/59



The purpose of training (learning) is to find the value of ωthat minimizes E .

(Stochastic) gradient descent

The basic workhorse is a variant of gradient descent

Two stages

Initialization: how to find a “good” starting point(configuration of ω)Algorithm: how to get to a “good” minimum point from thegiven starting point

40/59



Basic issues

(A) Mode

Full-batch learning (gradient descent): use the entire data setMinim-batch learning (stochastic gradient descent): divide thedata set into a family of smaller data sets (mini-batches) anduse each mini-batch alternatingly [Preferred method for largedata set with much redundancy]Online learning (stochastic gradient descent): every mini-batchconsisting of single data point

(B) InputWhether to use the given data as input or do sometransformation of it

(C) WeightHow to set weights initially and change them in the course oftraining

41/59



(D) Learning rateHow to choose learning rate(s) initially and change it (them)in the course of training

(E) Generalization error

How to avoid overfittingHow to estimate the generalization error

(B) Input

Example(Hinton)

h = ω1x1 + ω2x2

42/59



Data ((101,101),2), ((101,99),0)

43/59



Subtract 100 ⇒ ((1,1),2), ((1,−1),0)

44/59



Example (Hinton)

Data ((0.1,10),2), ((0.1,−10),0)

45/59



Divide componentwise by average magnitude⇒ ((1,1),2), ((1,−1),0)

46/59



If the inputs are highly correlated, they tend to create“ravines”. To alleviate this problem,

decorrelate the inputnormalize each co-ordinate value to have similar variability (i.e.variance)

May use, e.g., PCA or autoencoder (see the forthcominglecture)

47/59



(C) Weight

Weights are to be determined by the (learning) algorithm

Initialization is still an issue

Random initializationBreaking symmetry

Fan-in

Fan-in of a neuron (layer) is the number of layer from theinput to the neuron (layer) in questionBig fan-in may result in big change in the value of the neuronin the latter layer even with small change in the earlier layer.Thus, better to initialize the incoming weight of the neuronwith big fan-in smallWith small fan-in, may initialize the incoming weight bigger(not so small)

48/59



General (esp. deep) neural network has lots of local minimumin which the gradient descent process may get trapped. Thisis one of the central issues in the training of neural network

Pre-training has the effect of putting the initial positionreasonably close to he intended minimum [see the forthcominglectures]

49/59



(D) Learning rate

Big learning rate

learn quickly (error decreases rapidly) in the early stageand then may start to oscillate and the error gets erratic

Small learning rate

takes long time to learnmay get stuck in a “bad” local minimum

Rule of thumb

If error oscillates, decrease the learning rateIf error decreases consistently, increase the learning rate

Using different learning rate for each weight

The magnitude of weights vary greatly, and it may causeproblem if a single learning rate is used for all weightsSee Hinton’s Coursera lecture for this

50/59



Momentum method

IdeaIn a ravine, oscillatory phenomenon occurs

Gradient descentω(t) = ω(t − 1) − ε∇E(ω(t − 1))ω(t + 1) = ω(t) − ε∇E(ω(t))

51/59



ω(t) − ω(t − 1) and ω(t + 1) − ω(t1) are nearly opposite toeach other. If added up, they nearly cancel each other and theresulting vector is

52/59



This vector may point roughly to the bottom of ravine,

the momentum method came out of this kind of observation

53/59



Algorithm

Keep track of the momentum vector v(t)

Given weight ω(t) and momentum v(t), let

v(t + 1) = µv(t) − ε∇E(ω(t))

ω(t + 1) = ω(t) + v(t + 1)

µ ∶ momentum decay coefficientε ∶ learning rate

54/59



Improved momentum method (Sutskever et al. a la Nesterov)

v(t + 1) = µv(t) − ε∇E(ω(t) + µv(t))

ω(t + 1) = ω(t) + v(t + 1)

55/59



In the ravine, the sequence {ω(t)} may look like

At the beginning, set the momentum coefficient small (e.g.,0.5) and eventually increase it big (e.g., 0.9)

56/59



Other methods

Separate adaptive learning rate

Rmsprop

[See Hinton]

57/59



(E) Generalization error

Neural network tends to overfit

Some care is needed to control the generalization error

Make use of machine learning techniques

Regularization: Add regularizing term in the error function(standard regularization); keep weights within some prescribedbound; keep the network simpler;Model selection: Try many neural networks and choose thebest one according to the model selection criterion; try earlystoppingAggregation/Bagging: Train many neural network and use theaveraging technique; try bootstrap in conjunction withaggregationRandomization: Apply dropoutetc.

58/59



Dropout

Imitation of random forests idea

Repeat

at the start of each training step, randomly select someneuronsremove them together with the edges connected to themdo the training with the remaining networkput back the removed neurons and edges (with old values)

This forces each edge (weight) to individually adapt to thepatterns without the co-operation from other edges (weights)

59/59



Documents

Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf · 2015-11-13 · Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks