Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf ·...

1. Objectives of Lecture 4a 2. Multilayer perceptron 3. Training neural networks

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

Lecture 4aFeedforward neural network

October 30, 2015

Table of contents

1 1. Objectives of Lecture 4a

2 2. Multilayer perceptron2.1. Feedforward data flow2.2. Back propagation algorithm

3 3. Training neural networks3.1. Simple neural network3.2. Training general feedforward neural network

1. Objectives of Lecture 4a

Objective 1

Learn the basic formalism of multilayer feedforward neural network

Objective 2

Learn the back propagation algorithm

Objective 3

Learn about the basic issues related to training of neural network

Objective 4

Learn some useful tricks for training the neural network

2.1. Feedforward data flow

2. Multilayer perceptron2.1. Feedforward data flow

First layer

z1i ∶ input (pre-activation) to the ith neuron in Layer 1

∑j=1

ω1ijxj + b1

b1i ∶ bias at the ith neuron in Layer 1 in vector notation

z1=W 1x + b

h1i ∶ output of the ith neuron in Layer 1

h1i = ϕ1(z

in vector notationh1

= ϕ1(z1),

ϕ1(z) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

sigm(z) =1

1 + e−ztanh(z)ReLU(z) = max(z ,0) = z+

At the `th layer

The input (pre-activation) Layer at the ith neuron in Layer `

z`i =∑j

ω`ijh

`−1j + bi ,

in vector notationz` =W `h`−1

The output at the ith neuron in Layer `

h`j = ϕ`(z`j )

in vector notationh` = ϕ`(W

`h`−1+ b`)

The output layer: Layer L

pre-activationzL =W LhL−1

OutputhL = ϕL(z

In case of multi-class classification (K classes)

hL = softmax(zL)

hLi =ezi

∑Kk=1 e

exp(W Li ⋅ h

L−1 + bLi )

∑Kk=1 exp(W L

k ⋅hL−1 + bLk)

hLi ∼ P(Y = 1 ∣ x)

[Note: Wi ⋅ denotes the ith row of matrix W ]

In case of regression, hLi ∈ R,∀i

The loss (error) function

For real output (regression)

E =∑k

∣hLk − yk ∣2

For discrete output (classification)Recall: Multivariate Bernoulli

P(y) = µy11 ⋯µ

Given data D = {(x(t), y (t))}Nt=1

Likelihood

P(y (t) ∣ x(t)) =∏t

µy(t)1

1 ⋯µy(t)KK

log likelihood

{y(t)1 logµ1 +⋯ + y

(t)K logµK}

hLi maximizes this log likelihood and is the estimator of µi .The error is defined to be the negative of the log likelihood(hLi minimizes this error)

E = −N

∑t=1

∑k=1

y(t)k log hLk

= −N

∑t=1

∑k=1

I(y (t) = k) logez

∑kj=1 e

where zLj =W Lj ⋅ h

L−1 + bLj[Note: this error is called the softmax error or the crossentropy error]

Remark

For regression, the `2-error is typical. But one can use `1-errorlike

E =∑k

∣hLk − yk ∣,

or other convex function of hL − y

For classification, this softmax error is typical. But can useother similar errors

Feedforward network has the property that once the values(z` or h`) of all the neurons of a Layer ` are given, the valuesof layers that come after Layer ` are all determined by them(assuming all weights and biases are fixed). Thus can writethe error E(x , y) as E(h`, y) or E(z`, y) for any ` = 0,1,⋯,L

2.2. Back propagation algorithm

Change of E w.r.t. ω`ij

Fix y , treat E as a function of the values (z` or h`) of the `thlayer. Then

∂zì=dhìdzì

∂h`i, (1)

dh`idz`i

⎧⎪⎪⎪⎨⎪⎪⎪⎩

h`i (1 − h`i ) if ϕ` is sigm

I(z`i ≥ 0) if ϕ` is ReLU

sech2z`i if ϕ` is tanh z

Then∂E

∂h`−1j

∂z`i∂h`−1

∂z`i.

Fromz`i =∑

ω`ijh

`−1j + b`i , (2)

get∂z`i∂h`−1

= ω`ij

∂z`i∂ω`

= h`−1j

Thus∂E

∂h`−1j

ω`ij∂E

∂z`i(3)

Using (2), we get

∂ω`ij

=∂z`i∂ω`

∂z`i(4)

Thus∂E

∂ω`ij

= h`−1j

∂z`i

Change of E w.r.t. b`i

The weight connecting the bias neuron in Layer ` − 1 to theith neuron in Layer ` is b`i = ω

h`−10 = 1 and there is no input (pre-activation) to the bias

neuron

Note from (2)∂z`i∂b`i

=∂z`i∂ω`

Thus from (4)∂E

∂b`i=∂E

∂ω`i0

∂z`i

Hence we get

∂ω`ij

= h`−1j

∂z`i∂E

∂b`i=∂E

∂ω`i0

∂z`i(5)

Propagation mechanism

The data flows from Layer ` − 1 to Layer `, i.e. move forward(hence the name “feedforward network”)

Error derivative can compute∂E

∂h`i

by(1)Ð→

∂z`i

by(3)Ð→

∂h`−1i

by(1)Ð→

∂z`−1i

→ so on

Namely, the error derivatives can be computed from theoutput layer (Layer L) and backward all the way to the inputlayer (Layer 0) (hence the name back propagation)

Equation (5) is the basic equation to be used for the gradientdescent algorithm

3.1. Simple neural network

3. Training neural networks3.1. Simple neural network

Reference: Hinton’s Coursera Lectures

h = ω1x1 + ω2x2

`2-error for a single training set ((x1, x2), y)

E = E(ω1, ω2) = (ω1x1 + ω2x2 − y)2

x1, x2, y are fixedω1 and ω2 are variables

Gradient

∇E = (∂E

∂ω1,∂E

∂ω2)

= (2x1(ω1x1 + ω2x2 − y),2x2(ω1x1 + ω2x2 − y)) ∼ (x1, x2)

∇E is pointing perpendicularly to the line ω1x1 + ω2x2 − y = 0

`2-error for two data points

D = {((x11 , x

12 ), y

1), ((x21 , x

22 ), y

E = E(ω1, ω2) = (ω1x11 + ω2x

12 − y1)2 + (ω1x

21 + ω2x

22 − y2)2

The level sets E = constant are ellipses

Steepest descent: full batch learning

ω(new) = ω(old) − ε∇E(ω(old))

ε: learning rate

Stochastic gradient descent: online learning

Pathological situation

If two lines are almost parallel

then the ellipse has a ravine-like shape

Full batch

Online

In either case, ω(new) does not move much in the minimumdirection [oscillation phenomenon]

If the learning rate is big, ω(new) diverges

3.2. Training general feedforward neural network

Data: ((x1, x2), y)

y = 0,1

Example: error of binary classifier

E = − I(y = 1) logeω1x1+ω2x2

1 + eω1x1+ω2x2

− I(y = 0) log1

1 + eω1x1+ω2x2

When y = 1,

ErrorE = − log sigm(ω1x1 + ω2x2)

sigm(z) is concave for z > 0log is also concaveThus E = − log sigm(ω1x1 + ω2x2) is convex whenω1x1 + ω2x2 > 0

when y = 0

E = − log1

1 + eω1x1+ω2x2

1 + etis concave for t < 0

log is also concaveE is convex if ω1x1 + ω2x2 < 0

Thus E is convex in the “correct” region, but not everywhere

The error of a general neural network is a very complicatedfunction of huge number of variables

The purpose of training (learning) is to find the value of ωthat minimizes E .

(Stochastic) gradient descent

The basic workhorse is a variant of gradient descent

Two stages

Initialization: how to find a “good” starting point(configuration of ω)Algorithm: how to get to a “good” minimum point from thegiven starting point

Basic issues

(A) Mode

Full-batch learning (gradient descent): use the entire data setMinim-batch learning (stochastic gradient descent): divide thedata set into a family of smaller data sets (mini-batches) anduse each mini-batch alternatingly [Preferred method for largedata set with much redundancy]Online learning (stochastic gradient descent): every mini-batchconsisting of single data point

(B) InputWhether to use the given data as input or do sometransformation of it

(C) WeightHow to set weights initially and change them in the course oftraining

(D) Learning rateHow to choose learning rate(s) initially and change it (them)in the course of training

(E) Generalization error

How to avoid overfittingHow to estimate the generalization error

(B) Input

Example(Hinton)

h = ω1x1 + ω2x2

Data ((101,101),2), ((101,99),0)

Subtract 100 ⇒ ((1,1),2), ((1,−1),0)

Example (Hinton)

Data ((0.1,10),2), ((0.1,−10),0)

Divide componentwise by average magnitude⇒ ((1,1),2), ((1,−1),0)

If the inputs are highly correlated, they tend to create“ravines”. To alleviate this problem,

decorrelate the inputnormalize each co-ordinate value to have similar variability (i.e.variance)

May use, e.g., PCA or autoencoder (see the forthcominglecture)

(C) Weight

Weights are to be determined by the (learning) algorithm

Initialization is still an issue

Random initializationBreaking symmetry

Fan-in

Fan-in of a neuron (layer) is the number of layer from theinput to the neuron (layer) in questionBig fan-in may result in big change in the value of the neuronin the latter layer even with small change in the earlier layer.Thus, better to initialize the incoming weight of the neuronwith big fan-in smallWith small fan-in, may initialize the incoming weight bigger(not so small)

General (esp. deep) neural network has lots of local minimumin which the gradient descent process may get trapped. Thisis one of the central issues in the training of neural network

Pre-training has the effect of putting the initial positionreasonably close to he intended minimum [see the forthcominglectures]

(D) Learning rate

Big learning rate

learn quickly (error decreases rapidly) in the early stageand then may start to oscillate and the error gets erratic

Small learning rate

takes long time to learnmay get stuck in a “bad” local minimum

Rule of thumb

If error oscillates, decrease the learning rateIf error decreases consistently, increase the learning rate

Using different learning rate for each weight

The magnitude of weights vary greatly, and it may causeproblem if a single learning rate is used for all weightsSee Hinton’s Coursera lecture for this

Momentum method

IdeaIn a ravine, oscillatory phenomenon occurs

Gradient descentω(t) = ω(t − 1) − ε∇E(ω(t − 1))ω(t + 1) = ω(t) − ε∇E(ω(t))

ω(t) − ω(t − 1) and ω(t + 1) − ω(t1) are nearly opposite toeach other. If added up, they nearly cancel each other and theresulting vector is

This vector may point roughly to the bottom of ravine,

the momentum method came out of this kind of observation

Algorithm

Keep track of the momentum vector v(t)

Given weight ω(t) and momentum v(t), let

v(t + 1) = µv(t) − ε∇E(ω(t))

ω(t + 1) = ω(t) + v(t + 1)

µ ∶ momentum decay coefficientε ∶ learning rate

Improved momentum method (Sutskever et al. a la Nesterov)

v(t + 1) = µv(t) − ε∇E(ω(t) + µv(t))

ω(t + 1) = ω(t) + v(t + 1)

In the ravine, the sequence {ω(t)} may look like

At the beginning, set the momentum coefficient small (e.g.,0.5) and eventually increase it big (e.g., 0.9)

Other methods

Separate adaptive learning rate

Rmsprop

[See Hinton]

(E) Generalization error

Neural network tends to overfit

Some care is needed to control the generalization error

Make use of machine learning techniques

Regularization: Add regularizing term in the error function(standard regularization); keep weights within some prescribedbound; keep the network simpler;Model selection: Try many neural networks and choose thebest one according to the model selection criterion; try earlystoppingAggregation/Bagging: Train many neural network and use theaveraging technique; try bootstrap in conjunction withaggregationRandomization: Apply dropoutetc.

Dropout

Imitation of random forests idea

Repeat

at the start of each training step, randomly select someneuronsremove them together with the edges connected to themdo the training with the remaining networkput back the removed neurons and edges (with old values)

This forces each edge (weight) to individually adapt to thepatterns without the co-operation from other edges (weights)

Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/lecture4a.pdf ·...

Documents

Pareto Weights in Practice: Income Inequality and Tax Reformeconseminar/seminar2015/Apr08_Yongsung.pdf · Pareto Weights in Practice: Income Inequality and Tax Reform Bo Hyun Chang

Main Seminar Hot Topics in Bioinformaticssco.h-its.org/exelixis/web/teaching/seminar2015/seminar2.pdf · Motivation Little is known about efficient barrier implementations on multi-cores

categorified.netcategorified.net/Seminar2015/Penrose1994.pdf · Created Date: 5/28/2015 4:18:58 PM

Estimating Trade Elasticities Using Product-Level Dataeconseminar/seminar2015/May27_Kwak.pdf · controls. As an illustration, we provide some examples of welfare effects using trade

EE123 Digital Signal Processing - www …ee123/sp18/Notes/Lecture4A.pdf · Based on Course Notes by J.M Kahn Fall 2011, EE123 Digital Signal Processing. ... SP 2015 Decimation-in-Frequency

The Engine and the Reaper: the Impact of Industrialization ...econseminar/seminar2015/July22_Tang_mortality.pdfThe Engine and the Reaper: the Impact of Industrialization on Mortality

Time-Varying Systemic Risk: Evidence from a Dynamic Copula ...econseminar/seminar2015/Mar10_Andrew.pdf · 1 Introduction Systemic risk can be broadly de–ned as the risk of distress

2015 FAMILY LAW SEMINAR2015 FAMILY LAW SEMINAR · 2018-04-04 · Expert Advice on How to Deal with Parental Alienation in Family Law Cases 1:00 p.m.- 2:30 p.m. Presented by 2015 FAMILY

Bacterial Communities in Women with Bacterial …sco.h-its.org/exelixis/web/teaching/seminar2015/Example.pdf · Bacterial Communities in Women with Bacterial Vaginosis: High Resolution

Chapter 4: Consumption, Saving, and Investmentyluo/teaching/Econ2220_2015/lecture4a.pdf · 2015-02-03 · Chapter Outline Describe the factors that a⁄ect consumption and saving

2015 Eastport Yacht Club Lights Parade Seminar2015 Eastport Yacht Club Lights Parade Seminar. Topics • Schedule • Design Ideas • Case Studies. ... – Consider LED! Construction

Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

CS558 Programming Languagesapt/cs558_2017/lecture4a.pdf“Static scope”/“Lexical scope” 19 Better if this program remains erroneous Looking at a function declaration, we can

4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

The Great Famine and Savings Rate in Chinaeconseminar/seminar2015/May25_HengChen.pdfnatural disaster, instead of other changes.1 Malmendier and Nagel (2011) study the 1 Callen (2015)

seminar2015 a4 leaf 0930Title seminar2015_a4_leaf_0930 Created Date 9/30/2015 5:17:27 PM

Data Mining and Machine Learningpeople.scs.carleton.ca/~boyanbejanov/data5000/lecture4a.pdf · Machine Learning vs Data Mining I Machine Learning is the design of algorithms that

Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture5b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

Dominant feature extractionperso.uclouvain.be/paul.vandooren/Lecture4a.pdf · -6pt-6pt Dominant feature extraction-6pt-6pt 1 / 82 Dominant feature extraction Francqui Lecture 7-5-2010

Rapid Introduction to Machine Learning/ Deep Learninghichoi/machinelearning/lecture... · 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning. 4/62 1 Bird’s-eye view