28
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16

Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Neural Networks and Deep Learning

Professor Ameet Talwalkar

November 12, 2015

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16

Page 2: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Outline

1 Review of last lectureAdaBoostBoosting as learning nonlinear basis

2 Neural networks

3 Summary

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 2 / 16

Page 3: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

How Boosting algorithm works?

Given: N samples {xn, yn}, where yn ∈ {+1,−1}, and some ways ofconstructing weak (or base) classifiers

Initialize weights w1(n) =1N for every training sample.

For t= 1 to T1 Train a weak classifier ht(x) based on the current weight wt(n), by

minimizing the weighted classification error

εt =∑

n

wt(n)I[yn 6= ht(xn)]

2 Calculate weights for combining classifiers βt =12 log

1−εtεt

3 Update weightswt+1(n) ∝ wt(n)e−βtynht(xn)

and normalize them such that∑n wt+1(n) = 1.

Output the final classifier

h[x] = sign

[T∑

t=1

βtht(x)

]

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 3 / 16

Page 4: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Derivation of the AdaBoost

Minimize exponential loss

`exp(h(x), y) = e−yf(x)

Greedily (sequentially) find the best classifier to optimize the lossA classifier ft−1(x) is improved by adding a new classifier ht(x)

f(x) = ft−1(x) + βtht(x)

(h∗t (x), β∗t ) = argmin(ht(x),βt)

n

e−ynf(xn)

= argmin(ht(x),βt)∑

n

e−yn[ft−1(xn)+βtht(xn)]

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 4 / 16

Page 5: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Nonlinear basis learned by boosting

Two-stage process

Get sign[f1(x)], sign[f2(x)],· · · ,Combine into a linear classification model

y = sign

{∑

t

βtsign[ft(x)]

}

Equivalently, each stage learns a nonlinear basis φt(x) = sign[ft(x)].

One thought is then, why not learning the basis functions and the classifierat the same time?

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 5 / 16

Page 6: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Outline

1 Review of last lecture

2 Neural networksAlgorithmDeep Neural Networks (DNNs)

3 Summary

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 6 / 16

Page 7: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Basic idea

Transform the input feature with nonlinear function

Use nonlinear basis functions

�(·)

x1

x2

xm

�1(x)

�2(x)

�d(x)

Original input/feature

new features

y = wT�(x) + w0

y = sgn(wT�(x) + w0)

Linear regression

Linear classification

xD �M (x)

�(x) � RMx � RD

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 7 / 16

Page 8: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Nonlinear basis as two-layer network

Layered architecture of “neurons”

Input layer: features

hidden layer: nonlinear transformation

Output layer: targets

Feedforward computation

hidden layer output:

Output layer output

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

zj = h(aj) = h

�D⇤

i=0

w(1)ji xi

yk = g

�⇤

M⇧

j=0

w(2)kj zj

⇥⌅ we often set these two have a

constant value of 1, thus “bias”

Page 9: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

A very concise history

1943 McCulloch-Pitts model of single neurons

1960’s Rosenblatt’s perceptron learning

1969 Minsky and Papert’s perceptron

1985 Hopfield neural nets

1986 Parallel and Distributed Processing (PDP book) and Connectionisms

2006 Deep nets

nonlinear processing unit

outputinput from other neurons

Page 10: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Neural networks are very powerful

Sufficient

Universal approximator: with sufficient number of nonlinear hidden units, linear output unit can approximate any continuous functions

Transfer function for the neurons

sigmoid function

tanh function:

piecewise linear

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

h(z) =ez � e�z

ez + e�z

h(z) =1

1 + e�z

h(z) = max(0, z)

Page 11: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Ex: computing highly nonlinear function

x

bias

bias

+1

-1+0.5

+1

+1

+1

+10 01/2

0.5

z1 = h(0.5x + 1)z2 = h(x + 1)

z3 = h(10x)

y = �z1 + z2 + 0.5 ⇤ z3 + 0.5

Page 12: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Complicated decision boundaries

−2 −1 0 1 2

−2

−1

0

1

2

3

Page 13: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Choice of output nodes

Regression

Linear output

Classification

sigmoid (for binary classification)

softmax (for multiclass classification)

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputsyk =

k

w(2)kj h

�⇤

i

w(1)ji xi

y = �

�⇤

k

w(2)kj h

�⇤

i

w(1)ji xi

⇥⇥

zk =⇤

k

w(2)kj h

�⇤

i

w(1)ji xi

⇥yk =

ezk

�k� ezk�

Page 14: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Can have multiple (ie, deep) layers

Implements highly complicated nonlinear mapping

y = f(x)

Page 15: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

How to learn the parameters?

Choose the right loss function

Regression: least-square loss

Classification: cross-entropy loss

Very hard optimization problem

Stochastic gradient descent is commonly used

Many optimization tricks are applied

minX

n

(f(xn)� yn)2

min �X

n

X

k

ynk log fk(xn)

Page 16: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Stochastic gradient descent

High-level idea

Randomly pick a data point (xn, yn)

Compute the gradient using only this data point, for example,

Update the parameter right away

Iterate the process until some stop criteria

g =@[f(xn)� yn)2]

@w

w w � ⌘g

There are many possible improvements to this simple procedure(in practice, this procedure works pretty well in many cases, though!)

Page 17: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Several common tricks

Initialization is very important

We are solving a very difficult optimization problem.

There are several heuristics on how to select your starting points wisely.

Learning rate decay

Step size can be big in the begin but should be tuned down later, for example

As the iteration t goes up, the learning grate becomes smaller.

Minibatch

Use small batch of data points (instead just one) to estimate gradients more robustly.

Momentum

Remembering the good direction in previous iterations that you have changed the parameters

….

⌘ ⌘ � t�⌘

Page 18: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Heavy tuning

In practice

Many tricks require experimenting, and tweaking to obtain the best results

Additionally, other hyperparameters need to be tuned too

Number of hidden layers?

Number of hidden units in each layers?

But all those pay off

Deep neural networks attains the best results in automatic speech recognition.

Deep neural networks attains the best results in image recognition.

Deep neural networks attains the best results in recognizing faces

Deep neural networks attains the best results in recognizing poses.

Page 19: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

How to compute the gradient?

Even for very complicated nonlinear functions

Computing the gradient is surprisingly simple to implement

The idea behind it is called error back propagation.

It employs the simple chain-rule for taking derivative.

Implemented in many sophisticated packages

Theano

cuDNN

Page 20: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Derivation of the error-backpropagation

Calculate the feed-forward signals from the input to the output.

Calculate output error E based on the predictions xk and the targettk.

Backpropagate the error by weighting it by the gradients of theassociated activation functions and the weights in previous layers.

Calculating the gradients ∂E∂w for the parameters based on the

backpropagated error signal and the feedforward signals from theinputs.

Update the parameters using the calculated gradients w ← w − η ∂E∂w .

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 8 / 16

Page 21: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Illustrative example

xk

xjxi

wij wjk

wij : weights connecting node i in layer (`− 1) to node j in layer `.

bj : bias for node j.

zj : input to node j (where zj = bj +∑

i xiwij).

gj : activation function for node j (applied to zj).

xj = gj(zj): ouput/activation of node j.

tk: target value for node k in the output layer.

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 9 / 16

Page 22: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Illustrative example (cont’d)

Network output

xk = gk(bk +∑

j

gj(bj +∑

i

xiwij)wjk)

Let’s assume that the error function is the sum of the squareddifference between the target values tk and the network output xk

E =1

2

k∈K(xk − tk)2

Gradients for the output layer

∂E

∂wjk= (xk − tk)

∂wjk(xk − tk) = (xk − tk)g′k(zk)

∂wjkzk

= (xk − tk)g′k(zk)xj = δkxj

where δk is the output error through the top activation layer.

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 10 / 16

Page 23: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Illustrative example (cont’d)

Gradients for the hidden layer

∂E

∂wij=

k∈K(xk − tk)

∂wij(xk − tk) =

k∈K(xk − tk)g′k(zk)

∂wijzk

=∑

k∈K(xk − tk)g′k(zk)wjkg′j(zj)xi

= xig′j(zj)

k∈K(xk − tk)g′k(zk)wjk = δjxi

where we substituted zk = bk +∑

j gj(bi +∑

i xiwij)wjkand zk

wij= zk

xj

xjwij

= wjkg′j(zj)xi.

The gradients with respect to the biases are respectively:

E

∂bk= δk,

E

∂bi= δj

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 11 / 16

Page 24: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Basic idea behind DNNs

Architecturally, a big neural networks (with a lot of variants)

in depth: 4-5 layers are commonly (Google LeNet uses more than 20)

in width: the number of hidden units in each layer can be a fewthousands

the number of parameters: hundreds of millions, even billions

Algorithmically, many new things

Pre-training: do not do error-backprogation right away

Layer-wise greedy: train one layer at a time

...

Computing

Heavy computing: in both speed in computation and coping with alot of data

Ex: fast Graphics Processing Unit (GPUs) are almost indispensable

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 12 / 16

Page 25: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Good references

Easy to find as DNNs are very popular these days

Many, many online video tutorials

Good open-source packages: Theano, Caffe, MatConvNet,TensorFlow, etc

Examples:I Wikipedia entry on “Deep Learning”

http://en.wikipedia.org/wiki/Deep_learning provides a decentportal to many things including deep belief networks, convolution nets

I A collection of tutorials and codes for implementing them in Pythonhttp://www.deeplearning.net/tutorial/

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 13 / 16

Page 26: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Outline

1 Review of last lecture

2 Neural networks

3 SummarySupervised learning

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 14 / 16

Page 27: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

Summary of the course so far: a short list of importantconcepts

Supervised learning has been our focus

Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)

Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), which in terms depends on a

kernel function k(xm,xn) = φ(xm)Tφ(xn),

Loss function1 Squared loss: least square for regression (minimizing residual sum of

errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines

Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 15 / 16

Page 28: Neural Networks and Deep Learning - CS | Computer Scienceweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec14.pdfStep size can be big in the begin but should be tuned down later, for

cont’d

Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation

Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models

Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 16 / 16