Machine Learning: Chenhao Tan University of Colorado ... · Quiz 1 Which of the following statements is true? (Suppose that training data is large.) A.In training, K-nearest neighbors

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 8

Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

Machine Learning: Chenhao Tan | Boulder | 1 of 53

HW1 grades

• Most people did well!• We are extra lenient this time• Submit only your code in a zip file, with no folder structure


Final projects

• Team formation


Quiz 1

Which of the following statements is true? (Suppose that training data is large.)A. In training, K-nearest neighbors takes shorter time than neural networks.B. In training, K-nearest neighbors takes longer time than neural networks.C. In testing, K-nearest neighbors takes shorter time than neural networks.D. In testing, K-nearest neighbors takes longer time than neural networks.


Quiz 2

How many parameters are there in the following feed-forward neural networks(assuming no biases)?

x1

x2

. . .

x100

h1

. . .

h50

h1

. . .

h20

o1

. . .

o5

A. 100 * 50 + 50 * 20 + 20 * 5B. 100 * 50 + 50 + 50 * 20 + 20 + 20 * 5 + 5


Quiz 3

How many parameters are there in the following convolutional neural networks?(assuming no biases, convolution with 4 filters, max pooling, ReLu, and finally afully-connected layer)

input image (10*10)

4 filters 5*5, stride 1

4@6*6

Max pooling 2*2, stride 2

4@3*3 5*1


Quiz 3

How many ReLU operations are performed on the forward pass? (assuming nobiases, convolution with 4 filters, max pooling, ReLu, and finally a fully-connectedlayer)

input image (10*10)

4 filters 5*5, stride 1

4@6*6

Max pooling 2*2, stride 2

4@3*3 5*1


Overview

History lesson

Deep learning in practiceImprove stochastic gradient descentUnstable gradientsData preprocessingWeight InitializationModel Architecture


History lesson

Outline

History lesson



History lesson

History lesson

• 1962: Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory ofBrain Mechanisms◦ First neuron-based learning algorithm◦ “Could learning anything that you could program”

• 1969: Minsky & Papert, Perceptron: An Introduction to ComputationalGeometry◦ First real complexity analysis◦ Showed, in principle, many things that perceptrons can’t learn to do◦ Shut down any interest in neural networks


History lesson

History lesson

• 1962: Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory ofBrain Mechanisms◦ First neuron-based learning algorithm◦ “Could learning anything that you could program”

• 1969: Minsky & Papert, Perceptron: An Introduction to ComputationalGeometry◦ First real complexity analysis◦ Showed, in principle, many things that perceptrons can’t learn to do◦ Shut down any interest in neural networks


History lesson

History lesson

• 1986: Rumelhart, Hinton & Williams, Back Propagation◦ Overcame many difficulties raised by Minsky et al.◦ Neural networks wildly popular again (for a while)


History lesson

History lesson

• 1999-2005◦ Shift to Bayesian Methods

• Best ideas from neural networks• Direct statistical computing

◦ Support Vector Machines• Nice mathematical properties• Kernel tricks

◦ A few people still playing with NNs• Bengio• Hinton• LeCun


History lesson

History lesson

• 2005-2010◦ Core group continues to make improvements◦ Various tricks to make NNs practical

• 2010-present◦ BOOM!


History lesson

AlexNet

Krizhevsky et al. [2012]


History lesson

History lesson

Question: Why? What made neural networks amazing again?• Massive datasets• Computing power• Algorithmic improvements


History lesson

History lesson

Machine learning has a short history, but seems cyclic.What is next?


Deep learning in practice

Outline

History lesson



Deep learning in practice | Improve stochastic gradient descent

Outline

History lesson




Gradient descent

Gradient descentwt+1 = wt − η∇f (wt)



AdaGrad

Not all features are created equal!

• Gradient descentwt+1 = wt − η∇f (wt)

• Adagrad [Duchi et al., 2011]

cache = cache + (∇f (wt))2

wt+1 = wt − η1

10−7 +√

cache∇f (wt)



AdaGrad

Not all features are created equal!• Gradient descent

wt+1 = wt − η∇f (wt)



wt+1 = wt − η1

10−7 +√

cache∇f (wt)



AdaGrad

Not all features are created equal!• Gradient descent

wt+1 = wt − η∇f (wt)



wt+1 = wt − η1

10−7 +√

cache∇f (wt)



Momentum




Momentum


Physical interpretation: Imagine a object is falling, but it does not accumulateany velocity. Let us fix that!

• Momentum

vt+1 = βvt −∇f (wt)

wt+1 = wt + ηvt+1



Momentum


Physical interpretation: Imagine a object is falling, but it does not accumulateany velocity.

Let us fix that!• Momentum


wt+1 = wt + ηvt+1



Momentum



• Momentum


wt+1 = wt + ηvt+1



Momentum



• Momentum


wt+1 = wt + ηvt+1



Momentum

http://cs231n.github.io/assets/nn3/opt2.gifImage credit: Alec Radford


http://cs231n.github.io/assets/nn3/opt2.gif


More variations

• Adam [Kingma and Ba, 2014]• RMSProp



Dropout layer

"randomly set some neurons to zero in the forward pass" [Srivastava et al., 2014]



Dropout layer

Forces the network to have a redundantrepresentation



Dropout layer

Another interpretation: Dropout is training alarge ensemble of models


Deep learning in practice | Unstable gradients

Outline

History lesson




Unstable gradients

x h1 h2 . . . hL o

∂L

∂b1 = σ′(z1)× w2 × σ′(z2)× w3 . . . σ′(zL)∂L

∂aL



Unstable gradients

x h1 h2 . . . hL o

∂L

∂b1 = σ′(z1)× w2 × σ′(z2)× w3 . . . σ′(zL)∂L

∂aL



Unstable gradients

10 0 100.0

0.2

0.4

0.6

0.8

1.0sigmoidderivative



Vanishing gradients

If we use Gaussian initialization for weights, wj ∼ N (0, 1),

|wj| < 1

|wjσ′(zj)| <14

∂L

∂b1 decay to zero exponentially



Vanishing gradients

ReLu

2 0 20.0

0.5

1.0

1.5

2.0ReLuderivative



Exploding gradients

If wj = 100,|wjσ′(zj)| ≈ k > 1


Deep learning in practice | Data preprocessing

Outline

History lesson




Another subtle issue of activation function

If all inputs x are positive, thegradients on w are either all positiveor all negative.

Zero-center the inputs!



Another subtle issue of activation function

If all inputs x are positive, thegradients on w are either all positiveor all negative.Zero-center the inputs!



Data preprocessing

Original data

10 5 0 5 1010

5

0

5

10

Zero-centered data(X − X.mean(axis = 0))

10 5 0 5 1010

5

0

5

10

Normalized data(X/ = np.std(X, axis = 0))

10 5 0 5 1010

5

0

5

10

PCA, whitening



Data preprocessing

Original data

10 5 0 5 1010

5

0

5

10


10 5 0 5 1010

5

0

5

10


10 5 0 5 1010

5

0

5

10

PCA, whitening



Data preprocessing

Original data

10 5 0 5 1010

5

0

5

10


10 5 0 5 1010

5

0

5

10


10 5 0 5 1010

5

0

5

10

PCA, whitening



Data preprocessing

Original data

10 5 0 5 1010

5

0

5

10


10 5 0 5 1010

5

0

5

10


10 5 0 5 1010

5

0

5

10

PCA, whitening



Batch normalization

Why only for the input data? [Ioffe and Szegedy, 2015]Consider a batch of activations at some layer. Make each dimension unit gaussian:

ak =ak − E[ak]√

Var[ak]



Batch normalization

• Reduces internal covariant shift• Reduces the dependence of

gradients o the scale of theparameters or their initial values

• Allows higher learning rates and useof saturating nonlinearities

• Reduce the need for dropout (maybe)



Batch normalization

During training, use batch mean andbatch variance; during testing useempirical mean and variance on trainingdata



Batch normalization

Add batch normalization before nonlinear activation or after nonlinear activation?https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md


https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

Deep learning in practice | Weight Initialization

Outline

History lesson




Non-convexity



Weight initialization

Old idea: W = 0, what happens?

There is no source of asymmetry. (Every neuron looks the same and leads to aslow start.)δL = ∇aLL � σ′(zL) # Compute δ’s on output layerFor ` = L, . . . , 1

∂L∂W` = δ`(al−1)T # Compute weight derivatives∂L

∂b`= δ` # Compute bias derivatives

δ`−1 =(W`)T

δ` � σ′(z`−1) # Back prop δ’s to previous layer




Old idea: W = 0, what happens?There is no source of asymmetry. (Every neuron looks the same and leads to aslow start.)

δL = ∇aLL � σ′(zL) # Compute δ’s on output layerFor ` = L, . . . , 1



δ`−1 =(W`)T





Old idea: W = 0, what happens?There is no source of asymmetry. (Every neuron looks the same and leads to aslow start.)δL = ∇aLL � σ′(zL) # Compute δ’s on output layerFor ` = L, . . . , 1



δ`−1 =(W`)T





First idea: small random numbers, W ∼ N (0, 0.01)




Var(z) = Var(∑

i

wixi)

= nVar(wi)Var(xi)




Xavier initialization [Glorot and Bengio, 2010]

W ∼ N (0,2

nin + nout)

Does not work for ReLU




Xavier initialization [Glorot and Bengio, 2010]

W ∼ N (0,2

nin + nout)

Does not work for ReLU




He initialization [He et al., 2015]

W ∼ N (0,2

nin)




This is an actively research area and next great idea may come from you!


Deep learning in practice | Model Architecture

Outline

History lesson




ResNet

How to train a neural network with 100 layers? [He et al., 2016]



ResNet

Why is it hard to train a large number of layers?



ResNet

Simple solution:



ResNet



ResNet



References (1)

John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and StochasticOptimization. J. Mach. Learn. Res., 12:2121–2159, 2011. URLhttp://dl.acm.org/citation.cfm?id=1953048.2021068.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InYee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference onArtificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256,Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URLhttp://proceedings.mlr.press/v9/glorot10a.html.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-levelperformance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV),December 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internalcovariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference onMachine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France,07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/ioffe15.html.


http://dl.acm.org/citation.cfm?id=1953048.2021068

http://proceedings.mlr.press/v9/glorot10a.html

http://proceedings.mlr.press/v37/ioffe15.html


References (2)

Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of ICLR, 2014.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional NeuralNetworks. In Proceedings of the 25th International Conference on Neural Information Processing Systems,2012.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simpleway to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.URL http://jmlr.org/papers/v15/srivastava14a.html.


http://jmlr.org/papers/v15/srivastava14a.html

Documents

Machine Learning: Chenhao Tan University of Colorado ... · Quiz 1 Which of the following statements is true? (Suppose that training data is large.) A.In training, K-nearest neighbors