Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning

Introduction to Machine Learning Amo G. Tong 1

Lecture *Deep Learning

• An Introduction

• Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville

• Some materials are courtesy of .

• All pictures belong to their creators.



Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5



Deep Learning



Story: ImageNet object recognition contest.LeNet-5 (1998): 7-level convolutional network




Deep Learning



Story: ImageNet object recognition contest.AlexNet (2012): more layers and filters Trained for 6 days on two GPUs.Error rate: 15.3% (reduced from 26.2%)




Deep Learning



Story: ImageNet object recognition contest.

ResNet(2015): 152 layers with residual connections.Error rate: 3.57%




Deep Learning



• Why we need deep learning?

• We need to solve complex real-world problems.

• A standard way to parameterize functions.

• Layer by layer.

• Flexible to be customized for different applications.

• Image processing

• Sequence data

• Standard training methods.

• Backpropagation.


Deep Learning



• Why now?

• Availability of data

• Hardware

• GPUs,…

• Software

• Tensor Flow, Pytorch,…


Outline

• General Deep Feedforward Network.

• Convolutional Neural Network (CNN)

• Image processing.

• Recurrent Neural Network (RNN)

• Sequence data processing.


Deep Feedforward Network

Input Output

Hidden Layers

ℎ

Deep Feedforward Network

Training, Design, and Regularization


Deep Feedforward Network:

• Feedforward Network (Multi-Layer Perceptron (MLP))

• More Layers.

• Training

• Design

• Regularization

Input Output

Hidden Layers


Deep Feedforward Network: Training

• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑥

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)



• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log Pmodel(𝑦|𝑥) (negative log likelihood)

• Pmodel(𝑦|𝑥) changes from model to model

• Least square error: assume Gaussian and use MLE

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑥

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)



• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.


• One issue: the gradient must be large and predictable. Input Output

Hidden Layers



• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.


• One issue: the gradient must be large and predictable. Input Output

Hidden LayersStochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔


Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

ℎ

Note: the gradient must be large and predictable. Log can help.



• Output units

• A real vector





• Sigmoid 𝜎





• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)


σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

ℎ




• Output units

• A real vector





• Sigmoid 𝜎





• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)


σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

ℎ




• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

ℎ



• Hidden Units


• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)




• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)







• Perceptron

Input Output

Hidden Layers

ℎ



• Hidden Units


• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)




• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)







• Perceptron

Input Output

Hidden Layers

ℎ



• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.

• Greater depth may not be better.

Input Output

Hidden Layers





• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..




• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.


Input Output

Hidden Layers





• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..




• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. .


• Other considerations: local connected layers, skip layers, recurrent layers, …

Input Output

Hidden Layers


Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 1: Norm Penalties:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ Ω(𝜃)

• 𝐿2 regularization (weight decay): Ω 𝜃 =1

2∥ 𝜃 ∥2

• 𝐿1 regularization: Ω 𝜃 = σ𝑖 |𝜃𝑖|

• What is the difference between them?

ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))Typically, we put a penalty on 𝑊 but not bias.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met








• Method 2: Norm Penalties as Constrained Optimization:

• 𝐽 𝜃, 𝛼 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ (Ω 𝜃 − 𝑘)

• 𝛼 increases when Ω 𝜃 > 𝑘; 𝛼 decreases when Ω 𝜃 < 𝑘

• Take 𝛼 as another learnable parameter.

• Select 𝑘 manully.

• Need to control 𝛼 manully.









• Method 3: Explicit Constraints:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃

• Whenever Ω 𝜃 > 𝑘, project 𝜃 to the nearest point in Ω 𝜃 < 𝑘.

• After each iteration, check if the new parameters are in the area Ω 𝜃 < 𝑘.









• Method 4: Data Augmentation

• How does over-fitting happen?

• High representability and insufficient data.

• Creating fake but useful training data is possible.

• The target classifier is invariant to some transformations.

• E.g. object detection.

Input Output

Hidden Layers

9

9

Good Bad





• Method 5: Multi-Task Learning

• Two models (problems) share some parameters

• The hidden representations can be better learned if two tasks share certain factors.

Input

Output 2

Shared

Output 1

If animal?

If cat?

If dog?





• Method 6: Early Stopping

• We would want an early stop before converge.

• Validation set error.

• One possible method: stop if no better parameter found in 𝑝 (say 10) consecutive training batches.

• Simple to implement.

• Need some extra data.

• Use second training to fully utilize the data.

• Retraining. Adopt the same number of training steps.

• Continuous Training. Start from the obtained parameters.

while 𝑗 < 𝑝update 𝜃 for 𝑛 steps;if no smaller validation error

𝑗 = 𝑗 + 1;else

note the new parameters;𝑗 = 0;





• Method 7: Dropout

• Bagging: train different models with different datasets.

• Bagging neural networks is not effective.

• Dropout:

• Idea: dynamically change the network structure.

• Mute units by setting its output as zero.

• Algorithm:

• Before each iteration, randomly mute units.

• Typically, with prob 0.2 drop an input node, with prob 0.5 drop a hidden node.

• Pros: Computationally cheap, independent of training algorithm or model.

• Cons: Size of the model may need increasing, does not work for small data set.

Input Output

Hidden Layers





• Method 7: Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 258, 265, 267, 672





• Method 8: Adversarial training

• There can be 𝑥1 ≈ 𝑥2 but the results of prediction are very different.

• Training on adversarially perturbed examples from the training set

• Generative adversarial network (GAN)

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarialexamples. CoRR, abs/1412.6572. 268, 269, 271, 555, 556


Convolutional Neural Networks




• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.







• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)







• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞



• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

Input Output

Hidden Layers







• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞




• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

a

b

c

d

𝐾𝑎𝑒 + 𝑏𝑓

𝑏𝑒 + 𝑐𝑓

𝑐𝑒 + 𝑑𝑓

𝑑𝑒 + 0

ℎ(1) 1D example







• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞




• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

a

b

c

d

𝐾

ℎ(1)

2D example 𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)




• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• What are the advantages?

• Sparse interactions: the size of kernel

• Fewer parameters

• Parameter Sharing:

• One set of parameter for each layer

• Equivariance

• The output changes in the same way as input does.

𝑊 2 𝑇ℎ(1) 𝐾 ∗ ℎ(1)




• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.




• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..




• Pooling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.

Max=1 Max=1 Max=0




• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..








• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..






Recurrent Neural Network

“In this lecture, we are going to learn deep learning…..”





• Designed for processing sequence data.


𝑦1

𝑥1

𝑦2

𝑥2

𝑦3

𝑥3

𝑦4

𝑥4






𝑦1

𝑥1

𝑦2

𝑥2

𝑦3

𝑥3

𝑦4

𝑥4





Truth

Loss

Output

Hidden state

Input

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine





Truth

Loss

Output

Hidden state

Input

A typical pattern.







Truth

Loss

Output

Hidden state

Input

A typical pattern.







Truth

Loss

Output

Hidden state

Input

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

Output:𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)

Loss:Cross-entropy


Example (“hello”) Character-level Language Model

Vocabulary [h,e,l,o]

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

1.02.2-3.04.1

0.50.3-1.01.2

0.10.51.9-1.1

0.2-1.5-0.12.2

0.3-0.10.9

1.0-0.30.1

0.1-0.5-0.3

-0.30.90.7

1000

0100

0010

0010

h e l l

e l l o

o o l oLoss

Output

Hidden state

Input

Output:

𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)





Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.





Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.

• Training can be parallelized.




Summary

• General Deep Feedforward Network.

• Training

• Design: output node, hidden node,

• Regularization

• Convolutional Neural Network

• Convolution Operator

• Grid-like input

• Extract features layer by layer

• Recurrent Neural Network

• Sequence data processing

• Hidden state transmission

Documents

Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?