Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Introduction to Machine Learning
Introduction to Machine Learning Amo G. Tong 1
Lecture *Deep Learning
• An Introduction
• Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville
• Some materials are courtesy of .
• All pictures belong to their creators.
Introduction to Machine Learning Amo G. Tong 2
Introduction to Machine Learning Amo G. Tong 3
Deep Learning
• What are deep learning methods?
• Using a complex neural network to approximate the function we want to learn.
Story: ImageNet object recognition contest.
https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Introduction to Machine Learning Amo G. Tong 4
Deep Learning
• What are deep learning methods?
• Using a complex neural network to approximate the function we want to learn.
Story: ImageNet object recognition contest.LeNet-5 (1998): 7-level convolutional network
https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Introduction to Machine Learning Amo G. Tong 5
Deep Learning
• What are deep learning methods?
• Using a complex neural network to approximate the function we want to learn.
Story: ImageNet object recognition contest.AlexNet (2012): more layers and filters Trained for 6 days on two GPUs.Error rate: 15.3% (reduced from 26.2%)
https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Introduction to Machine Learning Amo G. Tong 6
Deep Learning
• What are deep learning methods?
• Using a complex neural network to approximate the function we want to learn.
Story: ImageNet object recognition contest.
ResNet(2015): 152 layers with residual connections.Error rate: 3.57%
https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Introduction to Machine Learning Amo G. Tong 7
Deep Learning
• What are deep learning methods?
• Using a complex neural network to approximate the function we want to learn.
• Why we need deep learning?
• We need to solve complex real-world problems.
• A standard way to parameterize functions.
• Layer by layer.
• Flexible to be customized for different applications.
• Image processing
• Sequence data
• Standard training methods.
• Backpropagation.
Introduction to Machine Learning Amo G. Tong 8
Deep Learning
• What are deep learning methods?
• Using a complex neural network to approximate the function we want to learn.
• Why now?
• Availability of data
• Hardware
• GPUs,…
• Software
• Tensor Flow, Pytorch,…
Introduction to Machine Learning Amo G. Tong 9
Outline
• General Deep Feedforward Network.
• Convolutional Neural Network (CNN)
• Image processing.
• Recurrent Neural Network (RNN)
• Sequence data processing.
Introduction to Machine Learning Amo G. Tong 10
Deep Feedforward Network
Input Output
Hidden Layers
ℎ
Deep Feedforward Network
Training, Design, and Regularization
Introduction to Machine Learning Amo G. Tong 11
Deep Feedforward Network:
• Feedforward Network (Multi-Layer Perceptron (MLP))
• More Layers.
• Training
• Design
• Regularization
Input Output
Hidden Layers
Introduction to Machine Learning Amo G. Tong 12
Deep Feedforward Network: Training
• Cost Function: 𝐽(𝜃)
• Common choice: cross-entropy
• The difference between two distributions 𝑄 and 𝑃
• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)
Input Output
Hidden Layers
𝐷(𝑃||𝑄) =
𝑥
𝑃 𝑥 log𝑃(𝑥)
𝑄(𝑥)
Introduction to Machine Learning Amo G. Tong 13
Deep Feedforward Network: Training
• Cost Function: 𝐽(𝜃)
• Common choice: cross-entropy
• The difference between two distributions 𝑄 and 𝑃
• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)
• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log Pmodel(𝑦|𝑥) (negative log likelihood)
• Pmodel(𝑦|𝑥) changes from model to model
• Least square error: assume Gaussian and use MLE
Input Output
Hidden Layers
𝐷(𝑃||𝑄) =
𝑥
𝑃 𝑥 log𝑃(𝑥)
𝑄(𝑥)
Introduction to Machine Learning Amo G. Tong 14
Deep Feedforward Network: Training
• Gradient-Based Training
• Non-convex because we have nonlinear units.
• Iteratively decrease the cost function; not global optimal; not even local optimal.
• Initializations with small random numbers.
• Backpropagation.
• One issue: the gradient must be large and predictable. Input Output
Hidden Layers
Introduction to Machine Learning Amo G. Tong 15
Deep Feedforward Network: Training
• Gradient-Based Training
• Non-convex because we have nonlinear units.
• Iteratively decrease the cost function; not global optimal; not even local optimal.
• Initializations with small random numbers.
• Backpropagation.
• One issue: the gradient must be large and predictable. Input Output
Hidden LayersStochastic gradient descent (SGD) algorithmWhile stopping criterion not met
Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚
Δ𝜃 𝐽(𝜃, 𝑥)
Update 𝜃 ← 𝜃 − 𝜖𝑔
Introduction to Machine Learning Amo G. Tong 16
Deep Feedforward Network: Design
• Output units
• A real vector
• Regression Problem: Pr[𝑦|𝑥].
• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏
• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2
• Binary classification
• Sigmoid 𝜎
• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)
• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)
• Multinoulli (𝑛-classification)
• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱
• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)
• softmax 𝑧 𝑖 =exp(zi)
σ𝑗 exp(z𝑗)
Input Output
Hidden Layers
ℎ
Note: the gradient must be large and predictable. Log can help.
Introduction to Machine Learning Amo G. Tong 17
Deep Feedforward Network: Design
• Output units
• A real vector
• Regression Problem: Pr[𝑦|𝑥].
• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏
• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2
• Binary classification
• Sigmoid 𝜎
• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)
• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)
• Multinoulli (𝑛-classification)
• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱
• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)
• softmax 𝑧 𝑖 =exp(zi)
σ𝑗 exp(z𝑗)
Input Output
Hidden Layers
ℎ
Note: the gradient must be large and predictable. Log can help.
Introduction to Machine Learning Amo G. Tong 18
Deep Feedforward Network: Design
• Output units
• A real vector
• Regression Problem: Pr[𝑦|𝑥].
• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏
• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2
• Binary classification
• Sigmoid 𝜎
• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)
• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)
• Multinoulli (𝑛-classification)
• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱
• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)
• softmax 𝑧 𝑖 =exp(zi)
σ𝑗 exp(z𝑗)
Input Output
Hidden Layers
ℎ
Note: the gradient must be large and predictable. Log can help.
Introduction to Machine Learning Amo G. Tong 19
Deep Feedforward Network: Design
• Hidden Units
• Rectified linear units (ReLU).
• 𝑔(𝑧) = max{0, 𝑧}
• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)
• Good point: large gradient when active, easy to train
• Bad point: not learnable when inactive
• Generalizations.
• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)
• Absolute value rectification: 𝛼 = −1.
• Leaky ReLU: fixed small 𝛼 (0.01).
• Parametric ReLU (PReLU): learnable 𝛼.
• Maxout units: group and take the max.
• Traditional units:
• Sigmoid (consider tanh(z))
• Perceptron
Input Output
Hidden Layers
ℎ
Introduction to Machine Learning Amo G. Tong 20
Deep Feedforward Network: Design
• Hidden Units
• Rectified linear units (ReLU).
• 𝑔(𝑧) = max{0, 𝑧}
• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)
• Good point: large gradient when active, easy to train
• Bad point: not learnable when inactive
• Generalizations.
• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)
• Absolute value rectification: 𝛼 = −1.
• Leaky ReLU: fixed small 𝛼 (0.01).
• Parametric ReLU (PReLU): learnable 𝛼.
• Maxout units: group and take the max.
• Traditional units:
• Sigmoid (consider tanh(z))
• Perceptron
Input Output
Hidden Layers
ℎ
Introduction to Machine Learning Amo G. Tong 21
Deep Feedforward Network: Design
• Hidden Units
• Rectified linear units (ReLU).
• 𝑔(𝑧) = max{0, 𝑧}
• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)
• Good point: large gradient when active, easy to train
• Bad point: not learnable when inactive
• Generalizations.
• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)
• Absolute value rectification: 𝛼 = −1.
• Leaky ReLU: fixed small 𝛼 (0.01).
• Parametric ReLU (PReLU): learnable 𝛼.
• Maxout units: group and take the max.
• Traditional units:
• Sigmoid (consider tanh(z))
• Perceptron
Input Output
Hidden Layers
ℎ
Introduction to Machine Learning Amo G. Tong 22
Deep Feedforward Network: Design
• Architecture Design
• The layer structure.
• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))
• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))
• …..
• How many layers? How many units in each layer?
• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.
• But it is not guaranteed that it can be learned.
• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.
• Greater depth may not be better.
Input Output
Hidden Layers
Introduction to Machine Learning Amo G. Tong 23
Deep Feedforward Network: Design
• Architecture Design
• The layer structure.
• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))
• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))
• …..
• How many layers? How many units in each layer?
• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.
• But it is not guaranteed that it can be learned.
• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.
• Greater depth may not be better.
Input Output
Hidden Layers
Introduction to Machine Learning Amo G. Tong 24
Deep Feedforward Network: Design
• Architecture Design
• The layer structure.
• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))
• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))
• …..
• How many layers? How many units in each layer?
• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.
• But it is not guaranteed that it can be learned.
• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. .
• Greater depth may not be better.
• Other considerations: local connected layers, skip layers, recurrent layers, …
Input Output
Hidden Layers
Introduction to Machine Learning Amo G. Tong 25
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 1: Norm Penalties:
• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ Ω(𝜃)
• 𝐿2 regularization (weight decay): Ω 𝜃 =1
2∥ 𝜃 ∥2
• 𝐿1 regularization: Ω 𝜃 = σ𝑖 |𝜃𝑖|
• What is the difference between them?
ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))Typically, we put a penalty on 𝑊 but not bias.
Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met
Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚
Δ𝜃 𝐽(𝜃, 𝑥)
Update 𝜃 ← 𝜃 − 𝜖𝑔
Introduction to Machine Learning Amo G. Tong 26
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 2: Norm Penalties as Constrained Optimization:
• 𝐽 𝜃, 𝛼 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ (Ω 𝜃 − 𝑘)
• 𝛼 increases when Ω 𝜃 > 𝑘; 𝛼 decreases when Ω 𝜃 < 𝑘
• Take 𝛼 as another learnable parameter.
• Select 𝑘 manully.
• Need to control 𝛼 manully.
Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met
Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚
Δ𝜃 𝐽(𝜃, 𝑥)
Update 𝜃 ← 𝜃 − 𝜖𝑔
Introduction to Machine Learning Amo G. Tong 27
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 3: Explicit Constraints:
• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃
• Whenever Ω 𝜃 > 𝑘, project 𝜃 to the nearest point in Ω 𝜃 < 𝑘.
• After each iteration, check if the new parameters are in the area Ω 𝜃 < 𝑘.
Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met
Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚
Δ𝜃 𝐽(𝜃, 𝑥)
Update 𝜃 ← 𝜃 − 𝜖𝑔
Introduction to Machine Learning Amo G. Tong 28
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 4: Data Augmentation
• How does over-fitting happen?
• High representability and insufficient data.
• Creating fake but useful training data is possible.
• The target classifier is invariant to some transformations.
• E.g. object detection.
Input Output
Hidden Layers
9
9
Good Bad
Introduction to Machine Learning Amo G. Tong 29
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 5: Multi-Task Learning
• Two models (problems) share some parameters
• The hidden representations can be better learned if two tasks share certain factors.
Input
Output 2
Shared
Output 1
If animal?
If cat?
If dog?
Introduction to Machine Learning Amo G. Tong 30
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 6: Early Stopping
• We would want an early stop before converge.
• Validation set error.
• One possible method: stop if no better parameter found in 𝑝 (say 10) consecutive training batches.
• Simple to implement.
• Need some extra data.
• Use second training to fully utilize the data.
• Retraining. Adopt the same number of training steps.
• Continuous Training. Start from the obtained parameters.
while 𝑗 < 𝑝update 𝜃 for 𝑛 steps;if no smaller validation error
𝑗 = 𝑗 + 1;else
note the new parameters;𝑗 = 0;
Introduction to Machine Learning Amo G. Tong 31
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 7: Dropout
• Bagging: train different models with different datasets.
• Bagging neural networks is not effective.
• Dropout:
• Idea: dynamically change the network structure.
• Mute units by setting its output as zero.
• Algorithm:
• Before each iteration, randomly mute units.
• Typically, with prob 0.2 drop an input node, with prob 0.5 drop a hidden node.
• Pros: Computationally cheap, independent of training algorithm or model.
• Cons: Size of the model may need increasing, does not work for small data set.
Input Output
Hidden Layers
Introduction to Machine Learning Amo G. Tong 32
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 7: Dropout
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 258, 265, 267, 672
Introduction to Machine Learning Amo G. Tong 33
Deep Feedforward Network: Regularization
• Regularization: methods for reducing over-fitting.
• Over-fitting is a serious problem in deep learning.
• Method 8: Adversarial training
• There can be 𝑥1 ≈ 𝑥2 but the results of prediction are very different.
• Training on adversarially perturbed examples from the training set
• Generative adversarial network (GAN)
Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarialexamples. CoRR, abs/1412.6572. 268, 269, 271, 555, 556
Introduction to Machine Learning Amo G. Tong 34
Convolutional Neural Networks
Convolutional Neural Networks
Introduction to Machine Learning Amo G. Tong 35
Convolutional Neural Networks
• One large class of neural networks.
• The main unit is the convolutional layer
• Replace the matrix multiplication by convolutional operation.
• Designed for processing grid-like inputs.
Introduction to Machine Learning Amo G. Tong 36
Convolutional Neural Networks
• One large class of neural networks.
• The main unit is the convolutional layer
• Replace the matrix multiplication by convolutional operation.
• Designed for processing grid-like inputs.
• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞
𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏
• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)
Introduction to Machine Learning Amo G. Tong 37
Convolutional Neural Networks
• One large class of neural networks.
• The main unit is the convolutional layer
• Replace the matrix multiplication by convolutional operation.
• Designed for processing grid-like inputs.
• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞
𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏
• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))
• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))
• …..
𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)
Input Output
Hidden Layers
Introduction to Machine Learning Amo G. Tong 38
Convolutional Neural Networks
• One large class of neural networks.
• The main unit is the convolutional layer
• Replace the matrix multiplication by convolutional operation.
• Designed for processing grid-like inputs.
• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞
𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏
• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))
• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))
• …..
𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)
a
b
c
d
𝐾𝑎𝑒 + 𝑏𝑓
𝑏𝑒 + 𝑐𝑓
𝑐𝑒 + 𝑑𝑓
𝑑𝑒 + 0
ℎ(1) 1D example
Introduction to Machine Learning Amo G. Tong 39
Convolutional Neural Networks
• One large class of neural networks.
• The main unit is the convolutional layer
• Replace the matrix multiplication by convolutional operation.
• Designed for processing grid-like inputs.
• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞
𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏
• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))
• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))
• …..
𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)
a
b
c
d
𝐾
ℎ(1)
2D example 𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)
Introduction to Machine Learning Amo G. Tong 40
Convolutional Neural Networks
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)
• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))
• …..
• What are the advantages?
• Sparse interactions: the size of kernel
• Fewer parameters
• Parameter Sharing:
• One set of parameter for each layer
• Equivariance
• The output changes in the same way as input does.
𝑊 2 𝑇ℎ(1) 𝐾 ∗ ℎ(1)
Introduction to Machine Learning Amo G. Tong 41
Convolutional Neural Networks
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)
• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))
• …..
• Typical structure of a convolutional networks:
• Convolution layers+
• Activation functions (e.g. ReLU)+
• Pooling and Sampling
• Pooling layers:
• Replace the nodes with a summary of the nearby neighbors
• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.
• Why pooling?
• Theoretically, making the model invariance to small translations.
• Intuitively, we want to know if some feature exists than where it is.
Introduction to Machine Learning Amo G. Tong 42
Convolutional Neural Networks
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)
• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))
• …..
• Typical structure of a convolutional networks:
• Convolution layers+
• Activation functions (e.g. ReLU)+
• Pooling
• Pooling layers:
• Replace the nodes with a summary of the nearby neighbors
• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.
• Why pooling?
• Theoretically, making the model invariance to small translations.
• Intuitively, we want to know if some feature exists than where it is.
Max=1 Max=1 Max=0
Introduction to Machine Learning Amo G. Tong 43
Convolutional Neural Networks
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)
• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))
• …..
• Typical structure of a convolutional networks:
• Convolution layers+
• Activation functions (e.g. ReLU)+
• Pooling and Sampling
Introduction to Machine Learning Amo G. Tong 44
Convolutional Neural Networks
• Convolutional operation in deep learning:
• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)
• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))
• …..
• Typical structure of a convolutional networks:
• Convolution layers+
• Activation functions (e.g. ReLU)+
• Pooling and Sampling
Introduction to Machine Learning Amo G. Tong 45
Recurrent Neural Network
“In this lecture, we are going to learn deep learning…..”
Recurrent Neural Network
Introduction to Machine Learning Amo G. Tong 46
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
“In this lecture, we are going to learn deep learning…..”
𝑦1
𝑥1
𝑦2
𝑥2
𝑦3
𝑥3
𝑦4
𝑥4
Introduction to Machine Learning Amo G. Tong 47
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
“In this lecture, we are going to learn deep learning…..”
𝑦1
𝑥1
𝑦2
𝑥2
𝑦3
𝑥3
𝑦4
𝑥4
Introduction to Machine Learning Amo G. Tong 48
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
Truth
Loss
Output
Hidden state
Input
A typical pattern.
• Stationary model• Parameter sharing• Unlimited Sequence
• Powerful: can simulate a universal Turing Machine
Introduction to Machine Learning Amo G. Tong 49
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
Truth
Loss
Output
Hidden state
Input
A typical pattern.
• Stationary model• Parameter sharing• Unlimited Sequence
• Powerful: can simulate a universal Turing Machine
Introduction to Machine Learning Amo G. Tong 50
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
Truth
Loss
Output
Hidden state
Input
A typical pattern.
• Stationary model• Parameter sharing• Unlimited Sequence
• Powerful: can simulate a universal Turing Machine
Introduction to Machine Learning Amo G. Tong 51
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
Truth
Loss
Output
Hidden state
Input
Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡
ℎ𝑡 = tanh(𝑎𝑡)
Output:𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡
𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)
Loss:Cross-entropy
Introduction to Machine Learning Amo G. Tong 52
Example (“hello”) Character-level Language Model
Vocabulary [h,e,l,o]
Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡
ℎ𝑡 = tanh(𝑎𝑡)
1.02.2-3.04.1
0.50.3-1.01.2
0.10.51.9-1.1
0.2-1.5-0.12.2
0.3-0.10.9
1.0-0.30.1
0.1-0.5-0.3
-0.30.90.7
1000
0100
0010
0010
h e l l
e l l o
o o l oLoss
Output
Hidden state
Input
Output:
𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡
𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)
Introduction to Machine Learning Amo G. Tong 53
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
Output guided hidden transmission (Teacher forcing), with less information.
• Less powerful in terms of representability.
• Output may not carry sufficient information.
Introduction to Machine Learning Amo G. Tong 54
Recurrent Neural Network
• One large class of neural networks.
• Designed for processing sequence data.
Output guided hidden transmission (Teacher forcing), with less information.
• Less powerful in terms of representability.
• Output may not carry sufficient information.
• Training can be parallelized.
Introduction to Machine Learning Amo G. Tong 55
Recurrent Neural Network
Introduction to Machine Learning Amo G. Tong 56
Summary
• General Deep Feedforward Network.
• Training
• Design: output node, hidden node,
• Regularization
• Convolutional Neural Network
• Convolution Operator
• Grid-like input
• Extract features layer by layer
• Recurrent Neural Network
• Sequence data processing
• Hidden state transmission