Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Error Function: Squared Error
Prediction Ground truth / target
Has no special meaning except it makes gradients look nicer
Closed Form Solution
Fast to compute Only exists for some models and error functions Must be determined manually
Gradient Descent
1. Initialize at random2. Compute error3. Compute gradients w.r.t. parameters4. Apply the above update rule5. Go back to 2. and repeat until error does not
decrease anymore
Gradient Descent (Result)
1. Initialize at random2. Compute error3. Compute gradients w.r.t. parameters4. Apply the above update rule5. Go back to 2. and repeat until error does not
decrease anymore
Vanishing Gradients
Element-wise multiplication with small or even tiny gradients for each layer
In a neural network with many layers, the gradients of the objective function w.r.t. the weights of a layer close to the inputs may become near zero!
⇒ Gradient descent updates will starve
The Importance of Weight Initialization
● Simple CNN trained on MNIST for 12 epochs
● 10-batch rolling average of training loss
Image Source: https://intoli.com/blog/neural-network-initialization/
The Importance of Weight Initialization
Initialization with “0” values is ALWAYS WRONG!
How to initialize properly?
0 here = everything is 0 = no error signal
Information Flow in a Neural Network
Consider a network with ...
● 5 hidden layers and 100 neurons per hidden layer
● the hidden layer activation function = identity function
Let’s omit the bias term for simplicity (commonly initialized with all 0’s).
Information Flow in a Neural Network
Image Source: https://intoli.com/blog/neural-network-initialization/
Information Flow in a Neural Network
What’s the explanation for the previous image?
One layer with some activation function andwithout the bias term:
Information Flow in a Neural Network
(1) tends to 0 when either (2) tends to 0 or (3) tends to 0.
⇒ Preserve variance of activations throughout the network.
(1) (2) (3)
Information Flow in a Neural Network
Variance approximation possible when pre-activationneurons are close to zero.
“Glorot” Initialization
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th international conference on artificial intelligence and statistics (pp. 249-256).
Gradient Descent
Martens, J. (2010). Deep Learning via Hessian-Free Optimization. In Proceedings of the 27th International Conference on Machine Learning (pp. 735-742).
Too large learning rate ⇒ zig-zag
Too small learning rate ⇒ starvation
Batch Gradient Descent
● Update based on the entire training data set● Susceptible to converging to local minima● Expensive and inefficient for large training data sets
Stochastic Gradient Descent (SGD)
● Update based on a single example● More robust against local minima● Noisy updates ⇒ small learning rate
Mini-Batch Gradient Descent
● Update based on multiple examples● More robust against local minima● More stable than stochastic gradient descent● Most common● Often also called SGD despite multiple examples
Gradient Descent with Momentum
● Momentum dampens oscillations● Gradient is computed before momentum is applied● Typical momentum term:
Gradient Descent with Nesterov Momentum
● Gradient is computed after momentum is applied● Anticipated update from momentum is used to
include knowledge of momentum in the gradient● Typically preferred over vanilla momentum
AdaGrad
● Adaptive (per-weight) learning rates● Learning rates of frequently occurring features are
reduced while learning rates of infrequent features remain large
● Monotonically decreasing learning rates● Suited for sparse data● Typical learning rate:
Matrix-Vector Multiplication
MATMUL
W
y
x
MATRIXfloat
VECTORfloat
VECTORfloat
SYMBOL TYPEdata type
OPERATION
symbolic variable
Graph Optimization
MULTIPLY
x
SCALARfloat
SCALARfloat
SCALARfloat
DIVIDE
zSCALARfloat
OPTIMIZATION
x
SCALARfloat
y
Automatic Differentiation
SQUARE
x
y
SCALARfloat
SCALARfloat
GRAD(y, x)
2
SCALARfloat
MULTIPLY
dy/dx
SCALARfloat