View
2
Download
0
Category
Preview:
Citation preview
CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader
Lecture 19: NN Regularization
CS109A, PROTOPAPAS, RADER
Outline
Regularization of NN § Norm Penalties
§ Early Stopping
§ Data Augmentation
§ Sparse Representation
§ Bagging
§ Dropout
2
CS109A, PROTOPAPAS, RADER
Outline
Regularization of NN § Norm Penalties§ Early Stopping
§ Data Augmentation
§ Sparse Representation
§ Bagging
§ Dropout
3
CS109A, PROTOPAPAS, RADER
Regularization
4
Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
CS109A, PROTOPAPAS, RADER
Overfitting
5
Fitting a deep neural network with 5 layers and 100 neurons per layer can lead to a very good prediction on the training set but poor prediction on validations set.
CS109A, PROTOPAPAS, RADER
Norm Penalties
We used to optimize:𝐽 𝑊; 𝑋, 𝑦
Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)
L2 regularization:– Weights decay– MAP estimation with Gaussian prior
L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior
6
Biasesnotpenalized
Ω 𝑊 =12∥ 𝑊 ∥22
Ω 𝑊 =12∥ 𝑊 ∥3
CS109A, PROTOPAPAS, RADER
Norm Penalties
We used to optimize:𝐽 𝑊; 𝑋, 𝑦
Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)
L2 regularization:– Decay of weights– MAP estimation with Gaussian prior
L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior
7
Biasesnotpenalized
Ω 𝑊 =12∥ 𝑊 ∥22
Ω 𝑊 =12∥ 𝑊 ∥3
𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊
𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 +12𝛼𝑊2
𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊
− 𝜆𝛼𝑊
weightsdecayin
proportiontoitssize.
CS109A, PROTOPAPAS, RADER
Norm Penalties
8
Ω 𝑊 =12∥ 𝑊 ∥22
Ω 𝑊 =12∥ 𝑊 ∥3
CS109A, PROTOPAPAS, RADER
Norm Penalties as Constraints
Useful if K is known in advance
Optimization:
• Construct Lagrangian and apply gradient descent
• Projected gradient descent
9
min< = >?
𝐽(𝑊;𝑋, 𝑦)
CS109A, PROTOPAPAS, RADER
Outline
Regularization of NN § Norm Penalties
§ Early Stopping§ Data Augmentation
§ Sparse Representation
§ Bagging
§ Dropout
10
CS109A, PROTOPAPAS, RADER
Early Stopping
11
Training time can be treated as a
hyperparameter
Early stopping: terminate while validation set performance is better
CS109A, PROTOPAPAS, RADER
Early Stopping
12
CS109A, PROTOPAPAS, RADER
Outline
Regularization of NN § Norm Penalties
§ Early Stopping
§ Data Augmentation§ Sparse Representation
§ Bagging
§ Dropout
13
CS109A, PROTOPAPAS, RADER
Data Augmentation
14
CS109A, PROTOPAPAS, RADER
Data Augmentation
15
CS109A, PROTOPAPAS, RADER
Outline
Regularization of NN § Norm Penalties
§ Early Stopping
§ Data Augmentation
§ Sparse Representation§ Bagging
§ Dropout
16
CS109A, PROTOPAPAS, RADER
Sparse Representation
17
𝐽 𝜃; 𝑋, 𝑦
𝑊3
𝑊2
𝑊A
𝑊B
𝑊C
𝑊D
4.34 = 3.2 2.0 1.8 2
−2.21.3
𝑊J 𝑌
𝑊J
CS109A, PROTOPAPAS, RADER
Sparse Representation
18
0.69 = 0.5 .2 0.1 2
−2.21.3
𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝑊)
𝑊3
𝑊2
𝑊A
𝑊B
𝑊C
𝑊D
Weights in output layer
𝑊J 𝑌
𝑊J
CS109A, PROTOPAPAS, RADER
Sparse Representation
19
𝐽 𝜃; 𝑋, 𝑦
ℎA3
ℎA2ℎAA
4.34 = 3.2 2 1 2
−2.21.3
𝑊J 𝑌
ℎA3, ℎA2, ℎAA
CS109A, PROTOPAPAS, RADER
Sparse Representation
20
𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(ℎ)
Output of hidden layer
ℎA3
ℎA2ℎAA
1.3 = 3.2 2 1 0
−0.2.9
ℎA3, ℎA2, ℎAA
CS109A, PROTOPAPAS, RADER
Outline
Regularization of NN § Norm Penalties
§ Early Stopping
§ Data Augmentation
§ Sparse Representation
§ Bagging§ Dropout
21
CS109A, PROTOPAPAS, RADER 22
CS109A, PROTOPAPAS, RADER
Outline
Regularization of NN § Norm Penalties
§ Early Stopping
§ Data Augmentation
§ Sparse Representation
§ Bagging
§ Dropout
23
CS109A, PROTOPAPAS, RADER
Noise Robustness
Random perturbation of network weights
• Gaussian noise: Equivalent to minimizing loss with regularization term
• Encourages smooth function: small perturbation in weights leads to small changes in output
Injecting noise in output labels
• Better convergence: prevents pursuit of hard probabilities
24
CS109A, PROTOPAPAS, RADER
Dropout
25
Train all sub-networks obtained by removing non-output units from base network
CS109A, PROTOPAPAS, RADER
Dropout: Stochastic GD
For each new example/mini-batch:
• Randomly sample a binary mask μ independently, where μi indicatesifinput/hidden node i is included
• Multiply output of node i with μi, and perform gradient update
Typically, an input node is included with prob=0.8, hidden node with prob=0.5.
26
CS109A, PROTOPAPAS, RADER
Dropout: Weight Scaling
During prediction time use all units, but scale weights with probability of inclusion
27
CS109A, PROTOPAPAS, RADER
Adversarial Examples
28
CS109A, PROTOPAPAS, RADER
Adversarial Examples
29
+ =
Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.
Panda 57% confidence Gibbon 99.3% confidencenoise
CS109A, PROTOPAPAS, RADER
Recap
Regularization of NN § Norm Penalties
§ Early Stopping
§ Data Augmentation
§ Sparse Representation
§ Bagging
§ Dropout
30
Recommended