Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization....

Preview:

Citation preview

CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader

Lecture 19: NN Regularization

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

2

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

3

CS109A, PROTOPAPAS, RADER

Regularization

4

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

CS109A, PROTOPAPAS, RADER

Overfitting

5

Fitting a deep neural network with 5 layers and 100 neurons per layer can lead to a very good prediction on the training set but poor prediction on validations set.

CS109A, PROTOPAPAS, RADER

Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Weights decay– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

6

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

CS109A, PROTOPAPAS, RADER

Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Decay of weights– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

7

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 +12𝛼𝑊2

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

− 𝜆𝛼𝑊

weightsdecayin

proportiontoitssize.

CS109A, PROTOPAPAS, RADER

Norm Penalties

8

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

CS109A, PROTOPAPAS, RADER

Norm Penalties as Constraints

Useful if K is known in advance

Optimization:

• Construct Lagrangian and apply gradient descent

• Projected gradient descent

9

min< = >?

𝐽(𝑊;𝑋, 𝑦)

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

10

CS109A, PROTOPAPAS, RADER

Early Stopping

11

Training time can be treated as a

hyperparameter

Early stopping: terminate while validation set performance is better

CS109A, PROTOPAPAS, RADER

Early Stopping

12

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation§ Sparse Representation

§ Bagging

§ Dropout

13

CS109A, PROTOPAPAS, RADER

Data Augmentation

14

CS109A, PROTOPAPAS, RADER

Data Augmentation

15

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation§ Bagging

§ Dropout

16

CS109A, PROTOPAPAS, RADER

Sparse Representation

17

𝐽 𝜃; 𝑋, 𝑦

𝑊3

𝑊2

𝑊A

𝑊B

𝑊C

𝑊D

4.34 = 3.2 2.0 1.8 2

−2.21.3

𝑊J 𝑌

𝑊J

CS109A, PROTOPAPAS, RADER

Sparse Representation

18

0.69 = 0.5 .2 0.1 2

−2.21.3

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝑊)

𝑊3

𝑊2

𝑊A

𝑊B

𝑊C

𝑊D

Weights in output layer

𝑊J 𝑌

𝑊J

CS109A, PROTOPAPAS, RADER

Sparse Representation

19

𝐽 𝜃; 𝑋, 𝑦

ℎA3

ℎA2ℎAA

4.34 = 3.2 2 1 2

−2.21.3

𝑊J 𝑌

ℎA3, ℎA2, ℎAA

CS109A, PROTOPAPAS, RADER

Sparse Representation

20

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(ℎ)

Output of hidden layer

ℎA3

ℎA2ℎAA

1.3 = 3.2 2 1 0

−0.2.9

ℎA3, ℎA2, ℎAA

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging§ Dropout

21

CS109A, PROTOPAPAS, RADER 22

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

23

CS109A, PROTOPAPAS, RADER

Noise Robustness

Random perturbation of network weights

• Gaussian noise: Equivalent to minimizing loss with regularization term

• Encourages smooth function: small perturbation in weights leads to small changes in output

Injecting noise in output labels

• Better convergence: prevents pursuit of hard probabilities

24

CS109A, PROTOPAPAS, RADER

Dropout

25

Train all sub-networks obtained by removing non-output units from base network

CS109A, PROTOPAPAS, RADER

Dropout: Stochastic GD

For each new example/mini-batch:

• Randomly sample a binary mask μ independently, where μi indicatesifinput/hidden node i is included

• Multiply output of node i with μi, and perform gradient update

Typically, an input node is included with prob=0.8, hidden node with prob=0.5.

26

CS109A, PROTOPAPAS, RADER

Dropout: Weight Scaling

During prediction time use all units, but scale weights with probability of inclusion

27

CS109A, PROTOPAPAS, RADER

Adversarial Examples

28

CS109A, PROTOPAPAS, RADER

Adversarial Examples

29

+ =

Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.

Panda 57% confidence Gibbon 99.3% confidencenoise

CS109A, PROTOPAPAS, RADER

Recap

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

30

Recommended