Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization....

CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader

Lecture 19: NN Regularization

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

Outline

Regularization of NN § Norm Penalties§ Early Stopping

§ Bagging

§ Dropout

Regularization

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Overfitting

Fitting a deep neural network with 5 layers and 100 neurons per layer can lead to a very good prediction on the training set but poor prediction on validations set.

Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Weights decay– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Decay of weights– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 +12𝛼𝑊2

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

− 𝜆𝛼𝑊

weightsdecayin

proportiontoitssize.

Norm Penalties

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

Norm Penalties as Constraints

Useful if K is known in advance

Optimization:

• Construct Lagrangian and apply gradient descent

• Projected gradient descent

min< = >?

𝐽(𝑊;𝑋, 𝑦)

Outline

§ Early Stopping§ Data Augmentation

§ Bagging

§ Dropout

Early Stopping

Training time can be treated as a

hyperparameter

Early stopping: terminate while validation set performance is better

Early Stopping

Outline

§ Early Stopping

§ Data Augmentation§ Sparse Representation

§ Bagging

§ Dropout

Data Augmentation

Outline

§ Early Stopping

§ Sparse Representation§ Bagging

§ Dropout

Sparse Representation

𝐽 𝜃; 𝑋, 𝑦

4.34 = 3.2 2.0 1.8 2

−2.21.3

𝑊J 𝑌

0.69 = 0.5 .2 0.1 2

−2.21.3

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝑊)

Weights in output layer

𝑊J 𝑌

𝐽 𝜃; 𝑋, 𝑦

ℎA2ℎAA

4.34 = 3.2 2 1 2

−2.21.3

𝑊J 𝑌

ℎA3, ℎA2, ℎAA

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(ℎ)

Output of hidden layer

ℎA2ℎAA

1.3 = 3.2 2 1 0

−0.2.9

ℎA3, ℎA2, ℎAA

Outline

§ Early Stopping

§ Bagging§ Dropout

CS109A, PROTOPAPAS, RADER 22

Outline

§ Early Stopping

§ Bagging

§ Dropout

Noise Robustness

Random perturbation of network weights

• Gaussian noise: Equivalent to minimizing loss with regularization term

• Encourages smooth function: small perturbation in weights leads to small changes in output

Injecting noise in output labels

• Better convergence: prevents pursuit of hard probabilities

Dropout

Train all sub-networks obtained by removing non-output units from base network

Dropout: Stochastic GD

For each new example/mini-batch:

• Randomly sample a binary mask μ independently, where μi indicatesifinput/hidden node i is included

• Multiply output of node i with μi, and perform gradient update

Typically, an input node is included with prob=0.8, hidden node with prob=0.5.

Dropout: Weight Scaling

During prediction time use all units, but scale weights with probability of inclusion

Adversarial Examples

Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.

Panda 57% confidence Gibbon 99.3% confidencenoise

§ Early Stopping

§ Bagging

§ Dropout

Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization....

Documents

LNCS 5304 - Non-local Regularization of Inverse Problemscohen/mypapers/... · framework improves over wavelet and total variation regularization. 2 Non-local Regularization 2.1 Inverse

PDFFormMergeFile version · n## # nnn#### nn### nn### nn# ## nn## nnn### n n### n nn## nn## nn### n## n## # # # # # # # Picc. Fl. 1-2 Ob. 1-2 E. Hn. Bsn. Eb Cl. Bb Cl. 1 Bb Cl. 2

Smooth regularization of bang-bang optimal control · PDF fileSmooth regularization of bang-bang optimal control ... Smooth regularization of bang-bang optimal control problems

Sparsity Based Regularization

Lecture 17: Boosting - harvard-iacs.github.io › 2018-CS109A › ... · The factor 8is often called the learning rate. 28. CS109A, PROTOPAPAS, RADER Why Does Gradient Descent Work?

NN-CD997S - Panasonic - Panasonic MiddleEastpanasonic.ae/EN/Manuals/NN-CD997S.pdf · NN-CD997S NN-CD987W Operating Instructions Microwave / Convection Oven Model No. NN-CD997S NN-CD987W

Regularization of linear inverse problems with total ... · regularization and convergence behavior for multiple parameters and functionals. However, all e orts aim towards regularization

Chap5.5 Regularization

Sparsity regularization for inverse problems using curvelets€¦ · 1.1 Inverse problems, regularization and curvelets Inverse problems and regularization are widely used in di erent

Lecture 4: Introduction to Regression · CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner Lecture 4: Introduction to Regression. CS109A ... How

Lecture 25: Boosting - harvard-iacs.github.io › 2020-CS109A › ...Lecture 25: Boosting. CS109A, PROTOPAPAS, RADER, TANNER 2. CS109A, PROTOPAPAS, RADER, TANNER Motivation for Boosting

Rudy Regularization

Regularization Harijan Basti

Regularization in Neural Networks - Welcome to CEDARcedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf · Regularization in Neural Networks ... Need for Regularization

nn nn

Title: regularization

Nn Se992s Nn Sd982s Nn St962s Panasonic 1680

Regularization Paths for Generalized Linear Models via … · 2017. 5. 5. · 2 Regularization Paths for GLMs via Coordinate Descent 4. ‘ 1 regularization paths for generalized

NN-T694SF NN-T664SF - service.panasonic.caservice.panasonic.ca/pcs/EXT_ServiceManuals/NNx6x4.pdf · nn-t694sf nn-t664sf nn-t654sf nn-t644sf nn-h674wf ... 1 schematic diagram 5 1.1

Ensemble Manifold Regularization