30
Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100

Learning From Data Lecture 12 Regularization

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning From Data Lecture 12 Regularization

Learning From Data

Lecture 12

Regularization

Constraining the Model

Weight Decay

Augmented Error

M. Magdon-IsmailCSCI 4100/6100

Page 2: Learning From Data Lecture 12 Regularization

recap: Overfitting

Fitting the data more than is warranted

x

yDataTargetFit

c© AML Creator: Malik Magdon-Ismail Regularization: 2 /30 Noise −→

Page 3: Learning From Data Lecture 12 Regularization

recap: Noise is Part of y We Cannot Model

Stochastic Noise

x

y

f(x)

y = f(x)+stoch. noise

Deterministic Noise

x

y

h∗

y = h∗(x)+det. noise

Stochastic and Deterministic Noise Hurt Learning

Human: Good at extracting the simple pattern, ignoring the noise and complications.

Computer: Pays equal attention to all pixels. Needs help simplifying→ (featuresX, regularization).

c© AML Creator: Malik Magdon-Ismail Regularization: 3 /30 What is regularization? −→

Page 4: Learning From Data Lecture 12 Regularization

Regularization

What is regularization?

A cure for our tendency to fit (get distracted by) the noise, hence improving Eout.

How does it work?

By constraining the model so that we cannot fit the noise.

↑putting on the brakes

Side effects?

The medication will have side effects – if we cannot fit the noise, maybe we cannot fit f (the signal)?

c© AML Creator: Malik Magdon-Ismail Regularization: 4 /30 Constraining −→

Page 5: Learning From Data Lecture 12 Regularization

Constraining the Model: Does it Help?

x

y

. . . and the winner is:

c© AML Creator: Malik Magdon-Ismail Regularization: 5 /30 Small weights −→

Page 6: Learning From Data Lecture 12 Regularization

Constraining the Model: Does it Help?

x

y

x

y

constrain weights to be smaller

. . . and the winner is:

c© AML Creator: Malik Magdon-Ismail Regularization: 6 /30 bias−→

Page 7: Learning From Data Lecture 12 Regularization

Bias Goes Up A Little

x

y g(x)

sin(x)

x

y g(x)

sin(x)

no regularization

bias = 0.21

regularization

bias = 0.23 ← side effect

(Constant model had bias=0.5 and var=0.25.)

c© AML Creator: Malik Magdon-Ismail Regularization: 7 /30 var−→

Page 8: Learning From Data Lecture 12 Regularization

Variance Drop is Dramatic!

x

y g(x)

sin(x)

x

y g(x)

sin(x)

no regularization

bias = 0.21var = 1.69

regularization

bias = 0.23 ← side effect

var = 0.33 ← treatment

(Constant model had bias=0.5 and var=0.25.)

c© AML Creator: Malik Magdon-Ismail Regularization: 8 /30 Regularication in a nutshell −→

Page 9: Learning From Data Lecture 12 Regularization

Regularization in a Nutshell

VC analysis:Eout(g) ≤ Ein(g) + Ω(H)

տIf you use a simpler H and get a good fit,then your Eout is better.

Regularization takes this a step further:

If you use a ‘simpler’ h and get a

good fit, then is your Eout better?

c© AML Creator: Malik Magdon-Ismail Regularization: 9 /30 Polynomials −→

Page 10: Learning From Data Lecture 12 Regularization

Polynomials of Order Q - A Useful Testbed

Hq: polynomials of order Q.

Standard Polynomial Legendre Polynomial

z =

1x

x2

...

xq

h(x) = wtz(x)

= w0 + w1x+ · · ·+ wqxq

z =

1L1(x)

L2(x)...

Lq(x)

h(x) = wtz(x)

= w0 + w1L1(x) + · · ·+ wqLq(x)

ւwe’re using linear regression

տallows us to treat the

weights ‘independently’

L1

x

L2

12(3x2 − 1)

L3

12(5x3 − 3x)

L4

18(35x4 − 30x2 + 3)

L5

18(63x5 · · · )

c© AML Creator: Malik Magdon-Ismail Regularization: 10 /30 recap: linear regression −→

Page 11: Learning From Data Lecture 12 Regularization

recap: Linear Regression

(x1, y1), . . . , (xN , yN)

︸ ︷︷ ︸

X y

−→ (z1, y1), . . . , (zN , yN)

︸ ︷︷ ︸

Z y

min : Ein(w) =1

N

N∑

n=1

(wtzn − yn)2

=1

N(Zw − y)t(Zw − y)

wlin = (ZtZ)−1Zty

linear regression fitր

c© AML Creator: Malik Magdon-Ismail Regularization: 11 /30 Already saw constraints −→

Page 12: Learning From Data Lecture 12 Regularization

Constraining The Model: H10 vs. H2

H10 =

h(x) = w0 + w1Φ1(x) + w2Φ2(x) + w3Φ3(x) + · · · + w10Φ10(x)

H2 =

h(x) = w0 + w1Φ1(x) + w2Φ2(x) + w3Φ3(x) + · · · + w10Φ10(x)

such that: w3 = w4 = · · · = w10 = 0

a ‘hard’ order constraint that

sets some weights to zero

ր

H2 ⊂ H10

c© AML Creator: Malik Magdon-Ismail Regularization: 12 /30 Soft constraint −→

Page 13: Learning From Data Lecture 12 Regularization

Soft Order Constraint

Don’t set weights explicitly to zero (e.g. w3 = 0).

Give a budget and let the learning choose.

q∑

q=0

w2q ≤ C

տ

budget forweights

H2

C →∞

H10

soft order constraint allows

‘intermediate’ models

c© AML Creator: Malik Magdon-Ismail Regularization: 13 /30 HC −→

Page 14: Learning From Data Lecture 12 Regularization

Soft Order Constrained Model HC

H10 =

h(x) = w0 + w1Φ1(x) + w2Φ2(x) + w3Φ3(x) + · · · + w10Φ10(x)

H2 =

h(x) = w0 + w1Φ1(x) + w2Φ2(x) + w3Φ3(x) + · · · + w10Φ10(x)

such that: w3 = w4 = · · · = w10 = 0

HC =

h(x) = w0 + w1Φ1(x) + w2Φ2(x) + w3Φ3(x) + · · · + w10Φ10(x)

such that:10∑

q=0

w2q ≤ C

a ‘soft’ budget constraint

on the sum of weights

ր

VC-perspective: HC is smaller than H10 =⇒ better generalization.

c© AML Creator: Malik Magdon-Ismail Regularization: 14 /30 Fitting data −→

Page 15: Learning From Data Lecture 12 Regularization

Fitting the Data

The optimal weightswreg ∈ HC

regularizedր

should minimize the in-sample error, but be within the budget.

wreg is a solution to

min : Ein(w) =1

N(Zw − y)t(Zw − y)

subject to: wtw ≤ C

c© AML Creator: Malik Magdon-Ismail Regularization: 15 /30 Getting wreg −→

Page 16: Learning From Data Lecture 12 Regularization

Solving For wreg

min : Ein(w) =1

N(Zw − y)t(Zw− y)

subject to: wtw ≤ C

Observations:

1. Optimal w tries to get as ‘close’ to wlin as possible.

Optimal w will use full budget and be on the surface wtw = C.

2. Surface wtw = C, at optimal w, should be perpindicular to ∇Ein.Otherwise can move along the surface and decrease Ein.

3. Normal to surface wtw = C is the vector w.

4. Surface is ⊥ ∇Ein; surface is ⊥ normal.

∇Ein is parallel to normal (but in opposite direction).

∇Ein(wreg) = −2λCwreg

wlin

wtw = C

w

Ein = const.

∇Ein

normal

λC , the lagrange multiplier, is positive.The 2 is for mathematical convenience.

ր

c© AML Creator: Malik Magdon-Ismail Regularization: 16 /30 Unconstrained minimization −→

Page 17: Learning From Data Lecture 12 Regularization

Solving For wreg

Ein(w) is minimized, subject to: wtw ≤ C

⇔ ∇Ein(wreg) + 2λCwreg = 0

⇔ ∇ (Ein(w) + λCwtw)

∣∣w=wreg

= 0

⇔ Ein(w) + λCwtw is minimized, unconditionally

There is a correspondence: C ↑ λC ↓

c© AML Creator: Malik Magdon-Ismail Regularization: 17 /30 Augmented error −→

Page 18: Learning From Data Lecture 12 Regularization

The Augmented Error

Pick a C and minimize

Ein(w) subject to: wtw ≤ C

l

Pick a λC and minimize

Eaug(w) = Ein(w) + λCwtw unconditionally

տA penalty for the ‘complexity’ of h, measured by the size of the weights.

We can pick any budget C. Translation: we are free to pick any multiplier λC

What’s the right C? ↔ What’s the right λC?

c© AML Creator: Malik Magdon-Ismail Regularization: 18 /30 Linear regression −→

Page 19: Learning From Data Lecture 12 Regularization

Linear Regression With Soft Order Constraint

Eaug(w) =1

N(Zw − y)t(Zw − y) + λCw

twտ

Convenient to set λC = λ

N

Eaug(w) =(Zw − y)t(Zw − y) + λwtw

Nտ

called ‘weight decay’ as the penaltyencourages smaller weights

Unconditionally minimize Eaug(w).

c© AML Creator: Malik Magdon-Ismail Regularization: 19 /30 Linear regression solution −→

Page 20: Learning From Data Lecture 12 Regularization

The Solution for wreg

∇Eaug(w) = 2Zt(Zw − y) + 2λw

= 2(ZtZ + λI)w − 2Zty

Set ∇Eaug(w) = 0

wreg = (ZtZ + λI)−1Zty↑

λ determines the amount of regularization

Recall the unconstrained solution (λ = 0):

wlin = (ZtZ)−1Zty

c© AML Creator: Malik Magdon-Ismail Regularization: 20 /30 Dramatic effect −→

Page 21: Learning From Data Lecture 12 Regularization

A Little Regularization . . .

Minimizing Ein(w) +λ

Nwtw with different λ’s

λ = 0 λ = 0.0001

x

y

DataTargetFit

Overfitting Wow!

c© AML Creator: Malik Magdon-Ismail Regularization: 21 /30 Just a little works −→

Page 22: Learning From Data Lecture 12 Regularization

. . . Goes A Long Way

Minimizing Ein(w) +λ

Nwtw with different λ’s

λ = 0 λ = 0.0001

x

y

DataTargetFit

x

y

Overfitting Wow!

c© AML Creator: Malik Magdon-Ismail Regularization: 22 /30 Easy to overdose −→

Page 23: Learning From Data Lecture 12 Regularization

Don’t Overdose

Minimizing Ein(w) +λ

Nwtw with different λ’s

λ = 0 λ = 0.0001 λ = 0.01 λ = 1

x

y

Data

Target

Fit

x

y

x

y

x

y

Overfitting → → Underfitting

c© AML Creator: Malik Magdon-Ismail Regularization: 23 /30 Overfitting and underfitting −→

Page 24: Learning From Data Lecture 12 Regularization

Overfitting and Underfitting

PSfrag

Regularization Parameter, λ

Exp

ectedE

out

overfitting underfitting

0 0.5 1 1.5 2

0.76

0.8

0.84

c© AML Creator: Malik Magdon-Ismail Regularization: 24 /30 Noise and regularization −→

Page 25: Learning From Data Lecture 12 Regularization

More Noise Needs More Medicine

Regularization Parameter, λ

Exp

ectedE

out

σ2 = 0

σ2 = 0.25

σ2 = 0.5

0.5 1 1.5 2

0.25

0.5

0.75

1

c© AML Creator: Malik Magdon-Ismail Regularization: 25 /30 Deterministic too −→

Page 26: Learning From Data Lecture 12 Regularization

. . . Even For Deterministic Noise

Regularization Parameter, λ

Exp

ectedE

out

σ2 = 0

σ2 = 0.25

σ2 = 0.5

0.5 1 1.5 2

0.25

0.5

0.75

1

Regularization Parameter, λExp

ectedE

out

Qf = 15

Qf = 30

Qf = 100

0.5 1 1.5 2

0.2

0.4

0.6

c© AML Creator: Malik Magdon-Ismail Regularization: 26 /30 Variations on weight decay −→

Page 27: Learning From Data Lecture 12 Regularization

Variations on Weight Decay

Uniform Weight Decay Low Order Fit Weight Growth!

Regularization Parameter, λ

Exp

ectedE

out

overfitting underfitting

0 0.5 1 1.5 2

0.76

0.8

0.84

Regularization Parameter, λ

Exp

ectedE

out

0 0.5 1 1.5 2

0.76

0.8

0.84

Regularization Parameter, λ

Exp

ectedE

out

weight growth

weight decay

Q∑

q=0

w2q

Q∑

q=0

qw2q

Q∑

q=0

1

w2q

c© AML Creator: Malik Magdon-Ismail Regularization: 27 /30 Choosing a regularizer −→

Page 28: Learning From Data Lecture 12 Regularization

Choosing a Regularizer – A Practitioner’s Guide

The perfect regularizer:constrain in the ‘direction’ of the target function.

target function is unknown (going around in circles ).

The guiding principle:constrain in the ‘direction’ of smoother (usually simpler) hypotheses

hurts your ability to fit the ‘high frequency’ noise

smoother and simplerusually means−→ weight decay not weight growth.

What if you choose the wrong regularizer?You still have λ to play with — validation.

c© AML Creator: Malik Magdon-Ismail Regularization: 28 /30 Regularization philosophy −→

Page 29: Learning From Data Lecture 12 Regularization

How Does Regularization Work?

Stochastic noise −→ nothing you can do about that.

Good features −→ helps to reduce deterministic noise.

Regularization:

Helps to combat what noise remains, especially when N is small.

Typical modus operandi: sacrifice a little bias for a huge improvement in var.

VC angle: you are using a smaller H without sacrificing too much Ein

c© AML Creator: Malik Magdon-Ismail Regularization: 29 /30 Eaug versus Ein −→

Page 30: Learning From Data Lecture 12 Regularization

Augmented Error as a Proxy for Eout

Eaug(h) = Ein(h) +λNΩ(h)

l

Eout(h) ≤ Ein(h) + Ω(H)

ւthis was wt

w

տthis was O

(

dvc

NlnN

)

Eaug can beat Ein as a proxy for Eout.տ

depends on choice of λ

c© AML Creator: Malik Magdon-Ismail Regularization: 30 /30