76
Optimization and (Under/over)fitting EECS 442 – Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/

Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Optimization and (Under/over)fittingEECS 442 – Prof. David Fouhey

Winter 2019, University of Michiganhttp://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/

Page 2: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Administrivia

• We’re grading HW2 and will try to get it backASAP

• Midterm next week• Review sessions next week (go to whichever you

want)

• We will post priorities from slides

• There’s a post on piazza looking for topics you want reviewed

Page 3: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Previously

Page 4: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Regularized Least Squares

Add regularization to objective that prefers some solutions:

argmin𝒘

𝒚 − 𝑿𝒘 22

LossBefore:

After: argmin𝒘

𝒚 − 𝑿𝒘 22 + 𝜆 𝒘 2

2

Loss RegularizationTrade-off

Want model “smaller”: pay a penalty for w with big norm

Intuitive Objective: accurate model (low loss) but not too

complex (low regularization). λ controls how much of each.

Page 5: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Nearest Neighbors

Known Images

Labels

𝒙1

𝒙𝑁

Test

Image

𝒙𝑇

𝐷(𝒙𝑁, 𝒙𝑇)

𝐷(𝒙1, 𝒙𝑇)

(1) Compute distance between

feature vectors (2) find nearest

(3) use label.

Cat

Dog

Cat!

Page 6: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Picking Parameters

What distance? What value for k / λ?

Training TestValidation

Use these data

points for lookup

Evaluate on these

points for different

k, λ, distances

Page 7: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Linear Models

Example Setup: 3 classes

𝒘0, 𝒘1, 𝒘2Model – one weight per class:big if cat𝒘0

𝑇𝒙

big if dog𝒘1𝑇𝒙

big if hippo𝒘2𝑇𝒙

𝑾𝟑𝒙𝑭Stack together: where x is in RF

Want:

Page 8: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Linear Models

0.2 -0.5 0.1 2.0

1.5 1.3 2.1 0.0

0.0 0.3 0.2 -0.3

1.1

3.2

-1.2

𝑾

56

231

24

2

1

𝒙𝒊

Cat weight vector

Dog weight vector

Hippo weight vector

𝑾𝒙𝒊

-96.8

437.9

61.95

Cat score

Dog score

Hippo score

Diagram by: Karpathy, Fei-Fei

Weight matrix a collection of

scoring functions, one per class

Prediction is vector

where jth component is

“score” for jth class.

Page 9: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Objective 1: Multiclass SVM

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max 0, (𝑾𝒙𝑖 𝑗 − 𝑾𝒙𝒊 𝑦𝑖 +𝑚)

Training (xi,yi):

Regularization

Over all data

points

For every class

j that’s NOT the

correct one (yi)

Pay no penalty if prediction

for class yi is bigger than j

by m (“margin”). Otherwise,

pay proportional to the

score of the wrong class.

Inference (x,y): argmaxk

𝑾𝒙 𝑘

(Take the class whose

weight vector gives the

highest score)

Page 10: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Objective 1:

Called: Support Vector Machine

Lots of great theory as to why this is a

sensible thing to do. See

Useful book (Free too!):

The Elements of Statistical Learning

Hastie, Tibshirani, Friedman

https://web.stanford.edu/~hastie/ElemStatLearn/

Page 11: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Objective 2: Making Probabilities

-0.9

0.4

0.6

Cat score

Dog score

Hippo score

exp(x)

e-0.9

e0.4

e0.6

0.41

1.49

1.82

∑=3.72

Norm

0.11

0.40

0.49

P(cat)

P(dog)

P(hippo)

Converting Scores to “Probability Distribution”

exp (𝑊𝑥 𝑗)

σ𝑘 exp( 𝑊𝑥 𝑘)Generally P(class j):

Inspired by: Karpathy, Fei-Fei

Called softmax function

Page 12: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Objective 2: Softmax

Inference (x): argmaxk

𝑾𝒙 𝑘

(Take the class whose

weight vector gives the

highest score)

exp (𝑊𝑥 𝑗)

σ𝑘 exp( 𝑊𝑥 𝑘)P(class j):

Why can we skip the exp/sum exp

thing to make a decision?

Page 13: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Objective 2: Softmax

Inference (x): argmaxk

𝑾𝒙 𝑘

(Take the class whose

weight vector gives the

highest score)

Training (xi,yi):

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

− logexp( 𝑊𝑥 𝑦𝑖)

σ𝑘 exp( 𝑊𝑥 𝑘))

Regularization

Over all data

points

P(correct class)

Pay penalty for not making

correct class likely.

“Negative log-likelihood”

Page 14: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Today

• Optimizing things

• Time permitting: • (Under/over)fitting or (Bias/Variance)

• (Not on the exam)

Page 15: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

How Do We Optimize Things?

arg min𝒘∈𝑅𝑁

𝐿(𝒘)Goal: find the w minimizing

some loss function L.

𝐿 𝑾 = 𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

− logexp( 𝑊𝑥 𝑦𝑖)

σ𝑘 exp( 𝑊𝑥 𝑘))

𝐿(𝒘)= 𝜆 𝒘 22 +

𝑖=1

𝑛

𝑦𝑖 −𝒘𝑻𝒙𝒊2

𝐿 𝒘 =𝐶 𝒘 22 +

𝑖=1

𝑛

max 0,1 − 𝑦𝑖𝒘𝑇𝒙𝒊

Works for lots of

different Ls:

Page 16: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Sample Function to Optimize

Global minimum

f(x,y) = (x+2y-7)2 + (2x+y-5)2

Page 17: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Sample Function to Optimize

• I’ll switch back and forth between this 2D function (called the Booth Function) and other more-learning-focused functions

• Beauty of optimization is that it’s all the same in principle.

• But don’t draw too many conclusions: 2D space has qualitative differences from 1000D space.

See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional

non-convex optimization NIPS 2014

https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

Page 18: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

A Caveat

• Each point in the picture is a

function evaluation

• Here it takes microseconds – so

we can easily see the answer

• Functions we want to optimize

may take hours to evaluate

Page 19: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

A Few Caveat

20

-20-20 20

Model in your head: moving around a

landscape with a teleportation device

Landscape diagram: Karpathy and Fei-Fei

Page 20: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option #1A – Grid Search

#systematically try things

best, bestScore = None, Inf

for dim1Value in dim1Values:

….

for dimNValue in dimNValues:

w = [dim1Value, …, dimNValue]

if L(w) < bestScore:

best, bestScore = w, L(w)

return best

Page 21: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option #1A – Grid Search

Page 22: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option #1A – “Grid Search”

Pros:

1. Super simple

2. Only requires being

able to evaluate model

Cons:

1. Scales horribly to high

dimensional spaces

Complexity: samplesPerDimnumberOfDims

Page 23: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option #1B – Random Search

#Do random stuff RANSAC Style

best, bestScore = None, Inf

for iter in range(numIters):

w = random(N,1) #sample

score = 𝐿 𝒘 #evaluate

if score < bestScore:

best, bestScore = w, score

return best

Page 24: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option #1B – Random Search

Page 25: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option #1B – Random Search

Pros:

1. Super simple

2. Only requires being

able to sample model

and evaluate it

Cons:

1. Slow and stupid –

throwing darts at high

dimensional dart board

2. Might miss something

Good parameters

All parameters

0 1

ε

P(all correct) =

εN

Page 26: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

When Do You Use Options 1A/1B?

Use these when

• Number of dimensions small, space bounded

• Objective is impossible to analyze (e.g., test accuracy if we use this distance function)

Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)

Page 27: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option 2 – Use The Gradient

Arrows:

gradient

Page 28: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option 2 – Use The Gradient

Arrows:

gradient

direction

(scaled to unit

length)

Page 29: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option 2 – Use The Gradient

argmin𝒘

𝐿(𝒘)

∇𝒘𝐿 𝒘 =𝜕𝐿/𝜕𝒙1

⋮𝜕𝐿/𝜕𝒙𝑁

What’s the geometric

interpretation of:

Want:

Which is bigger (for small α)?

𝐿 𝒙 𝐿 𝒙 + 𝛼∇𝒘𝐿(𝒘)≤?

>?

Page 30: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option 2 – Use The Gradient

Arrows:

gradient

direction

(scaled to unit

length)

𝒙

𝒙 + 𝛼𝛁

Page 31: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Option 2 – Use The Gradient

w0 = initialize() #initialize

for iter in range(numIters):

g = ∇𝒘𝐿 𝒘 #eval gradient

w = w + -stepsize(iter)*g #update w

return w

Method: at each step, move in direction

of negative gradient

Page 32: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent

Given starting point (blue)

wi+1 = wi + -9.8x10-2 x gradient

Page 33: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing the Gradient

How Do You Compute The Gradient?

Numerical Method:

∇𝒘𝐿 𝒘 =

𝜕𝐿(𝑤)

𝜕𝑥1⋮

𝜕𝐿(𝑤)

𝜕𝑥𝑛

𝜕𝑓(𝑥)

𝜕𝑥= lim

𝜖→0

𝑓 𝑥 + 𝜖 − 𝑓(𝑥)

𝜖

How do you compute this?

𝑓 𝑥 + 𝜖 − 𝑓 𝑥 − 𝜖

2𝜖

In practice, use:

Page 34: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing the Gradient

How Do You Compute The Gradient?

Numerical Method:

∇𝒘𝐿 𝒘 =

𝜕𝐿(𝑤)

𝜕𝑥1⋮

𝜕𝐿(𝑤)

𝜕𝑥𝑛

How many function

evaluations per dimension?

𝑓 𝑥 + 𝜖 − 𝑓 𝑥 − 𝜖

2𝜖Use:

Page 35: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing the Gradient

How Do You Compute The Gradient?

∇𝒘𝐿 𝒘 =

𝜕𝐿(𝑤)

𝜕𝑥1⋮

𝜕𝐿(𝑤)

𝜕𝑥𝑛

Use calculus!

Analytical Method:

Page 36: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing the Gradient

𝐿(𝒘)=𝜆 𝒘 22 +

𝑖=1

𝑛

𝑦𝑖 −𝒘𝑻𝒙𝒊2

∇𝒘𝐿(𝒘) = 2𝜆𝒘 +

𝑖=1

𝑛

− 2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖

𝜕

𝜕𝒘

Note: if you look at other derivations, things are written either (y-wTx) or (wTx – y); the

gradients will differ by a minus.

Page 37: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Interpreting the GradientRecall:

w = w + -∇𝒘𝐿 𝒘 #update w

∇𝒘𝐿(𝒘) = 2𝜆𝒘 +

𝑖=1

𝑛

− 2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖

−∇𝒘𝐿 𝒘 = −2𝜆𝒘 +

𝑖=1

𝑛

2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖

Push w towards 0

If yi > wTxi (too low): then w = w + αxi for some α

Before: wTx

After: (w+ αx)Tx = wTx + αxTx

α

Page 38: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Quick annoying detail: subgradients

What is the derivative of |x|?

𝜕

𝜕𝑥𝑓 𝑥 = sign(𝑥) 𝑥 ≠ 0

Derivatives/Gradients

Defined everywhere but 0

𝑥 = 0undefined

Oh no! A discontinuity!

|x|

Page 39: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Quick annoying detail: subgradients

Subgradient: any underestimate of function

Subderivatives/subgradients

Defined everywhere

In practice: at discontinuity, pick value on either side.

𝜕

𝜕𝑥𝑓 𝑥 = sign(𝑥) 𝑥 ≠ 0

|x|

𝑥 = 0𝜕

𝜕𝑥𝑓 𝑥 ∈ [−1,1]

Page 40: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Let’s Compute Another Gradient

Page 41: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing The Gradient

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max 0, (𝑾𝒙𝑖 𝑗 − 𝑾𝒙𝒊 𝑦𝑖 +𝑚)

Notation:

W → rows wi (i.e., per-class scorer)

(Wxi)j → wjTxi

argmin𝑾

𝝀

𝒋=𝟏

𝑲

𝒘𝒋 𝟐

𝟐+

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max(0,𝒘𝑗𝑇𝒙𝑖 −𝒘𝑦𝑖

𝑇 𝒙𝑖 +𝑚)

Derivation setup: Karpathy and Fei-Fei

Multiclass Support Vector Machine

Page 42: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing The Gradient

𝜕

𝜕𝑤𝑗:

argmin𝑾

𝝀

𝒋=𝟏

𝑲

𝒘𝒋 𝟐

𝟐+

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max(0,𝒘𝑗𝑇𝒙𝑖 −𝒘𝑦𝑖

𝑇 𝒙𝑖 +𝑚)

𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 ≤ 0: 0

𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 > 0: 𝑥𝑖

→ 1(𝑤𝑗𝑇 𝑥𝑖 − 𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 > 0)𝑥𝑖

Derivation setup: Karpathy and Fei-Fei

Page 43: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing The Gradient

argmin𝑾

𝝀

𝒋=𝟏

𝑲

𝒘𝒋 𝟐

𝟐+

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max(0,𝒘𝑗𝑇𝒙𝑖 −𝒘𝑦𝑖

𝑇 𝒙𝑖 +𝑚)

Derivation setup: Karpathy and Fei-Fei

𝜕

𝜕𝑤𝑦𝑖:

𝑗≠𝑦𝑖

1 𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 > 0 −𝑥𝑖

Page 44: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Interpreting The Gradient

If we do not predict the

correct class by at least a

score difference of m …

Want incorrect class’s

scoring vector to score

that point lower.

Recall:

Before: wTx;

After: (w-αx)Tx = wTx - axTx

Derivation setup: Karpathy and Fei-Fei

−𝜕

𝜕𝑤𝑗: −1(𝑤𝑗

𝑇𝑥𝑖 −𝑤𝑦𝑖𝑇 𝑥𝑖 +𝑚 > 0)𝑥𝑖

Page 45: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Computing The Gradient

• Numerical: foolproof but slow

• Analytical: can mess things up ☺

• In practice: do analytical, but check with numerical (called a gradient check)

Slide: Karpathy and Fei-Fei

Page 46: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Implementing Gradient Descent

Loss is a function that we can

evaluate over data

−∇𝒘𝐿 𝒘 = −2𝜆𝒘 +

𝑖=1

𝑛

2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖All

Data

−∇𝒘𝐿𝐵 𝒘 = −2𝜆𝒘 +

𝑖∈𝐵

2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖Subset

B

Page 47: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Implementing Gradient Descent

for iter in range(numIters):

g = gradient(data,L)

w = w + -stepsize(iter)*g #update w

Option 1: Vanilla Gradient Descent

Compute gradient of L over all data points

Page 48: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Implementing Gradient Descent

for iter in range(numIters):

index = randint(0,#data)

g = gradient(data[index],L)

w = w + -stepsize(iter)*g #update w

Option 2: Stochastic Gradient Descent

Compute gradient of L over 1 random sample

Page 49: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Implementing Gradient Descent

for iter in range(numIters):

subset = choose_samples(#data,B)

g = gradient(data[subset],L)

w = w + -stepsize(iter)*g #update w

Option 3: Minibatch Gradient Descent

Compute gradient of L over subset of B samples

Typical batch sizes: ~100

Page 50: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

Step size (also called learning rate / lr)

critical parameter

10x10-2

converges

12x10-2

diverges

1x10-2

falls short

Page 51: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

11x10-2 :oscillates

(Raw gradients)

Page 52: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

Typical solution: start with initial rate lr,

multiply by f every N interations

init_lr = 10-1

f = 0.1

N = 10K

Page 53: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

11x10-2 :oscillates

(Raw gradients)

Gradient Descent Details

Solution:

Average gradients

With exponentially decaying

weights, called “momentum”

Page 54: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

11x10-2

(0.25 momentum)

11x10-2 :oscillates

(Raw gradients)

Gradient Descent Details

Page 55: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

Multiple Minima

Gradient Descent

Finds local

minimum

Page 56: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

1

2 3

4

Guess the

minimum!

start

Page 57: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

Page 58: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

1

2 3

4start

Guess the

minimum!

Page 59: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Gradient Descent Details

Dynamics are

fairly complex

Many important

functions are

convex: any

local minimum is

a global minimum

Many important

functions are not.

Page 60: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

In practice

• Conventional wisdom: minibatch stochastic gradient descent (SGD) + momentum (package implements it for you) + decaying learning rate

• The above is typically what is meant by “SGD”

• Other update rules exist; benefits in general not clear (sometimes better, sometimes worse)

Page 61: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Optimizing Everything

• Optimize w on training set with SGD to maximize training accuracy

• Optimize λ with random/grid search to maximize validation accuracy

• Note: Optimizing λ on training sets it to 0

𝐿 𝑾 = 𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

− logexp( 𝑊𝑥 𝑦𝑖)

σ𝑘 exp( 𝑊𝑥 𝑘))

𝐿(𝒘)=𝜆 𝒘 22 +

𝑖=1

𝑛

𝑦𝑖 −𝒘𝑻𝒙𝒊2

Page 62: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

(Over/Under)fitting and Complexity

Let’s fit a polynomial: given x, predict y

Note: can do non-linear regression with copies of x

𝑦1⋮𝑦𝑁

=𝑥1𝐹

⋮𝑥𝑁𝐹

⋯⋱⋯

𝑥12

⋮𝑥𝑁2

𝑥1⋮𝑥𝑁

1⋮1

𝑤𝐹

⋮𝑤2

𝑤1𝑤0

Weights: one per polynomial degree

Matrix of all polynomial degrees

Page 63: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

(Over/Under)fitting and Complexity

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

Page 64: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Underfitting

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

Page 65: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Underfitting

Model doesn’t have

the parameters to fit

the data.

Bias (statistics): Error

intrinsic to the model.

Page 66: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Overfitting

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

Page 67: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Overfitting

Model has high variance: remove one point, and

model changes dramatically

Page 68: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

(Continuous) Model Complexity

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

− logexp( 𝑊𝑥 𝑦𝑖)

σ𝑘 exp( 𝑊𝑥 𝑘))

Regularization: penalty

for complex model

Pay penalty for negative log-

likelihood of correct class

Intuitively: big weights = more complex model

Model 1: 0.01*x1 + 1.3*x2 + -0.02*x3 + -2.1x4 + 10

Model 2: 37.2*x1 + 13.4*x2 + 5.6*x3 + -6.1x4 + 30

Page 69: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Fitting Model

Again, fitting polynomial, but with regularization

argminw

𝒚 − 𝑿𝒘 + 𝜆 𝒘

𝑥1𝐹

⋮𝑥𝑁𝐹

⋯⋱⋯

𝑥12

⋮𝑥𝑁2

𝑥1⋮𝑥𝑁

1⋮1

𝑤𝐹

⋮𝑤0

Page 70: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Adding Regularization

No regularization:

fits all data points

Regularization:

can’t fit all data points

Page 71: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

In General

Error on new data comes from combination of:

1. Bias: model is oversimplified and can’t fit the underlying data

2. Variance: you don’t have the ability to estimate your model from limited data

3. Inherent: the data is intrinsically difficult

Bias and variance trade-off. Fixing one hurts the other.

Page 72: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Underfitting and Overfitting

Low Bias

High Variance

High Bias

Low Variance

Diagram adapted from: D. Hoiem

Test

Error

Training

Error

Complexity

Error

Underfitting!

Page 73: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Underfitting and Overfitting

Low Bias

High Variance

High Bias

Low Variance

Diagram adapted from: D. Hoiem

Complexity

Test

Error

Small data

Overfits w/small model

Big data

Overfits w/bigger model

Page 74: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Underfitting

Do poorly on both training and validation data due to bias.

Solution:

1. More features

2. More powerful model

3. Reduce regularization

Page 75: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Overfitting

Do well on training data, but poorly on validation data due to variance

Solution:

1. More data

2. Less powerful model

3. Regularize your model more

Cris Dima rule: first make sure you can overfit, then stop overfitting.

Page 76: Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Next Class

• Non-linear models (neural nets)