Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review

Optimization and (Under/over)fittingEECS 442 – Prof. David Fouhey

Winter 2019, University of Michiganhttp://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/

Administrivia

• We’re grading HW2 and will try to get it backASAP

• Midterm next week• Review sessions next week (go to whichever you

want)

• We will post priorities from slides

• There’s a post on piazza looking for topics you want reviewed

Previously

Regularized Least Squares

Add regularization to objective that prefers some solutions:

argmin𝒘

𝒚 − 𝑿𝒘 22

LossBefore:

After: argmin𝒘

𝒚 − 𝑿𝒘 22 + 𝜆 𝒘 2

2

Loss RegularizationTrade-off

Want model “smaller”: pay a penalty for w with big norm

Intuitive Objective: accurate model (low loss) but not too

complex (low regularization). λ controls how much of each.

Nearest Neighbors

Known Images

Labels

…

𝒙1

𝒙𝑁

Test

Image

𝒙𝑇

𝐷(𝒙𝑁, 𝒙𝑇)

𝐷(𝒙1, 𝒙𝑇)

(1) Compute distance between

feature vectors (2) find nearest

(3) use label.

Cat

Dog

Cat!

Picking Parameters

What distance? What value for k / λ?

Training TestValidation

Use these data

points for lookup

Evaluate on these

points for different

k, λ, distances

Linear Models

Example Setup: 3 classes

𝒘0, 𝒘1, 𝒘2Model – one weight per class:big if cat𝒘0

𝑇𝒙

big if dog𝒘1𝑇𝒙

big if hippo𝒘2𝑇𝒙

𝑾𝟑𝒙𝑭Stack together: where x is in RF

Want:

Linear Models

0.2 -0.5 0.1 2.0

1.5 1.3 2.1 0.0

0.0 0.3 0.2 -0.3

1.1

3.2

-1.2

𝑾

56

231

24

2

1

𝒙𝒊

Cat weight vector

Dog weight vector

Hippo weight vector

𝑾𝒙𝒊

-96.8

437.9

61.95

Cat score

Dog score

Hippo score

Diagram by: Karpathy, Fei-Fei

Weight matrix a collection of

scoring functions, one per class

Prediction is vector

where jth component is

“score” for jth class.

Objective 1: Multiclass SVM

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max 0, (𝑾𝒙𝑖 𝑗 − 𝑾𝒙𝒊 𝑦𝑖 +𝑚)

Training (xi,yi):

Regularization

Over all data

points

For every class

j that’s NOT the

correct one (yi)

Pay no penalty if prediction

for class yi is bigger than j

by m (“margin”). Otherwise,

pay proportional to the

score of the wrong class.

Inference (x,y): argmaxk

𝑾𝒙 𝑘

(Take the class whose

weight vector gives the

highest score)

Objective 1:

Called: Support Vector Machine

Lots of great theory as to why this is a

sensible thing to do. See

Useful book (Free too!):

The Elements of Statistical Learning

Hastie, Tibshirani, Friedman

https://web.stanford.edu/~hastie/ElemStatLearn/

https://web.stanford.edu/~hastie/ElemStatLearn/

Objective 2: Making Probabilities

-0.9

0.4

0.6

Cat score

Dog score

Hippo score

exp(x)

e-0.9

e0.4

e0.6

0.41

1.49

1.82

∑=3.72

Norm

0.11

0.40

0.49

P(cat)

P(dog)

P(hippo)

Converting Scores to “Probability Distribution”

exp (𝑊𝑥 𝑗)

σ𝑘 exp( 𝑊𝑥 𝑘)Generally P(class j):

Inspired by: Karpathy, Fei-Fei

Called softmax function

Objective 2: Softmax

Inference (x): argmaxk

𝑾𝒙 𝑘



highest score)

exp (𝑊𝑥 𝑗)

σ𝑘 exp( 𝑊𝑥 𝑘)P(class j):

Why can we skip the exp/sum exp

thing to make a decision?

Objective 2: Softmax

Inference (x): argmaxk

𝑾𝒙 𝑘



highest score)

Training (xi,yi):

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

− logexp( 𝑊𝑥 𝑦𝑖)

σ𝑘 exp( 𝑊𝑥 𝑘))

Regularization

Over all data

points

P(correct class)

Pay penalty for not making

correct class likely.

“Negative log-likelihood”

Today

• Optimizing things

• Time permitting: • (Under/over)fitting or (Bias/Variance)

• (Not on the exam)

How Do We Optimize Things?

arg min𝒘∈𝑅𝑁

𝐿(𝒘)Goal: find the w minimizing

some loss function L.

𝐿 𝑾 = 𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛



𝐿(𝒘)= 𝜆 𝒘 22 +

𝑖=1

𝑛

𝑦𝑖 −𝒘𝑻𝒙𝒊2

𝐿 𝒘 =𝐶 𝒘 22 +

𝑖=1

𝑛

max 0,1 − 𝑦𝑖𝒘𝑇𝒙𝒊

Works for lots of

different Ls:

Sample Function to Optimize

Global minimum

f(x,y) = (x+2y-7)2 + (2x+y-5)2

Sample Function to Optimize

• I’ll switch back and forth between this 2D function (called the Booth Function) and other more-learning-focused functions

• Beauty of optimization is that it’s all the same in principle.

• But don’t draw too many conclusions: 2D space has qualitative differences from 1000D space.

See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional

non-convex optimization NIPS 2014

https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

A Caveat

• Each point in the picture is a

function evaluation

• Here it takes microseconds – so

we can easily see the answer

• Functions we want to optimize

may take hours to evaluate

A Few Caveat

20

-20-20 20

Model in your head: moving around a

landscape with a teleportation device

Landscape diagram: Karpathy and Fei-Fei

Option #1A – Grid Search

#systematically try things

best, bestScore = None, Inf

for dim1Value in dim1Values:

….

for dimNValue in dimNValues:

w = [dim1Value, …, dimNValue]

if L(w) < bestScore:

best, bestScore = w, L(w)

return best

Option #1A – Grid Search

Option #1A – “Grid Search”

Pros:

1. Super simple

2. Only requires being

able to evaluate model

Cons:

1. Scales horribly to high

dimensional spaces

Complexity: samplesPerDimnumberOfDims

Option #1B – Random Search

#Do random stuff RANSAC Style

best, bestScore = None, Inf

for iter in range(numIters):

w = random(N,1) #sample

score = 𝐿 𝒘 #evaluate

if score < bestScore:

best, bestScore = w, score

return best



Pros:

1. Super simple

2. Only requires being

able to sample model

and evaluate it

Cons:

1. Slow and stupid –

throwing darts at high

dimensional dart board

2. Might miss something

Good parameters

All parameters

0 1

ε

P(all correct) =

εN

When Do You Use Options 1A/1B?

Use these when

• Number of dimensions small, space bounded

• Objective is impossible to analyze (e.g., test accuracy if we use this distance function)

Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)

Option 2 – Use The Gradient

Arrows:

gradient


Arrows:

gradient

direction

(scaled to unit

length)


argmin𝒘

𝐿(𝒘)

∇𝒘𝐿 𝒘 =𝜕𝐿/𝜕𝒙1

⋮𝜕𝐿/𝜕𝒙𝑁

What’s the geometric

interpretation of:

Want:

Which is bigger (for small α)?

𝐿 𝒙 𝐿 𝒙 + 𝛼∇𝒘𝐿(𝒘)≤?

>?


Arrows:

gradient

direction

(scaled to unit

length)

𝒙

𝒙 + 𝛼𝛁


w0 = initialize() #initialize


g = ∇𝒘𝐿 𝒘 #eval gradient

w = w + -stepsize(iter)*g #update w

return w

Method: at each step, move in direction

of negative gradient

Gradient Descent

Given starting point (blue)

wi+1 = wi + -9.8x10-2 x gradient

Computing the Gradient

How Do You Compute The Gradient?

Numerical Method:

∇𝒘𝐿 𝒘 =

𝜕𝐿(𝑤)

𝜕𝑥1⋮

𝜕𝐿(𝑤)

𝜕𝑥𝑛

𝜕𝑓(𝑥)

𝜕𝑥= lim

𝜖→0

𝑓 𝑥 + 𝜖 − 𝑓(𝑥)

𝜖

How do you compute this?

𝑓 𝑥 + 𝜖 − 𝑓 𝑥 − 𝜖

2𝜖

In practice, use:



Numerical Method:

∇𝒘𝐿 𝒘 =

𝜕𝐿(𝑤)

𝜕𝑥1⋮

𝜕𝐿(𝑤)

𝜕𝑥𝑛

How many function

evaluations per dimension?

𝑓 𝑥 + 𝜖 − 𝑓 𝑥 − 𝜖

2𝜖Use:



∇𝒘𝐿 𝒘 =

𝜕𝐿(𝑤)

𝜕𝑥1⋮

𝜕𝐿(𝑤)

𝜕𝑥𝑛

Use calculus!

Analytical Method:


𝐿(𝒘)=𝜆 𝒘 22 +

𝑖=1

𝑛


∇𝒘𝐿(𝒘) = 2𝜆𝒘 +

𝑖=1

𝑛

− 2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖

𝜕

𝜕𝒘

Note: if you look at other derivations, things are written either (y-wTx) or (wTx – y); the

gradients will differ by a minus.

Interpreting the GradientRecall:

w = w + -∇𝒘𝐿 𝒘 #update w

∇𝒘𝐿(𝒘) = 2𝜆𝒘 +

𝑖=1

𝑛

− 2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖

−∇𝒘𝐿 𝒘 = −2𝜆𝒘 +

𝑖=1

𝑛

2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖

Push w towards 0

If yi > wTxi (too low): then w = w + αxi for some α

Before: wTx

After: (w+ αx)Tx = wTx + αxTx

α

Quick annoying detail: subgradients

What is the derivative of |x|?

𝜕

𝜕𝑥𝑓 𝑥 = sign(𝑥) 𝑥 ≠ 0

Derivatives/Gradients

Defined everywhere but 0

𝑥 = 0undefined

Oh no! A discontinuity!

|x|

Quick annoying detail: subgradients

Subgradient: any underestimate of function

Subderivatives/subgradients

Defined everywhere

In practice: at discontinuity, pick value on either side.

𝜕

𝜕𝑥𝑓 𝑥 = sign(𝑥) 𝑥 ≠ 0

|x|

𝑥 = 0𝜕

𝜕𝑥𝑓 𝑥 ∈ [−1,1]

Let’s Compute Another Gradient

Computing The Gradient

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max 0, (𝑾𝒙𝑖 𝑗 − 𝑾𝒙𝒊 𝑦𝑖 +𝑚)

Notation:

W → rows wi (i.e., per-class scorer)

(Wxi)j → wjTxi

argmin𝑾

𝝀

𝒋=𝟏

𝑲

𝒘𝒋 𝟐

𝟐+

𝑖=1

𝑛

𝑗 ≠𝑦𝑖

max(0,𝒘𝑗𝑇𝒙𝑖 −𝒘𝑦𝑖

𝑇 𝒙𝑖 +𝑚)

Derivation setup: Karpathy and Fei-Fei

Multiclass Support Vector Machine


𝜕

𝜕𝑤𝑗:

argmin𝑾

𝝀

𝒋=𝟏

𝑲

𝒘𝒋 𝟐

𝟐+

𝑖=1

𝑛

𝑗 ≠𝑦𝑖



𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 ≤ 0: 0

𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 > 0: 𝑥𝑖

→ 1(𝑤𝑗𝑇 𝑥𝑖 − 𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 > 0)𝑥𝑖



argmin𝑾

𝝀

𝒋=𝟏

𝑲

𝒘𝒋 𝟐

𝟐+

𝑖=1

𝑛

𝑗 ≠𝑦𝑖




𝜕

𝜕𝑤𝑦𝑖:

𝑗≠𝑦𝑖

1 𝑤𝑗𝑇𝑥𝑖 −𝑤𝑦𝑖

𝑇 𝑥𝑖 +𝑚 > 0 −𝑥𝑖

Interpreting The Gradient

If we do not predict the

correct class by at least a

score difference of m …

Want incorrect class’s

scoring vector to score

that point lower.

Recall:

Before: wTx;

After: (w-αx)Tx = wTx - axTx


−𝜕

𝜕𝑤𝑗: −1(𝑤𝑗

𝑇𝑥𝑖 −𝑤𝑦𝑖𝑇 𝑥𝑖 +𝑚 > 0)𝑥𝑖


• Numerical: foolproof but slow

• Analytical: can mess things up ☺

• In practice: do analytical, but check with numerical (called a gradient check)

Slide: Karpathy and Fei-Fei

Implementing Gradient Descent

Loss is a function that we can

evaluate over data

−∇𝒘𝐿 𝒘 = −2𝜆𝒘 +

𝑖=1

𝑛

2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖All

Data

−∇𝒘𝐿𝐵 𝒘 = −2𝜆𝒘 +

𝑖∈𝐵

2 𝑦𝑖 −𝒘𝑇𝒙𝒊 𝒙𝑖Subset

B



g = gradient(data,L)


Option 1: Vanilla Gradient Descent

Compute gradient of L over all data points



index = randint(0,#data)

g = gradient(data[index],L)


Option 2: Stochastic Gradient Descent

Compute gradient of L over 1 random sample



subset = choose_samples(#data,B)

g = gradient(data[subset],L)


Option 3: Minibatch Gradient Descent

Compute gradient of L over subset of B samples

Typical batch sizes: ~100

Gradient Descent Details

Step size (also called learning rate / lr)

critical parameter

10x10-2

converges

12x10-2

diverges

1x10-2

falls short


11x10-2 :oscillates

(Raw gradients)


Typical solution: start with initial rate lr,

multiply by f every N interations

init_lr = 10-1

f = 0.1

N = 10K

11x10-2 :oscillates

(Raw gradients)


Solution:

Average gradients

With exponentially decaying

weights, called “momentum”

11x10-2

(0.25 momentum)

11x10-2 :oscillates

(Raw gradients)



Multiple Minima

→

Gradient Descent

Finds local

minimum


1

2 3

4

Guess the

minimum!

start



1

2 3

4start

Guess the

minimum!


Dynamics are

fairly complex

Many important

functions are

convex: any

local minimum is

a global minimum

Many important

functions are not.

In practice

• Conventional wisdom: minibatch stochastic gradient descent (SGD) + momentum (package implements it for you) + decaying learning rate

• The above is typically what is meant by “SGD”

• Other update rules exist; benefits in general not clear (sometimes better, sometimes worse)

Optimizing Everything

• Optimize w on training set with SGD to maximize training accuracy

• Optimize λ with random/grid search to maximize validation accuracy

• Note: Optimizing λ on training sets it to 0

𝐿 𝑾 = 𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛



𝐿(𝒘)=𝜆 𝒘 22 +

𝑖=1

𝑛


(Over/Under)fitting and Complexity

Let’s fit a polynomial: given x, predict y

Note: can do non-linear regression with copies of x

𝑦1⋮𝑦𝑁

=𝑥1𝐹

⋮𝑥𝑁𝐹

⋯⋱⋯

𝑥12

⋮𝑥𝑁2

𝑥1⋮𝑥𝑁

1⋮1

𝑤𝐹

⋮𝑤2

𝑤1𝑤0

Weights: one per polynomial degree

Matrix of all polynomial degrees

(Over/Under)fitting and Complexity

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

Underfitting

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

Underfitting

Model doesn’t have

the parameters to fit

the data.

Bias (statistics): Error

intrinsic to the model.

Overfitting

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

Overfitting

Model has high variance: remove one point, and

model changes dramatically

(Continuous) Model Complexity

argmin𝑾

𝝀 𝑾 𝟐𝟐 +

𝑖=1

𝑛



Regularization: penalty

for complex model

Pay penalty for negative log-

likelihood of correct class

Intuitively: big weights = more complex model

Model 1: 0.01*x1 + 1.3*x2 + -0.02*x3 + -2.1x4 + 10

Model 2: 37.2*x1 + 13.4*x2 + 5.6*x3 + -6.1x4 + 30

Fitting Model

Again, fitting polynomial, but with regularization

argminw

𝒚 − 𝑿𝒘 + 𝜆 𝒘

𝑥1𝐹

⋮𝑥𝑁𝐹

⋯⋱⋯

𝑥12

⋮𝑥𝑁2

𝑥1⋮𝑥𝑁

1⋮1

𝑤𝐹

⋮𝑤0

Adding Regularization

No regularization:

fits all data points

Regularization:

can’t fit all data points

In General

Error on new data comes from combination of:

1. Bias: model is oversimplified and can’t fit the underlying data

2. Variance: you don’t have the ability to estimate your model from limited data

3. Inherent: the data is intrinsically difficult

Bias and variance trade-off. Fixing one hurts the other.

Underfitting and Overfitting

Low Bias

High Variance

High Bias

Low Variance

Diagram adapted from: D. Hoiem

Test

Error

Training

Error

Complexity

Error

Underfitting!

Underfitting and Overfitting

Low Bias

High Variance

High Bias

Low Variance

Diagram adapted from: D. Hoiem

Complexity

Test

Error

Small data

Overfits w/small model

Big data

Overfits w/bigger model

Underfitting

Do poorly on both training and validation data due to bias.

Solution:

1. More features

2. More powerful model

3. Reduce regularization

Overfitting

Do well on training data, but poorly on validation data due to variance

Solution:

1. More data

2. Less powerful model

3. Regularize your model more

Cris Dima rule: first make sure you can overfit, then stop overfitting.

Next Class

• Non-linear models (neural nets)

Documents

Optimization and (Under/over)fittingfouhey/teaching/EECS442_W19/slides/08_02... · Administrivia •We’regrading HW2 and will try to get it back ASAP •Midterm next week •Review