Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

1/38

Lecture 5: Linear Regression with Regularization

CSC 84020 - Machine Learning

Andrew Rosenberg

February 19, 2009
http://find/http://goback/


2/38

Today

Linear Regression with Regularization


3/38

Recap

Linear Regression

Given a target vector t, and data matrix X.

Goal: Identify the best parameters for a regression functiony = w0 +

N

i=1 wixi

w = (XTX)1XTt


4/38

Closed form solution for linear regression

This solution is based on

Maximum Likelihood estimation under an assumption ofGaussian Likelihood

Empirical Risk Minimization under an assumption of SquaredError

The extension of Basis Functions gives linear regression significantpower.


5/38

Revisiting overfitting

Overfitting occurs when a model captures idiosyncrasies of theinput data, rather than generalizing.

Too many parameters relative to the amount of training data

For example, an order-N polynomial can be exact fit to N+ 1 datapoints.

O


6/38

Overfitting Example

O fi i E l


7/38

Overfitting Example

A idi O fi i


8/38

Avoiding Overfitting

Ways of detecting/avoiding overfitting.

Use more data

Evaluate on a parameter tuning set

Regularization

Take a Bayesian approach

R l i ti


9/38

Regularization

In a Linear Regression model, overfitting is characterized by large

parameters.

M= 0 M= 1 M= 3 M = 9

w0 0.19 0.82 0.31 0.35w1 -1.27 7.99 232.37

w2 -25.43 -5321.83w3 17.37 48568.31w4 -231639.30w5 640042.26w6 -1061800.52w7 1042400.18w8 -557682.99w9 125201.43

R l i ti


10/38

Regularization

Introduce a penalty term for the size of the weights.

Unregularized Regression

E(w) =1

2

N1

n=0{

tn

y(xn,w)

}2

Regularized Regression(L2-Regularization or Ridge Regularization)

E(w) =1

2

N1n=0

(tn y(xn,w))2 + 2w2

Note: Large leads to higher complexity penalization.

L t S R i ith L2 R l i ti


11/38

Least Squares Regression with L2-Regularization

w(E(w)) = 0

Least Squares Regression with L2 Regularization


12/38


w(E(w)) = 0

w1

2

N1

i=0

(y(xi,w)ti)

2 +

2w2 = 0

w

1

2t Xw2 +

2w2

= 0



13/38


w(E(w)) = 0

w1

2

N1

i=0

(y(xi,w)ti)

2 +

2w2 = 0

w

1

2t Xw2 +

2w2

= 0

w12

(t Xw)T(t Xw) + 2

wTw

= 0



14/38


w1

2 (t Xw)T

(t Xw) +

2 wT

w

= 0



15/38


w1

2 (t Xw)

T

(t Xw) +

2 wT

w

= 0

XTt + XTXw + w

2wT

w

= 0



16/38


w1

2 (t Xw)

T

(t Xw) +

2 w

T

w

= 0

XTt + XTXw + w

2wT

w

= 0

XTt + XTXw + w = 0



17/38


w1

2 (t Xw)

T

(t Xw) +

2 w

T

w

= 0

XTt + XTXw + w

2wT

w

= 0

XTt + XTXw + w = 0

XTt + XTXw + Iw = 0



18/38


w1

2 (t Xw)

T

(t Xw) +

2 w

T

w

= 0

XTt + XTXw + w

2wT

w

= 0

XTt + XTXw + w = 0

XTt + XTXw + Iw = 0

XTt + (XTX + I)w = 0



19/38


w1

2 (t Xw)

T

(t Xw) +

2 w

T

w

= 0

XTt + XTXw + w

2wT

w

= 0

XTt + XTXw + w = 0

XTt + XTXw + Iw = 0


(XT

X + I)w = XT

t



20/38


w1

2 (t Xw)

T

(t Xw) +

2 w

T

w

= 0

XTt + XTXw + w

2wT

w

= 0

XTt + XTXw + w = 0

XTt + XTXw + Iw = 0


(XT

X + I)w = XT

t

w = (XTX + I)1XTt

Regularization Results


21/38




22/38

g

Further Regularization


23/38

g

Regularization Approaches

L2-Regularization

E(w) =1

2

N1n=0

(tn y(xn,w))2 + 2w2

L1-Regularization

E(w) =1

2

N1n=0

(tn y(xn,w))2 + |w|1

L0-Regularization

E(w) =1

2

N1

n=0

(tn y(xn,w))2 + N1

n=0

(wn = 0)

The L0-norm represents the optimal subset of features needed bya Regression model.



24/38

g


L2-Regularization Closed form in polynomial time.

E(w) =1

2

N1n=0

(tn y(xn,w))2 + 2w2

L1-Regularization

E(w) =1

2

N1n=0

(tn y(xn,w))2 + |w|1

L0-Regularization

E(w) =1

2

N1

n=0

(tn y(xn,w))2 + N1

n=0

(wn = 0)


How can we optimize of these functions?



25/38

g


L2-Regularization

E(w) =1

2

N1n=0

(tn y(xn,w))2 + 2w2

L1-Regularization Can be approximated in poly-time

E(w) =1

2

N1n=0

(tn y(xn,w))2 + |w|1

L0-Regularization

E(w) =1

2

N1

n=0

(tn y(xn,w))2 + N1

n=0

(wn = 0)





26/38

g


L2-Regularization

E(w) =1

2

N1n=0

(tn y(xn,w))2 + 2w2

L1-Regularization

E(w) =1

2

N1

n=0

(tn y(xn,w))2 + |w|1

L0-Regularization NP complete optimization

E(w) =1

2

N1

n=0

(tn y(xn,w))2 + N1

n=0

(wn = 0)



Curse of Dimensionality


27/38

y

Curse of Dimensionality

Increasing the dimensionality of the feature space exponentiallyincreases the data needs.Note: The dimensionality of the feature space = The number of

features.What is the message of this?

Models should be small relative to the amount of availabledata.

Dimensionality Reduction techniques feature selection can

help.L0-regularization is feature selection for linear models.L1- and L2-regularizations approximate feature selection andregularize the function.

Curse of Dimensionality Example


28/38

Assume a cell requires 100 data points to generalize properly, and3-ary multinomial features.

One dimension requires 300 data points

Two Dimensions requires 900 data points

Three Dimensions requires 2,700 data points

In this example, for D-dimensional model fitting, the datarequirements are 3D 10.

Argument against the Kitchen Sink approach.

Bayesians v. Frequentists


29/38

What is a Probability?



30/38


The Frequentist position

A probability is the likelihood that an event will happen.

It is approximated as the ratio of the number of times the eventhappened to the total number of events.

Assessment is very important to select a model.

Point Estimates are fine nN



31/38


The Frequentist position

A probability is the likelihood that an event will happen.

It is approximated as the ratio of the number of times the eventhappened to the total number of events.

Assessment is very important to select a model.

Point Estimates are fine nN

The Bayesian position

A probability is the degree of believability that the event will happen.

Bayesians require that probabilities be conditioned on data, p(y|x).

The Bayesian approach is optimal, given a good model, and good priorand good loss function dont worry about assessment as much.

Bayesians say: if you are ever making a point estimate, youve made amistake. The only valid probabilities are posteriors based on evidencegiven some prior.

Bayesian Linear Regression


32/38

In the previous derivation of the linear regression optimization, we

made point estimates for the weight vector, w.

Bayesians would say stop right there. Use a distribution overw to estimate the parameters.

p(w|) = N(w|0, 1I) = 2(M+1)/2

exp2 wTw

is a hyperparameterover w, where is the precision or inversevariance of the distribution.

So, optimize

p(w|x, t,, ) p(t|x,w, )p(w|)



35/38



ln p(t|x,w, ) = N2

ln N2

ln 2 2

N1n=0

(tn y(xn,w))2

ln p(w|) = M+ 12

ln M+ 12

ln 2 2

wTw



36/38



ln p(t|x,w, ) = N2

ln N2

ln 2 2

N1n=0

(tn y(xn,w))2

ln p(w|) =M+ 1

2 ln M+ 1

2 ln 2

2 wT

w

ln p(t|x,w, ) + ln p(w|) = 2

N1

n=0(tn y(xn,w))2 +

2wTw

Broader Context


37/38

Overfitting is bad.

Bayesians v. Frequentists.

Does it matter which camp you lie in?

Not particularly, but Bayesian approaches allow us some usefulinteresting and principled tools.

Bye


38/38

NextCategorization

Logistic RegressionNaive Bayes

Documents

Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning