Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we...

Linear RegressionLinear Regression with Shrinkage

Some slides are due to Tommi Jaakkola, MIT AI Lab

Introduction� The goal of regression is to make quantitative (real

valued) predictions on the basis of a (vector of) features or attributes.

� Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.

Predicting vehicle fuel efficiency (mpg) from 8 attributes:

A generic regression problem

� The input attributes are given as fixed length vectorsthat could come from different sources: inputs,transformation of inputs (log, square root, etc…) or basis functions.

� The outputs are assumed to be real valued (with some possible restrictions).

Mℜ∈ x

ℜ∈y some possible restrictions).

� Given n iid training samples from unknown distribution P(x, y), the goal is to minimize the prediction error (loss) on new examples (x, y) drawn at random from the same P(x, y).

� An a example of a loss function:

)},x)...(,x{(D 11 nn yy=

2)x)(()),x(( yfyfL −= squared loss

our prediction for x

Regression Function

� We need to define a class of functions (types of � We need to define a class of functions (types of predictions we make).

Linear prediction:

xwwwwxf 1010 ),;( +=

where are the parameters we need estimate.10,ww

Typically we have a set of training data from which we estimate the parameters .

Linear Regression

),)...(,( 11 nn yxyx

Each is a vector of measurements for ithcase.

),..,( 1 iMii xxx =

a basis expansion of)}(),...,({)( 0 iMii xhxhxh = ix

)()();()(1

0 xhwxhwwwxfxyM

tjj∑

=+=≈)1)( (define 0 =xh

Basis Functions

� There are many basis functions we can use e.g.� Polynomial

� Radial basis functions

1)( −= jj xxh

−−=

2exp)(

−x µ� Sigmoidal

� Splines, Fourier, Wavelets, etc

µσ)(

Estimation Criterion

� We need a fitting/estimation criterion to select appropriate � We need a fitting/estimation criterion to select appropriate values for the parameters based on the training set

� For example, we can use the empirical loss:

10,ww)},)...(,{( 11 nn yxyxD =

( )∑=

iiin wwxfy

20101 ),;(

(note: the loss is the same as in evaluation)

Empirical loss: motivation

� Ideally, we would like to find the parameters , that minimize the expected loss (unlimited training data):

where the expectation is over samples from P(x,y).

( )210~),(10 ),;(),( wwxfyEwwJ iiPyx −=

� When the number of training examples n is large:

( ) ( )∑=

−≈−n

iiiiiPyx wwxfy

nwwxfyE

210~),( ),;(

Expected loss Empirical loss

Linear Regression Estimation� Minimize the empirical squared loss

( )∑=

iiin wwxfy

21010 ),;(

( )∑=

−−=n

iii xwwy

� By setting the derivatives with respect to to zero we get necessary conditions for the “optimal” parameter values.

∑=in 1

( ) 0)(2

=−−−=∂∂

iiin xxwwy

( ) 0)1(2

=−−−=∂∂

iiin xwwy

01 , ww

Interpretation

( ) 0)(2

110 =−−−∑

iii xxwwy

( ) 0)1(2

110 =−−−∑

iii xwwy

( )xwwy −−=ε

The optimality conditions

ensure that the prediction error is decorrelated with any linear function of the inputs.

( )ii xwwy 10 −−=ε

......

Linear Regression: Matrix Form

n samples

( ) ( ) ( )yXwyXw1

21010 −−=−−= ∑

iiin xwwy

Linear Regression Solution

� By setting the derivatives of to zero,

( ) ( )yXwyXw −− t

( ) 0XwXyX2 tt =−n

we get the solution:

( ) yXXXˆ t1t −=w

The solution is a linear function of the outputs y.

Statistical view of linear regression

� In a statistical regression model we model both the function and noise

Observed output = function + noise

ε+= );()( wxfxy

),0(~ e.g., where, 2σε N

� Whatever we cannot capture with our chosen family of functions will be interpreted as noise

),0(~ e.g., where, 2σε N

� f(x;w) is trying to capture the mean of the observations y given the input x:

� where E[ y| x] is the conditional expectation of ygiven x, evaluated according to the model (not

);(]|);([]|[ wxfxwxfExyE =+= ε

given x, evaluated according to the model (not according to the underlying distribution of X)

� According to our statistical model

the outputs y given x are normally distributed with mean f (x;w) and variance σ2:

ε+= );()( wxfxy ),0(~ , 2σε N

mean f (x;w) and variance σ2:

(we model the uncertainty in the predictions, not just the mean)

−−= 222

2 ));((2

1),,|( wxfywxyp

σπσσ

Maximum likelihood estimation

� Given observations we find the parameters w that maximize the likelihood of the outputs:

{ }),(),...,,( 11 nn yxyxD =

iii wxypwL

22 ),,|(),( σσ

� Maximize log-likelihood

( )( )

−−

σπσ

( )( )

−−

wxfywL1

1log),(log

σπσσ

minimize

� Thus

� But the empirical squared loss is

Maximum likelihood estimation

( )( )∑=

wMLE wxfyw

2;minarg

( )∑=

iiin wxfy

Least-squares Linear Regression is MLE for Gaussian noise !!!

Linear Regression (is it “good”?)

� Simple model� Straightforward solution

• MLS is not a good estimator for prediction error

BUTBUT

• MLS is not a good estimator for prediction error

• The matrix XTX could be ill conditioned�Inputs are correlated�Input dimension is large�Training set is small

Linear Regression - Example

ε++= 21 2.08.0 xxy

In this example the output is generated by the model (where ε is a small white Gaussian noise):

Three training sets with different correlations Three training sets with different correlations between the two inputs were randomly chosen, and the linear regression solution was applied.

Linear Regression – Example

And the results are…

RSS=0.0901465b=80.807276 , 0.209322 <

RSS=0.104955b=80.434832 , 0.568331 <

RSS=0.0725469b=830.2363, -29.2196 <

RSS=0.47w=30.36, -29.21

RSS=0.204w=0.42, 0.56

RSS=0.29w=0.82, 0.209

Strong correlation can cause the coefficients to be very large, and they will cancel each other to achieve a good RSS.

-2 -1 0 1 2 3x1

-1.5-1

-2 -1 0 1 2 3x1

x1,x2 uncorrelated x1,x2 correlated x1,x2 strongly correlated

Linear Regression

What can be done?

Shrinkage methods

� Ridge Regression� Lasso� PCR (Principal Components Regression)� PLS (Partial Least Squares)� PLS (Partial Least Squares)

Shrinkage methods

Before we proceed:� Since the following methods are not invariant under input

scale, we will assume the input is normalized (mean 0, variance 1):

jijij xxx −←′j

′←

� The offset is always estimated as and we will work with centered y (meaning y � y-w0)

i /)(0 ∑=0w

Ridge Regression� Ridge regression shrinks the regression coefficients by

imposing a penalty on their size ( also called weight decay)� In ridge regression, we add a quadratic penalty on the

weights:∑ ∑∑= ==

−−=

jjiji wwxwywJ

10)( λ

where λ ≥ 0 is a tuning parameter that controls the amount of shrinkage.

� The size constraint prevents the phenomenon of wildly large coefficients that cancel each other from occurring.

Ridge Regression Solution� Ridge regression in matrix form:

� The solution is

( ) ( ) wwXwyXwywJ tt λ+−−=)(

yXIXXw tM

tridge 1)(ˆ −+= λ ( ) ywLS t1t XXXˆ−

� The solution adds a positive constant to the diagonal of XTX before inversion. This makes the problem non singular even if X does not have full column rank.

� For orthogonal inputs the ridge estimates are the scaled version of least squares estimates:

10 ˆ ˆ ≤≤= γγ LSridge ww

Ridge Regression (insights)

TUDVX =

The matrix X can be represented by it’s SVD:

• U is a N*M matrix, it’s columns span the column space of X

• D is a M*M diagonal matrix of singular values• D is a M*M diagonal matrix of singular values

• V is a M*M matrix, it’s columns span the row space of X

Lets see how this method looks in the Principal Components coordinates of X

( )( ) ''

wXwVXVXw

XVXT ==

( ) yUDyVDUUDVVDUVwVw TTTTTlsTls 11ˆ'ˆ −−

The least squares solution is given by: ( ) ywls t1t XXXˆ−

( ) yUDyVDUUDVVDUVwVw ˆ'ˆ ===

( )( ) ( ) lsT

TTTTridgeTridge

wDIDyDUID

yVDUIUDVVDUVwVw

−−

The Ridge Regression is similarly given by:

Diagonal

jridgej w

dw 'ˆ'ˆ

In the PCA axes, the ridge coefficients are just scaled LS coefficients!

The coefficients that correspond to smaller input variance directions are scaled down more.

scaled down more.

Ridge Regression

( ) ∑= +

In the following simulations, quantities are plotted versus the quantity

This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.

Ridge Regression

123456

Ridge Regression

No correlation Strong correlation

123456

Ridge Regression

Simulation results: 4 dimensions (w=1,2,3,4)

0 0.5 1 1.5 2 2.5 3 3.5Effective Dimension

0.5 1 1.5 2 2.5 3 3.5Effective Dimension

0 1 2 3 4Effective Dimension

0.5 1 1.5 2 2.5 3 3.5 4Effective Dimension

0123456

Ridge regression is MAP with Gaussian prior

constwwXwyXwy

wΝxwyΝ

wPwDPwJ

++−−=

),0|(),|(log

)()|(log)(

This is the same objective function that ridge solves, using

Ridge:

22 τσ

22 /τσλ =

( ) ( ) wwXwyXwywJ tt λ+−−=)(

∑ ∑= =1

subject to

minargˆβ

Lasso is a shrinkage method like ridge, with a subtle but important difference. It is defined by

j ≤∑=1

subject to

There is a similarity to the ridge regression problem: the L2 ridge regression penalty is replaced by the L1 lasso penalty.

The L1 penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it.

≡≥M

lswtt1

0 ˆIf t is chosen to be larger then t0:

then the lasso estimation is identical to the least squares. On the other hand, for say t=t0/2, the least squares coefficients are shrunk by about 50% on average.

For convenience we will plot the simulation results versus shrinkage factor s:

=≡ M

Simulation results:

No correlation

123456

Strong correlation

123456

0 0.2 0.4 0.6 0.8 1Shrinkage factor

0.2 0.4 0.6 0.8 1Shrinkage factor

0 0.2 0.4 0.6 0.8 1Shrinkage factor

0.2 0.4 0.6 0.8 1Shrinkage factor

L2 vs L1 penaltiesContours of the LS error function

t≤+ ββ 222 ≤+ ββ=βt≤+ 21 ββ 222

21 t≤+ ββ01 =β

In Lasso the constraint region has corners; when the solution hits a corner the corresponding coefficients becomes 0 (when M>2 more than one).

Problem:Figure 1 plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see below) but got confused aboutwhich plot corresponds to which regularization method. Please assign each plot to one (and only one) of the following regularization method.

Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we...

Documents

Two-Variable Analysis: Simple Linear Regression/ Correlationmeonline.engin.umich.edu/courses/blackbelt/6s... · II. Simple Linear Regression • Simple Linear Regression examines

Linear Regression - SAS · MULTIPLE LINEAR REGRESSION SAS/STAT SYNTAX & EXAMPLE DATA ... SIMPLE LINEAR REGRESSION ... title 'Best Models Using All-Regression Option'; run;

Regression Linear Regression

Chapter 3 Multiple Linear Regression Model - IITKhome.iitk.ac.in/~shalab/regression/Chapter3-Regression-Multiple...Chapter 3 Multiple Linear Regression Model ... So a simple linear

Regression Linear Regression Regression Trees

Introduction to Linear Regression - …classroom.takasila.org/classroom/dataupload/takasila/155/LinearReg.pdf · Background Simple Linear Regression Multi-variable Linear Regression

Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

1 Curve-Fitting Interpolation. 2 Curve Fitting Regression Linear Regression Polynomial Regression Multiple Linear Regression Non-linear Regression Interpolation

Title stata.com etregress — Linear regression with ... · etregress — Linear regression with endogenous treatment effects ... etregress— Linear regression with endogenous

Chapter 8 Linear Regression. Objectives & Learning Goals Understand Linear Regression (linear modeling): Create and interpret a linear regression model

Chapter 11 Multiple Linear Regression Chapter 11 Multiple Linear Regression

Linear Regression via Normal Equations...Module 1 Objectives/Linear Regression •Linear Algebra Primer -matrix equations, notations -matrix manipulations •Linear Regression -objective,

Simple linear regression Linear regression with one predictor variable

Linear Regression

Basic Statistics Linear Regression. X Y Simple Linear Regression

Part II Multiple Linear Regression - Statistics · PDF filePart II Multiple Linear Regression 86. Chapter 7 Multiple Regression A multiple linear regression model is a linear model

LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model

Linear and Non-Linear Regression: Powerful and Very ...business.expertjournals.com/ark:/16759/EJBM_322... · Keywords: Linear Regression, Non-Linear Regression, Best-Fitting Model,

REGRESSION 12.1 Simple Linear Regression Model 12.2 ...sman/courses/6739/SimpleLinearRegression.pdf · Goldsman — ISyE 6739 Linear Regression REGRESSION 12.1 Simple Linear Regression

Linear regression: Part 1 - NTNU · Linear regression: Part 1. Lecture Outline What are linear models?-EX1: What is the ‘best’ line? What is linear regression? ... Linear regression