View
8
Download
0
Category
Preview:
Citation preview
Linear RegressionLinear Regression with Shrinkage
Some slides are due to Tommi Jaakkola, MIT AI Lab
Introduction� The goal of regression is to make quantitative (real
valued) predictions on the basis of a (vector of) features or attributes.
� Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.
Predicting vehicle fuel efficiency (mpg) from 8 attributes:
A generic regression problem
� The input attributes are given as fixed length vectorsthat could come from different sources: inputs,transformation of inputs (log, square root, etc…) or basis functions.
� The outputs are assumed to be real valued (with some possible restrictions).
Mℜ∈ x
ℜ∈y some possible restrictions).
� Given n iid training samples from unknown distribution P(x, y), the goal is to minimize the prediction error (loss) on new examples (x, y) drawn at random from the same P(x, y).
� An a example of a loss function:
)},x)...(,x{(D 11 nn yy=
2)x)(()),x(( yfyfL −= squared loss
our prediction for x
Regression Function
� We need to define a class of functions (types of � We need to define a class of functions (types of predictions we make).
Linear prediction:
xwwwwxf 1010 ),;( +=
where are the parameters we need estimate.10,ww
Typically we have a set of training data from which we estimate the parameters .
Linear Regression
),)...(,( 11 nn yxyx
Each is a vector of measurements for ithcase.
),..,( 1 iMii xxx =
10,ww
or
a basis expansion of)}(),...,({)( 0 iMii xhxhxh = ix
)()();()(1
0 xhwxhwwwxfxyM
j
tjj∑
=
=+=≈)1)( (define 0 =xh
Basis Functions
� There are many basis functions we can use e.g.� Polynomial
� Radial basis functions
1)( −= jj xxh
( )
−−=
2
2
2exp)(
s
xxh j
j
µ
−x µ� Sigmoidal
� Splines, Fourier, Wavelets, etc
−=
s
xxh j
j
µσ)(
Estimation Criterion
� We need a fitting/estimation criterion to select appropriate � We need a fitting/estimation criterion to select appropriate values for the parameters based on the training set
� For example, we can use the empirical loss:
10,ww)},)...(,{( 11 nn yxyxD =
( )∑=
−=n
iiin wwxfy
nwwJ
1
20101 ),;(
1),(
(note: the loss is the same as in evaluation)
Empirical loss: motivation
� Ideally, we would like to find the parameters , that minimize the expected loss (unlimited training data):
where the expectation is over samples from P(x,y).
( )210~),(10 ),;(),( wwxfyEwwJ iiPyx −=
10,ww
� When the number of training examples n is large:
( ) ( )∑=
−≈−n
iiiiiPyx wwxfy
nwwxfyE
1
210
210~),( ),;(
1),;(
Expected loss Empirical loss
Linear Regression Estimation� Minimize the empirical squared loss
( )∑=
−=n
iiin wwxfy
nwwJ
1
21010 ),;(
1),(
( )∑=
−−=n
iii xwwy
n 1
210
1
� By setting the derivatives with respect to to zero we get necessary conditions for the “optimal” parameter values.
∑=in 1
( ) 0)(2
),(1
10101
=−−−=∂∂
∑=
i
n
iiin xxwwy
nwwJ
w
( ) 0)1(2
),(1
10100
=−−−=∂∂
∑=
n
iiin xwwy
nwwJ
w
01 , ww
Interpretation
( ) 0)(2
110 =−−−∑
=i
n
iii xxwwy
n
( ) 0)1(2
110 =−−−∑
=
n
iii xwwy
n
( )xwwy −−=ε
The optimality conditions
ensure that the prediction error is decorrelated with any linear function of the inputs.
( )ii xwwy 10 −−=ε
=0
w
w
=
1
......
1
X
Linear Regression: Matrix Form
n samples
M+
1
M+1
=
ny
y
...y1
1w
1x
nx
( ) ( ) ( )yXwyXw1
),(1
21010 −−=−−= ∑
=
tn
iiin xwwy
nwwJ
Linear Regression Solution
� By setting the derivatives of to zero,
( ) ( )yXwyXw −− t
( ) 0XwXyX2 tt =−n
we get the solution:
( ) yXXXˆ t1t −=w
The solution is a linear function of the outputs y.
Statistical view of linear regression
� In a statistical regression model we model both the function and noise
Observed output = function + noise
ε+= );()( wxfxy
),0(~ e.g., where, 2σε N
� Whatever we cannot capture with our chosen family of functions will be interpreted as noise
),0(~ e.g., where, 2σε N
Statistical view of linear regression
� f(x;w) is trying to capture the mean of the observations y given the input x:
� where E[ y| x] is the conditional expectation of ygiven x, evaluated according to the model (not
);(]|);([]|[ wxfxwxfExyE =+= ε
given x, evaluated according to the model (not according to the underlying distribution of X)
Statistical view of linear regression
� According to our statistical model
the outputs y given x are normally distributed with mean f (x;w) and variance σ2:
ε+= );()( wxfxy ),0(~ , 2σε N
mean f (x;w) and variance σ2:
(we model the uncertainty in the predictions, not just the mean)
−−= 222
2 ));((2
1exp
2
1),,|( wxfywxyp
σπσσ
Maximum likelihood estimation
� Given observations we find the parameters w that maximize the likelihood of the outputs:
{ }),(),...,,( 11 nn yxyxD =
=∏=
n
n
iii wxypwL
1
22 ),,|(),( σσ
� Maximize log-likelihood
( )( )
−−
= ∑
=
n
ikk
n
wxfy1
2
22;
2
1exp
2
1
σπσ
( )( )
−−
= ∑
=
n
ikk
n
wxfywL1
2
22
2 ;2
1
2
1log),(log
σπσσ
minimize
� Thus
� But the empirical squared loss is
Maximum likelihood estimation
( )( )∑=
−=n
iii
wMLE wxfyw
1
2;minarg
( )∑=
−=n
iiin wxfy
nwJ
1
2);(1
)(
Least-squares Linear Regression is MLE for Gaussian noise !!!
Linear Regression (is it “good”?)
� Simple model� Straightforward solution
• MLS is not a good estimator for prediction error
BUTBUT
• MLS is not a good estimator for prediction error
• The matrix XTX could be ill conditioned�Inputs are correlated�Input dimension is large�Training set is small
Linear Regression - Example
ε++= 21 2.08.0 xxy
In this example the output is generated by the model (where ε is a small white Gaussian noise):
Three training sets with different correlations Three training sets with different correlations between the two inputs were randomly chosen, and the linear regression solution was applied.
Linear Regression – Example
And the results are…
-0.50
0.51
1.52
x 2
RSS=0.0901465b=80.807276 , 0.209322 <
0
1
2
3
x 2
RSS=0.104955b=80.434832 , 0.568331 <
0
1
2
3
x 2
RSS=0.0725469b=830.2363, -29.2196 <
RSS=0.47w=30.36, -29.21
RSS=0.204w=0.42, 0.56
RSS=0.29w=0.82, 0.209
Strong correlation can cause the coefficients to be very large, and they will cancel each other to achieve a good RSS.
-2 -1 0 1 2 3x1
-1.5-1
-0.5
-2 -1 0 1 2 3x1
-2
-1
-2 -1 0 1 2 3x1
-2
-1
x1,x2 uncorrelated x1,x2 correlated x1,x2 strongly correlated
Linear Regression
What can be done?
Shrinkage methods
� Ridge Regression� Lasso� PCR (Principal Components Regression)� PLS (Partial Least Squares)� PLS (Partial Least Squares)
Shrinkage methods
Before we proceed:� Since the following methods are not invariant under input
scale, we will assume the input is normalized (mean 0, variance 1):
jijij xxx −←′j
ijij
xx
σ
′←
� The offset is always estimated as and we will work with centered y (meaning y � y-w0)
Nywi
i /)(0 ∑=0w
jσ
Ridge Regression� Ridge regression shrinks the regression coefficients by
imposing a penalty on their size ( also called weight decay)� In ridge regression, we add a quadratic penalty on the
weights:∑ ∑∑= ==
+
−−=
N
i
M
jj
M
jjiji wwxwywJ
1 1
2
2
10)( λ
where λ ≥ 0 is a tuning parameter that controls the amount of shrinkage.
� The size constraint prevents the phenomenon of wildly large coefficients that cancel each other from occurring.
Ridge Regression Solution� Ridge regression in matrix form:
� The solution is
( ) ( ) wwXwyXwywJ tt λ+−−=)(
yXIXXw tM
tridge 1)(ˆ −+= λ ( ) ywLS t1t XXXˆ−
=
� The solution adds a positive constant to the diagonal of XTX before inversion. This makes the problem non singular even if X does not have full column rank.
� For orthogonal inputs the ridge estimates are the scaled version of least squares estimates:
10 ˆ ˆ ≤≤= γγ LSridge ww
Ridge Regression (insights)
TUDVX =
The matrix X can be represented by it’s SVD:
• U is a N*M matrix, it’s columns span the column space of X
• D is a M*M diagonal matrix of singular values• D is a M*M diagonal matrix of singular values
• V is a M*M matrix, it’s columns span the row space of X
Lets see how this method looks in the Principal Components coordinates of X
Ridge Regression (insights)
( )( ) ''
'
wXwVXVXw
XVXT ==
=
( ) yUDyVDUUDVVDUVwVw TTTTTlsTls 11ˆ'ˆ −−
===
The least squares solution is given by: ( ) ywls t1t XXXˆ−
=
( ) yUDyVDUUDVVDUVwVw ˆ'ˆ ===
( )( ) ( ) lsT
TTTTridgeTridge
wDIDyDUID
yVDUIUDVVDUVwVw
'ˆ
ˆ'ˆ
21212
1
−−
−
+=+=
=+==
λλ
λ
The Ridge Regression is similarly given by:
Diagonal
Ridge Regression (insights)
lsj
j
jridgej w
d
dw 'ˆ'ˆ
2
2
λ+=
In the PCA axes, the ridge coefficients are just scaled LS coefficients!
The coefficients that correspond to smaller input variance directions are scaled down more.
d1d2
scaled down more.
Ridge Regression
( ) ∑= +
=M
j j
j
d
ddf
12
2
λλ
In the following simulations, quantities are plotted versus the quantity
This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.
Ridge Regression
123456
b
Ridge Regression
No correlation Strong correlation
123456
b
Ridge Regression
Simulation results: 4 dimensions (w=1,2,3,4)
w w
0 0.5 1 1.5 2 2.5 3 3.5Effective Dimension
01
0.5 1 1.5 2 2.5 3 3.5Effective Dimension
1
2
3
4
5
6
SS
R
0 1 2 3 4Effective Dimension
01
0.5 1 1.5 2 2.5 3 3.5 4Effective Dimension
0123456
SS
R
Ridge regression is MAP with Gaussian prior
constwwXwyXwy
wΝxwyΝ
wPwDPwJ
tt
n
ii
ti
++−−=
−=
−=
∏=
22
1
22
2
1)()(
2
1
),0|(),|(log
)()|(log)(
τσ
τσ
This is the same objective function that ridge solves, using
Ridge:
22 τσ
22 /τσλ =
( ) ( ) wwXwyXwywJ tt λ+−−=)(
Lasso
tw
wxyw
M
N
i
M
jjiji
lasso
≤
−=
∑
∑ ∑= =1
2
1
subject to
minargˆβ
Lasso is a shrinkage method like ridge, with a subtle but important difference. It is defined by
twj
j ≤∑=1
subject to
There is a similarity to the ridge regression problem: the L2 ridge regression penalty is replaced by the L1 lasso penalty.
The L1 penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it.
Lasso
∑=
≡≥M
j
lswtt1
0 ˆIf t is chosen to be larger then t0:
then the lasso estimation is identical to the least squares. On the other hand, for say t=t0/2, the least squares coefficients are shrunk by about 50% on average.
For convenience we will plot the simulation results versus shrinkage factor s:
∑=
=≡ M
j
lsw
t
t
ts
1
0 ˆ
Lasso
Simulation results:
No correlation
123456
b
Lasso
Strong correlation
123456
b
Lasso
w w
0 0.2 0.4 0.6 0.8 1Shrinkage factor
01
0.2 0.4 0.6 0.8 1Shrinkage factor
0
1234
56
SS
R
0 0.2 0.4 0.6 0.8 1Shrinkage factor
01
0.2 0.4 0.6 0.8 1Shrinkage factor
0
1
2
3
4
SS
R
L2 vs L1 penaltiesContours of the LS error function
L1 L2
t≤+ ββ 222 ≤+ ββ=βt≤+ 21 ββ 222
21 t≤+ ββ01 =β
In Lasso the constraint region has corners; when the solution hits a corner the corresponding coefficients becomes 0 (when M>2 more than one).
Problem:Figure 1 plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see below) but got confused aboutwhich plot corresponds to which regularization method. Please assign each plot to one (and only one) of the following regularization method.
Recommended