Upload
tuytm2
View
226
Download
1
Embed Size (px)
Citation preview
8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
1/38
Lecture 5: Linear Regression with Regularization
CSC 84020 - Machine Learning
Andrew Rosenberg
February 19, 2009
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
2/38
Today
Linear Regression with Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
3/38
Recap
Linear Regression
Given a target vector t, and data matrix X.
Goal: Identify the best parameters for a regression functiony = w0 +
N
i=1 wixi
w = (XTX)1XTt
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
4/38
Closed form solution for linear regression
This solution is based on
Maximum Likelihood estimation under an assumption ofGaussian Likelihood
Empirical Risk Minimization under an assumption of SquaredError
The extension of Basis Functions gives linear regression significantpower.
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
5/38
Revisiting overfitting
Overfitting occurs when a model captures idiosyncrasies of theinput data, rather than generalizing.
Too many parameters relative to the amount of training data
For example, an order-N polynomial can be exact fit to N+ 1 datapoints.
O
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
6/38
Overfitting Example
O fi i E l
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
7/38
Overfitting Example
A idi O fi i
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
8/38
Avoiding Overfitting
Ways of detecting/avoiding overfitting.
Use more data
Evaluate on a parameter tuning set
Regularization
Take a Bayesian approach
R l i ti
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
9/38
Regularization
In a Linear Regression model, overfitting is characterized by large
parameters.
M= 0 M= 1 M= 3 M = 9
w0 0.19 0.82 0.31 0.35w1 -1.27 7.99 232.37
w2 -25.43 -5321.83w3 17.37 48568.31w4 -231639.30w5 640042.26w6 -1061800.52w7 1042400.18w8 -557682.99w9 125201.43
R l i ti
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
10/38
Regularization
Introduce a penalty term for the size of the weights.
Unregularized Regression
E(w) =1
2
N1
n=0{
tn
y(xn,w)
}2
Regularized Regression(L2-Regularization or Ridge Regularization)
E(w) =1
2
N1n=0
(tn y(xn,w))2 + 2w2
Note: Large leads to higher complexity penalization.
L t S R i ith L2 R l i ti
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
11/38
Least Squares Regression with L2-Regularization
w(E(w)) = 0
Least Squares Regression with L2 Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
12/38
Least Squares Regression with L2-Regularization
w(E(w)) = 0
w1
2
N1
i=0
(y(xi,w)ti)
2 +
2w2 = 0
w
1
2t Xw2 +
2w2
= 0
Least Squares Regression with L2 Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
13/38
Least Squares Regression with L2-Regularization
w(E(w)) = 0
w1
2
N1
i=0
(y(xi,w)ti)
2 +
2w2 = 0
w
1
2t Xw2 +
2w2
= 0
w12
(t Xw)T(t Xw) + 2
wTw
= 0
Least Squares Regression with L2 Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
14/38
Least Squares Regression with L2-Regularization
w1
2 (t Xw)T
(t Xw) +
2 wT
w
= 0
Least Squares Regression with L2 Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
15/38
Least Squares Regression with L2-Regularization
w1
2 (t Xw)
T
(t Xw) +
2 wT
w
= 0
XTt + XTXw + w
2wT
w
= 0
Least Squares Regression with L2-Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
16/38
Least Squares Regression with L2-Regularization
w1
2 (t Xw)
T
(t Xw) +
2 w
T
w
= 0
XTt + XTXw + w
2wT
w
= 0
XTt + XTXw + w = 0
Least Squares Regression with L2-Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
17/38
Least Squares Regression with L2 Regularization
w1
2 (t Xw)
T
(t Xw) +
2 w
T
w
= 0
XTt + XTXw + w
2wT
w
= 0
XTt + XTXw + w = 0
XTt + XTXw + Iw = 0
Least Squares Regression with L2-Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
18/38
Least Squares Regression with L2 Regularization
w1
2 (t Xw)
T
(t Xw) +
2 w
T
w
= 0
XTt + XTXw + w
2wT
w
= 0
XTt + XTXw + w = 0
XTt + XTXw + Iw = 0
XTt + (XTX + I)w = 0
Least Squares Regression with L2-Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
19/38
Least Squares Regression with L2 Regularization
w1
2 (t Xw)
T
(t Xw) +
2 w
T
w
= 0
XTt + XTXw + w
2wT
w
= 0
XTt + XTXw + w = 0
XTt + XTXw + Iw = 0
XTt + (XTX + I)w = 0
(XT
X + I)w = XT
t
Least Squares Regression with L2-Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
20/38
Least Squares Regression with L2 Regularization
w1
2 (t Xw)
T
(t Xw) +
2 w
T
w
= 0
XTt + XTXw + w
2wT
w
= 0
XTt + XTXw + w = 0
XTt + XTXw + Iw = 0
XTt + (XTX + I)w = 0
(XT
X + I)w = XT
t
w = (XTX + I)1XTt
Regularization Results
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
21/38
Regularization Results
Regularization Results
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
22/38
g
Further Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
23/38
g
Regularization Approaches
L2-Regularization
E(w) =1
2
N1n=0
(tn y(xn,w))2 + 2w2
L1-Regularization
E(w) =1
2
N1n=0
(tn y(xn,w))2 + |w|1
L0-Regularization
E(w) =1
2
N1
n=0
(tn y(xn,w))2 + N1
n=0
(wn = 0)
The L0-norm represents the optimal subset of features needed bya Regression model.
Further Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
24/38
g
Regularization Approaches
L2-Regularization Closed form in polynomial time.
E(w) =1
2
N1n=0
(tn y(xn,w))2 + 2w2
L1-Regularization
E(w) =1
2
N1n=0
(tn y(xn,w))2 + |w|1
L0-Regularization
E(w) =1
2
N1
n=0
(tn y(xn,w))2 + N1
n=0
(wn = 0)
The L0-norm represents the optimal subset of features needed bya Regression model.
How can we optimize of these functions?
Further Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
25/38
g
Regularization Approaches
L2-Regularization
E(w) =1
2
N1n=0
(tn y(xn,w))2 + 2w2
L1-Regularization Can be approximated in poly-time
E(w) =1
2
N1n=0
(tn y(xn,w))2 + |w|1
L0-Regularization
E(w) =1
2
N1
n=0
(tn y(xn,w))2 + N1
n=0
(wn = 0)
The L0-norm represents the optimal subset of features needed bya Regression model.
How can we optimize of these functions?
Further Regularization
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
26/38
g
Regularization Approaches
L2-Regularization
E(w) =1
2
N1n=0
(tn y(xn,w))2 + 2w2
L1-Regularization
E(w) =1
2
N1
n=0
(tn y(xn,w))2 + |w|1
L0-Regularization NP complete optimization
E(w) =1
2
N1
n=0
(tn y(xn,w))2 + N1
n=0
(wn = 0)
The L0-norm represents the optimal subset of features needed bya Regression model.
How can we optimize of these functions?
Curse of Dimensionality
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
27/38
y
Curse of Dimensionality
Increasing the dimensionality of the feature space exponentiallyincreases the data needs.Note: The dimensionality of the feature space = The number of
features.What is the message of this?
Models should be small relative to the amount of availabledata.
Dimensionality Reduction techniques feature selection can
help.L0-regularization is feature selection for linear models.L1- and L2-regularizations approximate feature selection andregularize the function.
Curse of Dimensionality Example
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
28/38
Assume a cell requires 100 data points to generalize properly, and3-ary multinomial features.
One dimension requires 300 data points
Two Dimensions requires 900 data points
Three Dimensions requires 2,700 data points
In this example, for D-dimensional model fitting, the datarequirements are 3D 10.
Argument against the Kitchen Sink approach.
Bayesians v. Frequentists
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
29/38
What is a Probability?
Bayesians v. Frequentists
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
30/38
What is a Probability?
The Frequentist position
A probability is the likelihood that an event will happen.
It is approximated as the ratio of the number of times the eventhappened to the total number of events.
Assessment is very important to select a model.
Point Estimates are fine nN
Bayesians v. Frequentists
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
31/38
What is a Probability?
The Frequentist position
A probability is the likelihood that an event will happen.
It is approximated as the ratio of the number of times the eventhappened to the total number of events.
Assessment is very important to select a model.
Point Estimates are fine nN
The Bayesian position
A probability is the degree of believability that the event will happen.
Bayesians require that probabilities be conditioned on data, p(y|x).
The Bayesian approach is optimal, given a good model, and good priorand good loss function dont worry about assessment as much.
Bayesians say: if you are ever making a point estimate, youve made amistake. The only valid probabilities are posteriors based on evidencegiven some prior.
Bayesian Linear Regression
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
32/38
In the previous derivation of the linear regression optimization, we
made point estimates for the weight vector, w.
Bayesians would say stop right there. Use a distribution overw to estimate the parameters.
p(w|) = N(w|0, 1I) = 2(M+1)/2
exp2 wTw
is a hyperparameterover w, where is the precision or inversevariance of the distribution.
So, optimize
p(w|x, t,, ) p(t|x,w, )p(w|)
Bayesian Linear Regression
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
33/38
p(w|x, t,, ) p(t|x,w, )p(w|)Again, optimizing the log likelihood yields a simpler solution.
ln p(t
|x,w, ) + ln p(w
|)
p(t|x,w, ) =N1n=0
2
exp
2(tn y(xn,w))2
ln p(t|x,w, ) = N2
ln N2
ln 2 2
N1n=0
(tn y(xn,w))2
Bayesian Linear Regression
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
34/38
p(w
|x, t,, )
p(t
|x,w, )p(w
|)
Again, optimizing the log likelihood yields a simpler solution.
ln p(t|x,w, ) + ln p(w|)
ln p(t|x,w, ) = N2
ln N2
ln 2 2
N1n=0
(tn y(xn,w))2
p(w|) = N(w|0, 1I) = 2(M+1)/2
exp2 wTw
ln p(w|) = M+ 12
ln M+ 12
ln 2 2
wTw
Bayesian Linear Regression
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
35/38
p(w|x, t,, ) p(t|x,w, )p(w|)Again, optimizing the log likelihood yields a simpler solution.
ln p(t|x,w, ) + ln p(w|)
ln p(t|x,w, ) = N2
ln N2
ln 2 2
N1n=0
(tn y(xn,w))2
ln p(w|) = M+ 12
ln M+ 12
ln 2 2
wTw
Bayesian Linear Regression
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
36/38
p(w|x, t,, ) p(t|x,w, )p(w|)Again, optimizing the log likelihood yields a simpler solution.
ln p(t|x,w, ) + ln p(w|)
ln p(t|x,w, ) = N2
ln N2
ln 2 2
N1n=0
(tn y(xn,w))2
ln p(w|) =M+ 1
2 ln M+ 1
2 ln 2
2 wT
w
ln p(t|x,w, ) + ln p(w|) = 2
N1
n=0(tn y(xn,w))2 +
2wTw
Broader Context
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
37/38
Overfitting is bad.
Bayesians v. Frequentists.
Does it matter which camp you lie in?
Not particularly, but Bayesian approaches allow us some usefulinteresting and principled tools.
Bye
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning
38/38
NextCategorization
Logistic RegressionNaive Bayes
http://find/http://goback/