Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
18-661 Introduction to Machine Learning
Linear Regression – II
Spring 2020
ECE – Carnegie Mellon University
Announcements
• Homework 1 due today.
• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR
• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
Today’s Class: Practical Issues with Using
Linear Regression and How to Address Them
1
Outline
1. Review of Linear Regression
2. Gradient Descent Methods
3. Feature Scaling
4. Ridge regression
5. Non-linear Basis Functions
6. Overfitting
2
Review of Linear Regression
Example: Predicting house prices
Sale price ≈ price per sqft × square footage + fixed expense
3
Minimize squared errors
Our model:
Sale price =
price per sqft× square footage + fixed expense + unexplainable stuffTraining data:
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·
Aim:
Adjust price per sqft and fixed expense such that the sum of the squared
error is minimized — i.e., the unexplainable stuff is minimized.
4
Linear regression
Setup:
• Input: x ∈ RD (covariates, predictors, features, etc)• Output: y ∈ R (responses, targets, outcomes, outputs, etc)• Model: f : x→ y , with f (x) = w0 +
∑Dd=1 wdxd = w0 + w
>x.
• w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector• w0 is called bias.• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.
• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}
Minimize the Residual sum of squares:
RSS(w) =N∑
n=1
[yn − f (xn)]2 =N∑
n=1
[yn − (w0 +D∑
d=1
wdxnd )]2
5
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
6
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
Stationary points:
Take derivative with respect to parameters and set it to zero
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0,
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0.
7
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
Stationary points:
Take derivative with respect to parameters and set it to zero
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0,
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0.
7
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1
∑xn∑
xnyn = w0∑
xn + w1∑
x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x̄)(yn − ȳ)∑
(xi − x̄)2and w0 = ȳ − w1x̄
where x̄ = 1N∑
n xn and ȳ =1N
∑n yn.
8
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1
∑xn∑
xnyn = w0∑
xn + w1∑
x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x̄)(yn − ȳ)∑
(xi − x̄)2and w0 = ȳ − w1x̄
where x̄ = 1N∑
n xn and ȳ =1N
∑n yn.
8
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1
∑xn∑
xnyn = w0∑
xn + w1∑
x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x̄)(yn − ȳ)∑
(xi − x̄)2and w0 = ȳ − w1x̄
where x̄ = 1N∑
n xn and ȳ =1N
∑n yn.
8
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RNCompact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1)
, y =
y1y2...
yN
∈ RNCompact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RN
Compact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RNCompact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RN. Compact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
10
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Design matrix and target vector:
X =
x>1x>2...
x>N
=
1 1 2 1
1 2 2 2
1 1.5 3 2
1 2.5 4 2.5
, y =
2
3.5
3
4.5
. Compact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
11
Three Optimization Methods
Want to Minimize
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
• Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent
12
Least-Squares Solution
Compact expression
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(X>X
)−1X>y
13
Least-Squares Solution
Compact expression
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(X>X
)−1X>y
13
Least-Squares Solution
Compact expression
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(X>X
)−1X>y
13
Gradient Descent Methods
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
14
Three Optimization Methods
Want to Minimize
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
• Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent
15
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X
• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3)
– Impractical for very large D or N
16
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
Alternative method: Batch Gradient Descent
(Batch) Gradient descent
• Initialize w to w (0) (e.g., randomly);set t = 0; choose η > 0
• Loop until convergence1. Compute the gradient
∇RSS(w) = X>(Xw (t) − y)2. Update the parameters
w (t+1) = w (t) − η∇RSS(w)3. t ← t + 1
What is the complexity of each iteration?
O(ND)
17
Why would this work?
If gradient descent converges, it will converge to the same solution as
using matrix inversion.
This is because RSS(w) is a convex function in its parameters w
Hessian of RSS
RSS(w) = w>X>Xw − 2(X>y
)>w + const
⇒ ∂2RSS(w)∂ww>
= 2X>X
X>X is positive semidefinite, because for any v
v>X>Xv = ‖X>v‖22 ≥ 0
18
Three Optimization Methods
Want to Minimize
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
• Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent
19
Stochastic gradient descent (SGD)
Widrow-Hoff rule: update parameters using one example at a time
• Initialize w to some w (0); set t = 0; choose η > 0• Loop until convergence
1. random choose a training a sample x t2. Compute its contribution to the gradient
g t = (x>t w
(t) − yt)xt
3. Update the parameters
w (t+1) = w (t) − ηg t4. t ← t + 1
How does the complexity per iteration compare with gradient descent?
• O(ND) for gradient descent versus O(D) for SGD
20
SGD versus Batch GD
• SGD reduces per-iteration complexity from O(ND) to O(D)• But it is noisier and can take longer to converge
21
Example: Comparing the Three Methods
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
regressiondata intelligence
= ??
house size
pirc
e ($
)
22
Example: Least Squares Solution
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
The w0 and w1 that minimize this are given by:
wLMS =(X>X
)−1X>y
[w0w1
]=
[
1 1 1 1
1 2 1.5 2.5
]1 1
1 2
1 1.5
1 2.5
−1 [
1 1 1 1
1 2 1.5 2.5
]2
3.5
3
4.5
23
Example: Least Squares Solution
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
The w0 and w1 that minimize this are given by:
wLMS =(X>X
)−1X>y
[w0w1
]=
[
1 1 1 1
1 2 1.5 2.5
]1 1
1 2
1 1.5
1 2.5
−1 [
1 1 1 1
1 2 1.5 2.5
]2
3.5
3
4.5
[w0w1
]=
[0.45
1.6
]Minimum RSS is RSS∗ = ||XwLMS − y||22 = 0.2236
24
Example: Batch Gradient Descent
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0.5
1.0
1.5
RS
SV
alu
e
η = 0.01
25
Larger η gives faster convergence
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0.5
1.0
1.5
RS
SV
alu
e
η = 0.01
η = 0.1
26
But too large η makes GD unstable
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0
20
40
RS
SV
alu
e
η = 0.01
η = 0.1
η = 0.12
27
Example: Stochastic Gradient Descent
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − η(
x>t w(t) − y
)xt
0 20 40Number of Iterations
0.5
1.0
1.5
2.0
RS
SV
alu
e
η = 0.05
28
Larger η gives faster convergence
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − η(
x>t w(t) − y
)xt
0 20 40Number of Iterations
0.5
1.0
1.5
2.0
RS
SV
alu
e
η = 0.05
η = 0.1
29
But too large η makes SGD unstable
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − η(
x>t w(t) − y
)xt
0 20 40Number of Iterations
0.5
1.0
1.5
2.0
RS
SV
alu
e
η = 0.05
η = 0.1
η = 0.25
30
How to Choose Learning Rate η in practice?
• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more onthis later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates sothat we can reach closer to the true minimum.
• More advanced learning rate schedules such as AdaGrad, Adam,AdaDelta are used in practice.
31
Summary of Gradient Descent Methods
• Batch gradient descent computes the exact gradient.• Stochastic gradient descent approximates the gradient with a single
data point; its expectation equals the true gradient.
• Mini-batch variant: set the batch size to trade-off between accuracyof estimating gradient and computational cost
• Similar ideas extend to other ML optimization problems.
32
Feature Scaling
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
33
Batch Gradient Descent: Scaled Features
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0.5
1.0
1.5
RS
SV
alu
e
η = 0.01
η = 0.1
34
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)
• ∇RSS(w (t)) = X>(Xw (t) − y
)becomes HUGE, causing instability
• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
50000
100000
150000
200000
RS
SV
alu
e
η = 0.0000001
35
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>
(Xw (t) − y
)becomes HUGE, causing instability
• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
50000
100000
150000
200000
RS
SV
alu
e
η = 0.0000001
35
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>
(Xw (t) − y
)becomes HUGE, causing instability
• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
50000
100000
150000
200000
RS
SV
alu
e
η = 0.0000001
35
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w) becomes HUGE, causing instability• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
0.00
0.25
0.50
0.75
1.00
RS
SV
alu
e
×10110
η = 0.0000001
η = 0.00001
36
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1
• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization)
Labels y (1), . . . y (N) should be similarly re-scaled
37
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
Ridge regression
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
38
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
Example: Matrix X>X is not invertible
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
Design matrix and target vector:
X =
1 1 2
1 2 2
1 1.5 2
1 2.5 2
, w =w0w1w2
, y =
2
3.5
3
4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
40
Example: Matrix X>X is not invertible
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
Design matrix and target vector:
X =
1 1 2
1 2 2
1 1.5 2
1 2.5 2
, w =w0w1w2
, y =
2
3.5
3
4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
40
Example: Matrix X>X is not invertible
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
Design matrix and target vector:
X =
1 1 2
1 2 2
1 1.5 2
1 2.5 2
, w =w0w1w2
, y =
2
3.5
3
4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
40
What does the RSS loss function look like?
• When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point
In our example, this line is w0,eff = (w0 + 2w2)41
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features
• But this can be tedious and non-trivial, especially when a feature isa linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is
a linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is
a linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is
a linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
Ridge regression
Intuition: what does a non-invertible X>X mean?
Consider the SVD of this matrix:
X>X = V
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
V>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1
Fix the problem: ensure all singular values are non-zero:
X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>
where λ > 0 and I is the identity matrix.
43
Ridge regression
Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:
X>X = V
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
V>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1
Fix the problem: ensure all singular values are non-zero:
X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>
where λ > 0 and I is the identity matrix.
43
Ridge regression
Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:
X>X = V
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
V>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1
Fix the problem: ensure all singular values are non-zero:
X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>
where λ > 0 and I is the identity matrix.
43
Regularized least square (ridge regression)
Solution
w =(X>X + λI
)−1X>y
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
• Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later
44
Regularized least square (ridge regression)
Solution
w =(X>X + λI
)−1X>y
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
• Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later
44
Regularized least square (ridge regression)
Solution
w =(X>X + λI
)−1X>y
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
• Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later
44
Applying this to our example
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
45
Applying this to our example
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
45
Applying this to our example
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
Compute the solution for λ = 0.5w0w1w2
= (X>X + λI)−1 X>yw0w1w2
= 0.2081.247
0.4166
46
Applying this to our example
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
Compute the solution for λ = 0.5w0w1w2
= (X>X + λI)−1 X>yw0w1w2
= 0.2081.247
0.4166
46
How does λ affect the solution?
w0w1w2
= (X>X + λI)−1 X>yLet us plot w ′o = w0 + 2w2 and w1 for different λ ∈ [0.01, 20]
0 5 10 15 20Hyperparameter λ
0.50
0.75
1.00
1.25
1.50
Par
amet
erV
alu
es
w0,effw1
Setting small λ gives almost the least-squares solution, but it can cause
numerical instability in the inversion47
How to choose λ?
λ is referred as hyperparameter
• Associated with the estimation method, not the dataset• In contrast w is the parameter vector• Use validation set or cross-validation to find good choice of λ (more
on this in the next lecture)
0 5 10 15 20Hyperparameter λ
0.50
0.75
1.00
1.25
1.50
Par
amet
erV
alu
es
w0,effw1
48
Why is it called Ridge Regression?
• When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point
• Adding the regularizer term 12λ‖w‖22 yields a unique minimum, thusavoiding instability in matrix inversion
49
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
Review: Probabilistic interpretation of Linear Regression
Linear Regression model: Y = w>X + η
η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)
Frequentist interpretation: We assume that w is fixed.
• The likelihood function maps parameters to probabilities
L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏
n
p(yn|xn,w , σ20)
• Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:
wLMS = wML = arg maxw L(w , σ20)
51
Review: Probabilistic interpretation of Linear Regression
Linear Regression model: Y = w>X + η
η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)
Frequentist interpretation: We assume that w is fixed.
• The likelihood function maps parameters to probabilities
L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏
n
p(yn|xn,w , σ20)
• Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:
wLMS = wML = arg maxw L(w , σ20)
51
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + η
• Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2
• To find w given data D, compute the posterior distribution of w :
p(w |D) = p(D|w)p(w)p(D)
• Maximum a posterior (MAP) estimate:
wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)
52
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + η
• Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2
• To find w given data D, compute the posterior distribution of w :
p(w |D) = p(D|w)p(w)p(D)
• Maximum a posterior (MAP) estimate:
wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)
52
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + η
• Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2
• To find w given data D, compute the posterior distribution of w :
p(w |D) = p(D|w)p(w)p(D)
• Maximum a posterior (MAP) estimate:
wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)
52
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).
Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
Maximum a posterior (MAP) estimate
E(w) =∑
n
(w>xn − yn)2 + λ‖w‖22
where λ > 0 is used to denote σ20/σ2. This extra term ‖w‖22 is called
regularization/regularizer and controls the magnitude of w .
Intuitions
• If λ→ +∞, then σ20 � σ2: the variance of noise is far greater than whatour prior model can allow for w . In this case, our prior model on w willforce w to be close to zero. Numerically,
wmap → 0
• If λ→ 0, then we trust our data more. Numerically,
wmap → w lms = argmin∑
n
(w>xn − yn)2
54
Outline
1. Review of Linear Regression
2. Gradient Descent Methods
3. Feature Scaling
4. Ridge regression
5. Non-linear Basis Functions
6. Overfitting
55
Non-linear Basis Functions
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
56
Is a linear modeling assumption always a good idea?
Figure 1: Sale price can saturate as sq.footage increases
x
t
0 1
−1
0
1
Figure 2: Temperature has cyclic variations over each year
57
General nonlinear basis functions
We can use a nonlinear mapping to a new feature vector:
φ(x) : x ∈ RD → z ∈ RM
• M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D
We can apply existing learning methods on the transformed data:
• linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc
58
General nonlinear basis functions
We can use a nonlinear mapping to a new feature vector:
φ(x) : x ∈ RD → z ∈ RM
• M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D
We can apply existing learning methods on the transformed data:
• linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc
58
Regression with nonlinear basis
Residual sum of squares∑n
[w>φ(xn)− yn]2
where w ∈ RM , the same dimensionality as the transformed featuresφ(x).
The LMS solution can be formulated with the new design matrix
Φ =
φ(x1)>
φ(x2)>...
φ(xN )>
∈ RN×M , w lms =(
Φ>Φ)−1
Φ>y
59
Regression with nonlinear basis
Residual sum of squares∑n
[w>φ(xn)− yn]2
where w ∈ RM , the same dimensionality as the transformed featuresφ(x).
The LMS solution can be formulated with the new design matrix
Φ =
φ(x1)>
φ(x2)>...
φ(xN )>
∈ RN×M , w lms =(
Φ>Φ)−1
Φ>y
59
Example: Lot of Flexibility in Designing New Features!
x1, Area (1k sqft) x21 , Area
2 Price (100k)
1 1 2
2 4 3.5
1.5 2.25 3
2.5 6.25 4.5
Figure 3: Add x21 as a feature to allow us to fit quadratic, instead of linear
functions of the house area x1
60
Example: Lot of Flexibility in Designing New Features!
x1, front (100ft) x2 depth (100ft) 10x1x2, Lot (1k sqft) Price (100k)
0.5 0.5 2.5 2
0.5 1 5 3.5
0.8 1.5 12 3
1.0 1.5 15 4.5
Figure 4: Instead of having frontage and depth as two separate features, it
may be better to consider the lot-area, which is equal to frontage×depth61
Example with regression
Polynomial basis functions
φ(x) =
1
x
x2
...
xM
⇒ f (x) = w0 +M∑
m=1
wmxm
Fitting samples from a sine function:
underfitting since f (x) is too simple
x
t
M = 0
0 1
−1
0
1
x
t
M = 1
0 1
−1
0
1
62
Example with regression
Polynomial basis functions
φ(x) =
1
x
x2
...
xM
⇒ f (x) = w0 +M∑
m=1
wmxm
Fitting samples from a sine function:
underfitting since f (x) is too simple
x
t
M = 0
0 1
−1
0
1
x
t
M = 1
0 1
−1
0
1
62
Adding high-order terms
M=3
x
t
M = 3
0 1
−1
0
1
M=9: overfitting
x
t
M = 9
0 1
−1
0
1
More complex features lead to better results on the training data, but
potentially worse results on new data, e.g., test data!
63
Adding high-order terms
M=3
x
t
M = 3
0 1
−1
0
1
M=9: overfitting
x
t
M = 9
0 1
−1
0
1
More complex features lead to better results on the training data, but
potentially worse results on new data, e.g., test data!
63
Overfitting
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
64
Overfitting
Parameters for higher-order polynomials are very large
M = 0 M = 1 M = 3 M = 9
w0 0.19 0.82 0.31 0.35
w1 -1.27 7.99 232.37
w2 -25.43 -5321.83
w3 17.37 48568.31
w4 -231639.30
w5 640042.26
w6 -1061800.52
w7 1042400.18
w8 -557682.99
w9 125201.43
65
Overfitting can be quite disastrous
Fitting the housing price data with large M:
Predicted price goes to zero (and is ultimately negative) if you buy a big
enough house!
This is called poor generalization/overfitting.
66
Detecting overfitting
Plot model complexity versus objective function:
• X axis: model complexity, e.g., M• Y axis: error, e.g., RSS, RMS
(square root of RSS), 0-1 loss
Compute the objective on a training and
test dataset.
M
ERMS
0 3 6 90
0.5
1TrainingTest
As a model increases in complexity:
• Training error keeps improving• Test error may first improve but eventually will deteriorate
67
Dealing with overfitting
Try to use more training data
x
t
M = 9
0 1
−1
0
1
x
t
N = 15
0 1
−1
0
1
x
t
N = 100
0 1
−1
0
1
What if we do not have a lot of data?
68
Dealing with overfitting
Try to use more training data
x
t
M = 9
0 1
−1
0
1
x
t
N = 15
0 1
−1
0
1
x
t
N = 100
0 1
−1
0
1
What if we do not have a lot of data?
68
Dealing with overfitting
Try to use more training data
x
t
M = 9
0 1
−1
0
1
x
t
N = 15
0 1
−1
0
1
x
t
N = 100
0 1
−1
0
1
What if we do not have a lot of data?
68
Regularization methods
Intuition: Give preference to ‘simpler’ models
• How do we define a simple linear regression model — w>x?• Intuitively, the weights should not be “too large”
M = 0 M = 1 M = 3 M = 9
w0 0.19 0.82 0.31 0.35
w1 -1.27 7.99 232.37
w2 -25.43 -5321.83
w3 17.37 48568.31
w4 -231639.30
w5 640042.26
w6 -1061800.52
w7 1042400.18
w8 -557682.99
w9 125201.43
69
Next Class: Overfitting, Regularization and
the Bias-variance trade-off
69
Review of Linear RegressionGradient Descent MethodsFeature ScalingRidge regressionNon-linear Basis FunctionsOverfitting