Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Linear Regression
Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London
https://deisenroth.cc
AIMS Rwanda and AIMS Ghana
March/April 2020
Reference
MATHEMATICS FOR MACHINE LEARNING
Marc Peter DeisenrothA. Aldo Faisal
Cheng Soon Ong
MA
THEM
ATICS FO
R MA
CHIN
E LEARN
ING
DEIS
ENR
OTH
ET AL.
The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.
MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.
A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.
CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.
Cover image courtesy of Daniel Bosma / Moment / Getty Images
Cover design by Holly Johnson
Deisenrith et al. 9781108455145 Cover. C M Y K
https://mml-book.com
Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 2
Regression Problems
Regression (curve fitting)
Given inputs x P RD andcorresponding observations y P Rfind a function f that models therelationship between x and y.
−4 −2 0 2 4
x
−4
−2
0
2
4
y
Typically parametrize the function f with parameters θ
Linear regression: Consider functions f that are linear in theparameters
Marc Deisenroth (UCL) Linear Regression March/April 2020 3
Linear Regression Functions
Straight lines
y “ fpx,θq “ θ0 ` θ1x “”
θ0 θ1
ı
«
1
x
ff
Polynomials
y “ fpx,θq “Mÿ
m“0
θmxm “
”
θ0 ¨ ¨ ¨ θM
ı
»
—
—
–
1...xM
fi
ffi
ffi
fl
Radial basis function networks
y “ fpx,θq “Mÿ
m“1
θm exp´
´ 12px´ µmq
2¯
−4 −2 0 2 4x
−2
−1
0
1
2
f(x
)
−4 −2 0 2 4x
−20
−10
0
10
f(x
)
−4 −2 0 2 4x
−3
−2
−1
0
1
f(x
)
Marc Deisenroth (UCL) Linear Regression March/April 2020 4
Linear Regression Functions
Straight lines
y “ fpx,θq “ θ0 ` θ1x “”
θ0 θ1
ı
«
1
x
ff
Polynomials
y “ fpx,θq “Mÿ
m“0
θmxm “
”
θ0 ¨ ¨ ¨ θM
ı
»
—
—
–
1...xM
fi
ffi
ffi
fl
Radial basis function networks
y “ fpx,θq “Mÿ
m“1
θm exp´
´ 12px´ µmq
2¯
−4 −2 0 2 4x
−2
−1
0
1
2
f(x
)
−4 −2 0 2 4x
−20
−10
0
10
f(x
)
−4 −2 0 2 4x
−3
−2
−1
0
1
f(x
)
Marc Deisenroth (UCL) Linear Regression March/April 2020 4
Linear Regression Functions
Straight lines
y “ fpx,θq “ θ0 ` θ1x “”
θ0 θ1
ı
«
1
x
ff
Polynomials
y “ fpx,θq “Mÿ
m“0
θmxm “
”
θ0 ¨ ¨ ¨ θM
ı
»
—
—
–
1...xM
fi
ffi
ffi
fl
Radial basis function networks
y “ fpx,θq “Mÿ
m“1
θm exp´
´ 12px´ µmq
2¯
−4 −2 0 2 4x
−2
−1
0
1
2
f(x
)
−4 −2 0 2 4x
−20
−10
0
10
f(x
)
−4 −2 0 2 4x
−3
−2
−1
0
1
f(x
)Marc Deisenroth (UCL) Linear Regression March/April 2020 4
Linear Regression Model and Setting
y “ xJθ ` ε , ε „ N`
0, σ2˘
Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚
Maximum Likelihood EstimationMaximum a Posteriori Estimation
Marc Deisenroth (UCL) Linear Regression March/April 2020 5
Linear Regression Model and Setting
y “ xJθ ` ε , ε „ N`
0, σ2˘
Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚
Maximum Likelihood EstimationMaximum a Posteriori Estimation
Marc Deisenroth (UCL) Linear Regression March/April 2020 5
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
Making Predictions
y “ xJθ ` ε, ε „ N`
0, σ2˘
Given an arbitrary input x˚, we can predict the correspondingobservation y˚ using the maximum likelihood parameter:
ppy˚|x˚,θMLq “ N
`
y˚ |xJ˚ θ
ML, σ2˘
Measurement noise variance σ2 assumed known
In the absence of noise (σ2 “ 0), the prediction will bedeterministic
Marc Deisenroth (UCL) Linear Regression March/April 2020 8
Example 1: Linear Functions
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
y “ θ0 ` θ1x` ε, ε „ N`
0, σ2˘
At any query point x˚ we obtain the mean prediction as
Ery˚|θML, x˚s “ θML
0 ` θML1 x˚
Marc Deisenroth (UCL) Linear Regression March/April 2020 9
Nonlinear Functions
y “ φpxqJθ ` ε “Mÿ
m“0
θmxm ` ε
Polynomial regression with features
φpxq “ r1, x, x2, . . . , xM sJ
Maximum likelihood estimator:
θML “ pΦJΦq´1ΦJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 10
Nonlinear Functions
y “ φpxqJθ ` ε “Mÿ
m“0
θmxm ` ε
Polynomial regression with features
φpxq “ r1, x, x2, . . . , xM sJ
Maximum likelihood estimator:
θML “ pΦJΦq´1ΦJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 10
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Figure: Training data
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 2nd-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 4th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 6th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 8th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 10th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
Overfitting
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
Training error
Test error
Training error decreases with higher flexibility of the model
We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?
Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data
Marc Deisenroth (UCL) Linear Regression March/April 2020 12
Overfitting
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
Training error
Test error
Training error decreases with higher flexibility of the model
We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?
Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data
Marc Deisenroth (UCL) Linear Regression March/April 2020 12
Overfitting
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
Training error
Test error
Training error decreases with higher flexibility of the model
We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?
Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data
Marc Deisenroth (UCL) Linear Regression March/April 2020 12
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
MAP Estimation (2)
Gaussian parameter prior ppθq “ N`
0, α2I˘
Log-posterior distribution:
log ppθ|X,yq “ ´ 12σ2 py ´Xθq
Jpy ´Xθq ´ 12α2θ
Jθ ` const
“ ´ 12σ2 }y ´Xθ}
2 ´ 12α2 }θ}
2 ` const
Compute gradient with respect to θ, set it to 0
Maximum a posteriori estimate:
θMAP “ pXJX ` σ2
α2 Iq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 14
MAP Estimation (2)
Gaussian parameter prior ppθq “ N`
0, α2I˘
Log-posterior distribution:
log ppθ|X,yq “ ´ 12σ2 py ´Xθq
Jpy ´Xθq ´ 12α2θ
Jθ ` const
“ ´ 12σ2 }y ´Xθ}
2 ´ 12α2 }θ}
2 ` const
Compute gradient with respect to θ, set it to 0
Maximum a posteriori estimate:
θMAP “ pXJX ` σ2
α2 Iq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 14
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Figure: Training data
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 2nd-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 4th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 6th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 8th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 10th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
Generalization Error
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
(tes
tse
t)
Maximum likelihood
Maximum a posteriori
MAP estimation “delays” the problem of overfitting
It does not provide a general solutionNeed a more principled solution
Marc Deisenroth (UCL) Linear Regression March/April 2020 16
Bayesian Linear Regression
y “ φJpxqθ ` ε , ε „ N`
0, σ2˘
Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them
Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:
ppy˚|x˚q “
ż
ppy˚|x˚,θqppθqdθ
Prediction no longer depends on θ
Predictive distribution reflects the uncertainty about the “correct”parameter setting
Marc Deisenroth (UCL) Linear Regression March/April 2020 17
Bayesian Linear Regression
y “ φJpxqθ ` ε , ε „ N`
0, σ2˘
Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them
Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:
ppy˚|x˚q “
ż
ppy˚|x˚,θqppθqdθ
Prediction no longer depends on θ
Predictive distribution reflects the uncertainty about the “correct”parameter setting
Marc Deisenroth (UCL) Linear Regression March/April 2020 17
Example
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
BLR
−4 −2 0 2 4x
−4
−2
0
2
4
y
Light-gray: uncertainty due to noise (same as in MLE/MAP)
Dark-gray: uncertainty due to parameter uncertainty
Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)
Marc Deisenroth (UCL) Linear Regression March/April 2020 18
Example
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
BLR
−4 −2 0 2 4x
−4
−2
0
2
4
y
Light-gray: uncertainty due to noise (same as in MLE/MAP)
Dark-gray: uncertainty due to parameter uncertainty
Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)
Marc Deisenroth (UCL) Linear Regression March/April 2020 18
Model for Bayesian Linear Regression
Prior ppθq “ N`
m0, S0
˘
,
Likelihood ppy|x,θq “ N`
y |φJpxqθ, σ2˘
Parameter θ becomes a latent (random) variable
Prior distribution induces a distribution over plausible functions
Choose a conjugate Gaussian priorClosed-form computationsGaussian posterior
Marc Deisenroth (UCL) Linear Regression March/April 2020 19
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:Derive this
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
Marginal Likelihood
Marginal likelihood can be computed analytically.
With ppθq “ N`
µ, Σ˘
ppy|Xq “
ż
ppy|X,θqppθqdθ “ N`
y |Φµ, ΦΣΦJ ` σ2I˘
Derivation via completing the squares (see Section 9.3.5 ofMML book)
Marc Deisenroth (UCL) Linear Regression March/April 2020 21
Distribution over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
−4 −2 0 2 4a
−4
−3
−2
−1
0
1
2
3
4b
Marc Deisenroth (UCL) Linear Regression March/April 2020 22
Sampling from the Prior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
fipxq “ ai ` bix, rai, bis „ ppa, bq
Marc Deisenroth (UCL) Linear Regression March/April 2020 23
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
X “ rx1, . . . , xN s, y “ ry1, . . . , yN s Training inputs/targets
−10 −5 0 5 10x
−15
−10
−5
0
5
10
15y
Marc Deisenroth (UCL) Linear Regression March/April 2020 24
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
ppa, b|X,yq “ N`
mN , SN˘
Posterior
−4 −2 0 2 4a
−4
−3
−2
−1
0
1
2
3
4b
Marc Deisenroth (UCL) Linear Regression March/April 2020 25
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
rai, bis „ ppa, b|X,yq
fi “ ai ` bix
Marc Deisenroth (UCL) Linear Regression March/April 2020 26
Fitting Nonlinear Functions
Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features
Example: Radial-basis-function (RBF) network
fpxq “nÿ
i“1
θiφipxq , θi „ N`
0, σ2p˘
whereφipxq “ exp
`
´ 12px´ µiq
Jpx´ µiq˘
for given “centers” µi
Marc Deisenroth (UCL) Linear Regression March/April 2020 27
Fitting Nonlinear Functions
Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features
Example: Radial-basis-function (RBF) network
fpxq “nÿ
i“1
θiφipxq , θi „ N`
0, σ2p˘
whereφipxq “ exp
`
´ 12px´ µiq
Jpx´ µiq˘
for given “centers” µi
Marc Deisenroth (UCL) Linear Regression March/April 2020 27
Fitting Nonlinear Functions
Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features
Example: Radial-basis-function (RBF) network
fpxq “nÿ
i“1
θiφipxq , θi „ N`
0, σ2p˘
whereφipxq “ exp
`
´ 12px´ µiq
Jpx´ µiq˘
for given “centers” µi
Marc Deisenroth (UCL) Linear Regression March/April 2020 27
Illustration: Fitting a Radial Basis Function Network
φipxq “ exp`
´ 12px´ µiq
Jpx´ µiq˘
-5 0 5x
-2
0
2f(
x)
Place Gaussian-shaped basis functions φi at 25 input locationsµi, linearly spaced in the interval r´5, 3s
Marc Deisenroth (UCL) Linear Regression March/April 2020 28
Samples from the RBF Prior
fpxq “nÿ
i“1
θiφipxq , ppθq “ N`
0, I˘
-5 0 5x
-4
-2
0
2
4f(
x)
Marc Deisenroth (UCL) Linear Regression March/April 2020 29
Samples from the RBF Posterior
fpxq “nÿ
i“1
θiφipxq , ppθ|X,yq “ N`
mN , SN˘
-5 0 5x
-4
-2
0
2
4f(
x)
Marc Deisenroth (UCL) Linear Regression March/April 2020 30
RBF Posterior
-5 0 5x
-2
0
2f(
x)
Marc Deisenroth (UCL) Linear Regression March/April 2020 31
Limitations
-5 0 5x
-2
0
2
f(x)
Feature engineering (what basis functions to use?)Finite number of features:
Above: Without basis functions on the right, we cannot expressany variability of the functionIdeally: Add more (infinitely many) basis functions
Marc Deisenroth (UCL) Linear Regression March/April 2020 32
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
Summary
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
(tes
tse
t)
Maximum likelihood
Maximum a posteriori
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
BLR
−2 0 2a
−4
−3
−2
−1
0
1
2
3
4
b
−10 −5 0 5 10x
−15
−10
−5
0
5
10
15
y
Regression = curve fittingLinear regression = linear in the parametersParameter estimation via maximum likelihood and MAPestimation can lead to overfittingBayesian linear regression addresses this issue, but may not beanalytically tractablePredictive uncertainty in Bayesian linear regression explicitlyaccounts for parameter uncertaintyDistribution over parameters Distribution over functions
Marc Deisenroth (UCL) Linear Regression March/April 2020 34
Appendix
Marc Deisenroth (UCL) Linear Regression March/April 2020 35
Joint Gaussian Distribution
Joint Gaussian distribution
ppx,yq “ N˜«
µx
µy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal:
ppx q “
ż
ppx , y qd y
“ N`
µx , Σxx
˘
Conditional:
ppx|yq “ N`
µx|y, Σx|y
˘
µx|y “ µx ` Σxy Σ´1yy py ´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Marc Deisenroth (UCL) Linear Regression March/April 2020 36
Joint Gaussian Distribution
Joint Gaussian distribution
ppx,yq “ N˜«
µx
µy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal:
ppx q “
ż
ppx , y qd y
“ N`
µx , Σxx
˘
Conditional:
ppx|yq “ N`
µx|y, Σx|y
˘
µx|y “ µx ` Σxy Σ´1yy py ´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Marc Deisenroth (UCL) Linear Regression March/April 2020 36
Joint Gaussian Distribution
Joint Gaussian distribution
ppx,yq “ N˜«
µx
µy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal:
ppx q “
ż
ppx , y qd y
“ N`
µx , Σxx
˘
Conditional:
ppx|yq “ N`
µx|y, Σx|y
˘
µx|y “ µx ` Σxy Σ´1yy py ´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Marc Deisenroth (UCL) Linear Regression March/April 2020 36
Linear Transformation of Gaussian RandomVariables
If x „ N`
x |µ, Σ˘
and z “ Ax` b then
ppzq “ N`
z |Aµ` b, AΣAJ˘
Marc Deisenroth (UCL) Linear Regression March/April 2020 37
Product of Two Gaussians
x P RD. Then:
N`
x |a, A˘
N`
x | b, B˘
“ ZN`
x | c, C˘
C “ pA´1 `B´1q´1
c “ CpA´1a`B´1bq
Z “ p2πq´D2 |A`B| exp
`
´12pa´ bq
JpA`Bq´1pa´ bq˘
Product of two Gaussians is an unnormalized Gaussian
The “un-normalizer” Z has a Gaussian functional form:
Z “ N`
a | b, A`B˘
“ N`
b |a, A`B˘
Note: This is not a distribution (no random variables)
Marc Deisenroth (UCL) Linear Regression March/April 2020 38
Product of Two Gaussians
x P RD. Then:
N`
x |a, A˘
N`
x | b, B˘
“ ZN`
x | c, C˘
C “ pA´1 `B´1q´1
c “ CpA´1a`B´1bq
Z “ p2πq´D2 |A`B| exp
`
´12pa´ bq
JpA`Bq´1pa´ bq˘
Product of two Gaussians is an unnormalized Gaussian
The “un-normalizer” Z has a Gaussian functional form:
Z “ N`
a | b, A`B˘
“ N`
b |a, A`B˘
Note: This is not a distribution (no random variables)
Marc Deisenroth (UCL) Linear Regression March/April 2020 38
Product of Two Gaussians
x P RD. Then:
N`
x |a, A˘
N`
x | b, B˘
“ ZN`
x | c, C˘
C “ pA´1 `B´1q´1
c “ CpA´1a`B´1bq
Z “ p2πq´D2 |A`B| exp
`
´12pa´ bq
JpA`Bq´1pa´ bq˘
Product of two Gaussians is an unnormalized Gaussian
The “un-normalizer” Z has a Gaussian functional form:
Z “ N`
a | b, A`B˘
“ N`
b |a, A`B˘
Note: This is not a distribution (no random variables)
Marc Deisenroth (UCL) Linear Regression March/April 2020 38
Example: Marginalization of a Product
p1pxq “ N`
x |a, A˘
p2pxq “ N`
x | b, B˘
Thenż
p1pxqp2pxqdx “
Z “ N`
a | b, A`B˘
P R
Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.
Marc Deisenroth (UCL) Linear Regression March/April 2020 39
Example: Marginalization of a Product
p1pxq “ N`
x |a, A˘
p2pxq “ N`
x | b, B˘
Thenż
p1pxqp2pxqdx “ Z “ N`
a | b, A`B˘
P R
Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.
Marc Deisenroth (UCL) Linear Regression March/April 2020 39