81
Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 [email protected] https://deisenroth.cc AIMS Rwanda and AIMS Ghana March/April 2020

Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 [email protected]

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@[email protected]

https://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April 2020

Page 2: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Reference

MATHEMATICS FOR MACHINE LEARNING

Marc Peter DeisenrothA. Aldo Faisal

Cheng Soon Ong

MA

THEM

ATICS FO

R MA

CHIN

E LEARN

ING

DEIS

ENR

OTH

ET AL.

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.

MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.

A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.

CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.

Cover image courtesy of Daniel Bosma / Moment / Getty Images

Cover design by Holly Johnson

Deisenrith et al. 9781108455145 Cover. C M Y K

https://mml-book.com

Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 2

Page 3: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Regression Problems

Regression (curve fitting)

Given inputs x P RD andcorresponding observations y P Rfind a function f that models therelationship between x and y.

−4 −2 0 2 4

x

−4

−2

0

2

4

y

Typically parametrize the function f with parameters θ

Linear regression: Consider functions f that are linear in theparameters

Marc Deisenroth (UCL) Linear Regression March/April 2020 3

Page 4: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

θ0 ¨ ¨ ¨ θM

ı

»

1...xM

fi

ffi

ffi

fl

Radial basis function networks

y “ fpx,θq “Mÿ

m“1

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)

Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Page 5: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

θ0 ¨ ¨ ¨ θM

ı

»

1...xM

fi

ffi

ffi

fl

Radial basis function networks

y “ fpx,θq “Mÿ

m“1

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)

Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Page 6: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

θ0 ¨ ¨ ¨ θM

ı

»

1...xM

fi

ffi

ffi

fl

Radial basis function networks

y “ fpx,θq “Mÿ

m“1

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Page 7: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Marc Deisenroth (UCL) Linear Regression March/April 2020 5

Page 8: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Marc Deisenroth (UCL) Linear Regression March/April 2020 5

Page 9: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Page 10: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Page 11: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Page 12: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Page 13: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Page 14: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Page 15: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Page 16: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Page 17: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Making Predictions

y “ xJθ ` ε, ε „ N`

0, σ2˘

Given an arbitrary input x˚, we can predict the correspondingobservation y˚ using the maximum likelihood parameter:

ppy˚|x˚,θMLq “ N

`

y˚ |xJ˚ θ

ML, σ2˘

Measurement noise variance σ2 assumed known

In the absence of noise (σ2 “ 0), the prediction will bedeterministic

Marc Deisenroth (UCL) Linear Regression March/April 2020 8

Page 18: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 1: Linear Functions

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

y “ θ0 ` θ1x` ε, ε „ N`

0, σ2˘

At any query point x˚ we obtain the mean prediction as

Ery˚|θML, x˚s “ θML

0 ` θML1 x˚

Marc Deisenroth (UCL) Linear Regression March/April 2020 9

Page 19: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 10

Page 20: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 10

Page 21: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Page 22: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 2nd-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Page 23: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 4th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Page 24: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 6th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Page 25: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 8th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Page 26: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 10th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Page 27: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Marc Deisenroth (UCL) Linear Regression March/April 2020 12

Page 28: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Marc Deisenroth (UCL) Linear Regression March/April 2020 12

Page 29: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Marc Deisenroth (UCL) Linear Regression March/April 2020 12

Page 30: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

Page 31: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

Page 32: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

Page 33: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

Page 34: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

Page 35: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 14

Page 36: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 14

Page 37: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Page 38: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 2nd-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Page 39: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 4th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Page 40: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 6th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Page 41: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 8th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Page 42: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 10th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Page 43: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Generalization Error

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

MAP estimation “delays” the problem of overfitting

It does not provide a general solutionNeed a more principled solution

Marc Deisenroth (UCL) Linear Regression March/April 2020 16

Page 44: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Marc Deisenroth (UCL) Linear Regression March/April 2020 17

Page 45: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Marc Deisenroth (UCL) Linear Regression March/April 2020 17

Page 46: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Marc Deisenroth (UCL) Linear Regression March/April 2020 18

Page 47: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Marc Deisenroth (UCL) Linear Regression March/April 2020 18

Page 48: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Model for Bayesian Linear Regression

Prior ppθq “ N`

m0, S0

˘

,

Likelihood ppy|x,θq “ N`

y |φJpxqθ, σ2˘

Parameter θ becomes a latent (random) variable

Prior distribution induces a distribution over plausible functions

Choose a conjugate Gaussian priorClosed-form computationsGaussian posterior

Marc Deisenroth (UCL) Linear Regression March/April 2020 19

Page 49: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:Derive this

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Page 50: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Page 51: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Page 52: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Page 53: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Marginal Likelihood

Marginal likelihood can be computed analytically.

With ppθq “ N`

µ, Σ˘

ppy|Xq “

ż

ppy|X,θqppθqdθ “ N`

y |Φµ, ΦΣΦJ ` σ2I˘

Derivation via completing the squares (see Section 9.3.5 ofMML book)

Marc Deisenroth (UCL) Linear Regression March/April 2020 21

Page 54: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Distribution over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b

Marc Deisenroth (UCL) Linear Regression March/April 2020 22

Page 55: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Sampling from the Prior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

fipxq “ ai ` bix, rai, bis „ ppa, bq

Marc Deisenroth (UCL) Linear Regression March/April 2020 23

Page 56: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

X “ rx1, . . . , xN s, y “ ry1, . . . , yN s Training inputs/targets

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15y

Marc Deisenroth (UCL) Linear Regression March/April 2020 24

Page 57: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

ppa, b|X,yq “ N`

mN , SN˘

Posterior

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b

Marc Deisenroth (UCL) Linear Regression March/April 2020 25

Page 58: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

rai, bis „ ppa, b|X,yq

fi “ ai ` bix

Marc Deisenroth (UCL) Linear Regression March/April 2020 26

Page 59: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Marc Deisenroth (UCL) Linear Regression March/April 2020 27

Page 60: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Marc Deisenroth (UCL) Linear Regression March/April 2020 27

Page 61: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Marc Deisenroth (UCL) Linear Regression March/April 2020 27

Page 62: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Illustration: Fitting a Radial Basis Function Network

φipxq “ exp`

´ 12px´ µiq

Jpx´ µiq˘

-5 0 5x

-2

0

2f(

x)

Place Gaussian-shaped basis functions φi at 25 input locationsµi, linearly spaced in the interval r´5, 3s

Marc Deisenroth (UCL) Linear Regression March/April 2020 28

Page 63: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Samples from the RBF Prior

fpxq “nÿ

i“1

θiφipxq , ppθq “ N`

0, I˘

-5 0 5x

-4

-2

0

2

4f(

x)

Marc Deisenroth (UCL) Linear Regression March/April 2020 29

Page 64: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Samples from the RBF Posterior

fpxq “nÿ

i“1

θiφipxq , ppθ|X,yq “ N`

mN , SN˘

-5 0 5x

-4

-2

0

2

4f(

x)

Marc Deisenroth (UCL) Linear Regression March/April 2020 30

Page 65: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

RBF Posterior

-5 0 5x

-2

0

2f(

x)

Marc Deisenroth (UCL) Linear Regression March/April 2020 31

Page 66: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Limitations

-5 0 5x

-2

0

2

f(x)

Feature engineering (what basis functions to use?)Finite number of features:

Above: Without basis functions on the right, we cannot expressany variability of the functionIdeally: Add more (infinitely many) basis functions

Marc Deisenroth (UCL) Linear Regression March/April 2020 32

Page 67: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Page 68: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Page 69: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Page 70: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Page 71: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Summary

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−2 0 2a

−4

−3

−2

−1

0

1

2

3

4

b

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15

y

Regression = curve fittingLinear regression = linear in the parametersParameter estimation via maximum likelihood and MAPestimation can lead to overfittingBayesian linear regression addresses this issue, but may not beanalytically tractablePredictive uncertainty in Bayesian linear regression explicitlyaccounts for parameter uncertaintyDistribution over parameters Distribution over functions

Marc Deisenroth (UCL) Linear Regression March/April 2020 34

Page 72: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Appendix

Marc Deisenroth (UCL) Linear Regression March/April 2020 35

Page 73: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Marc Deisenroth (UCL) Linear Regression March/April 2020 36

Page 74: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Marc Deisenroth (UCL) Linear Regression March/April 2020 36

Page 75: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Marc Deisenroth (UCL) Linear Regression March/April 2020 36

Page 76: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Transformation of Gaussian RandomVariables

If x „ N`

x |µ, Σ˘

and z “ Ax` b then

ppzq “ N`

z |Aµ` b, AΣAJ˘

Marc Deisenroth (UCL) Linear Regression March/April 2020 37

Page 77: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

Marc Deisenroth (UCL) Linear Regression March/April 2020 38

Page 78: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

Marc Deisenroth (UCL) Linear Regression March/April 2020 38

Page 79: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

Marc Deisenroth (UCL) Linear Regression March/April 2020 38

Page 80: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “

Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.

Marc Deisenroth (UCL) Linear Regression March/April 2020 39

Page 81: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “ Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.

Marc Deisenroth (UCL) Linear Regression March/April 2020 39