Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 [email protected]

Linear Regression

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@[email protected]

https://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April 2020

[email protected]

https://deisenroth.cc

Reference

MATHEMATICS FOR MACHINE LEARNING

Marc Peter DeisenrothA. Aldo Faisal

Cheng Soon Ong

MA

THEM

ATICS FO

R MA

CHIN

E LEARN

ING

DEIS

ENR

OTH

ET AL.

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.

MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.

A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.

CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.

Cover image courtesy of Daniel Bosma / Moment / Getty Images

Cover design by Holly Johnson

Deisenrith et al. 9781108455145 Cover. C M Y K

https://mml-book.com

Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 2


Regression Problems

Regression (curve fitting)

Given inputs x P RD andcorresponding observations y P Rfind a function f that models therelationship between x and y.

−4 −2 0 2 4

x

−4

−2

0

2

4

y

Typically parametrize the function f with parameters θ

Linear regression: Consider functions f that are linear in theparameters


Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

”

θ0 ¨ ¨ ¨ θM

ı

»

—

—

–

1...xM

fi

ffi

ffi

fl

Radial basis function networks


m“1

θm exp´

´ 12px´ µmq

2¯

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)



Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials


m“0

θmxm “

”

θ0 ¨ ¨ ¨ θM

ı

»

—

—

–

1...xM

fi

ffi

ffi

fl



m“1

θm exp´

´ 12px´ µmq

2¯

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)



Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials


m“0

θmxm “

”

θ0 ¨ ¨ ¨ θM

ı

»

—

—

–

1...xM

fi

ffi

ffi

fl



m“1

θm exp´

´ 12px´ µmq

2¯

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation


Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation


Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

2˘

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const


Maximum Likelihood




n“1

N`

yn |xJnθ, σ

2˘



n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const


Maximum Likelihood




n“1

N`

yn |xJnθ, σ

2˘



n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const


Maximum Likelihood




n“1

N`

yn |xJnθ, σ

2˘



n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const


Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get


n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy



With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get


n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1


“ ´1






With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get


n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1


“ ´1






With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get


n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1


“ ´1





Making Predictions

y “ xJθ ` ε, ε „ N`

0, σ2˘

Given an arbitrary input x˚, we can predict the correspondingobservation y˚ using the maximum likelihood parameter:

ppy˚|x˚,θMLq “ N

`

y˚ |xJ˚ θ

ML, σ2˘

Measurement noise variance σ2 assumed known

In the absence of noise (σ2 “ 0), the prediction will bedeterministic


Example 1: Linear Functions

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

y “ θ0 ` θ1x` ε, ε „ N`

0, σ2˘

At any query point x˚ we obtain the mean prediction as

Ery˚|θML, x˚s “ θML

0 ` θML1 x˚


Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy


Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy


Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 2nd-order polynomial



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 4th-order polynomial



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE




−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE




−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE



Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data


Overfitting


0

2

4

6

8

10

RM

SE

Training error

Test error





Overfitting


0

2

4

6

8

10

RM

SE

Training error

Test error





MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)


MAP Estimation






log-likelihood


log-prior

` const




MAP Estimation






log-likelihood


log-prior

` const




MAP Estimation






log-likelihood


log-prior

` const




MAP Estimation






log-likelihood


log-prior

` const




MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy


MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy


Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 2nd-order polynomial

Mean prediction:


MAP



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP


Mean prediction:


MAP



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP


Mean prediction:


MAP



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP


Mean prediction:


MAP



−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP


Mean prediction:


MAP


Generalization Error


0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

MAP estimation “delays” the problem of overfitting

It does not provide a general solutionNeed a more principled solution


Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting


Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting


Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)


Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)


Model for Bayesian Linear Regression

Prior ppθq “ N`

m0, S0

˘

,

Likelihood ppy|x,θq “ N`

y |φJpxqθ, σ2˘

Parameter θ becomes a latent (random) variable

Prior distribution induces a distribution over plausible functions

Choose a conjugate Gaussian priorClosed-form computationsGaussian posterior


Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:Derive this

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

2˘

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9




Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘



´2ΦJyq



mN , SN˘

. Then

ppy˚|x˚q “ N`


2˘





Prior ppθq “ N`

m0, S0

˘


ppθ|X,yq “ N`

mN , SN˘



´2ΦJyq



mN , SN˘

. Then

ppy˚|x˚q “ N`


2˘





Prior ppθq “ N`

m0, S0

˘


ppθ|X,yq “ N`

mN , SN˘



´2ΦJyq



mN , SN˘

. Then

ppy˚|x˚q “ N`


2˘




Marginal Likelihood

Marginal likelihood can be computed analytically.

With ppθq “ N`

µ, Σ˘

ppy|Xq “

ż

ppy|X,θqppθqdθ “ N`

y |Φµ, ΦΣΦJ ` σ2I˘

Derivation via completing the squares (see Section 9.3.5 ofMML book)


Distribution over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b


Sampling from the Prior over Functions


y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

fipxq “ ai ` bix, rai, bis „ ppa, bq


Sampling from the Posterior over Functions


y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

X “ rx1, . . . , xN s, y “ ry1, . . . , yN s Training inputs/targets

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15y




y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

ppa, b|X,yq “ N`

mN , SN˘

Posterior

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b




y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

rai, bis „ ppa, b|X,yq

fi “ ai ` bix


Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi





fpxq “nÿ

i“1


0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘






fpxq “nÿ

i“1


0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘



Illustration: Fitting a Radial Basis Function Network

φipxq “ exp`

´ 12px´ µiq

Jpx´ µiq˘

-5 0 5x

-2

0

2f(

x)

Place Gaussian-shaped basis functions φi at 25 input locationsµi, linearly spaced in the interval r´5, 3s


Samples from the RBF Prior

fpxq “nÿ

i“1

θiφipxq , ppθq “ N`

0, I˘

-5 0 5x

-4

-2

0

2

4f(

x)


Samples from the RBF Posterior

fpxq “nÿ

i“1

θiφipxq , ppθ|X,yq “ N`

mN , SN˘

-5 0 5x

-4

-2

0

2

4f(

x)


RBF Posterior

-5 0 5x

-2

0

2f(

x)


Limitations

-5 0 5x

-2

0

2

f(x)

Feature engineering (what basis functions to use?)Finite number of features:

Above: Without basis functions on the right, we cannot expressany variability of the functionIdeally: Add more (infinitely many) basis functions


Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process


Approach




Gaussian process


Approach




Gaussian process


Approach




Gaussian process


Summary


0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−2 0 2a

−4

−3

−2

−1

0

1

2

3

4

b

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15

y

Regression = curve fittingLinear regression = linear in the parametersParameter estimation via maximum likelihood and MAPestimation can lead to overfittingBayesian linear regression addresses this issue, but may not beanalytically tractablePredictive uncertainty in Bayesian linear regression explicitlyaccounts for parameter uncertaintyDistribution over parameters Distribution over functions


Appendix


Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx




ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘






ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘




Linear Transformation of Gaussian RandomVariables

If x „ N`

x |µ, Σ˘

and z “ Ax` b then

ppzq “ N`

z |Aµ` b, AΣAJ˘


Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)



x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1



`

´12pa´ bq

JpA`Bq´1pa´ bq˘



Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘




x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1



`

´12pa´ bq

JpA`Bq´1pa´ bq˘



Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘



Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “

Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.


Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “ Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.


Documents

Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 [email protected]