MLSS 2012: Gaussian Processes for Machine Learningjwp2128/Teaching/E6892/papers/...MLSS 2012:...

MLSS 2012: Gaussian Processes for Machine Learning

MLSS 2012:Gaussian Processes for Machine Learning

John P. CunninghamWashington University

University of Cambridge

April 18, 2012

Outline

Gaussian Process BasicsGaussians in words and picturesGaussians in equationsUsing Gaussian Processes

Beyond BasicsKernel choicesLikelihood choicesShortcomings of GPConnections

Conclusions & References

Gaussian Process Basics

Gaussians in words and pictures

What is a Gaussian (for machine learning)?

◮ A handy tool for Bayesian inference on real valued variables:

40 50 60 70 80 900

Heart Rate at 0700 (bps)

40 50 60 70 80 900

prior p(f)y

1 (Monday)

y2 (Tuesday)

y3 (Wednesday)

y4 (Thursday)

40 50 60 70 80 900

prior p(f)y

1 (Monday)

y2 (Tuesday)

y3 (Wednesday)

y4 (Thursday)

posterior p(f|y)

From univariate to multivariate Gaussians:

40 50 60 70 80 900

40 50 60 70 80 90 10040

prior p(f)y

1 (Monday)

y2 (Tuesday)

y3 (Wednesday)

y4 (Thursday)

40 50 60 70 80 90 10040

prior p(f)y

1 (Monday)

y2 (Tuesday)

y3 (Wednesday)

y4 (Thursday)

posterior p(f|y)

From multivariate Gaussians to Gaussian Processes:

0600 0700 0800 0900 1000 1100 120040

Time of Day

0600 0700 0800 0900 1000 1100 120040

Time of Day

0600 0700 0800 0900 1000 1100 120040

Time of Day

0600 0700 0800 0900 1000 1100 120040

Time of Day

0600 0700 0800 0900 1000 1100 120040

Time of Day

y1 (Monday)

y2 (Tuesday)

y3 (Wednesday)

y4 (Thursday)

Our representation of a GP distribution:

0600 0700 0800 0900 1000 1100 120040

Time of Day

prior p(f)y

1 (Monday)

y2 (Tuesday)

y3 (Wednesday)

y4 (Thursday)

We can take measurements less rigidly:

0600 0700 0800 0900 1000 1100 120040

Time of Day

We can take measurements less rigidly:

0600 0700 0800 0900 1000 1100 120040

Time of Day

Updating the posterior:

0600 0700 0800 0900 1000 1100 120040

Time of Day

0600 0700 0800 0900 1000 1100 120040

Time of Day

0600 0700 0800 0900 1000 1100 120040

Time of Day

An intuitive summary

◮ Univariate Gaussians: distributions over real valued variables

◮ Multivariate Gaussians: {pairs, triplets, ... } of real valued vars

◮ Gaussian Processes: functions of (infinite numbers of) realvalued variables → regression.

Regression: a few reminders

◮ denoising/smoothing

0600 0700 0800 0900 1000 1100 120040

Time of Day

◮ denoising/smoothing◮ prediction/forecasting

0600 0700 0800 0900 1000 1100 120040

Time of Day

◮ denoising/smoothing◮ prediction/forecasting

0600 0700 0800 0900 1000 1100 120040

Time of Day

Regression: a few reminders◮ denoising/smoothing◮ prediction/forecasting◮ dangers of parametric models

0600 0700 0800 0900 1000 1100 120040

Time of Day

Regression: a few reminders◮ denoising/smoothing◮ prediction/forecasting◮ dangers of parametric models◮ dangers of overfitting/underfitting

0600 0700 0800 0900 1000 1100 120040

Time of Day

Regression: a few reminders◮ denoising/smoothing◮ prediction/forecasting◮ dangers of parametric models◮ dangers of overfitting/underfitting

0600 0700 0800 0900 1000 1100 120040

Time of Day

Outline

Gaussians in equations

Review: multivariate Gaussian

◮ f ∈ Rn is normally distributed if

p(f ) = (2π)−n2 |K |−

12 exp

− 12 (f − m)T K−1(f − m)

◮ for mean vector m ∈ Rn and positive semidefinite covariance

matrix K ∈ Rn×n

◮ shorthand: f ∼ N (m,K )

Definition: Gaussian Process

◮ Loosely, a multivariate Gaussian of uncountably infinite length...really long vector ≈ function

◮ f is a Gaussian process if f (t) = [f (t1), ..., f (tn)]′ has amultivariate normal distribution for all t = [t1, ..., tn]′:

f (t) ∼ N (m(t),K (t , t))

◮ (t ∈ R here for familiarity with regression in time, but domain canbe x ∈ R

D)◮ What are m(t),K (t , t)?

Mean function m(t):

◮ any function m : R → R (or m : RD → R)◮ very often m(t) = 0 ∀ t (mean subtract your data)

Kernel (covariance) function:

◮ any valid Mercer kernel k : RD × RD → R

◮ Mercer’s theorem: every matrix K (t , t) = {k(ti , tj)}i,j=1...n is apositive semidefinite (covariance) matrix ∀t :

vT K (t , t)v =

Kijvivj =

k(ti , tj)vivj ≥ 0

GP is fully defined by:

◮ mean function m(·) and kernel (covariance) function k(·, ·)◮ requirement that every finite subset of the domain t has a

multivariate normal f (t) ∼ N (m(t),K (t , t))

Notes◮ that this should exist is not trivial!◮ most interesting properties are inherited◮ Kernel function...

Kernel Function

Example kernel (squared exponential or SE):

k(ti , tj) = σ2f exp

2ℓ2 (ti − tj)2}

From kernel to covariance matrix◮ Choose some hyperparameters: σf = 7 , ℓ = 100

070008001029

K (t , t) = {k(ti , tj)}i,j =

49.0 29.7 00.229.7 49.0 03.600.2 03.6 49.0

Kernel Function

2ℓ2 (ti − tj)2}

070008001029

K (t , t) = {k(ti , tj)}i,j =

49.0 48.0 39.548.0 49.0 44.139.5 44.1 49.0

Kernel Function

2ℓ2 (ti − tj)2}

070008001029

K (t , t) = {k(ti , tj)}i,j =

49.0 06.6 00.006.6 49.0 00.000.0 00.0 49.0

Kernel Function

2ℓ2 (ti − tj)2}

070008001029

K (t , t) = {k(ti , tj)}i,j =

196 26.5 00.026.5 196 0.0100.0 0.01 196

Intuitive summary of GP so far

◮ GP offer distributions over functions (infinite numbers of jointlyGaussian variables)

◮ For any finite subset vector t , we have a normal distribution:

f (t) ∼ N (0,K (t , t))

◮ where covariance matrix K is calculated by plugging t into kernelk(·, ·).

◮ New notation: f ∼ GP (m(·), k(·, ·)) or f ∼ GP(m, k).

Important Gaussian properties (for today’s purposes):

◮ additivity (forming a joint)

◮ conditioning (inference)

◮ expectations (posterior and predictive moments)

◮ marginalisation (marginal likelihood/model selection)

◮ ...

Additivity (joint)

◮ prior (or latent) f ∼ N (mf ,Kff )

◮ additive iid noise n ∼ N (0, σ2n I)

◮ let y = f + n, then:

p(y , f ) = p(y |f )p(f ) = N

Kff Kfy

K Tfy Kyy

◮ where (in this case):

Kfy = E [(f − mf )(y − my )T ] = Kff Kyy = Kff + σ2

◮ latent f and noisy observation y are jointly Gaussian

Where did the GP go?

◮ prior (or latent) f ∼ N (mf ,Kff )

◮ additive iid noise n ∼ N (0, σ2n I)

p(y , f ) = p(y |f )p(f ) = N

Kff Kfy

K Tfy Kyy

◮ If f and y are indexed by some input points t :

mf (t1)...

mf (tn)

Kff = {k(ti , tj)}i,j=1...n ...

Where did the GP go?

◮ prior (or latent) f ∼ GP(mf , kff )

◮ additive iid noise n ∼ GP(0, σ2nδ)

p(y(t), f (t)) = p(y |f )p(f ) = N

Kff Kfy

K Tfy Kyy

◮ If f and y are indexed by some input points t :

mf (t1)...

mf (tn)

Kff = {k(ti , tj)}i,j=1...n ...

◮ warning: overloaded notation - f can be infinite (GP) or finite(MVN) depending on context.

Conditioning (inference)

◮ If f and y are jointly Gaussian:[

Kff Kfy

K Tfy Kyy

◮ Then:

f |y ∼ N(

Kfy K−1yy (y − my ) + mf , Kff − Kfy K−1

yy K Tfy

◮ inference of latent given data is simple linear algebra.

p(f |y) =p(y |f )p(f )

Expectation (posterior and predictive moments)

◮ Conditioning on data gave us:

f |y ∼ N(

yy K Tfy

◮ then E [f |y ] = Kfy K−1yy (y − my ) + mf (MAP, posterior mean, ...)

◮ Predict data observations y∗:[

Kyy Ky∗y

K Ty∗y Ky∗y∗

◮ no different:

y∗|y ∼ N(

Ky∗y K−1yy (y − my ) + my∗ , Ky∗y∗ − Ky∗y K−1

yy K Ty∗y

Marginalisation (marginal likelihood and modelselection)

◮ Again, if:[

Kff Kfy

K Tfy Kyy

◮ we can marginalize out the latent:

p(y) =∫

p(y |f )p(f )df ↔ y ∼ N (my ,Kyy )

◮ marginal likelihood of the data (or log(p(y)) data log-likelihood)◮ In GP context, actually p(y |θ) = p(y |σf , σn, ℓ). This can be the

basis of model selection.

Complaint

◮ I’m bored. All we are doing is messing around with Gaussians.

◮ Correct! (sorry)

◮ This is the whole point.

◮ We can do some remarkable things...

Outline

Our example model◮ f ∼ GP(0, kff ), where kff (ti , tj) = σ2

f exp{

− 12ℓ2 (ti − tj)2

◮ y |f ∼ GP(f , knn), where knn(ti , tj) = σ2nδ(ti − tj)

◮ y ∼ GP(0, kyy ), where kyy (ti , tj) = kff (ti , tj) + knn(ti , tj)◮ We choose σf = 10 , ℓ = 50 , σn = 1◮ The prior on f :

0 100 200 300 400 500

input t

Outline

Our example model◮ f ∼ GP(0, kff ), where kff (ti , tj) = σ2

f exp{

− 12ℓ2 (ti − tj)2

◮ y ∼ GP(0, kyy ), where kyy (ti , tj) = kff (ti , tj) + knn(ti , tj)◮ We choose σf = 10 , ℓ = 50 , σn = 1◮ A draw from f :

0 100 200 300 400 500

input t

Outline

Drawing from the prior◮ These steps should be clear:

0 100 200 300 400 500

input t

◮ Take n (many, but finite!) points ti ∈ [0, 500]◮ Evaluate Kff = {kff (ti , tj)}◮ Draw from f ∼ N (0,Kff )◮ ( f = chol(K)′ ∗ randn(n,1) )

Outline

Draw a few more

◮ four draws from f :

0 100 200 300 400 500

input t

Outline

Impact of hyperparameters

◮ σf = 10 , ℓ = 50

0 100 200 300 400 500

input t

Outline

◮ σf = 4 , ℓ = 50

0 100 200 300 400 500

input t

Outline

◮ σf = 4 , ℓ = 10

0 100 200 300 400 500

input t

Outline

Multidimensional input◮ just make each input x ∈ R

D (here D = 2, e.g. lat and long)◮ f ∼ GP(0, kff ), where

kff (x (i), x (j)) = σ2f exp

−∑

2ℓ2d(x (i)

d − x (j)d )2

◮ (multidimensional GP in action: Cunningham, Rasmussen,Ghahramani (2012) AISTATS)

Outline

Observations

Same model; we will now gather data yi .

◮ f ∼ GP(0, kff ), where kff (ti , tj) = σ2f exp

− 12ℓ2 (ti − tj)2

◮ y ∼ GP(0, kyy ), where kyy (ti , tj) = kff (ti , tj) + knn(ti , tj)◮ We choose σf = 10 , ℓ = 50 , σn = 1

Outline

Observations

◮ the GP prior p(f )

0 100 200 300 400 500

input t

Outline

Observations

◮ Observe a single point at t = 204:y(204) ∼ N (0, kyy (204, 204)) = N (0, σ2

f + σ2n)

0 100 200 300 400 500

input t

Outline

Observations

◮ Use conditioning to update the posterior:

f |y(204) ∼ N(

Kfy K−1yy (y(204) − my ) , Kff − Kfy K−1

yy K Tfy

0 100 200 300 400 500

input t

Outline

Observations

◮ Use conditioning to update the posterior:

f |y(204) ∼ N(

Kfy K−1yy (y(204) − my ) , Kff − Kfy K−1

yy K Tfy

0 100 200 300 400 500

input t

Outline

Observations

◮ ... and the predictive distribution:

y∗|y(204) ∼ N(

Ky∗y K−1yy (y(204)− my ) , Ky∗y∗ − Ky∗y K−1

yy K Ty∗y

0 100 200 300 400 500

input t

Outline

Observations◮ More observations (data vector y ):

y∗|y([

) ∼ N

Ky∗y K−1yy

)− my

,Ky∗y∗ − Ky∗y K−1yy K T

0 100 200 300 400 500

input t

Outline

Observations◮ More observations (data vector y ):

y∗|y([

) ∼ N

Ky∗y K−1yy

)− my

,Ky∗y∗ − Ky∗y K−1yy K T

0 100 200 300 400 500

input t

Outline

Observations

◮ More observations (data vector y ):

y∗|y ∼ N(

Ky∗y K−1yy (y − my ) , Ky∗y∗ − Ky∗y K−1

yy K Ty∗y

0 100 200 300 400 500

input t

Outline

Observations

◮ More observations (data vector y ):

y∗|y ∼ N(

yy K Ty∗y

0 100 200 300 400 500

input t

Outline

Nonparametric Regression

◮ GP let the data speak for itself... but all the data must speak.

y∗|y ∼ N(

yy K Ty∗y

◮ “nonparametric models have an infinite number of parameters”

Outline

Nonparametric Regression

◮ GP let the data speak for itself... but all the data must speak.

y∗|y ∼ N(

yy K Ty∗y

◮ “nonparametric models have an infinite number of parameters”

◮ “nonparametric models have a finite but unbounded number ofparameters that grows with data”

Outline

Almost through the basics...

Outline

Model Selection / Hyperparameter Learning

◮ f ∼ GP(0, kff ), where kff (ti , tj) = σ2f exp

− 12ℓ2 (ti − tj)2

ℓ = 50: just right

0 100 200 300 400 500

input t

ℓ = 15: overfitting

0 100 200 300 400 500

input t

ℓ = 250: underfitting

0 100 200 300 400 500

input t

Outline

Model Selection (1): Marginal Likelihood

◮ p(y) = N (0,Kyy ) → p(y |σf , σn, ℓ) = N (0,Kyy (σf , σn, ℓ))

◮ not obvious why this should not over (or under) fit, but it’s in themath...

log (p(y |σf , σn, ℓ)) = −12

yT K−1yy y −

log |Kyy | −n2

log(2π)

◮ “Occam’s Razor” implemented via regularization/Bayesian modelselection

◮ Details to be fleshed out in practical...

Outline

Model Selection (2): Cross Validation

◮ Can also consider predictive distribution for some held out data:

PL(σf , σn, ℓ) = log (p(ytest|ytrain, σf , σn, ℓ))

◮ Again a Gaussian.

◮ Again can take derivatives and tune model hyperparameters.

◮ Details to be fleshed out in practical...

Outline

More details

◮ GP basics: Appdx A, Ch 1◮ Regression: Ch 2◮ Kernels: Ch 4◮ Model Selection: Ch 5

Beyond Basics

What’s next?

◮ Revisit the model and see what can be hacked:

f ∼ GP(0, kff ), where kff (ti , tj) = σ2f exp

2ℓ2 (ti − tj)2}

yi |fi ∼ N (fi , σ2n I)

Beyond Basics

What’s next?

2ℓ2 (ti − tj)2}

◮ Option 1: hyperparameters → model selection.

Beyond Basics

What’s next?

2ℓ2 (ti − tj)2}

◮ Option 2: functional form of kff → kernel choices.

Beyond Basics

What’s next?

2ℓ2 (ti − tj)2}

◮ Option 3: the GP?

Beyond Basics

What’s next?

2ℓ2 (ti − tj)2}

◮ Option 4: the data distribution → likelihood choices.

Beyond Basics

Outline

Beyond Basics

Kernel choices

What’s next?

2ℓ2 (ti − tj)2}

Beyond Basics

Kernel choices

What the kernel is doing (SE)

2ℓ2 (ti − tj)2}

0 100 200 300 400 500

input t

Beyond Basics

Kernel choices

Rational Quadratic

k(ti , tj) = σ2f

2αℓ2 (ti − tj)2)

∝ σ2f

zα−1 exp(

−αzβ

−z(ti − tj)2

0 100 200 300 400 500

input t

Beyond Basics

Kernel choices

Periodic

−2ℓ2 sin2

p|ti − tj |

0 100 200 300 400 500

input t

Beyond Basics

Kernel choices

From Stationary to Nonstationary Kernels

◮ k(ti , tj) = k(ti − tj) = k(τ)◮ k(ti , tj) = σ2

f exp{

− 12ℓ2 (ti − tj )2

0 100 200 300 400 500

input t

Beyond Basics

Kernel choices

Wiener Process

◮ k(ti , tj) = min(ti , tj)◮ Still a GP

0 100 200 300 400 500

input t

Beyond Basics

Kernel choices

Wiener Process

◮ k(ti , tj) = min(ti , tj)◮ Draws from a nonstationary GP

0 100 200 300 400 500

input t

Beyond Basics

Kernel choices

Linear Regression...

◮ f (t) = wt with w ∼ N (0, 1)◮ k(ti , tj) = E [f (ti )f (tj )] = ti tj

0 2 4 6 8 10−20

input t

Beyond Basics

Kernel choices

Build your own kernel (1): Operations

◮ Linear: k(ti , tj) = αk1(ti , tj) + βk2(ti , tj) (for α, β ≥ 0)

x (i), x (j))

x (i)1 , x (j)

x (i)2 , x (j)

◮ Products: k(ti , tj) = k1(ti , tj)k2(ti , tj)

◮ Integration: z(t) =∫

g(u, t)f (u)du ↔

kz(ti , tj) =∫ ∫

g(u, t1)kf (ti , tj)g(v , tj)dudv

◮ Differentiation: z(t) = ∂∂t f (t) ↔ kz(ti , tj) = ∂2

∂ti∂tjkf (ti , tj)

◮ Warping: z(t) = f (h(t)) ↔ kz(ti , tj) = kf (h(ti ), h(tj ))

Beyond Basics

Kernel choices

Preserves joint Gaussianity (mostly)!

◮ Linear: k(ti , tj) = αk1(ti , tj) + βk2(ti , tj)

x (i), x (j))

x (i)1 , x (j)

x (i)2 , x (j)

◮ Products: k(ti , tj) = k1(ti , tj)k2(ti , tj)

◮ Integration: z(t) =∫

g(u, t)f (u)du ↔

kz(ti , tj) =∫ ∫

g(u, t1)kf (ti , tj)g(v , tj)dudv

◮ Differentiation: z(t) = ∂∂t f (t) ↔ kz(ti , tj) = ∂2

∂ti∂tjkf (ti , tj)

◮ Warping: z(t) = f (h(t)) ↔ kz(ti , tj) = kf (h(ti ), h(tj ))

Beyond Basics

Kernel choices

Build your own kernel (2): frequency domain

◮ a stationary kernel k(ti , tj) = k(ti − tj ) = k(τ) is positivesemidefinite (satisfies Mercer) iff:

S(ω) = F{k}(ω) ≥ 0 ∀ ω

◮ Power spectral density (Wiener-Khinchin, ...)

◮ Note k(0) =∫

S(ω)dω

Beyond Basics

Kernel choices

Kernel Summary

◮ GP gives a distribution over functions...◮ the kernel determines the type of functions.◮ can/should be tailored to application◮ toward a GP toolbox

Beyond Basics

Likelihood choices

What’s next?

2ℓ2 (ti − tj)2}

Beyond Basics

Likelihood choices

Data up to now

◮ continuous regression made sense◮ data likelihood model: yi |fi ∼ N (fi , σ2

0 100 200 300 400 500−2

input t

Beyond Basics

Likelihood choices

Binary label data

◮ Classification (not regression) setting◮ yi |fi ∼ N (fi , σ2

n I) is inappropriate

0 100 200 300 400 500

−0.5

input t

Beyond Basics

Likelihood choices

GP Classification◮ Probit or Logistic “regression” model on yi ∈ {−1,+1}:

p(yi |fi ) = φ(yi fi) =1

1 + exp(−yi fi)

◮ Warps f onto the [0, 1] interval

−6 −4 −2 0 2 4 6

GP value f

Beyond Basics

Likelihood choices

GP Classification

◮ Probit or Logistic “regression” model on yi ∈ {−1,+1}:

p(yi |fi ) = φ(yi fi) =1

1 + exp(−yi fi)

◮ Warps f onto the [0, 1] interval

0 100 200 300 400 500−6

input t

0 100 200 300 400 500−0.2

input t

Beyond Basics

Likelihood choices

What we want to calculate

◮ predictive distribution:

p(y∗|y) =∫

p(y∗|f ∗)p(f ∗|y)df ∗

◮ predictive posterior:

p(f ∗|y) =∫

p(f ∗|f )p(f |y)df

◮ data posterior:

p(f |y) =

i p(yi |fi)p(f )p(y)

◮ None of which is tractable to compute

Beyond Basics

Likelihood choices

However...

◮ predictive distribution:

p(y∗|y) =∫

p(y∗|f ∗)q(f ∗|y)df ∗

◮ predictive posterior:

q(f ∗|y) =∫

p(f ∗|f )q(f |y)df

◮ data posterior:

q(f |y) ≈ p(f |y) =∏

i p(yi |fi)p(f )p(y)

◮ If q is Gaussian, these are tractable to compute

Beyond Basics

Likelihood choices

Approximate Inference

◮ Methods for producing a Gaussian q(f |y) ≈ p(f |y)

◮ Laplace Approximation, Expectation Propagation, VariationalInference

◮ Technologies within a GP method

◮ Subject of much research; often work well

Beyond Basics

Likelihood choices

Using Approximate Inference

◮ Allows “regression” on the [0, 1] interval

0 100 200 300 400 500−0.2

input t

Beyond Basics

Likelihood choices

Using Approximate Inference

◮ Allows “regression” on the [0, 1] interval

0 100 200 300 400 500−0.2

input t

Beyond Basics

Likelihood choices

What about SVM?

◮ illustrate flexibility of GP◮ draw an interesting connection◮ GP joint:

− log p(y , f ) =12

f T K−1ff f −

log(p(yi |fi))

◮ SVM loss (for fi = f (xi) = wT xi ):

ℓ(w) =12

wT w + C∑

(1 − yi fi)

Beyond Basics

Likelihood choices

◮ illustrate flexibility of GP◮ draw an interesting connection◮ GP joint:

− log p(y , f ) =12

f T K−1f −∑

log(p(yi |fi))

◮ SVM loss (for fi = f (xi) = φ(xi) = k(·, xi)):

ℓ(φ) =12

f T K−1f + C∑

(1 − yi fi)

◮ (more reading: Seeger (2002), Relationship between GP, SVM,Splines)

Beyond Basics

Outline

Beyond Basics

Shortcomings of GP

Our [weakness] grows out of our [strength]

Everything is happily Gaussian

◮ ... except when it isn’t.◮ nonnormal likelihood → approximate inference◮ model selection p(y |{σf , σn, ℓ})

Nonparametric flexibility

◮ ... but we have to compute on all the data:

f |y ∼ N(

yy K Tfy

◮ O(n3) in runtime, O(n2) in memory◮ sparsification methods (pseudo points, etc.)◮ special structure methods (kernels, input points, etc.)

Beyond Basics

Shortcomings of GP

Shortcomings → active research

◮ Applications

◮ Approximate inference

◮ Computational complexity / sparsification

Beyond Basics

Outline

Beyond Basics

Connections

Already we have seen:

◮ Wiener processes

◮ linear regression

◮ SVM

◮ what else?

Beyond Basics

Connections

Temporal linear Gaussian models

◮ Wiener process (Brownian motion, random walk, OU process)◮ Linear dynamical system (state space model, Kalman

filter/smoother, etc.)

f (t) = Af (t − 1) + w(t) y(t) = f (t) + n(t)

◮ Gauss-Markov processes (ARMA(p, q), etc.)

f (t) =p

αi f (t − i) +q

βiw(t − i)

◮ Intuition of linearity and Gaussianity → GP

Beyond Basics

Connections

Other nonparametric models (or parametric limits)

◮ Kernel smoothing (Nadaraya-Watson, locally weightedregression):

y∗ =

αiyi where αi = k(ti , t∗)

◮ Compare to:

y∗|y ∼ N(

yy K Ty∗y

◮ Neural network limit (infinite bases, important to know about,Neal ’96):

f (t) = limm→∞

vih(t ; ui )

Outline

Conclusions

◮ Gaussian Processes can be effective tools for regression andclassification

◮ Quantified uncertainty can be highly valuable

◮ GP can be extended in interesting ways (linearity helps)

◮ GP appear as limits or general cases of a number of MLtechnologies

◮ GP are not without problems

Outline

Some References/Pointers/Credits

◮ Rasmussen and Williams, Gaussian Processes for MachineLearning

◮ Bishop, Pattern Recognition and Machine Learning

◮ www.gaussianprocess.org (better updated/kept than .com)

◮ loads of papers at AISTATS/NIPS/ICML/JMLR over the last 12years.

MLSS 2012: Gaussian Processes for Machine Learningjwp2128/Teaching/E6892/papers/...MLSS 2012:...

Documents

Part1 MLSS 2011

Murka workshop on MLSS

ARELLO CONFERENCE 2013 SEATTLE, WASHINGTON Issues Impacting MLSs

Topic models Source: Topic models, David Blei, MLSS 09

Distributed Gaussian Processesgpss.cc/gpss15/talks/marc.pdf · Table of Contents Gaussian Processes Sparse Gaussian Processes Distributed Gaussian Processes Distributed Gaussian Processes

Multi-Layered System of Supports (MLSS) IMPLEMENTATION …

Model SS400 MLSS Analyzer · equipment, the analyzer also has a construction that allows easy system maintenance. The SS400G MLSS converter has a measuring range as wide as 500-20000

How Lenders Plan to Grow their Mortgage Business in 2016fanniemae.com/resources/file/research/mlss/pdf/mlss... · 2016-11-14 · Continuing trends seen in the prior year, the vast

ANU MLSS 2010: Data Miningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Christen3.pdf · ANU MLSS 2010: Data Mining Part 3: Application techniques ... Stream data mining versus stream querying

Brain and Reinforcement Learning - Jason Yosinskiyosinski.com/mlss12/media/slides/MLSS-2012-Doya-Neural...Neural Computation Unit Okinawa ... MLSS 2012 in Kyoto Brain and Reinforcement

Markov Random Fields for Computer Vision (Part 2)sgould/papers/part2-MLSS... · 2011. 6. 29. · Markov Random Fields for Computer Vision (Part 2) Machine Learning Summer School (MLSS

MLSS 2015 Labour market trends for new economy LMIS_Unit

MLSS Tutorial on: Deep Belief Nets - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/Hinton_1.pdf · MLSS Tutorial on: Deep Belief Nets (An updated and extended version

Bayesian Nonnegative Matrix Factorization with Stochastic ...jwp2128/Teaching/E6892/papers/chapter.pdf · Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference

innoCon 6800S Suspended Solids /MLSS Analyzer · MLSS Analyzer 2. innoSens 810S Probe（0-30.0 g/L） The innoCon 6800S SS/MLSS analyzer is based on the 90 ° light scattering principle

Flocculation-Flotation vs. MBR for High MLSS Secondary ...cleanwatertech.com/library/renderingemmbrmlss.pdf · Flocculation-Flotation vs. MBR for High MLSS Secondary Clarification

Turbidity/TSS/MLSS Controller turbidity m… · User's Manual U-TU1-MYEN2 Turbidity/TSS/MLSS Controller Supmea Automation Co.,Ltd. Headquarters 5th floor,Building 4,Singapore Hangzhou

SS350G Wiper-Washing ControllerIM 12E6E1-01E 1-1 1. Overview 1. Overview When combined with a wiper-washing device incorporated into the MLSS detector and MLSS converter, the SS350G

Gaussian Processes - All Facultypeople.ee.duke.edu/~lcarin/David1.27.06.pdf · Gaussian Distributions and Gaussian Processes • A Gaussian distribution is a distribution over vectors

Bayesian Deep Learning (MLSS 2019) · 2019. 8. 27. · Bayesian Deep Learning (MLSS 2019) Yarin Gal University of Oxford yarin@cs.ox.ac.uk Unless speci ed otherwise, photos are either