Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Gaussian Models: Bayesian Inference Prof. Nicholas Zabaras

Materials Process Design and Control Laboratory

Sibley School of Mechanical and Aerospace Engineering

101 Frank H. T. Rhodes Hall

Cornell University

Ithaca, NY 14853-3801

Email: [email protected]

URL: http://mpdc.mae.cornell.edu/

February 1, 2014

1

mailto:[email protected]

http://mpdc.mae.cornell.edu/


Inferring the Precision of a Univariate Gaussian with Known Mean,

Gamma and Inverse Gamma as Priors for l and s2.

Inverse Chi Squared Distribution as a prior for s2.

Bayesian Inference for the Univariate Gaussian with Unknown Mean

and Precision l, Normal-Gamma Distribution as a prior for (m,l)

Posterior for (m,s2) using a Normal-Inverse x2 Prior, Marginal

Posteriors, Credible Intervals, Bayesian T-Test, Multi-Sensor Fusion

with Unknown Parameters

Inference for m in a Multivariate Gaussian with a Gaussian Prior,

Inference of L in a Multivariate Gaussian and Wishart Distribution,

Inference of S and Inverse Wishart Distribution, MAP Estimate, MAP

Shrinkage Estimation, Inference for (m,L), Inference for (m,S), Posterior

Marginals of (m,S) , Visualization of the Wishart

Contents

2

Chris Bishop, Pattern Recognition and Machine Learning, Chapter 2

Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 4

http://research.microsoft.com/en-us/um/people/cmbishop/prml/

http://www.cs.ubc.ca/~murphyk/MLbook/


Inference of Precision with Known Mean

3

Consider We want to infer the precision

l=1/s2 with the mean m taken as known.

The likelihood takes the form:

The corresponding conjugate prior should be proportional to the

product of a power of λ and the exponential of a linear function of λ.

This corresponds to the gamma distribution.

The gamma distribution has a finite integral if a > 0,and the distribution

itself is finite if a ≥1.

1~ ( | , ), 1,....,n nx x n Nm l N

/2 2

11

1( ) ( | ) exp ( )

2

N NN

n n

nn

p f x xl m l l m

X |

1( | , ) , 0,( )

aa bb

a b e xa

ll l

Gamma 2,

a avar

b bl l



4

The posterior takes the form:

We can immediately see that the posterior is also a

Gamma distribution:

Here is the MLE of the variance.

0/2 1 2

0 0 0

11

1( | ) ( | ) ( | , ) exp ( )

2

N NN a

n n

nn

p f x a b b xl m m l l l l m

X, Gamma

2 2

0 0 0

1

1( | ) ( | , ), / 2 , ( )

2 2

N

N N N N n ML

n

Np a b a N a b b x bl m l m s

X, Gamma

2

MLs



5

The effect of observing N data points is to increase the value of a by

N/2 (i.e. ½ for each data point). Thus we interpret the parameter a0 as

2a0 ‘effective’ prior observations.

Each measurement contributes to the parameter b a variance

Since we have 2a0 effective prior measurements, each of them

contributes to b an effective prior variance

The interpretation of a conjugate prior in terms of effective dummy data

points is typical for the exponential family of distributions.

The results above are identical with inference directly of the variance

s2 using as prior resulting in a posterior

2 2

0 0 0

1

1( | ) ( | , ), / 2 , ( )

2 2

N

N N N N n ML

n

Np a b a N a b b x bl m l m s

X, Gamma

2 / 2.MLs

2 20 00

0

2

2

a bb

as s

2

0 0( | , )a bsInvGamma2( | , )N Na bsInvGamma

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 6

Gamma and Inverse Gamma

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

1 1

2 2

~ | , ~ | ,

: ~ | , ~ | ,

If a b then a b

Here a b then a b

l l s s

Gamma InvGamma

Gamma InvGamma

http://www.amazon.com/Bayesian-Analysis-Second-Statistical-Science/dp/158488388X




Univariate Posterior – Inverse Chi Squared Prior

Alternative prior for s2 is the Scaled Inverse Chi-Squared Distribution*

Here v0 represents the strength of the prior and encodes the value

of the prior. With this, the posterior takes the form:

The posterior dof vN is the prior dof plus N. The posterior sum of

squares is the sum of the prior sum of squares plus

the data sum of squares. An uninformative prior corresponds to

zero virtual sample size, v0=0. This is the prior

This approach is certainly more appealing.

7

0

2 2/2 1

2 2 2 2 20 0 0 0 00 0 2

| , | , exp2 2 2

vv v vv

s s s s s s

s

InvGamma

2

0s

22

0 02 2 2 2 2 2 1

0| , | , , ,

N

i

iN N N N

N

v x

v v v Nv

s m

s m s s s

D

22 2

0 0

1

N

N N i

i

v v xs s m

2 2p s s

* Often denoted as 2

2 2 2 0 00 0 0

0

| , , 22

v- - v mean for v

v

s s s

Scale Inv

2 4 2

0 0 0 002

00 0

2var 4,

22 4

v vfor v mode

vv v

s s

http://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution






Sequential Update of the Posterior for s2

Sequential update of the posterior for s2 starting from an

uninformative prior . The data were generated

from N(5,10).

8

0 5 10 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

s2

prior = IW(=0.001, S=0.001), true s2=10.000

N=2

N=5

N=50

N=100

gaussSeqUpdateSigma1D

from PMTK

2 1 2 2s| ,s | , ,

2 2For D = 1: s v

s s s

IGIW

Gelman 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis

1(3):515–533

2 2

0 0 0 0| 0.001, s = 0.001v vs s IW

https://code.google.com/p/pmtk3/source/browse/trunk/bookDemos/Statistics/gaussSeqUpdateSigma1D.m?r=1550

https://code.google.com/p/pmtk3/

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/taumain.pdf


Bayesian Inference: Unknown Mean and Precision

9

Consider . We want to infer both the

precision l=1/s2 and the mean m.

The likelihood takes the form:

We need a prior that has a similar functional form in terms of l and m.

1~ ( | , ), 1,....,n nx x n Nm l N

/2 2

11

21/2 2

1 1

1( ) ( | ) exp ( )

2

1exp exp

2 2

N NN

n n

nn

NN N

n n

n n

p f x x

x x

m l m l l m

lml lm l

X | ,

2

1/2

2 2/2

( )( | )

( ) exp exp2

exp exp2 2

pp

p c d

c cd

lm l

lmm l l lm l

lm l l

,



10

We can easily identify that the prior is of the form (Normal-

Gamma):

Recall the form of the Gamma distribution:

21/2

2 21/2 ( 1)/2

( )( | )

( ) exp exp2

exp exp2 2

pp

p c d

c cd

lm l

lmm l l lm l

ll m l l

,

2

1

0

1( , ) | , | ,

2 2

c cp a b d

b

m l m m l l

N Gamma

1( | , )( )

aa bb

a b ea

ll l

Gamma



11

Combining the likelihood and prior, we can re-arrange and

write:

Completing the square on the 2nd argument gives:

| ,2

N N

Na a bl

Gamma

/2 1 2 2

0

1

1/2 2

0

1

1( ) exp

2 2

2( ) exp

2

NN a

n

n

N

n

n

p b x

NN x

N

m l l l m l

l l m m m

,

2

2

0

1/2 1 2 2

0

1

2

2

01/2 1

1( ) exp

2 2 2

( ) exp2

N

N

N

nNnN a

n

n

N

b

N

n

n

x

p b xN

xN

NN

m

m

m l l l m l

ml

l m

,

1

0

1| ,

N

N

n

nN

x

NN

m

m m l

N


The Normal-Gamma Distribution

12

2

1

0( , ) | , | 1 ,2 2

c cp a b d

b

m l m m l l

N Gamma

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

1

2

3

4

5

6

7

8

0

5, 6,

0, 2

a b

m

MatLab Code

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior

marginals, posterior predictive, and reference results for an uninformative prior)

http://mpdc.mae.cornell.edu/Courses/MAE714/Software/Fig14Chapter2Bishop.m

http://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf




http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/bayesGauss.pdf




Posterior for m and s for Scalar Data We can also work directly with s2. We use the normal inverse chi-

squared distribution (NIX)

Similarly to our earlier calculations, the posterior is given as:

The posterior marginal for s2 and posterior expectation are:

13

0

2 2 2 2 2 2 2

0 0 0 0 0 0 0 0

( 3)/2 22

0 0 0 0

2 2

, | , , , | , / | ,

1exp

2

v

m v m v

v m

m s s m s s s

s m

s s

NI N

2 2 2 2 0 0 00

0 0

2 22 2 0

0 0 0 0 0

1 0

( , | ) , | , , ,

, ,

N N N N N

N

N

N N N N i

i

m N Np m v , m m

N N

NN v v N v v x x m x

N

s s s

s s

D NIm mx

x

=

2 2 2 2 2 2 2( | ) ( , | ) | , , |2

NN N N

N

vp p d v

vs s s s s s m mD D D

0

22 2 2 2 0 0 0

0 0

2/2 1

2 0 0

2

| , | ,2 2

exp2

v

v vv

v

s s s s

ss

s

IG

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior

marginals, posterior predictive, and reference results for an uninformative prior, also Section 6 provides the analysis

for a normal-inverse-Gamma prior)















Normal Inverse x2 Distribution

14

-1

0

1

0

1

20

0.1

0.2

0.3

0.4

m

NIX(m0=0, k

0=1,

0=1, s

2

0=1)

s2 -1

0

1

0

1

20

0.1

0.2

0.3

0.4

m

NIX(m0=0, k

0=5,

0=1, s

2

0=1)

s2 -1

0

1

0

1

20

0.1

0.2

0.3

0.4

m

NIX(m0=0, k

0=1,

0=5, s

2

0=1)

s2

The NIχ2(m0, κ0, ν0, σ20) distribution. m0 is the prior mean and κ0 is how strongly we

believe this; σ20 is the prior variance and ν0 is how strongly we believe this. (a) The

contour plot (underneath the surface) is shaped like a “squashed egg”. (b) We increase

the strength of our belief in the mean, so it gets narrower (c) We increase the strength of

our belief in the variance, so it gets narrower.

NIXdemo2

from PMTK

https://code.google.com/p/pmtk3/source/browse/trunk/demos/NIXdemo2.m?r=2779&spec=svn2779



Posterior for m and s for Scalar Data The posterior for m is Students’ T:

Let us revisit these results with an uninformative prior:

With this prior, the posterior becomes:

s is the sample std. Thus the marginal posterior for m becomes:

The standard error of the mean is defined as

An approximate 95% posterior credible interval is thus:

15

2 2 2( | ) ( , | ) | , / , , |N N N N Np p d m v mm m s s m s m D D T D

2 2 2 2 2 2

0 0 0 0, , | 0, 0, 1, 0p p p vm s m s s m s m s NI

2 2

2 2 2 2 2 2

1

1( , | ) , | , , 1, ,

1 1

N

MLEN N N N i

i

Np m N v N s s x x

N Ns s s s

D NI xm m

2 2

2 2 1( | ) | , / , 1 , var |

2 3

NN

N

v N s sp x s N N

v N N Nm m m s

D T D

var |s

Nm D

0.95 | 2s

I xN

m D

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l ml

m l

2

2

: , 1

:

:2

, 22

Mean

Mode

Var

m

m

s

l

l s


Bayesian T-Test We want to test the hypothesis μ ≠ μ0 for some known value μ0 (often

0), given xi ∼ N(μ, σ2). This is called a two-sided, one-sample t-test.

One can check if μ0 Є I0.95(μ|D). If it is not, then we can be 95% sure

that μ ≠ μ0.

A more common scenario is when we want to test if two paired

samples have the same mean. More precisely, suppose yi ∼ N(μ1,

σ2) and zi ∼ N(μ2, σ2). We want to determine if μ = μ1 − μ2 > 0, using

xi = yi − zi as our data. We can evaluate this as follows:

This is called a one-sided, paired t-test. To calculate the posterior,

we must specify a prior. Suppose we use an uninformative prior. As

shown earlier, the posterior marginal is:

16

0

0 | |p p dm

m m m m

D D

2

| | , , 1s

p x NN

m m

D T


Bayesian T-Test We define the t statistic as:

The denominator is the standard error of the mean. With this

definition note that

where Fv(t) is the CDF of the standard Student’s T distribution

Note: The posterior of m has a form . From a frequentist

point of view, this is identical to the sampling distribution of the MLE:

. Thus the one sided p value in a frequentist test is the

same as the Bayesian estimate . The interpretation of the

results in the two approaches is of course very different.

17

0

0 1| | 1 ( )Np p d F tm

m m m m

D D

0

/

xt

s N

m

0,1,vT

1| ~/

N

x

s N

m

D T

1| ~/

N

x

s N

mm

T

0 |p m m D

Box, G. and G. Tiao (1973). Bayesian inference in statistical analysis. Addison-Wesley.

Gonen, M., W. Johnson, Y. Lu, and P. Westfall (2005, August). The Bayesian Two-Sample t Test. The American

Statistician 59(3), 252–257.

Rouder, J., P. Speckman, D. Sun, and R. Morey (2009). Bayesian t tests for accepting and rejecting the null

hypothesis. Pyschonomic Bulletin & Review 16(2), 225–237.

bayesTtestDemo

from PMTK

http://onlinelibrary.wiley.com/book/10.1002/9781118033197



http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/27643674.pdf




http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Rouder.bf_.pdf





https://code.google.com/p/pmtk3/source/browse/trunk/demos/bayesTtestDemo.m?r=2838



Sensor Fusion with Unknown Parameters Let us consider a sensor fusion problem where the precision of each

measurement device is unknown.

The unknown precision case turns out to give qualitatively different

results from the earlier case of known precision, yielding a potentially

multi-modal posterior.

Suppose we want to pool data from two sources x and y to

estimate some quantity μ ∈ R, but the reliability of the sources is

unknown. Specifically, suppose we have two different measurement

We make two independent measurements with each device, which

turn out to be x1= 1.1, x2 = 1.9, y1 = 2.9, y2 = 4.1

We use non-informative prior for μ, p(μ) ∝ 1, which we can emulate

using a Gaussian,

18

1 1| ~ , , | ~ ,i x i yx ym m l m m l N N

1

0 0( ) | 0,p mm m l N

Minka, T. (2000). Estimating a Dirichlet distribution. Technical report, MIT.

http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf





Sensor Fusion with Unknown Parameters For known precisions, the posterior is Gaussian:

However, the measurement precisions are not known. Initially we will

estimate them by MLE. The log-likelihood is given by

The MLE is obtained by solving the following coupled equations:

19

1

1 1

0

| | , ,

, 2, 1.5, 3.5

yx

N N

NN

i ix x y y i i

N x y

x x y y x y

N x x y y

p m

x yN x N y

m N N x yN N N N

N N

m m l

l l

l l

l l l l

D N

2 2

1 1

, , log log2 2

N Nyx

x y x i y i

i i

x yll

m l l l m l m

2 2

1 1

1 1 1 1, ,

yxNN

x yx y

i i

i ix yx y x yx y

N x N yx y

N N N N

l lm m m

l l l l


Sensor Fusion with Unknown Parameters

We solve these equs by iteration starting with

Upon convergence,

The plug-in approximation to the posterior is plotted next.

20

2 2

1 1

1 1 1 1, ,

yxNN

x yx y

i i

i ix yx y x yx y

N x N yx y

N N N N

l lm m m

l l l l

2 22 2

1 1

1 1 1 1,

0.16 0.36x y

yxx yN N

x yi i

i i

NN

s sx x y y

l l

1 1

, , | , , |1.5788,0.07980.1662 4.0509

x y x xpl l m l l m D N


Plug-in Approximation to the Posterior Posterior for μ. Plug-in approximation.

21

-2 -1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

sensorFusionUnknownPrec

from Kevin Murphys’ PMTK

This weights each sensor

according to its estimated

precision.

Since sensor y was

estimated to be much less

reliable than sensor x, we

have

Effectively with this

approximation we ignore the

y sensor.

| , ,x x xm l l D

https://code.google.com/p/pmtk3/source/browse/trunk/bookDemos/Statistics/sensorFusionUnknownPrec.m?r=1621





Sensor Fusion with Unknown Parameters Now we will adopt a Bayesian approach and integrate out the

unknown precisions, rather than trying to estimate them. That is, we

compute

We will use uninformative Jeffrey’s priors, p(μ) ∝ 1, p(λx|μ) ∝ 1/λx and

p(λy|μ) ∝ 1/λy.

The first integral becomes (using the likelihood derived earlier):

For Nx=2, it simplifies to the normalizing factor of a Gamma distribution

22

| | , | | , |x x x x y y y yp p p p d p p dm m m l l m l m l l m l D D D

2/2 21

| , | exp2 2

xN x xx x x x x x x x x x

x

N NI p p d N x s dm l l m l l l m l l

l

D

1 1 22 1

122

exp exp ( )a a a

x x x x x x x

b

x

I x s d b d a b

I x s

l l m l l l l

m

1( | , )( )

aa bb

a b ea

ll l

G


Sensor Fusion with Unknown Parameters Finally:

The exact posterior is plotted below.

23

2 22 2

1 1|

x y

px s y s

mm m

D

-2 -1 0 1 2 3 4 5 60

0.5

1

1.5

The posterior has two modes at

The weight of the 1st mode is larger,

since the data from the x sensor agree

more with each other. It seems likely

that the x sensor is the reliable one.

The Bayesian solution makes it

possible that the y sensor is the more

reliable one; from two measurements,

we cannot tell, and choosing just the x

sensor, as the plug-in approximation

does, results in over confidence (a

narrow posterior)

1.5, 3.5x y

sensorFusionUnknownPrec

from Kevin Murphys’ PMTK

https://code.google.com/p/pmtk3/source/browse/trunk/bookDemos/Statistics/sensorFusionUnknownPrec.m?r=1621





Multivariate Gaussian:Posterior of m

24

Consider a known variance S and a Gaussian prior N(m,S0) with the

posterior for the unknown mean m taking the form:

This posterior is the exponential of a quadratic in m:

So the variance and mean of the posterior are:

For uninformative prior,

1

( | ) ( ) ( | , )N

n

n

p p p x

Xm m m S

1

1

1 1

0 0 0

1

1 1 1 1

0 0 0

1

1 1( ) ( )

2 2

1

2N

N N

NT T

n n

n

NT T

n

n

N const

x x

x

m

m m m m m m

m m m m

SS

S S

S S S S

1 1 1

0

1 11 1 1 1 1 1 1 1

0 0 0 0 0 0

1

,N

N

N n ML

n

N

N N N

xm m m m

S S S

S S S S S S S S

( | ) | N Np X, ,S Sm m mN

0

1( | ) | MLp

N

I X, ,S S Sm m mN


Posterior Distribution of Precision L We now discuss how to compute p(L|D,μ). The likelihood has the

form

The corresponding conjugate prior is known as the Wishart

distribution

The following slide shows the similarities of this distribution with the

Gamma prior for l used earlier for univariate Gaussian distributions.

25

/2 1( | , ) | | exp ,

2

NTN

n n

n

p Tr S S

D x xL L Lm mm m m

( 1)/2 1

1

/2 /2 ( 1)/4

1

1| , | | exp , 1 ( ), . . ( )

2

1, | | 2

2

v D

Dv vD D D

i

v B Tr v D dof sym pos def D D

v iB v

Wi W W W

W W

L L L

( 1)/4

1

1( )

2 2

DD D

D

i

v v imultivariateGamma Function

1 s, ,s ,

2 2For D = 1

l l

GammaWi | |

http://en.wikipedia.org/wiki/Wishart_distribution




Wishart Distribution


( 1)

1

v k

for v k

Smode

1

,N

T

i i i

i

For ~ 0, , (scatter matrix) ~ N N

Nx S x x S SS S SWi |





Posterior Distribution of S We now similarly discuss how to compute p(Σ|D,μ). The likelihood

has the form

The corresponding conjugate prior is known as the inverse Wishart

distribution

ν0 + D + 1 controls the strength of the prior, and hence plays a role

analogous to the sample size N.

The prior scatter matrix is here S0.

Note: There are many parametrizations of the InvWi leading to

different reported DOF. We here follow the notation from Gelman et

al. with the same for both Wi and InvWi in the Eq. above.

27

/2 11( | , ) | | exp ,

2

NTN

n n

n

p Tr S S

D x xS S Sm mm m m

0( 1)/21 1

0 0 0 0

1| , | | exp , . .

2

v Dv Tr sym pos defS S

InvWi SS S S

1 1~ | , ~ ,If v then v Wi InvWiS SL S L S

Steven W. Nydick, The Wishart and Inverse Wishart Distributions, Report, 2012.

A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, 2004

http://en.wikipedia.org/wiki/Inverse-Wishart_distribution


http://www.tc.umn.edu/~nydic001/docs/unpubs/Wishart_Distribution.pdf

http://www.tc.umn.edu/~nydic001/docs/unpubs/Wishart_Distribution.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Wishart_Distribution.pdf












Inverse Wishart Distribution


1

1v k

Smode

2 1 2 S, ,S ,

2 2For k = 1

s s

InvGamma| |InvWi

1 1

1

, 1,

If ~ (a,b) ~ (a,b)

If ~ ~ kS S

ll

Gamma InvGamma

S SWi InvWi





Posterior Distribution of S Multiplying the likelihood and prior we find that the posterior is also

inverse Wishart:

The posterior strength νN is the prior strength ν0 plus the number of

observations N.

The posterior scatter matrix SN is the prior scatter matrix S0 plus the

data scatter matrix Sm.

29

0

0

( 1)/2/2 1 1

0

( 1) /2 1

0

1

0 0

1 1| , | | exp | | exp ,

2 2

1| | exp

2

| , | , ,

v DN

N v D

N N

p Tr Tr

Tr

p = N v

S S

S S

S S

D

D InvWi S S

S S S S S

S S

S S

m

m

m

m

m


MAP Estimation From the mode of the inverse Wishart and

we conclude that the MAP estimate is:

For an improper prior, S0=0 and N0=0,

Consider now the use of a proper informative prior, which is

necessary whenever D/N is large. Let .Then we can

rewrite the MAP estimate as a convex combination of the prior mode

and the MLE

where λ controls the amount of shrinkage towards the prior.

30

1

0 0| , | , , ,N N N Np = v v N vS S D InvWi S SS S mm

0

0 0

0 01 1

NMAP

N

N

= =v D N v D N N

S S S S

S S

m m

MAP MLE

N

S S S

m

0 0 0 00 0

0 0 0 0 0

1

(1 ) , ( )MAP MLE

N N= prior mode

N N N N N N N N N

S S SS S

l l

l l

x x

S S S S

S S x

x mm


MAP Estimation

Set λ by cross validation. Alternatively, we can use the closed-form

formula provided in (Ledoit & Wolf and Schaefer & Strimmer), which

is the optimal frequentist estimate if we use squared loss.

This is not a natural loss function for covariance matrices since it

ignores the positive definite constraint, but it results in a simple

estimator (see PMTK function shrinkcov).

For the prior covariance matrix, S0, it is common to use the following

(data dependent) prior:

31

00 0

0

(1 ) , ( )MAP MLE prior modeN

Sl l S S S S

0 MLEdiagS S

Ledoit, O. and M. Wolf (2004b). A well conditioned estimator for large dimensional covariance matrices. J. of

Multivariate Analysis 88(2), 365– 411.

Ledoit, O. and M. Wolf (2004a). Honey, I Shrunk the Sample Covariance Matrix. J. of Portfolio Management 31(1).

Schaefer, J. and K. Strimmer (2005). A shrinkage approach to largescale covariance matrix estimation and

implications for functional genomics. Statist. Appl. Genet. Mol. Biol 4(32).

https://code.google.com/p/pmtk3/source/browse/trunk/toolbox/BasicModels/sub/shrinkcov.m

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/LedoitWolf_JMA2004.pdf





http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/assetall_0001.pdf



http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/shrinkcov2005.pdf










MAP Shrinkage Estimation

In this case, the MAP estimate is:

Thus we see that the diagonal entries are equal to their MLE

estimates, and the off diagonal elements are “shrunk” somewhat

towards 0 (shrinkage estimation, or regularized estimation).

The benefits of MAP estimation are illustrated next. We consider

fitting a 50-dim Gaussian to N = 100, N = 50 and N = 25 data points.

We see that the MAP estimate is always well-conditioned, unlike the MLE.

In particular, the eigenvalue spectrum of the MAP estimate is much

closer to that of the true matrix than the MLE’s.

The eigenvectors, however, are unaffected.

32

( , )

(1 ) ( , )

MLEMAP

MLE

i j if i j=

i j otherwisel

SS

S

0 MLEdiagS S


Posterior Distribution of S Estimating a covariance

matrix in D = 50

dimensions using N ∈

{100, 50, 25} samples.

Eigenvalues in

descending order for the

true covariance matrix

(solid black), MLE (dotted

blue) and MAP estimates

(dashed red) with λ = 0.9.

The condition number of

each matrix is also given

in the legend.

33

shrinkcovDemo

from PMTK

0 5 10 15 20 250

0.5

1

1.5

eig

envalu

e

N=100, D=50

true, k=10.00

MLE, k= 71

MAP, k=8.62

0 5 10 15 20 250

0.5

1

1.5

eig

envalu

e

N=50, D=50

true, k=10.00

MLE, k=4.1e+16

MAP, k=8.85

0 5 10 15 20 250

0.5

1

1.5

eig

envalu

e

N=25, D=50

true, k=10.00

MLE, k=3.7e+17

MAP, k=21.09

https://code.google.com/p/pmtk3/source/browse/trunk/bookDemos/Statistics/shrinkcovDemo.m?r=1451



Inference for Both m and L

34

Suppose x1, x2, … xN ~(i.i.d) N(m,L1). We do not know m or L

When both sets of parameters are unknown, a conjugate

family of priors is one in which

and

m | L ~ N(m0, (L1 )

The Wishart distribution is the multivariate analog of the

Gamma distribution (extension to positive definite matrices). If

matrix U has the Wishart distribution, then U−1 has the inverse-

Wishart distribution. The resulting p(m,L| m0,,T,v) is the

Gaussian-Wishart distribution. The quantity is a positive scalar, while T is a positive definite

matrix. They play roles analogous to those played by a and ,

respectively, in the Gamma distribution.

The other parameters of the prior are the mean vector m0 and

, the latter of which represents the ``a priori number of

observations’’.

L ~ W(L |,T )


http://en.wikipedia.org/wiki/Gamma_distribution









35

The likelihood and prior distributions are given explicitly as:

Combining it gives the following posterior:

/2 /2

1

1| , 2 | | exp

2

NND TN

i i

n

p

D x xS L Lm m m

1

0 0 0 0

1/2 ( 1)/2 1

0 0 0

/2/2 /2

0

( , ) , | , , , | , | ,

1 1| | | | exp

2 2

2 , multivariate Gamma function2 2

T D

D

D

NIW D D

p v

exp TrZ

Z

NWi = N WiL L L L

L L L L

m m m m m

m m m m

T T

T

T

1

0

0 0 0

1

( , ) , | , , , | , | ,

,

,

N N N N N N N N

N

NT T

N i i

i

N N

p v

N

N

Nwhere

N

N N

NWi = N WiL L L Lm m m m m

mm

m m

T T

x

T T S x x S = x x x x

D

M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 8)









36

The posterior marginals can be derived as:

One can also derive the MAP estimates as:

These are reduced to the MLE by setting

1

( ) | ,

( ) |( 1)N

N N

ND N

N N

p v

pD

Wi

T

L L

m m m

T

T,

D

D

,

0 0 0

1

1

0 0 0 0

1

0

, arg max ( , )

N

i

i

N T T

i i

i

p

N

N D

= x

x x T

L

L L

L

m

m m

m m

m m m m m m

D

0 0 00, , 0D T



37

The posterior predictive is:

The marginal likelihood can be computed as a ratio of

normalization constants:

A useful reference analysis considers

This results in the following:

The posterior parameters are simplified as:

The posterior marginals and posterior predictive are:

1

1( )

( 1)N

N N

D N

N N

pD

T

Tx , mD

0/2/2

0 0

/2 /2/2

0 0

/ 21 1( )

/ 22 N

D

D NN

ND ND

D NN

Zp

Z

T

TD

0 0 0 00, 0, 1, 0 Tm

( 1)/2( , )

Dp

L Lm

, , , 1N N N NN N x T Sm

( ) | , ( ) |( )

( )( )

N D N D

N D

p pN N D

(N +1)p

N N D

Wi T

T

SS x,

Sx x,

L L

m mD D

D


Inference for m and S

38

For the case of the multivariate Gaussian of a D-dim variable x, N(x|m,S)

with both the mean and variance unknowns, the likelihood is of the form:

The conjugate prior is given as the product of a Gaussian and the

Inverse Wishart distribution:

0

0

0

1

0 0 0 0 0 0 0

0

( 1)/21/2 1 100 0 0

( 2)/21 100 0 0

/2 0

1( , | , , , ) | , | ,

1 1| | | | exp

2 2

1 1| |

2 2

22

T D

NIW

T D

NIW

v D

NIW D

v

exp TrZ

exp TrZ

vZ

NIW N InvWisS S S

S S S S

S S S

m m m m

m m m m

m m m m

S S

S

S

0

/2

/2

0

0

2, multivariate Gamma function

D

v

D

S

/2 /2 1

11

/2 /2 1 1

1

1| , 2 | | exp

2

12 | | exp exp

2 2

N NND TN

n n n

nn

N TND N

n

NTr

m m m

m mx

x x x

x x S

N S S S

S S S


The Posterior of m and S

39

The posterior is NIW given as:

The posterior mean is a convex combination of the prior mean and the

MLE with strength 0+N.

The posterior scatter matrix SN is the prior scatter matrix S0 plus the

empirical scatter matrix plus an extra term due to the uncertainty in

the mean which creates its own scatter matrix.

0 0 0 0

0 0 00

0 0

0 0

00 0 0 0 0 0 0

10

, | , , , , , | , , ,

,

,

N N N N

N

N

N N

NTT T T

N N N N i i

i

p v v

N N

N N

N v v N

N

N

D NIW

x

S S

xx

=

S S S x x S S S = x x

S Sm m m m

mm m

m m m m m m

Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.

Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of

Bayesian Model Selection. Model Selection. IMS Lecture Notes.

Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation

and Model-Based Clustering. J. of Classification 24, 155–181

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007

xS

http://research.microsoft.com/en-us/um/people/minka/papers/minka-gaussian.pdf


http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/euclid.lnms.1215540964.pdf




http://projecteuclid.org/download/pdf_1/euclid.lnms/1215540964



http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Fraley2007JClass.pdf









MAP Estimate of m and S

40

The mode of the joint posterior is:

For 0=0, this becomes:

It is interesting to note that this mode is almost the same as the MAP

estimate computed earlier – it differs by 1 in the denominator as the

mode above is the mode of the joint posterior!

0 0 0 0

0 0 00 0 0

0 0

00 0 0 0 0 0 0

10

, | , , , , , | , , ,

, ,

,

N N N N

N N N

N

NTT T T

N N N N i i

i

p v v

N NN v v N

N N

N

N

D NIW

x

S S

xx =

S S S x x S S S = x x

S Sm m m m

mm m

m m m m m m

0 0 0 0,

, arg max , | , , , , ,2

NN

N

p vv D

DS

SS

S Sm

m m m m

0

0 0 0 0, 0

, arg max , | , , , , ,2

p vv N D

D xS S

S xS

S Sm

m m m


The Posterior Marginals of m and S

41

The posterior marginal for S and m are:

It is not surprising that the last marginal is Student’s T that we know can

be represented as a mixture of Gaussians.

To see the connection for the scalar case, note that SN plays the role of

the posterior sum of squares

1

1

( | ) ( , | ) | ,

,1 1

1( | ) ( , | ) |

( 1)N

N N

N NMAP

N N

v D N N

N N

p p d v

v D v D

p p dv D

D D IW

D D T

S

S S

, S

S S S

S S

S S

m m

m m m m

2

N Nv s

21

( 1)

N NN

N N N N Nv D v

s

SS


The Posterior Predictive of m and S

42

The posterior predictive p(x|D)=p(x,D)/p(D) can be evaluated as:

Recall that the Student’s T distribution has heavier tails than the

Gaussian but rapidly becomes Gaussian like.

To see the connection of the above expression with the scalar case,

note:

1

( | ) ( | , ) , | , , ,

1| ,

( 1)N

N N N N

Nv D N N

N N

p v d d

v D

x x

x

D N

T

S

S

S S Sm m m m

m

NIW

2 21 11

( 1)

N N N N NNN

N N N N N

v

v D v

s s

S

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 9)







Marginal Likelihood

43

The posterior is given as:

The last two expressions give the unnormalized likelihood and prior.

The marginal likelihood is then:

Note that for D=1, this reduces to the familiar Equ.

0

0 0 0 0 0 /2

0

/2 1 1 100 0 0 0

/2 1

1 1 1 1, | , , , , , | | , , |

2

1, | exp

2 2

1| , exp

2

NND

N

D T

N

p v a ap Z Z

a tr

tr

D NIW' N' D NIW'D

NIW'

N' D

S

S

S

S S S S

S S S S

S S S

m m m m m

m m m m m

m

0 0 0

0

0

/2 /2

/2/2 /2 /2/2

0 0 0

/2 /2 /2 /2 /2 /2/2 /20 0/2 0

0 0

2 22

2 1 1 2 12 2

22 22 22 2 22

N

N

N N

D D

D N N ND D D D

N N

D ND ND DD ND

N N NDD D

D

p

DS S S

S S S

/2

0

/2N

D

N

0 /2 1/22

0 0 0

/2/2 20

1 2

2

N

ND

N

NN ND

p

s

s

D

/2

0

1

2

N

ND

Zp

Z D

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See calculation of the marginal

likelihood for 1D analysis of the Normal-Inverse-Chi-Squared prior on Section 5)














Non Informative Prior

44

The uninformative Jeffrey’s prior is . This is obtained in

the limit

In this case, we have:

The posterior marginals are then given as:

Also the posterior predictive is:

1

, , 1,N T

N N N N i i

i

N v N

x S S x x x xm

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004 (pp. 88)

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See Section 9)

( 1)/2

,D

p

S Sm

0 0 00, 1, 0v S

1 1( | ) | , 1 , ( | ) |

( )N Dp N p

N N D

D IW D TS x, SS S m m

1( | ) |

( )N D

Np

N N D

D Tx x x, S











Non Informative Prior

45

Based on the report of Minka belpw, the uninformative prior should be

instead

Often, a data-dependent weakly informative prior is recommended (see

Chipman et al. and Fraley and Raftery):

1

0 00

1/2 ( 1)/2 ( /2 1)

1| , | ,

| 2 | | | | | , | ,0,0 ,0D D0

limN InvWis

NIW

S

I

S S

S S S S

m m

m

0 0 0

0 0

, 2 ,

, 0.01

diagSet : v D to ensure and

N

S

m

xS

S S

x

Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.

Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of

Bayesian Model Selection. Model Selection. IMS Lecture Notes.

Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation

and Model-Based Clustering. J. of Classification 24, 155–181

0 00 1v instead of v















Visualization of the Wishart Distribution Since Wis is a distribution over matrices, it is difficult to plot it. However,

one can sample from it and in 2d use the eigen-vectors of the resulting

sample to define an ellipse (see Fig on next slide).

For higher dimensional matrices, we can plot marginals. The diagonals

of a Wis distributed matrix have Gamma distributions.

For off-diagonal elements, one can sample matrices from the

distribution, and then compute their distribution empirically.

We can convert each sampled matrix to a correlation matrix, and thus

compute a Monte Carlo approximation

We can then use kernel density estimation to produce for plotting purposes a smooth approximation to the univariate density E [Rij ].

46

( ) ( )

1

1[ ] , ~ ( , ),

Sijs s

ij ijijs ii jj

R R v RS

S

S S S SS S

Wi


Visualization of the Wishart Distribution

Above: Samples from Σ ∼ Wi(S, ν),

where S=3.1653,−0.0262; −0.0262,

0.6477] and ν = 3.

Right: Plots of the marginals (which

are Gamma), and the sample-based

marginal on the correlation

coefficient

47

-10 0 10-5

0

5

-5 0 5-2

0

2

-5 0 5-2

0

2

-50 0 50-5

0

5

-10 0 10-5

0

5

-10 0 10-5

0

5

-20 0 20-5

0

5

-5 0 5-5

0

5

Wi(dof=3.0, S), E=[9.5, -0.1; -0.1, 1.9], =-0.018

-5 0 5-5

0

5

0 10 20 30 400

0.05

0.1

s21

-2 -1 0 1 20

0.5

1(1, 2)

0 5 100

0.2

0.4

s22

If ν = 3 there is a lot of uncertainty

about the value of the correlation

coefficient ρ (almost uniform

distribution on [−1, 1]).

The sampled matrices are highly

variable, and some are nearly

singular.

As ν increases, the sampled

matrices are more concentrated on

the prior S.

wiPlotDemo

from PMTK

A. Gelman et al., Visualizing Distributions of Covariance Matrices, Unpublished.

https://code.google.com/p/pmtk3/source/browse/trunk/demos/bookDemos/The_multivariate_Gaussian_and_friends/wiPlotDemo.m?r=2637


http://www.stat.columbia.edu/~gelman/research/unpublished/Visualization.pdf

http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Visualization.pdf

Documents

Gaussian Models: Bayesian Inference