47
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) Gaussian Models: Bayesian Inference Prof. Nicholas Zabaras Materials Process Design and Control Laboratory Sibley School of Mechanical and Aerospace Engineering 101 Frank H. T. Rhodes Hall Cornell University Ithaca, NY 14853-3801 Email: [email protected] URL: http://mpdc.mae.cornell.edu/ February 1, 2014 1

Gaussian Models: Bayesian Inference

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Gaussian Models: Bayesian Inference Prof. Nicholas Zabaras

Materials Process Design and Control Laboratory

Sibley School of Mechanical and Aerospace Engineering

101 Frank H. T. Rhodes Hall

Cornell University

Ithaca, NY 14853-3801

Email: [email protected]

URL: http://mpdc.mae.cornell.edu/

February 1, 2014

1

Page 2: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inferring the Precision of a Univariate Gaussian with Known Mean,

Gamma and Inverse Gamma as Priors for l and s2.

Inverse Chi Squared Distribution as a prior for s2.

Bayesian Inference for the Univariate Gaussian with Unknown Mean

and Precision l, Normal-Gamma Distribution as a prior for (m,l)

Posterior for (m,s2) using a Normal-Inverse x2 Prior, Marginal

Posteriors, Credible Intervals, Bayesian T-Test, Multi-Sensor Fusion

with Unknown Parameters

Inference for m in a Multivariate Gaussian with a Gaussian Prior,

Inference of L in a Multivariate Gaussian and Wishart Distribution,

Inference of S and Inverse Wishart Distribution, MAP Estimate, MAP

Shrinkage Estimation, Inference for (m,L), Inference for (m,S), Posterior

Marginals of (m,S) , Visualization of the Wishart

Contents

2

Chris Bishop, Pattern Recognition and Machine Learning, Chapter 2

Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 4

Page 3: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference of Precision with Known Mean

3

Consider We want to infer the precision

l=1/s2 with the mean m taken as known.

The likelihood takes the form:

The corresponding conjugate prior should be proportional to the

product of a power of λ and the exponential of a linear function of λ.

This corresponds to the gamma distribution.

The gamma distribution has a finite integral if a > 0,and the distribution

itself is finite if a ≥1.

1~ ( | , ), 1,....,n nx x n Nm l N

/2 2

11

1( ) ( | ) exp ( )

2

N NN

n n

nn

p f x xl m l l m

X |

1( | , ) , 0,( )

aa bb

a b e xa

ll l

Gamma 2,

a avar

b bl l

Page 4: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference of Precision with Known Mean

4

The posterior takes the form:

We can immediately see that the posterior is also a

Gamma distribution:

Here is the MLE of the variance.

0/2 1 2

0 0 0

11

1( | ) ( | ) ( | , ) exp ( )

2

N NN a

n n

nn

p f x a b b xl m m l l l l m

X, Gamma

2 2

0 0 0

1

1( | ) ( | , ), / 2 , ( )

2 2

N

N N N N n ML

n

Np a b a N a b b x bl m l m s

X, Gamma

2

MLs

Page 5: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference of Precision with Known Mean

5

The effect of observing N data points is to increase the value of a by

N/2 (i.e. ½ for each data point). Thus we interpret the parameter a0 as

2a0 ‘effective’ prior observations.

Each measurement contributes to the parameter b a variance

Since we have 2a0 effective prior measurements, each of them

contributes to b an effective prior variance

The interpretation of a conjugate prior in terms of effective dummy data

points is typical for the exponential family of distributions.

The results above are identical with inference directly of the variance

s2 using as prior resulting in a posterior

2 2

0 0 0

1

1( | ) ( | , ), / 2 , ( )

2 2

N

N N N N n ML

n

Np a b a N a b b x bl m l m s

X, Gamma

2 / 2.MLs

2 20 00

0

2

2

a bb

as s

2

0 0( | , )a bsInvGamma2( | , )N Na bsInvGamma

Page 6: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 6

Gamma and Inverse Gamma

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

1 1

2 2

~ | , ~ | ,

: ~ | , ~ | ,

If a b then a b

Here a b then a b

l l s s

Gamma InvGamma

Gamma InvGamma

Page 7: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Univariate Posterior – Inverse Chi Squared Prior

Alternative prior for s2 is the Scaled Inverse Chi-Squared Distribution*

Here v0 represents the strength of the prior and encodes the value

of the prior. With this, the posterior takes the form:

The posterior dof vN is the prior dof plus N. The posterior sum of

squares is the sum of the prior sum of squares plus

the data sum of squares. An uninformative prior corresponds to

zero virtual sample size, v0=0. This is the prior

This approach is certainly more appealing.

7

0

2 2/2 1

2 2 2 2 20 0 0 0 00 0 2

| , | , exp2 2 2

vv v vv

s s s s s s

s

InvGamma

2

0s

22

0 02 2 2 2 2 2 1

0| , | , , ,

N

i

iN N N N

N

v x

v v v Nv

s m

s m s s s

D

22 2

0 0

1

N

N N i

i

v v xs s m

2 2p s s

* Often denoted as 2

2 2 2 0 00 0 0

0

| , , 22

v- - v mean for v

v

s s s

Scale Inv

2 4 2

0 0 0 002

00 0

2var 4,

22 4

v vfor v mode

vv v

s s

Page 8: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Sequential Update of the Posterior for s2

Sequential update of the posterior for s2 starting from an

uninformative prior . The data were generated

from N(5,10).

8

0 5 10 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

s2

prior = IW(=0.001, S=0.001), true s2=10.000

N=2

N=5

N=50

N=100

gaussSeqUpdateSigma1D

from PMTK

2 1 2 2s| ,s | , ,

2 2For D = 1: s v

s s s

IGIW

Gelman 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis

1(3):515–533

2 2

0 0 0 0| 0.001, s = 0.001v vs s IW

Page 9: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Inference: Unknown Mean and Precision

9

Consider . We want to infer both the

precision l=1/s2 and the mean m.

The likelihood takes the form:

We need a prior that has a similar functional form in terms of l and m.

1~ ( | , ), 1,....,n nx x n Nm l N

/2 2

11

21/2 2

1 1

1( ) ( | ) exp ( )

2

1exp exp

2 2

N NN

n n

nn

NN N

n n

n n

p f x x

x x

m l m l l m

lml lm l

X | ,

2

1/2

2 2/2

( )( | )

( ) exp exp2

exp exp2 2

pp

p c d

c cd

lm l

lmm l l lm l

lm l l

,

Page 10: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Inference: Unknown Mean and Precision

10

We can easily identify that the prior is of the form (Normal-

Gamma):

Recall the form of the Gamma distribution:

21/2

2 21/2 ( 1)/2

( )( | )

( ) exp exp2

exp exp2 2

pp

p c d

c cd

lm l

lmm l l lm l

ll m l l

,

2

1

0

1( , ) | , | ,

2 2

c cp a b d

b

m l m m l l

N Gamma

1( | , )( )

aa bb

a b ea

ll l

Gamma

Page 11: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian Inference: Unknown Mean and Precision

11

Combining the likelihood and prior, we can re-arrange and

write:

Completing the square on the 2nd argument gives:

| ,2

N N

Na a bl

Gamma

/2 1 2 2

0

1

1/2 2

0

1

1( ) exp

2 2

2( ) exp

2

NN a

n

n

N

n

n

p b x

NN x

N

m l l l m l

l l m m m

,

2

2

0

1/2 1 2 2

0

1

2

2

01/2 1

1( ) exp

2 2 2

( ) exp2

N

N

N

nNnN a

n

n

N

b

N

n

n

x

p b xN

xN

NN

m

m

m l l l m l

ml

l m

,

1

0

1| ,

N

N

n

nN

x

NN

m

m m l

N

Page 12: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Normal-Gamma Distribution

12

2

1

0( , ) | , | 1 ,2 2

c cp a b d

b

m l m m l l

N Gamma

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

1

2

3

4

5

6

7

8

0

5, 6,

0, 2

a b

m

MatLab Code

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior

marginals, posterior predictive, and reference results for an uninformative prior)

Page 13: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior for m and s for Scalar Data We can also work directly with s2. We use the normal inverse chi-

squared distribution (NIX)

Similarly to our earlier calculations, the posterior is given as:

The posterior marginal for s2 and posterior expectation are:

13

0

2 2 2 2 2 2 2

0 0 0 0 0 0 0 0

( 3)/2 22

0 0 0 0

2 2

, | , , , | , / | ,

1exp

2

v

m v m v

v m

m s s m s s s

s m

s s

NI N

2 2 2 2 0 0 00

0 0

2 22 2 0

0 0 0 0 0

1 0

( , | ) , | , , ,

, ,

N N N N N

N

N

N N N N i

i

m N Np m v , m m

N N

NN v v N v v x x m x

N

s s s

s s

D NIm mx

x

=

2 2 2 2 2 2 2( | ) ( , | ) | , , |2

NN N N

N

vp p d v

vs s s s s s m mD D D

0

22 2 2 2 0 0 0

0 0

2/2 1

2 0 0

2

| , | ,2 2

exp2

v

v vv

v

s s s s

ss

s

IG

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior

marginals, posterior predictive, and reference results for an uninformative prior, also Section 6 provides the analysis

for a normal-inverse-Gamma prior)

Page 14: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Normal Inverse x2 Distribution

14

-1

0

1

0

1

20

0.1

0.2

0.3

0.4

m

NIX(m0=0, k

0=1,

0=1, s

2

0=1)

s2 -1

0

1

0

1

20

0.1

0.2

0.3

0.4

m

NIX(m0=0, k

0=5,

0=1, s

2

0=1)

s2 -1

0

1

0

1

20

0.1

0.2

0.3

0.4

m

NIX(m0=0, k

0=1,

0=5, s

2

0=1)

s2

The NIχ2(m0, κ0, ν0, σ20) distribution. m0 is the prior mean and κ0 is how strongly we

believe this; σ20 is the prior variance and ν0 is how strongly we believe this. (a) The

contour plot (underneath the surface) is shaped like a “squashed egg”. (b) We increase

the strength of our belief in the mean, so it gets narrower (c) We increase the strength of

our belief in the variance, so it gets narrower.

NIXdemo2

from PMTK

Page 15: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior for m and s for Scalar Data The posterior for m is Students’ T:

Let us revisit these results with an uninformative prior:

With this prior, the posterior becomes:

s is the sample std. Thus the marginal posterior for m becomes:

The standard error of the mean is defined as

An approximate 95% posterior credible interval is thus:

15

2 2 2( | ) ( , | ) | , / , , |N N N N Np p d m v mm m s s m s m D D T D

2 2 2 2 2 2

0 0 0 0, , | 0, 0, 1, 0p p p vm s m s s m s m s NI

2 2

2 2 2 2 2 2

1

1( , | ) , | , , 1, ,

1 1

N

MLEN N N N i

i

Np m N v N s s x x

N Ns s s s

D NI xm m

2 2

2 2 1( | ) | , / , 1 , var |

2 3

NN

N

v N s sp x s N N

v N N Nm m m s

D T D

var |s

Nm D

0.95 | 2s

I xN

m D

/2 1/2

21/21

( )2 2( | , , ) 1

( )2

xp x

l ml

m l

2

2

: , 1

:

:2

, 22

Mean

Mode

Var

m

m

s

l

l s

Page 16: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian T-Test We want to test the hypothesis μ ≠ μ0 for some known value μ0 (often

0), given xi ∼ N(μ, σ2). This is called a two-sided, one-sample t-test.

One can check if μ0 Є I0.95(μ|D). If it is not, then we can be 95% sure

that μ ≠ μ0.

A more common scenario is when we want to test if two paired

samples have the same mean. More precisely, suppose yi ∼ N(μ1,

σ2) and zi ∼ N(μ2, σ2). We want to determine if μ = μ1 − μ2 > 0, using

xi = yi − zi as our data. We can evaluate this as follows:

This is called a one-sided, paired t-test. To calculate the posterior,

we must specify a prior. Suppose we use an uninformative prior. As

shown earlier, the posterior marginal is:

16

0

0 | |p p dm

m m m m

D D

2

| | , , 1s

p x NN

m m

D T

Page 17: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Bayesian T-Test We define the t statistic as:

The denominator is the standard error of the mean. With this

definition note that

where Fv(t) is the CDF of the standard Student’s T distribution

Note: The posterior of m has a form . From a frequentist

point of view, this is identical to the sampling distribution of the MLE:

. Thus the one sided p value in a frequentist test is the

same as the Bayesian estimate . The interpretation of the

results in the two approaches is of course very different.

17

0

0 1| | 1 ( )Np p d F tm

m m m m

D D

0

/

xt

s N

m

0,1,vT

1| ~/

N

x

s N

m

D T

1| ~/

N

x

s N

mm

T

0 |p m m D

Box, G. and G. Tiao (1973). Bayesian inference in statistical analysis. Addison-Wesley.

Gonen, M., W. Johnson, Y. Lu, and P. Westfall (2005, August). The Bayesian Two-Sample t Test. The American

Statistician 59(3), 252–257.

Rouder, J., P. Speckman, D. Sun, and R. Morey (2009). Bayesian t tests for accepting and rejecting the null

hypothesis. Pyschonomic Bulletin & Review 16(2), 225–237.

bayesTtestDemo

from PMTK

Page 18: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Sensor Fusion with Unknown Parameters Let us consider a sensor fusion problem where the precision of each

measurement device is unknown.

The unknown precision case turns out to give qualitatively different

results from the earlier case of known precision, yielding a potentially

multi-modal posterior.

Suppose we want to pool data from two sources x and y to

estimate some quantity μ ∈ R, but the reliability of the sources is

unknown. Specifically, suppose we have two different measurement

We make two independent measurements with each device, which

turn out to be x1= 1.1, x2 = 1.9, y1 = 2.9, y2 = 4.1

We use non-informative prior for μ, p(μ) ∝ 1, which we can emulate

using a Gaussian,

18

1 1| ~ , , | ~ ,i x i yx ym m l m m l N N

1

0 0( ) | 0,p mm m l N

Minka, T. (2000). Estimating a Dirichlet distribution. Technical report, MIT.

Page 19: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Sensor Fusion with Unknown Parameters For known precisions, the posterior is Gaussian:

However, the measurement precisions are not known. Initially we will

estimate them by MLE. The log-likelihood is given by

The MLE is obtained by solving the following coupled equations:

19

1

1 1

0

| | , ,

, 2, 1.5, 3.5

yx

N N

NN

i ix x y y i i

N x y

x x y y x y

N x x y y

p m

x yN x N y

m N N x yN N N N

N N

m m l

l l

l l

l l l l

D N

2 2

1 1

, , log log2 2

N Nyx

x y x i y i

i i

x yll

m l l l m l m

2 2

1 1

1 1 1 1, ,

yxNN

x yx y

i i

i ix yx y x yx y

N x N yx y

N N N N

l lm m m

l l l l

Page 20: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Sensor Fusion with Unknown Parameters

We solve these equs by iteration starting with

Upon convergence,

The plug-in approximation to the posterior is plotted next.

20

2 2

1 1

1 1 1 1, ,

yxNN

x yx y

i i

i ix yx y x yx y

N x N yx y

N N N N

l lm m m

l l l l

2 22 2

1 1

1 1 1 1,

0.16 0.36x y

yxx yN N

x yi i

i i

NN

s sx x y y

l l

1 1

, , | , , |1.5788,0.07980.1662 4.0509

x y x xpl l m l l m D N

Page 21: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Plug-in Approximation to the Posterior Posterior for μ. Plug-in approximation.

21

-2 -1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

sensorFusionUnknownPrec

from Kevin Murphys’ PMTK

This weights each sensor

according to its estimated

precision.

Since sensor y was

estimated to be much less

reliable than sensor x, we

have

Effectively with this

approximation we ignore the

y sensor.

| , ,x x xm l l D

Page 22: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Sensor Fusion with Unknown Parameters Now we will adopt a Bayesian approach and integrate out the

unknown precisions, rather than trying to estimate them. That is, we

compute

We will use uninformative Jeffrey’s priors, p(μ) ∝ 1, p(λx|μ) ∝ 1/λx and

p(λy|μ) ∝ 1/λy.

The first integral becomes (using the likelihood derived earlier):

For Nx=2, it simplifies to the normalizing factor of a Gamma distribution

22

| | , | | , |x x x x y y y yp p p p d p p dm m m l l m l m l l m l D D D

2/2 21

| , | exp2 2

xN x xx x x x x x x x x x

x

N NI p p d N x s dm l l m l l l m l l

l

D

1 1 22 1

122

exp exp ( )a a a

x x x x x x x

b

x

I x s d b d a b

I x s

l l m l l l l

m

1( | , )( )

aa bb

a b ea

ll l

G

Page 23: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Sensor Fusion with Unknown Parameters Finally:

The exact posterior is plotted below.

23

2 22 2

1 1|

x y

px s y s

mm m

D

-2 -1 0 1 2 3 4 5 60

0.5

1

1.5

The posterior has two modes at

The weight of the 1st mode is larger,

since the data from the x sensor agree

more with each other. It seems likely

that the x sensor is the reliable one.

The Bayesian solution makes it

possible that the y sensor is the more

reliable one; from two measurements,

we cannot tell, and choosing just the x

sensor, as the plug-in approximation

does, results in over confidence (a

narrow posterior)

1.5, 3.5x y

sensorFusionUnknownPrec

from Kevin Murphys’ PMTK

Page 24: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Multivariate Gaussian:Posterior of m

24

Consider a known variance S and a Gaussian prior N(m,S0) with the

posterior for the unknown mean m taking the form:

This posterior is the exponential of a quadratic in m:

So the variance and mean of the posterior are:

For uninformative prior,

1

( | ) ( ) ( | , )N

n

n

p p p x

Xm m m S

1

1

1 1

0 0 0

1

1 1 1 1

0 0 0

1

1 1( ) ( )

2 2

1

2N

N N

NT T

n n

n

NT T

n

n

N const

x x

x

m

m m m m m m

m m m m

SS

S S

S S S S

1 1 1

0

1 11 1 1 1 1 1 1 1

0 0 0 0 0 0

1

,N

N

N n ML

n

N

N N N

xm m m m

S S S

S S S S S S S S

( | ) | N Np X, ,S Sm m mN

0

1( | ) | MLp

N

I X, ,S S Sm m mN

Page 25: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Distribution of Precision L We now discuss how to compute p(L|D,μ). The likelihood has the

form

The corresponding conjugate prior is known as the Wishart

distribution

The following slide shows the similarities of this distribution with the

Gamma prior for l used earlier for univariate Gaussian distributions.

25

/2 1( | , ) | | exp ,

2

NTN

n n

n

p Tr S S

D x xL L Lm mm m m

( 1)/2 1

1

/2 /2 ( 1)/4

1

1| , | | exp , 1 ( ), . . ( )

2

1, | | 2

2

v D

Dv vD D D

i

v B Tr v D dof sym pos def D D

v iB v

Wi W W W

W W

L L L

( 1)/4

1

1( )

2 2

DD D

D

i

v v imultivariateGamma Function

1 s, ,s ,

2 2For D = 1

l l

GammaWi | |

Page 26: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 26

Wishart Distribution

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

( 1)

1

v k

for v k

Smode

1

,N

T

i i i

i

For ~ 0, , (scatter matrix) ~ N N

Nx S x x S SS S SWi |

Page 27: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Distribution of S We now similarly discuss how to compute p(Σ|D,μ). The likelihood

has the form

The corresponding conjugate prior is known as the inverse Wishart

distribution

ν0 + D + 1 controls the strength of the prior, and hence plays a role

analogous to the sample size N.

The prior scatter matrix is here S0.

Note: There are many parametrizations of the InvWi leading to

different reported DOF. We here follow the notation from Gelman et

al. with the same for both Wi and InvWi in the Eq. above.

27

/2 11( | , ) | | exp ,

2

NTN

n n

n

p Tr S S

D x xS S Sm mm m m

0( 1)/21 1

0 0 0 0

1| , | | exp , . .

2

v Dv Tr sym pos defS S

InvWi SS S S

1 1~ | , ~ ,If v then v Wi InvWiS SL S L S

Steven W. Nydick, The Wishart and Inverse Wishart Distributions, Report, 2012.

A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, 2004

Page 28: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 28

Inverse Wishart Distribution

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

1

1v k

Smode

2 1 2 S, ,S ,

2 2For k = 1

s s

InvGamma| |InvWi

1 1

1

, 1,

If ~ (a,b) ~ (a,b)

If ~ ~ kS S

ll

Gamma InvGamma

S SWi InvWi

Page 29: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Distribution of S Multiplying the likelihood and prior we find that the posterior is also

inverse Wishart:

The posterior strength νN is the prior strength ν0 plus the number of

observations N.

The posterior scatter matrix SN is the prior scatter matrix S0 plus the

data scatter matrix Sm.

29

0

0

( 1)/2/2 1 1

0

( 1) /2 1

0

1

0 0

1 1| , | | exp | | exp ,

2 2

1| | exp

2

| , | , ,

v DN

N v D

N N

p Tr Tr

Tr

p = N v

S S

S S

S S

D

D InvWi S S

S S S S S

S S

S S

m

m

m

m

m

Page 30: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

MAP Estimation From the mode of the inverse Wishart and

we conclude that the MAP estimate is:

For an improper prior, S0=0 and N0=0,

Consider now the use of a proper informative prior, which is

necessary whenever D/N is large. Let .Then we can

rewrite the MAP estimate as a convex combination of the prior mode

and the MLE

where λ controls the amount of shrinkage towards the prior.

30

1

0 0| , | , , ,N N N Np = v v N vS S D InvWi S SS S mm

0

0 0

0 01 1

NMAP

N

N

= =v D N v D N N

S S S S

S S

m m

MAP MLE

N

S S S

m

0 0 0 00 0

0 0 0 0 0

1

(1 ) , ( )MAP MLE

N N= prior mode

N N N N N N N N N

S S SS S

l l

l l

x x

S S S S

S S x

x mm

Page 31: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

MAP Estimation

Set λ by cross validation. Alternatively, we can use the closed-form

formula provided in (Ledoit & Wolf and Schaefer & Strimmer), which

is the optimal frequentist estimate if we use squared loss.

This is not a natural loss function for covariance matrices since it

ignores the positive definite constraint, but it results in a simple

estimator (see PMTK function shrinkcov).

For the prior covariance matrix, S0, it is common to use the following

(data dependent) prior:

31

00 0

0

(1 ) , ( )MAP MLE prior modeN

Sl l S S S S

0 MLEdiagS S

Ledoit, O. and M. Wolf (2004b). A well conditioned estimator for large dimensional covariance matrices. J. of

Multivariate Analysis 88(2), 365– 411.

Ledoit, O. and M. Wolf (2004a). Honey, I Shrunk the Sample Covariance Matrix. J. of Portfolio Management 31(1).

Schaefer, J. and K. Strimmer (2005). A shrinkage approach to largescale covariance matrix estimation and

implications for functional genomics. Statist. Appl. Genet. Mol. Biol 4(32).

Page 32: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

MAP Shrinkage Estimation

In this case, the MAP estimate is:

Thus we see that the diagonal entries are equal to their MLE

estimates, and the off diagonal elements are “shrunk” somewhat

towards 0 (shrinkage estimation, or regularized estimation).

The benefits of MAP estimation are illustrated next. We consider

fitting a 50-dim Gaussian to N = 100, N = 50 and N = 25 data points.

We see that the MAP estimate is always well-conditioned, unlike the MLE.

In particular, the eigenvalue spectrum of the MAP estimate is much

closer to that of the true matrix than the MLE’s.

The eigenvectors, however, are unaffected.

32

( , )

(1 ) ( , )

MLEMAP

MLE

i j if i j=

i j otherwisel

SS

S

0 MLEdiagS S

Page 33: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Distribution of S Estimating a covariance

matrix in D = 50

dimensions using N ∈

{100, 50, 25} samples.

Eigenvalues in

descending order for the

true covariance matrix

(solid black), MLE (dotted

blue) and MAP estimates

(dashed red) with λ = 0.9.

The condition number of

each matrix is also given

in the legend.

33

shrinkcovDemo

from PMTK

0 5 10 15 20 250

0.5

1

1.5

eig

envalu

e

N=100, D=50

true, k=10.00

MLE, k= 71

MAP, k=8.62

0 5 10 15 20 250

0.5

1

1.5

eig

envalu

e

N=50, D=50

true, k=10.00

MLE, k=4.1e+16

MAP, k=8.85

0 5 10 15 20 250

0.5

1

1.5

eig

envalu

e

N=25, D=50

true, k=10.00

MLE, k=3.7e+17

MAP, k=21.09

Page 34: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference for Both m and L

34

Suppose x1, x2, … xN ~(i.i.d) N(m,L1). We do not know m or L

When both sets of parameters are unknown, a conjugate

family of priors is one in which

and

m | L ~ N(m0, (L1 )

The Wishart distribution is the multivariate analog of the

Gamma distribution (extension to positive definite matrices). If

matrix U has the Wishart distribution, then U−1 has the inverse-

Wishart distribution. The resulting p(m,L| m0,,T,v) is the

Gaussian-Wishart distribution. The quantity is a positive scalar, while T is a positive definite

matrix. They play roles analogous to those played by a and ,

respectively, in the Gamma distribution.

The other parameters of the prior are the mean vector m0 and

, the latter of which represents the ``a priori number of

observations’’.

L ~ W(L |,T )

Page 35: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference for Both m and L

35

The likelihood and prior distributions are given explicitly as:

Combining it gives the following posterior:

/2 /2

1

1| , 2 | | exp

2

NND TN

i i

n

p

D x xS L Lm m m

1

0 0 0 0

1/2 ( 1)/2 1

0 0 0

/2/2 /2

0

( , ) , | , , , | , | ,

1 1| | | | exp

2 2

2 , multivariate Gamma function2 2

T D

D

D

NIW D D

p v

exp TrZ

Z

NWi = N WiL L L L

L L L L

m m m m m

m m m m

T T

T

T

1

0

0 0 0

1

( , ) , | , , , | , | ,

,

,

N N N N N N N N

N

NT T

N i i

i

N N

p v

N

N

Nwhere

N

N N

NWi = N WiL L L Lm m m m m

mm

m m

T T

x

T T S x x S = x x x x

D

M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 8)

Page 36: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference for Both m and L

36

The posterior marginals can be derived as:

One can also derive the MAP estimates as:

These are reduced to the MLE by setting

1

( ) | ,

( ) |( 1)N

N N

ND N

N N

p v

pD

Wi

T

L L

m m m

T

T,

D

D

,

0 0 0

1

1

0 0 0 0

1

0

, arg max ( , )

N

i

i

N T T

i i

i

p

N

N D

= x

x x T

L

L L

L

m

m m

m m

m m m m m m

D

0 0 00, , 0D T

Page 37: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference for Both m and L

37

The posterior predictive is:

The marginal likelihood can be computed as a ratio of

normalization constants:

A useful reference analysis considers

This results in the following:

The posterior parameters are simplified as:

The posterior marginals and posterior predictive are:

1

1( )

( 1)N

N N

D N

N N

pD

T

Tx , mD

0/2/2

0 0

/2 /2/2

0 0

/ 21 1( )

/ 22 N

D

D NN

ND ND

D NN

Zp

Z

T

TD

0 0 0 00, 0, 1, 0 Tm

( 1)/2( , )

Dp

L Lm

, , , 1N N N NN N x T Sm

( ) | , ( ) |( )

( )( )

N D N D

N D

p pN N D

(N +1)p

N N D

Wi T

T

SS x,

Sx x,

L L

m mD D

D

Page 38: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inference for m and S

38

For the case of the multivariate Gaussian of a D-dim variable x, N(x|m,S)

with both the mean and variance unknowns, the likelihood is of the form:

The conjugate prior is given as the product of a Gaussian and the

Inverse Wishart distribution:

0

0

0

1

0 0 0 0 0 0 0

0

( 1)/21/2 1 100 0 0

( 2)/21 100 0 0

/2 0

1( , | , , , ) | , | ,

1 1| | | | exp

2 2

1 1| |

2 2

22

T D

NIW

T D

NIW

v D

NIW D

v

exp TrZ

exp TrZ

vZ

NIW N InvWisS S S

S S S S

S S S

m m m m

m m m m

m m m m

S S

S

S

0

/2

/2

0

0

2, multivariate Gamma function

D

v

D

S

/2 /2 1

11

/2 /2 1 1

1

1| , 2 | | exp

2

12 | | exp exp

2 2

N NND TN

n n n

nn

N TND N

n

NTr

m m m

m mx

x x x

x x S

N S S S

S S S

Page 39: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Posterior of m and S

39

The posterior is NIW given as:

The posterior mean is a convex combination of the prior mean and the

MLE with strength 0+N.

The posterior scatter matrix SN is the prior scatter matrix S0 plus the

empirical scatter matrix plus an extra term due to the uncertainty in

the mean which creates its own scatter matrix.

0 0 0 0

0 0 00

0 0

0 0

00 0 0 0 0 0 0

10

, | , , , , , | , , ,

,

,

N N N N

N

N

N N

NTT T T

N N N N i i

i

p v v

N N

N N

N v v N

N

N

D NIW

x

S S

xx

=

S S S x x S S S = x x

S Sm m m m

mm m

m m m m m m

Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.

Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of

Bayesian Model Selection. Model Selection. IMS Lecture Notes.

Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation

and Model-Based Clustering. J. of Classification 24, 155–181

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007

xS

Page 40: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

MAP Estimate of m and S

40

The mode of the joint posterior is:

For 0=0, this becomes:

It is interesting to note that this mode is almost the same as the MAP

estimate computed earlier – it differs by 1 in the denominator as the

mode above is the mode of the joint posterior!

0 0 0 0

0 0 00 0 0

0 0

00 0 0 0 0 0 0

10

, | , , , , , | , , ,

, ,

,

N N N N

N N N

N

NTT T T

N N N N i i

i

p v v

N NN v v N

N N

N

N

D NIW

x

S S

xx =

S S S x x S S S = x x

S Sm m m m

mm m

m m m m m m

0 0 0 0,

, arg max , | , , , , ,2

NN

N

p vv D

DS

SS

S Sm

m m m m

0

0 0 0 0, 0

, arg max , | , , , , ,2

p vv N D

D xS S

S xS

S Sm

m m m

Page 41: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Posterior Marginals of m and S

41

The posterior marginal for S and m are:

It is not surprising that the last marginal is Student’s T that we know can

be represented as a mixture of Gaussians.

To see the connection for the scalar case, note that SN plays the role of

the posterior sum of squares

1

1

( | ) ( , | ) | ,

,1 1

1( | ) ( , | ) |

( 1)N

N N

N NMAP

N N

v D N N

N N

p p d v

v D v D

p p dv D

D D IW

D D T

S

S S

, S

S S S

S S

S S

m m

m m m m

2

N Nv s

21

( 1)

N NN

N N N N Nv D v

s

SS

Page 42: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Posterior Predictive of m and S

42

The posterior predictive p(x|D)=p(x,D)/p(D) can be evaluated as:

Recall that the Student’s T distribution has heavier tails than the

Gaussian but rapidly becomes Gaussian like.

To see the connection of the above expression with the scalar case,

note:

1

( | ) ( | , ) , | , , ,

1| ,

( 1)N

N N N N

Nv D N N

N N

p v d d

v D

x x

x

D N

T

S

S

S S Sm m m m

m

NIW

2 21 11

( 1)

N N N N NNN

N N N N N

v

v D v

s s

S

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 9)

Page 43: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Marginal Likelihood

43

The posterior is given as:

The last two expressions give the unnormalized likelihood and prior.

The marginal likelihood is then:

Note that for D=1, this reduces to the familiar Equ.

0

0 0 0 0 0 /2

0

/2 1 1 100 0 0 0

/2 1

1 1 1 1, | , , , , , | | , , |

2

1, | exp

2 2

1| , exp

2

NND

N

D T

N

p v a ap Z Z

a tr

tr

D NIW' N' D NIW'D

NIW'

N' D

S

S

S

S S S S

S S S S

S S S

m m m m m

m m m m m

m

0 0 0

0

0

/2 /2

/2/2 /2 /2/2

0 0 0

/2 /2 /2 /2 /2 /2/2 /20 0/2 0

0 0

2 22

2 1 1 2 12 2

22 22 22 2 22

N

N

N N

D D

D N N ND D D D

N N

D ND ND DD ND

N N NDD D

D

p

DS S S

S S S

/2

0

/2N

D

N

0 /2 1/22

0 0 0

/2/2 20

1 2

2

N

ND

N

NN ND

p

s

s

D

/2

0

1

2

N

ND

Zp

Z D

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See calculation of the marginal

likelihood for 1D analysis of the Normal-Inverse-Chi-Squared prior on Section 5)

Page 44: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Non Informative Prior

44

The uninformative Jeffrey’s prior is . This is obtained in

the limit

In this case, we have:

The posterior marginals are then given as:

Also the posterior predictive is:

1

, , 1,N T

N N N N i i

i

N v N

x S S x x x xm

Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004 (pp. 88)

K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See Section 9)

( 1)/2

,D

p

S Sm

0 0 00, 1, 0v S

1 1( | ) | , 1 , ( | ) |

( )N Dp N p

N N D

D IW D TS x, SS S m m

1( | ) |

( )N D

Np

N N D

D Tx x x, S

Page 45: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Non Informative Prior

45

Based on the report of Minka belpw, the uninformative prior should be

instead

Often, a data-dependent weakly informative prior is recommended (see

Chipman et al. and Fraley and Raftery):

1

0 00

1/2 ( 1)/2 ( /2 1)

1| , | ,

| 2 | | | | | , | ,0,0 ,0D D0

limN InvWis

NIW

S

I

S S

S S S S

m m

m

0 0 0

0 0

, 2 ,

, 0.01

diagSet : v D to ensure and

N

S

m

xS

S S

x

Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.

Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of

Bayesian Model Selection. Model Selection. IMS Lecture Notes.

Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation

and Model-Based Clustering. J. of Classification 24, 155–181

0 00 1v instead of v

Page 46: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Visualization of the Wishart Distribution Since Wis is a distribution over matrices, it is difficult to plot it. However,

one can sample from it and in 2d use the eigen-vectors of the resulting

sample to define an ellipse (see Fig on next slide).

For higher dimensional matrices, we can plot marginals. The diagonals

of a Wis distributed matrix have Gamma distributions.

For off-diagonal elements, one can sample matrices from the

distribution, and then compute their distribution empirically.

We can convert each sampled matrix to a correlation matrix, and thus

compute a Monte Carlo approximation

We can then use kernel density estimation to produce for plotting purposes a smooth approximation to the univariate density E [Rij ].

46

( ) ( )

1

1[ ] , ~ ( , ),

Sijs s

ij ijijs ii jj

R R v RS

S

S S S SS S

Wi

Page 47: Gaussian Models: Bayesian Inference

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Visualization of the Wishart Distribution

Above: Samples from Σ ∼ Wi(S, ν),

where S=3.1653,−0.0262; −0.0262,

0.6477] and ν = 3.

Right: Plots of the marginals (which

are Gamma), and the sample-based

marginal on the correlation

coefficient

47

-10 0 10-5

0

5

-5 0 5-2

0

2

-5 0 5-2

0

2

-50 0 50-5

0

5

-10 0 10-5

0

5

-10 0 10-5

0

5

-20 0 20-5

0

5

-5 0 5-5

0

5

Wi(dof=3.0, S), E=[9.5, -0.1; -0.1, 1.9], =-0.018

-5 0 5-5

0

5

0 10 20 30 400

0.05

0.1

s21

-2 -1 0 1 20

0.5

1(1, 2)

0 5 100

0.2

0.4

s22

If ν = 3 there is a lot of uncertainty

about the value of the correlation

coefficient ρ (almost uniform

distribution on [−1, 1]).

The sampled matrices are highly

variable, and some are nearly

singular.

As ν increases, the sampled

matrices are more concentrated on

the prior S.

wiPlotDemo

from PMTK

A. Gelman et al., Visualizing Distributions of Covariance Matrices, Unpublished.