Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Gaussian Models: Bayesian Inference Prof. Nicholas Zabaras
Materials Process Design and Control Laboratory
Sibley School of Mechanical and Aerospace Engineering
101 Frank H. T. Rhodes Hall
Cornell University
Ithaca, NY 14853-3801
Email: [email protected]
URL: http://mpdc.mae.cornell.edu/
February 1, 2014
1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inferring the Precision of a Univariate Gaussian with Known Mean,
Gamma and Inverse Gamma as Priors for l and s2.
Inverse Chi Squared Distribution as a prior for s2.
Bayesian Inference for the Univariate Gaussian with Unknown Mean
and Precision l, Normal-Gamma Distribution as a prior for (m,l)
Posterior for (m,s2) using a Normal-Inverse x2 Prior, Marginal
Posteriors, Credible Intervals, Bayesian T-Test, Multi-Sensor Fusion
with Unknown Parameters
Inference for m in a Multivariate Gaussian with a Gaussian Prior,
Inference of L in a Multivariate Gaussian and Wishart Distribution,
Inference of S and Inverse Wishart Distribution, MAP Estimate, MAP
Shrinkage Estimation, Inference for (m,L), Inference for (m,S), Posterior
Marginals of (m,S) , Visualization of the Wishart
Contents
2
Chris Bishop, Pattern Recognition and Machine Learning, Chapter 2
Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 4
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference of Precision with Known Mean
3
Consider We want to infer the precision
l=1/s2 with the mean m taken as known.
The likelihood takes the form:
The corresponding conjugate prior should be proportional to the
product of a power of λ and the exponential of a linear function of λ.
This corresponds to the gamma distribution.
The gamma distribution has a finite integral if a > 0,and the distribution
itself is finite if a ≥1.
1~ ( | , ), 1,....,n nx x n Nm l N
/2 2
11
1( ) ( | ) exp ( )
2
N NN
n n
nn
p f x xl m l l m
X |
1( | , ) , 0,( )
aa bb
a b e xa
ll l
Gamma 2,
a avar
b bl l
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference of Precision with Known Mean
4
The posterior takes the form:
We can immediately see that the posterior is also a
Gamma distribution:
Here is the MLE of the variance.
0/2 1 2
0 0 0
11
1( | ) ( | ) ( | , ) exp ( )
2
N NN a
n n
nn
p f x a b b xl m m l l l l m
X, Gamma
2 2
0 0 0
1
1( | ) ( | , ), / 2 , ( )
2 2
N
N N N N n ML
n
Np a b a N a b b x bl m l m s
X, Gamma
2
MLs
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference of Precision with Known Mean
5
The effect of observing N data points is to increase the value of a by
N/2 (i.e. ½ for each data point). Thus we interpret the parameter a0 as
2a0 ‘effective’ prior observations.
Each measurement contributes to the parameter b a variance
Since we have 2a0 effective prior measurements, each of them
contributes to b an effective prior variance
The interpretation of a conjugate prior in terms of effective dummy data
points is typical for the exponential family of distributions.
The results above are identical with inference directly of the variance
s2 using as prior resulting in a posterior
2 2
0 0 0
1
1( | ) ( | , ), / 2 , ( )
2 2
N
N N N N n ML
n
Np a b a N a b b x bl m l m s
X, Gamma
2 / 2.MLs
2 20 00
0
2
2
a bb
as s
2
0 0( | , )a bsInvGamma2( | , )N Na bsInvGamma
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 6
Gamma and Inverse Gamma
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004
1 1
2 2
~ | , ~ | ,
: ~ | , ~ | ,
If a b then a b
Here a b then a b
l l s s
Gamma InvGamma
Gamma InvGamma
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Univariate Posterior – Inverse Chi Squared Prior
Alternative prior for s2 is the Scaled Inverse Chi-Squared Distribution*
Here v0 represents the strength of the prior and encodes the value
of the prior. With this, the posterior takes the form:
The posterior dof vN is the prior dof plus N. The posterior sum of
squares is the sum of the prior sum of squares plus
the data sum of squares. An uninformative prior corresponds to
zero virtual sample size, v0=0. This is the prior
This approach is certainly more appealing.
7
0
2 2/2 1
2 2 2 2 20 0 0 0 00 0 2
| , | , exp2 2 2
vv v vv
s s s s s s
s
InvGamma
2
0s
22
0 02 2 2 2 2 2 1
0| , | , , ,
N
i
iN N N N
N
v x
v v v Nv
s m
s m s s s
D
22 2
0 0
1
N
N N i
i
v v xs s m
2 2p s s
* Often denoted as 2
2 2 2 0 00 0 0
0
| , , 22
v- - v mean for v
v
s s s
Scale Inv
2 4 2
0 0 0 002
00 0
2var 4,
22 4
v vfor v mode
vv v
s s
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Sequential Update of the Posterior for s2
Sequential update of the posterior for s2 starting from an
uninformative prior . The data were generated
from N(5,10).
8
0 5 10 150
0.05
0.1
0.15
0.2
0.25
0.3
0.35
s2
prior = IW(=0.001, S=0.001), true s2=10.000
N=2
N=5
N=50
N=100
gaussSeqUpdateSigma1D
from PMTK
2 1 2 2s| ,s | , ,
2 2For D = 1: s v
s s s
IGIW
Gelman 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis
1(3):515–533
2 2
0 0 0 0| 0.001, s = 0.001v vs s IW
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Inference: Unknown Mean and Precision
9
Consider . We want to infer both the
precision l=1/s2 and the mean m.
The likelihood takes the form:
We need a prior that has a similar functional form in terms of l and m.
1~ ( | , ), 1,....,n nx x n Nm l N
/2 2
11
21/2 2
1 1
1( ) ( | ) exp ( )
2
1exp exp
2 2
N NN
n n
nn
NN N
n n
n n
p f x x
x x
m l m l l m
lml lm l
X | ,
2
1/2
2 2/2
( )( | )
( ) exp exp2
exp exp2 2
pp
p c d
c cd
lm l
lmm l l lm l
lm l l
,
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Inference: Unknown Mean and Precision
10
We can easily identify that the prior is of the form (Normal-
Gamma):
Recall the form of the Gamma distribution:
21/2
2 21/2 ( 1)/2
( )( | )
( ) exp exp2
exp exp2 2
pp
p c d
c cd
lm l
lmm l l lm l
ll m l l
,
2
1
0
1( , ) | , | ,
2 2
c cp a b d
b
m l m m l l
N Gamma
1( | , )( )
aa bb
a b ea
ll l
Gamma
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Inference: Unknown Mean and Precision
11
Combining the likelihood and prior, we can re-arrange and
write:
Completing the square on the 2nd argument gives:
| ,2
N N
Na a bl
Gamma
/2 1 2 2
0
1
1/2 2
0
1
1( ) exp
2 2
2( ) exp
2
NN a
n
n
N
n
n
p b x
NN x
N
m l l l m l
l l m m m
,
2
2
0
1/2 1 2 2
0
1
2
2
01/2 1
1( ) exp
2 2 2
( ) exp2
N
N
N
nNnN a
n
n
N
b
N
n
n
x
p b xN
xN
NN
m
m
m l l l m l
ml
l m
,
1
0
1| ,
N
N
n
nN
x
NN
m
m m l
N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Normal-Gamma Distribution
12
2
1
0( , ) | , | 1 ,2 2
c cp a b d
b
m l m m l l
N Gamma
-2 -1.5 -1 -0.5 0 0.5 1 1.5 20
1
2
3
4
5
6
7
8
0
5, 6,
0, 2
a b
m
MatLab Code
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior
marginals, posterior predictive, and reference results for an uninformative prior)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior for m and s for Scalar Data We can also work directly with s2. We use the normal inverse chi-
squared distribution (NIX)
Similarly to our earlier calculations, the posterior is given as:
The posterior marginal for s2 and posterior expectation are:
13
0
2 2 2 2 2 2 2
0 0 0 0 0 0 0 0
( 3)/2 22
0 0 0 0
2 2
, | , , , | , / | ,
1exp
2
v
m v m v
v m
m s s m s s s
s m
s s
NI N
2 2 2 2 0 0 00
0 0
2 22 2 0
0 0 0 0 0
1 0
( , | ) , | , , ,
, ,
N N N N N
N
N
N N N N i
i
m N Np m v , m m
N N
NN v v N v v x x m x
N
s s s
s s
D NIm mx
x
=
2 2 2 2 2 2 2( | ) ( , | ) | , , |2
NN N N
N
vp p d v
vs s s s s s m mD D D
0
22 2 2 2 0 0 0
0 0
2/2 1
2 0 0
2
| , | ,2 2
exp2
v
v vv
v
s s s s
ss
s
IG
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior
marginals, posterior predictive, and reference results for an uninformative prior, also Section 6 provides the analysis
for a normal-inverse-Gamma prior)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Normal Inverse x2 Distribution
14
-1
0
1
0
1
20
0.1
0.2
0.3
0.4
m
NIX(m0=0, k
0=1,
0=1, s
2
0=1)
s2 -1
0
1
0
1
20
0.1
0.2
0.3
0.4
m
NIX(m0=0, k
0=5,
0=1, s
2
0=1)
s2 -1
0
1
0
1
20
0.1
0.2
0.3
0.4
m
NIX(m0=0, k
0=1,
0=5, s
2
0=1)
s2
The NIχ2(m0, κ0, ν0, σ20) distribution. m0 is the prior mean and κ0 is how strongly we
believe this; σ20 is the prior variance and ν0 is how strongly we believe this. (a) The
contour plot (underneath the surface) is shaped like a “squashed egg”. (b) We increase
the strength of our belief in the mean, so it gets narrower (c) We increase the strength of
our belief in the variance, so it gets narrower.
NIXdemo2
from PMTK
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior for m and s for Scalar Data The posterior for m is Students’ T:
Let us revisit these results with an uninformative prior:
With this prior, the posterior becomes:
s is the sample std. Thus the marginal posterior for m becomes:
The standard error of the mean is defined as
An approximate 95% posterior credible interval is thus:
15
2 2 2( | ) ( , | ) | , / , , |N N N N Np p d m v mm m s s m s m D D T D
2 2 2 2 2 2
0 0 0 0, , | 0, 0, 1, 0p p p vm s m s s m s m s NI
2 2
2 2 2 2 2 2
1
1( , | ) , | , , 1, ,
1 1
N
MLEN N N N i
i
Np m N v N s s x x
N Ns s s s
D NI xm m
2 2
2 2 1( | ) | , / , 1 , var |
2 3
NN
N
v N s sp x s N N
v N N Nm m m s
D T D
var |s
Nm D
0.95 | 2s
I xN
m D
/2 1/2
21/21
( )2 2( | , , ) 1
( )2
xp x
l ml
m l
2
2
: , 1
:
:2
, 22
Mean
Mode
Var
m
m
s
l
l s
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian T-Test We want to test the hypothesis μ ≠ μ0 for some known value μ0 (often
0), given xi ∼ N(μ, σ2). This is called a two-sided, one-sample t-test.
One can check if μ0 Є I0.95(μ|D). If it is not, then we can be 95% sure
that μ ≠ μ0.
A more common scenario is when we want to test if two paired
samples have the same mean. More precisely, suppose yi ∼ N(μ1,
σ2) and zi ∼ N(μ2, σ2). We want to determine if μ = μ1 − μ2 > 0, using
xi = yi − zi as our data. We can evaluate this as follows:
This is called a one-sided, paired t-test. To calculate the posterior,
we must specify a prior. Suppose we use an uninformative prior. As
shown earlier, the posterior marginal is:
16
0
0 | |p p dm
m m m m
D D
2
| | , , 1s
p x NN
m m
D T
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian T-Test We define the t statistic as:
The denominator is the standard error of the mean. With this
definition note that
where Fv(t) is the CDF of the standard Student’s T distribution
Note: The posterior of m has a form . From a frequentist
point of view, this is identical to the sampling distribution of the MLE:
. Thus the one sided p value in a frequentist test is the
same as the Bayesian estimate . The interpretation of the
results in the two approaches is of course very different.
17
0
0 1| | 1 ( )Np p d F tm
m m m m
D D
0
/
xt
s N
m
0,1,vT
1| ~/
N
x
s N
m
D T
1| ~/
N
x
s N
mm
T
0 |p m m D
Box, G. and G. Tiao (1973). Bayesian inference in statistical analysis. Addison-Wesley.
Gonen, M., W. Johnson, Y. Lu, and P. Westfall (2005, August). The Bayesian Two-Sample t Test. The American
Statistician 59(3), 252–257.
Rouder, J., P. Speckman, D. Sun, and R. Morey (2009). Bayesian t tests for accepting and rejecting the null
hypothesis. Pyschonomic Bulletin & Review 16(2), 225–237.
bayesTtestDemo
from PMTK
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Sensor Fusion with Unknown Parameters Let us consider a sensor fusion problem where the precision of each
measurement device is unknown.
The unknown precision case turns out to give qualitatively different
results from the earlier case of known precision, yielding a potentially
multi-modal posterior.
Suppose we want to pool data from two sources x and y to
estimate some quantity μ ∈ R, but the reliability of the sources is
unknown. Specifically, suppose we have two different measurement
We make two independent measurements with each device, which
turn out to be x1= 1.1, x2 = 1.9, y1 = 2.9, y2 = 4.1
We use non-informative prior for μ, p(μ) ∝ 1, which we can emulate
using a Gaussian,
18
1 1| ~ , , | ~ ,i x i yx ym m l m m l N N
1
0 0( ) | 0,p mm m l N
Minka, T. (2000). Estimating a Dirichlet distribution. Technical report, MIT.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Sensor Fusion with Unknown Parameters For known precisions, the posterior is Gaussian:
However, the measurement precisions are not known. Initially we will
estimate them by MLE. The log-likelihood is given by
The MLE is obtained by solving the following coupled equations:
19
1
1 1
0
| | , ,
, 2, 1.5, 3.5
yx
N N
NN
i ix x y y i i
N x y
x x y y x y
N x x y y
p m
x yN x N y
m N N x yN N N N
N N
m m l
l l
l l
l l l l
D N
2 2
1 1
, , log log2 2
N Nyx
x y x i y i
i i
x yll
m l l l m l m
2 2
1 1
1 1 1 1, ,
yxNN
x yx y
i i
i ix yx y x yx y
N x N yx y
N N N N
l lm m m
l l l l
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Sensor Fusion with Unknown Parameters
We solve these equs by iteration starting with
Upon convergence,
The plug-in approximation to the posterior is plotted next.
20
2 2
1 1
1 1 1 1, ,
yxNN
x yx y
i i
i ix yx y x yx y
N x N yx y
N N N N
l lm m m
l l l l
2 22 2
1 1
1 1 1 1,
0.16 0.36x y
yxx yN N
x yi i
i i
NN
s sx x y y
l l
1 1
, , | , , |1.5788,0.07980.1662 4.0509
x y x xpl l m l l m D N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Plug-in Approximation to the Posterior Posterior for μ. Plug-in approximation.
21
-2 -1 0 1 2 3 4 5 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
sensorFusionUnknownPrec
from Kevin Murphys’ PMTK
This weights each sensor
according to its estimated
precision.
Since sensor y was
estimated to be much less
reliable than sensor x, we
have
Effectively with this
approximation we ignore the
y sensor.
| , ,x x xm l l D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Sensor Fusion with Unknown Parameters Now we will adopt a Bayesian approach and integrate out the
unknown precisions, rather than trying to estimate them. That is, we
compute
We will use uninformative Jeffrey’s priors, p(μ) ∝ 1, p(λx|μ) ∝ 1/λx and
p(λy|μ) ∝ 1/λy.
The first integral becomes (using the likelihood derived earlier):
For Nx=2, it simplifies to the normalizing factor of a Gamma distribution
22
| | , | | , |x x x x y y y yp p p p d p p dm m m l l m l m l l m l D D D
2/2 21
| , | exp2 2
xN x xx x x x x x x x x x
x
N NI p p d N x s dm l l m l l l m l l
l
D
1 1 22 1
122
exp exp ( )a a a
x x x x x x x
b
x
I x s d b d a b
I x s
l l m l l l l
m
1( | , )( )
aa bb
a b ea
ll l
G
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Sensor Fusion with Unknown Parameters Finally:
The exact posterior is plotted below.
23
2 22 2
1 1|
x y
px s y s
mm m
D
-2 -1 0 1 2 3 4 5 60
0.5
1
1.5
The posterior has two modes at
The weight of the 1st mode is larger,
since the data from the x sensor agree
more with each other. It seems likely
that the x sensor is the reliable one.
The Bayesian solution makes it
possible that the y sensor is the more
reliable one; from two measurements,
we cannot tell, and choosing just the x
sensor, as the plug-in approximation
does, results in over confidence (a
narrow posterior)
1.5, 3.5x y
sensorFusionUnknownPrec
from Kevin Murphys’ PMTK
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Multivariate Gaussian:Posterior of m
24
Consider a known variance S and a Gaussian prior N(m,S0) with the
posterior for the unknown mean m taking the form:
This posterior is the exponential of a quadratic in m:
So the variance and mean of the posterior are:
For uninformative prior,
1
( | ) ( ) ( | , )N
n
n
p p p x
Xm m m S
1
1
1 1
0 0 0
1
1 1 1 1
0 0 0
1
1 1( ) ( )
2 2
1
2N
N N
NT T
n n
n
NT T
n
n
N const
x x
x
m
m m m m m m
m m m m
SS
S S
S S S S
1 1 1
0
1 11 1 1 1 1 1 1 1
0 0 0 0 0 0
1
,N
N
N n ML
n
N
N N N
xm m m m
S S S
S S S S S S S S
( | ) | N Np X, ,S Sm m mN
0
1( | ) | MLp
N
I X, ,S S Sm m mN
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Distribution of Precision L We now discuss how to compute p(L|D,μ). The likelihood has the
form
The corresponding conjugate prior is known as the Wishart
distribution
The following slide shows the similarities of this distribution with the
Gamma prior for l used earlier for univariate Gaussian distributions.
25
/2 1( | , ) | | exp ,
2
NTN
n n
n
p Tr S S
D x xL L Lm mm m m
( 1)/2 1
1
/2 /2 ( 1)/4
1
1| , | | exp , 1 ( ), . . ( )
2
1, | | 2
2
v D
Dv vD D D
i
v B Tr v D dof sym pos def D D
v iB v
Wi W W W
W W
L L L
( 1)/4
1
1( )
2 2
DD D
D
i
v v imultivariateGamma Function
1 s, ,s ,
2 2For D = 1
l l
GammaWi | |
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 26
Wishart Distribution
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004
( 1)
1
v k
for v k
Smode
1
,N
T
i i i
i
For ~ 0, , (scatter matrix) ~ N N
Nx S x x S SS S SWi |
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Distribution of S We now similarly discuss how to compute p(Σ|D,μ). The likelihood
has the form
The corresponding conjugate prior is known as the inverse Wishart
distribution
ν0 + D + 1 controls the strength of the prior, and hence plays a role
analogous to the sample size N.
The prior scatter matrix is here S0.
Note: There are many parametrizations of the InvWi leading to
different reported DOF. We here follow the notation from Gelman et
al. with the same for both Wi and InvWi in the Eq. above.
27
/2 11( | , ) | | exp ,
2
NTN
n n
n
p Tr S S
D x xS S Sm mm m m
0( 1)/21 1
0 0 0 0
1| , | | exp , . .
2
v Dv Tr sym pos defS S
InvWi SS S S
1 1~ | , ~ ,If v then v Wi InvWiS SL S L S
Steven W. Nydick, The Wishart and Inverse Wishart Distributions, Report, 2012.
A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, 2004
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 28
Inverse Wishart Distribution
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004
1
1v k
Smode
2 1 2 S, ,S ,
2 2For k = 1
s s
InvGamma| |InvWi
1 1
1
, 1,
If ~ (a,b) ~ (a,b)
If ~ ~ kS S
ll
Gamma InvGamma
S SWi InvWi
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Distribution of S Multiplying the likelihood and prior we find that the posterior is also
inverse Wishart:
The posterior strength νN is the prior strength ν0 plus the number of
observations N.
The posterior scatter matrix SN is the prior scatter matrix S0 plus the
data scatter matrix Sm.
29
0
0
( 1)/2/2 1 1
0
( 1) /2 1
0
1
0 0
1 1| , | | exp | | exp ,
2 2
1| | exp
2
| , | , ,
v DN
N v D
N N
p Tr Tr
Tr
p = N v
S S
S S
S S
D
D InvWi S S
S S S S S
S S
S S
m
m
m
m
m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
MAP Estimation From the mode of the inverse Wishart and
we conclude that the MAP estimate is:
For an improper prior, S0=0 and N0=0,
Consider now the use of a proper informative prior, which is
necessary whenever D/N is large. Let .Then we can
rewrite the MAP estimate as a convex combination of the prior mode
and the MLE
where λ controls the amount of shrinkage towards the prior.
30
1
0 0| , | , , ,N N N Np = v v N vS S D InvWi S SS S mm
0
0 0
0 01 1
NMAP
N
N
= =v D N v D N N
S S S S
S S
m m
MAP MLE
N
S S S
m
0 0 0 00 0
0 0 0 0 0
1
(1 ) , ( )MAP MLE
N N= prior mode
N N N N N N N N N
S S SS S
l l
l l
x x
S S S S
S S x
x mm
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
MAP Estimation
Set λ by cross validation. Alternatively, we can use the closed-form
formula provided in (Ledoit & Wolf and Schaefer & Strimmer), which
is the optimal frequentist estimate if we use squared loss.
This is not a natural loss function for covariance matrices since it
ignores the positive definite constraint, but it results in a simple
estimator (see PMTK function shrinkcov).
For the prior covariance matrix, S0, it is common to use the following
(data dependent) prior:
31
00 0
0
(1 ) , ( )MAP MLE prior modeN
Sl l S S S S
0 MLEdiagS S
Ledoit, O. and M. Wolf (2004b). A well conditioned estimator for large dimensional covariance matrices. J. of
Multivariate Analysis 88(2), 365– 411.
Ledoit, O. and M. Wolf (2004a). Honey, I Shrunk the Sample Covariance Matrix. J. of Portfolio Management 31(1).
Schaefer, J. and K. Strimmer (2005). A shrinkage approach to largescale covariance matrix estimation and
implications for functional genomics. Statist. Appl. Genet. Mol. Biol 4(32).
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
MAP Shrinkage Estimation
In this case, the MAP estimate is:
Thus we see that the diagonal entries are equal to their MLE
estimates, and the off diagonal elements are “shrunk” somewhat
towards 0 (shrinkage estimation, or regularized estimation).
The benefits of MAP estimation are illustrated next. We consider
fitting a 50-dim Gaussian to N = 100, N = 50 and N = 25 data points.
We see that the MAP estimate is always well-conditioned, unlike the MLE.
In particular, the eigenvalue spectrum of the MAP estimate is much
closer to that of the true matrix than the MLE’s.
The eigenvectors, however, are unaffected.
32
( , )
(1 ) ( , )
MLEMAP
MLE
i j if i j=
i j otherwisel
SS
S
0 MLEdiagS S
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Distribution of S Estimating a covariance
matrix in D = 50
dimensions using N ∈
{100, 50, 25} samples.
Eigenvalues in
descending order for the
true covariance matrix
(solid black), MLE (dotted
blue) and MAP estimates
(dashed red) with λ = 0.9.
The condition number of
each matrix is also given
in the legend.
33
shrinkcovDemo
from PMTK
0 5 10 15 20 250
0.5
1
1.5
eig
envalu
e
N=100, D=50
true, k=10.00
MLE, k= 71
MAP, k=8.62
0 5 10 15 20 250
0.5
1
1.5
eig
envalu
e
N=50, D=50
true, k=10.00
MLE, k=4.1e+16
MAP, k=8.85
0 5 10 15 20 250
0.5
1
1.5
eig
envalu
e
N=25, D=50
true, k=10.00
MLE, k=3.7e+17
MAP, k=21.09
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference for Both m and L
34
Suppose x1, x2, … xN ~(i.i.d) N(m,L1). We do not know m or L
When both sets of parameters are unknown, a conjugate
family of priors is one in which
and
m | L ~ N(m0, (L1 )
The Wishart distribution is the multivariate analog of the
Gamma distribution (extension to positive definite matrices). If
matrix U has the Wishart distribution, then U−1 has the inverse-
Wishart distribution. The resulting p(m,L| m0,,T,v) is the
Gaussian-Wishart distribution. The quantity is a positive scalar, while T is a positive definite
matrix. They play roles analogous to those played by a and ,
respectively, in the Gamma distribution.
The other parameters of the prior are the mean vector m0 and
, the latter of which represents the ``a priori number of
observations’’.
L ~ W(L |,T )
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference for Both m and L
35
The likelihood and prior distributions are given explicitly as:
Combining it gives the following posterior:
/2 /2
1
1| , 2 | | exp
2
NND TN
i i
n
p
D x xS L Lm m m
1
0 0 0 0
1/2 ( 1)/2 1
0 0 0
/2/2 /2
0
( , ) , | , , , | , | ,
1 1| | | | exp
2 2
2 , multivariate Gamma function2 2
T D
D
D
NIW D D
p v
exp TrZ
Z
NWi = N WiL L L L
L L L L
m m m m m
m m m m
T T
T
T
1
0
0 0 0
1
( , ) , | , , , | , | ,
,
,
N N N N N N N N
N
NT T
N i i
i
N N
p v
N
N
Nwhere
N
N N
NWi = N WiL L L Lm m m m m
mm
m m
T T
x
T T S x x S = x x x x
D
M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 8)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference for Both m and L
36
The posterior marginals can be derived as:
One can also derive the MAP estimates as:
These are reduced to the MLE by setting
1
( ) | ,
( ) |( 1)N
N N
ND N
N N
p v
pD
Wi
T
L L
m m m
T
T,
D
D
,
0 0 0
1
1
0 0 0 0
1
0
, arg max ( , )
N
i
i
N T T
i i
i
p
N
N D
= x
x x T
L
L L
L
m
m m
m m
m m m m m m
D
0 0 00, , 0D T
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference for Both m and L
37
The posterior predictive is:
The marginal likelihood can be computed as a ratio of
normalization constants:
A useful reference analysis considers
This results in the following:
The posterior parameters are simplified as:
The posterior marginals and posterior predictive are:
1
1( )
( 1)N
N N
D N
N N
pD
T
Tx , mD
0/2/2
0 0
/2 /2/2
0 0
/ 21 1( )
/ 22 N
D
D NN
ND ND
D NN
Zp
Z
T
TD
0 0 0 00, 0, 1, 0 Tm
( 1)/2( , )
Dp
L Lm
, , , 1N N N NN N x T Sm
( ) | , ( ) |( )
( )( )
N D N D
N D
p pN N D
(N +1)p
N N D
Wi T
T
SS x,
Sx x,
L L
m mD D
D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inference for m and S
38
For the case of the multivariate Gaussian of a D-dim variable x, N(x|m,S)
with both the mean and variance unknowns, the likelihood is of the form:
The conjugate prior is given as the product of a Gaussian and the
Inverse Wishart distribution:
0
0
0
1
0 0 0 0 0 0 0
0
( 1)/21/2 1 100 0 0
( 2)/21 100 0 0
/2 0
1( , | , , , ) | , | ,
1 1| | | | exp
2 2
1 1| |
2 2
22
T D
NIW
T D
NIW
v D
NIW D
v
exp TrZ
exp TrZ
vZ
NIW N InvWisS S S
S S S S
S S S
m m m m
m m m m
m m m m
S S
S
S
0
/2
/2
0
0
2, multivariate Gamma function
D
v
D
S
/2 /2 1
11
/2 /2 1 1
1
1| , 2 | | exp
2
12 | | exp exp
2 2
N NND TN
n n n
nn
N TND N
n
NTr
m m m
m mx
x x x
x x S
N S S S
S S S
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Posterior of m and S
39
The posterior is NIW given as:
The posterior mean is a convex combination of the prior mean and the
MLE with strength 0+N.
The posterior scatter matrix SN is the prior scatter matrix S0 plus the
empirical scatter matrix plus an extra term due to the uncertainty in
the mean which creates its own scatter matrix.
0 0 0 0
0 0 00
0 0
0 0
00 0 0 0 0 0 0
10
, | , , , , , | , , ,
,
,
N N N N
N
N
N N
NTT T T
N N N N i i
i
p v v
N N
N N
N v v N
N
N
D NIW
x
S S
xx
=
S S S x x S S S = x x
S Sm m m m
mm m
m m m m m m
Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.
Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of
Bayesian Model Selection. Model Selection. IMS Lecture Notes.
Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation
and Model-Based Clustering. J. of Classification 24, 155–181
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007
xS
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
MAP Estimate of m and S
40
The mode of the joint posterior is:
For 0=0, this becomes:
It is interesting to note that this mode is almost the same as the MAP
estimate computed earlier – it differs by 1 in the denominator as the
mode above is the mode of the joint posterior!
0 0 0 0
0 0 00 0 0
0 0
00 0 0 0 0 0 0
10
, | , , , , , | , , ,
, ,
,
N N N N
N N N
N
NTT T T
N N N N i i
i
p v v
N NN v v N
N N
N
N
D NIW
x
S S
xx =
S S S x x S S S = x x
S Sm m m m
mm m
m m m m m m
0 0 0 0,
, arg max , | , , , , ,2
NN
N
p vv D
DS
SS
S Sm
m m m m
0
0 0 0 0, 0
, arg max , | , , , , ,2
p vv N D
D xS S
S xS
S Sm
m m m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Posterior Marginals of m and S
41
The posterior marginal for S and m are:
It is not surprising that the last marginal is Student’s T that we know can
be represented as a mixture of Gaussians.
To see the connection for the scalar case, note that SN plays the role of
the posterior sum of squares
1
1
( | ) ( , | ) | ,
,1 1
1( | ) ( , | ) |
( 1)N
N N
N NMAP
N N
v D N N
N N
p p d v
v D v D
p p dv D
D D IW
D D T
S
S S
, S
S S S
S S
S S
m m
m m m m
2
N Nv s
21
( 1)
N NN
N N N N Nv D v
s
SS
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Posterior Predictive of m and S
42
The posterior predictive p(x|D)=p(x,D)/p(D) can be evaluated as:
Recall that the Student’s T distribution has heavier tails than the
Gaussian but rapidly becomes Gaussian like.
To see the connection of the above expression with the scalar case,
note:
1
( | ) ( | , ) , | , , ,
1| ,
( 1)N
N N N N
Nv D N N
N N
p v d d
v D
x x
x
D N
T
S
S
S S Sm m m m
m
NIW
2 21 11
( 1)
N N N N NNN
N N N N N
v
v D v
s s
S
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 9)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Marginal Likelihood
43
The posterior is given as:
The last two expressions give the unnormalized likelihood and prior.
The marginal likelihood is then:
Note that for D=1, this reduces to the familiar Equ.
0
0 0 0 0 0 /2
0
/2 1 1 100 0 0 0
/2 1
1 1 1 1, | , , , , , | | , , |
2
1, | exp
2 2
1| , exp
2
NND
N
D T
N
p v a ap Z Z
a tr
tr
D NIW' N' D NIW'D
NIW'
N' D
S
S
S
S S S S
S S S S
S S S
m m m m m
m m m m m
m
0 0 0
0
0
/2 /2
/2/2 /2 /2/2
0 0 0
/2 /2 /2 /2 /2 /2/2 /20 0/2 0
0 0
2 22
2 1 1 2 12 2
22 22 22 2 22
N
N
N N
D D
D N N ND D D D
N N
D ND ND DD ND
N N NDD D
D
p
DS S S
S S S
/2
0
/2N
D
N
0 /2 1/22
0 0 0
/2/2 20
1 2
2
N
ND
N
NN ND
p
s
s
D
/2
0
1
2
N
ND
Zp
Z D
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See calculation of the marginal
likelihood for 1D analysis of the Normal-Inverse-Chi-Squared prior on Section 5)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Non Informative Prior
44
The uninformative Jeffrey’s prior is . This is obtained in
the limit
In this case, we have:
The posterior marginals are then given as:
Also the posterior predictive is:
1
, , 1,N T
N N N N i i
i
N v N
x S S x x x xm
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004 (pp. 88)
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See Section 9)
( 1)/2
,D
p
S Sm
0 0 00, 1, 0v S
1 1( | ) | , 1 , ( | ) |
( )N Dp N p
N N D
D IW D TS x, SS S m m
1( | ) |
( )N D
Np
N N D
D Tx x x, S
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Non Informative Prior
45
Based on the report of Minka belpw, the uninformative prior should be
instead
Often, a data-dependent weakly informative prior is recommended (see
Chipman et al. and Fraley and Raftery):
1
0 00
1/2 ( 1)/2 ( /2 1)
1| , | ,
| 2 | | | | | , | ,0,0 ,0D D0
limN InvWis
NIW
S
I
S S
S S S S
m m
m
0 0 0
0 0
, 2 ,
, 0.01
diagSet : v D to ensure and
N
S
m
xS
S S
x
Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.
Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of
Bayesian Model Selection. Model Selection. IMS Lecture Notes.
Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation
and Model-Based Clustering. J. of Classification 24, 155–181
0 00 1v instead of v
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Visualization of the Wishart Distribution Since Wis is a distribution over matrices, it is difficult to plot it. However,
one can sample from it and in 2d use the eigen-vectors of the resulting
sample to define an ellipse (see Fig on next slide).
For higher dimensional matrices, we can plot marginals. The diagonals
of a Wis distributed matrix have Gamma distributions.
For off-diagonal elements, one can sample matrices from the
distribution, and then compute their distribution empirically.
We can convert each sampled matrix to a correlation matrix, and thus
compute a Monte Carlo approximation
We can then use kernel density estimation to produce for plotting purposes a smooth approximation to the univariate density E [Rij ].
46
( ) ( )
1
1[ ] , ~ ( , ),
Sijs s
ij ijijs ii jj
R R v RS
S
S S S SS S
Wi
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Visualization of the Wishart Distribution
Above: Samples from Σ ∼ Wi(S, ν),
where S=3.1653,−0.0262; −0.0262,
0.6477] and ν = 3.
Right: Plots of the marginals (which
are Gamma), and the sample-based
marginal on the correlation
coefficient
47
-10 0 10-5
0
5
-5 0 5-2
0
2
-5 0 5-2
0
2
-50 0 50-5
0
5
-10 0 10-5
0
5
-10 0 10-5
0
5
-20 0 20-5
0
5
-5 0 5-5
0
5
Wi(dof=3.0, S), E=[9.5, -0.1; -0.1, 1.9], =-0.018
-5 0 5-5
0
5
0 10 20 30 400
0.05
0.1
s21
-2 -1 0 1 20
0.5
1(1, 2)
0 5 100
0.2
0.4
s22
If ν = 3 there is a lot of uncertainty
about the value of the correlation
coefficient ρ (almost uniform
distribution on [−1, 1]).
The sampled matrices are highly
variable, and some are nearly
singular.
As ν increases, the sampled
matrices are more concentrated on
the prior S.
wiPlotDemo
from PMTK
A. Gelman et al., Visualizing Distributions of Covariance Matrices, Unpublished.