1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010...

Preview:

Citation preview

1

Connections between MCMC and Likelihood

MethodsDonald A. Pierce

with Ruggero BellioWinter 2010 OSU

Slides are at www.science.oregonstate.edu/~piercedo/osu-mcmc-mpl.ppt

2

It is popular these days to be “Bayesian”, in large part due to the utility of MCMC and in particular (Win)BUGS

However, substantive prior information is seldom used, aiming for “objective Bayes”, and connections to likelihood inference are interesting

Largely, the gain in MCMC is in utilizing rather intractable likelihood functions: integrating over latent variates, e.g. latent cluster effects or covariates observed with error

However, if everything except observed data is a random variable, issues of inference become highly (too?) automatic

3

A key issue in this is the contrast of profile and integrated likelihoods, namely

Modern higher-order likelihood theory suggests, surprisingly, that integrated likelihoods can overcome shortcomings of profile likelihood

A posterior for is an instance of integrated likelihood

That is, so

( ; ) max ( , ; )

( ; ) ( , ; ) ( )P

w

I

L y L y

L y L y w d

( , | ) ( , ; ) ( , )y L y

( | ) ( ) ( , ; ) ( | )y L y d

4

An integrated likelihood is approximated very well by a Laplace approximation

Hence, the MCMC posterior for “flat” priors is essentially

We will see that this depends substantially on the representation of the nuisance parameter --- to be avoided in frequentist or likelihood inference

The approximation above is, within reason, valid for any such representation (not that this is so comforting)

1/2ˆ ˆ( ; ) ( ; ) | ( ) | ( )w

I PL y L y j w

1/2ˆ( | ) ( ; ) | ( ) |Py L y j

5

Regarding “flat” priors: in practice those used in WinBUGS manual examples seem advisable, i.e. proper but very diffuse for parameters on , e.g. dnorm(0,1E-6), and implicitly for the logs of inherently positive parameters, e.g. dgamma(1E-6,1E-6)

The latter is to obtain approximate invariance to scale for scale parameters, a natural requirement

If to facilitate convergence is chosen otherwise, then for likelihood analysis one should divide the posterior of by the prior

Geyer & Thompson (1992 JRSS-B) gave a method for computing the likelihood using MCMC, but the proposal here is far simpler

( , )

( )

6

An attempt to generally improve on profile likelihood was the Cox-Reid approximate conditional likelihood

requiring that the nuisance parameter be represented as ‘orthogonal’ to , i.e. that varies slowly with

However, orthogonal parameters are not at all uniquely defined, resulting in arbitrariness of the ACL that must be resolved

A partial indication of our interests is that the ACL is formally the same as the above approximation to the posterior for using flat priors

1/2ˆ( ; ) ( ; ) | ( ) |AC PL y L y j

7

Barndorff-Nielsen developed the modified profile likelihood

that is invariant to representation of the nuisance parameter --- a really key issue

Remarkable stroke of intuition, and B-N only showed that the MPL approximates what is desired for the primary special settings: exponential families, regression-scale models, etc

We have been developing the idea that what the MPL in general approximates is a suitable integrated likelihood, hence with close connections to MCMC

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j

8

Example (Pierce & Peters 1992): CC study, 40 sets with 2:1 matching, 30/80 of controls “exposed”

Solid line PL, dashed lines conditional likelihood and MPL

9

Concept of ‘orthogonal’ parameter, for ACL and for MCMC, needs clarification

In principle there is an ‘ideal’ choice of orthogonal parameter such that the integrated likelihood, i.e. the Bayes posterior (with uniform priors), approximates the MPL

Some goals are: (a) to actually compute this, either from the likelihood or the posterior samples, (b) to recover the PL from the posterior distribution, and (c) to approximate the MPL in this way, even if not as in (a)

These are not completed, but some progress has been made

10

Example: Binary data on 50 subjects, repeated observations at up to five times, total of 220 observations

Suitable for logistic mixed model with latent random intercepts for subjects

Interest parameter the standard deviation of the random intercepts. Seven nuisance parameters: constant term, 2 treatment parameters, 4 for time effects

Usual parametrization is not orthogonal: vector of canonical regression parameters are ‘attenuated’ as

suggesting an approximately orthogonal parameter

2

0ˆ ˆ / 1 0.304

2/ 1 0.304

11

WinBUGS posterior densities of using flat priors: heavy line original parametrization, light line using the approximately orthogonal nuisance parameters

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Sigma

De

nsi

ty

12

Posterior samples: Sigma vs constant term, for original and orthogonal parametrizations

This provides a clue that we can use posterior samples to assess and correct for lack of orthogonality

1 2 3 4 5

12

34

constant-orig

Sig

ma

1.0 1.5 2.0 2.5

0.5

1.0

1.5

2.0

2.5

3.0constant-orthog

Sig

ma

13

Important but confusing issue –- clearly, if we transform the posterior samples asthe marginal distribution of is unchanged

Part of reason reparametrization of matters is that this is done in the model specification, where in contrast to the above there is no (implicit) Jacobian involved in the density

Having samples from the joint distribution of , it would be possible but impractical to divide the density by the Jacobian, to avoid re-doing MCMC

We can achieve this aim otherwise by resampling from the posterior samples with weights inversely proportional to the reciprocal Jacobian

{ , } { , ( , )}

{ , }

14

1/2ˆ( | ) ( ; ) | ( ) |Py L y j

Recall that to very good approximation the MCMC posterior, for flat priors, is essentially

which can be expressed approximately as 1/2( | ) ( ; ) | asyvar( | ) |Py L y

We can approximate the final factor from the MCMC samples at hand, and thus approximate the PL by dividing the posterior density of by our estimate of

There are, however, issues involving the distinction between posterior andsampling theory ˆvar( ; )

1/2| asyvar( | ) |

var( | )

15

A transparent way to do this, although there may be more accurate ways

Choose bins for (e.g. 20 using quantiles), for each of these compute , and then smooth (the logs of) these by quadratic regression on the bin classmarks

| var( | ) |

1.0 1.5 2.0 2.5 3.0 3.5

-10

-9

-8

-7

-6

-5

-4

log

va

r(la

m|p

si)

16

Red right: MCMC posterior original parametrizationRed left (dashed): after above adjustmentBlack: PL computed by quadrature

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

17

What should be the meaning of ‘orthogonal’ parameter for use in the APL?

Said earlier that should vary slowly with which is related to the more usual definition that the (expected) cross-information terms are zero

But if satisfies this definition then so does any 1-1 transformation of it --- very unsatisfactory

Further, this could not be a requirement for validity of APL, since linear transformations leave the APL unchanged even though not conforming at all to such requirements

This suggests more difficulties than first thought in utilizing plots such as on slide 12 for such purposes

c

18

There is in principle a reparametrization such that MPL and IL agree (related to Severini, 2007 Bmtrka)

The constrained MLE can be thought of as a function of if sufficient, otherwise

If there is taken as a variable, this defines a nuisance parameter representation

This representation of the NP depends on or on --- no real problem for Bayesian methods

Define as the inverse function solving the equation

Then the MPL is the Laplace approximation to the integrated likelihood based on representation of the nuisance parameter

ˆ( ˆ , , )a

( , )

*( , ) *( , )

*( , )

ˆ( ˆ , )

( ˆ , )a

19

Theory for this: Laplace approximation in parametrizations and differ only by Jacobian factor

and we are matching that Jacobian with final factor of

Actually need only derivatives

Difficulty in all this is in utilizing, for likelihood, variations in while holding fixed a suitable ancillary “a”

Roughly speaking, a suitable ancillary is the ratio of observed to expected information for

* 1/ { / }

1/2 1/2ˆ( ) | ( ) | ( ) | ( ˆ ) | | / |P P cnstr MLEL j L j

( , )

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j

( , )

20

Ex: Two exponential samples with means and

Reparametrize orthogonally with means

Then provides the corresponding

parametric function

Set this equal to and solve for the inverse

Then to up to Laplace approximation the MPL is the IL for nuisance parameter representation

log PL

log ACL and MCMC posterior with “obvious” orthog

but for this example MPL=PL

ˆ ˆ(1 ˆ / ) / 2

( , ) (1 ˆ / ) / 2

* (1 ˆ / ) / 2

*( , ) 2 / 1 ˆ /

*

/ ,

2 log(1 ˆ / ) log( )n n

2( 1) log(1 ˆ / ) ( 1/ 2) log( )n n

21

Our MCMC example is not very suitable for investigating all this --- MPL is (again) very near the PL

When likelihood is intractable, or when the MLE is not sufficient, can we use the MCMC to approximate the MPL?

Is it better to approximate the reparametrization for which IL = MPL, or better to compute the required Jacobian more directly?

An issue is whether there can, in principle, be enough information in the likelihood, or posterior samples, to approximate the MPL

Can we tell from the posterior samples how the joint distribution would change for slightly different data?

22

There is yet another parametrization such that locally the nuisance parameter becomes a translation parameter

In this parametrization the answer to that question is “yes”

An aim is to capitalize on this without solving for that new parametrization, perhaps taking advantage of the fact that the product of the final two terms in the MPL is invariant to reparametrization

Have had some success for a single nuisance parameter, but there remains much to do

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j