Inference for misspecified models - Richard Wilkinson · 2021. 1. 24. · Inference for misspeci ed models Richard Wilkinson University of She eld. Mechanistic models Models describe

Inference for misspecified models

Richard Wilkinson

University of Sheffield

Mechanistic modelsModels describe hypothesised relationships between variables.

Mechanistic model

e.g. ODE/PDE models

explains how/why the variables interact the way they do.

parameters may have a physical meaning

often imperfect representations of reality, but may be the only linkbetween the quantity of interest and the data

e.g. Atrial fibrillation

http://staffwww.dcs.shef.ac.uk/people/R.Clayton/MoviesAndPictures/fk3d-3panel.mpg

UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .

Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.

Affects around 610,000 people in UK.

Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.

40% of patients subsequently experience atrial tachycardia (AT).

Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to

Infer tissues properties, including regions of fibrotic material

Predict AT pathways

Aid clinical decision making (accounting for uncertainty)

However, our simulator is imperfect. How should we proceed?






Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.

We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to


Predict AT pathways








Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to


Predict AT pathways



Inference under discrepancyHow should we do inference if the model is imperfect?

Data generating processy ∼ G

Model (complex simulator, finite dimensional parameter)

F = {Fθ : θ ∈ Θ}

If G = Fθ0 ∈ F then we know what to do1.How should we proceed if

G 6∈ F

Note: Interest lies in inference of θ

θ̂ ± σ or π(θ | y)

not calibrated prediction:

π(y ′ | y) =∫

Fθ(y′)π(θ | y)dθ

1

Even if we can’t agree about it!

Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process

y ∼ G

Model (complex simulator, finite dimensional parameter)

F = {Fθ : θ ∈ Θ}


G 6∈ F


θ̂ ± σ or π(θ | y)


π(y ′ | y) =∫


1

Even if we can’t agree about it!


y ∼ GModel (complex simulator, finite dimensional parameter)

F = {Fθ : θ ∈ Θ}

If G = Fθ0 ∈ F then we know what to do1.

How should we proceed ifG 6∈ F


θ̂ ± σ or π(θ | y)


π(y ′ | y) =∫


1Even if we can’t agree about it!


y ∼ GModel (complex simulator, finite dimensional parameter)

F = {Fθ : θ ∈ Θ}


G 6∈ F


θ̂ ± σ or π(θ | y)


π(y ′ | y) =∫


1Even if we can’t agree about it!

Maximum likelihoodMaximum likelihood estimator

θ̂n = arg maxθ

log π(y |θ)

If G = Fθ0 ∈ F , then (under some conditions)

θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)

d=⇒ N(0, I−1(θ0))

Asymptotic consistency, efficiency, normality.If G 6∈ F

θ̂n → θ∗ = arg minθ

DKL(G ,Fθ) a.s.

= arg minθ

∫log

dG

dFθdG

√n(θ̂n − θ∗)

d=⇒ N(0,V−1)


θ̂n = arg maxθ

log π(y |θ)



d=⇒ N(0, I−1(θ0))

Asymptotic consistency, efficiency, normality.

If G 6∈ F


DKL(G ,Fθ) a.s.

= arg minθ

∫log

dG

dFθdG

√n(θ̂n − θ∗)

d=⇒ N(0,V−1)


θ̂n = arg maxθ

log π(y |θ)



d=⇒ N(0, I−1(θ0))

Asymptotic consistency, efficiency, normality.If G 6∈ F


DKL(G ,Fθ) a.s.

= arg minθ

∫log

dG

dFθdG

√n(θ̂n − θ∗)

d=⇒ N(0,V−1)

Bayes

Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)

If G = Fθ0 ∈ F

π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞

Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.

If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).

there is no obvious meaning for Bayesian analysis in this case

Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.

An appealing idea: model the discrepancyKennedy an O’Hagan 2001

Can we model our way out of trouble by expanding F into anon-parametric world?

Grey-box models

One way to expand the class of models isby adding a Gaussian process (GP) to thesimulator.

If fθ(x) is our simulator, y the observation,then perhaps we can correct f using themodel

y = fθ∗(x) + δ(x) where δ(·) ∼ GP

and jointly infer θ∗ and δ(·)

An appealing, but flawed, ideaKennedy and O’Hagan 2001, Brynjarsdottir and O’Hagan 2014

Simulator Reality

fθ(x) = θx g(x) =θx

1 + xaθ = 0.65, a = 20

1 2 3 4

0.5

1.0

1.5

2.0

2.5

Solid=model with true theta, dashed=truth

x

y

●

●

●

●

●

●

●

●

●

●

●

An appealing, but flawed, ideaBolting on a GP can correct your predictions2, but won’t necessarily fixyour inference:

No discrepancy:

y = fθ(x) + N(0, σ2),

θ ∼ N(0,100), σ2 ∼ Γ−1(0.001, 0.001)

GP discrepancy:

y = fθ(x) + δ(x) + N(0, σ2),

δ(·) ∼ GP(·, ·) with objective priorsNo MD

chains$beta

Frequency

0.2 0.4 0.6 0.8 1.0

0200

400

600

GP prior on MD

chains3$beta

Frequency

0.2 0.4 0.6 0.8 1.0

0100

200

300

400

Uniform MD on [−1,1]

chains2b$beta

Frequency

0.2 0.4 0.6 0.8 1.0

01000

2000

Uniform MD on [−0.5,0.5]

chains2$beta

Frequency

0.2 0.4 0.6 0.8 1.0

0500

1500

2500

2as long as you are not extrapolating

Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.

Consider the state space model:

xt+1 = fθ(xt) + et , yt = g(xt) + �t

Can we correct errors in f or g? eg, xt+1 = fθ(xt) + δ(xt) + et

Chapter 6: Gaussian Process Models of Simulator Discrepancy

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)=0.5xt+8ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 0.5+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)=8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 0.5+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)

Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function

(red line) from using different incorrect simulators with Gaussian process discrepancy.

Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters

(q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and

(1, 100) (3 plots in the bottom row), respectively.

ancy from using different incorrect simulators with Gaussian process discrepancy to

model different set of data. We have found that when data come from the system with

193

Fitting a GP is challenging: PGAS works but is expensive, reduced rankmethods better. Variational approaches (for parametric models) lookpromising...

Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.Consider the state space model:

xt+1 = fθ(xt) + et , yt = g(xt) + �t

Can we correct errors in f or g?

eg, xt+1 = fθ(xt) + δ(xt) + et


−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 0.5+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 0.5+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)








193


Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.Consider the state space model:

xt+1 = fθ(xt) + et , yt = g(xt) + �t

Can we correct errors in f or g? eg, xt+1 = fθ(xt) + δ(xt) + et


−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 0.5+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 0.5+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15

f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

x(t)

x(t+1)−f(xt,ut)

−30 −20 −10 0 10 20 30

−15

−10

−5

05

10

15


x(t)

x(t+1)−f(xt,ut)








193


Dangers of non-parametric model extensions

There are (at least) two problems with this approach:

We may still find G 6∈ FIdentifiability

I A GP is an incredibly complex infinite dimensional model, which is notnecessarily identified even asymptotically. The posterior canconcentrate not on a point, but on some sub manifold of parameterspace, and the projection of the prior on this space continues toimpact the posterior even as more and more data are collected.

ie We never forget the prior, but the prior is too complex to understand

I Brynjarsdottir and O’Hagan 2014 try to model their way out oftrouble with prior information:

δ(0) = 0 δ′(x) ≥ 0

Great if you have this information.

Inferential approaches

Instead of trying to model our way out of trouble, can we modify theinferential approach instead?

Common approaches to inference:

Maximum likelihood/minimum-distance

Bayes(ish)

History matching (HM)/ABC type methods (thresholding)

How do these approaches behave for well-specified and mis-specifiedmodels?

Try to understand why (at least anecdotally) HM and ABC seem to workwell in mis-specified cases.

Big question3 is what properties would we like our inferential approach topossess?

3

To which I have no answer


Instead of trying to model our way out of trouble, can we modify theinferential approach instead?Common approaches to inference:


Bayes(ish)





3

To which I have no answer


Instead of trying to model our way out of trouble, can we modify theinferential approach instead?Common approaches to inference:


Bayes(ish)





3To which I have no answer

ABC: approximate Bayesian computation

Rejection Algorithm

Draw θ from prior π(·)Accept θ with probability π(D | θ)

Accepted θ are independent draws from the posterior distribution,π(θ | D).

If the likelihood, π(D|θ), is unknown:

‘Mechanical’ Rejection Algorithm

Draw θ from π(·)Simulate X ∼ f (θ) from the computer modelAccept θ if D = X , i.e., if computer output equals observation

The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).

ABC: approximate Bayesian computation

Rejection Algorithm

Draw θ from prior π(·)Accept θ with probability π(D | θ)

Accepted θ are independent draws from the posterior distribution,π(θ | D).If the likelihood, π(D|θ), is unknown:

‘Mechanical’ Rejection Algorithm

Draw θ from π(·)Simulate X ∼ f (θ) from the computer modelAccept θ if D = X , i.e., if computer output equals observation

The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).

Rejection ABC

If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:

Uniform Rejection Algorithm

Draw θ from π(θ)

Simulate X ∼ f (θ)Accept θ if ρ(D,X ) ≤ �

� reflects the tension between computability and accuracy.

As �→∞, we get observations from the prior, π(θ).If � = 0, we generate observations from π(θ | D).

� = 10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●● ●

●●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●● ●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●●●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●● ●

●

● ●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

ρ(D,X ) = |D − X |, D = 2

� = 7.5

●

●

●

●

●

●● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●● ●

●

●

● ●●

●

●

●

●

●

●

● ●

●

● ●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●●●

●

●

●●

●

●

●●●

●

●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●●● ●

●●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●● ●

●●

●

●

●

●

●

●●

●●

●

●

●

●●

●●●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●●

●●

●

●

●

●

● ●

●●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●●

●●

●●

●● ●

● ●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

ρ(D,X ) = |D − X |, D = 2

� = 5

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

● ●●

●

●

●

●

●● ●

● ●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

● ●●●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

● ●

● ●

●

●

●●

●

●●● ●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

● ●●

●

●

●

●

●●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●●● ●

●●

●

● ●

●

● ● ●●

●

●●

●●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●● ●

●●

●

●

●

●●

●● ●

●

●●

●●●●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●

●

●●

●

●

●●

●

●

●

● ●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

● ●● ●

●

●●

●●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●●●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●● ●●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●●

●

●● ●

●

●●

●●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●●

●

●

●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

ρ(D,X ) = |D − X |, D = 2

� = 2.5

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

● ● ●●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

● ●●

●

●

●●

●●

●

●

●

●

●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

● ●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●● ●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●●●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●●

●

●●

●

●

●

●

●●

●●●●

●

● ●●

●

●●

●●

●●

●●●●

●

●●●

● ●●

●●

●●

●

● ●

●

●

●●

●

●●●

●

●

●●

●

●●

●●

●●

●

●●

●●●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●● ●

●

●

●

●

●

●●●

●

●●

●

●

●●

●●●

●

●

●

●

●●

●

●

●●

●

●

●

●●●

●●

●●

●

●

●●

●

●

●

●●●

●

●●●

●

●

●●●

● ● ●

●

●●

●●●

●●

●●

●●

●●

●

●●● ●●

●

●

●

●●

●●

●

●

●

●

●●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●●

●

●●

●● ●

●●

●●

●

●●●

●

●●

●●

●

●

●

●●

● ●

●

●

●

●

●

●

●●

●●

●

●● ●

●

●●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

ρ(D,X ) = |D − X |, D = 2

� = 1

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ● ●●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

● ●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D ●●● ●●●● ●●●●●●● ●

●●

●●● ●●

● ●

●●

●●

●●

●●

●●●●●

●

●●●●●

●

●●●

●●●

● ●●●●

● ● ●●●●

●

●●

●●

●●●●●

●●

●●

●●

●● ●● ●

● ●● ●●●

●

● ●● ● ●

●●● ●●

− ε

+ ε

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

ρ(D,X ) = |D − X |, D = 2

Rejection ABC

If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionality

Reduce the dimension using summary statistics, S(D).

Approximate Rejection Algorithm With Summaries

Draw θ from π(θ)

Simulate X ∼ f (θ)Accept θ if ρ(S(D), S(X )) < �

If S is sufficient this is equivalent to the previous algorithm.

Simple → Popular with non-statisticians

History matching and ABC

History matching seeks to find a NROY set

Pθ = {θ : SHM(F̂θ,y ) ≤ 3}

where

SHM(Fθ) =|EFθ(Y )− y |√

VarFθ(Y )

ABC approximates the posterior as

π�(θ) ∝ π(θ)E(IS(F̂θ,y)≤�)

for some choice of S (typically S(F̂θ, y) = ρ(η(y), η(y′)) where y ′ ∼ Fθ)

and �.

They have thresholding of a score in common and are algorithmicallycomparable.


These methods (anecdotally) seem to work better in mis-specifiedsituations.

Why?

They differ from likelihood based approaches in that

They only use some aspect of the simulator output

I Typically we hand pick which simulator outputs to compare, andweight them on a case by case basis.

Potentially use generalised scores/loss-functions

The thresholding type nature potentially makes them somewhatconservative


These methods (anecdotally) seem to work better in mis-specifiedsituations.

Why?

They differ from likelihood based approaches in that

They only use some aspect of the simulator outputI Typically we hand pick which simulator outputs to compare, and

weight them on a case by case basis.

Potentially use generalised scores/loss-functions

The thresholding type nature potentially makes them somewhatconservative

What makes a good inferential approach?

Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

Consistency?

I I don’t want inconsistency.

Frequency properties?

I I wouldn’t object but seems impossible for subjective priors.

Coherence?

Robustness to small mis-specifications?



Consistency?I I don’t want inconsistency.



Coherence?





Asymptotic concentration or normality?



Coherence?





Asymptotic concentration or normality?

Frequency properties?I I wouldn’t object but seems impossible for subjective priors.

Coherence?


Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.

A single outlier can make our inference arbitrarily bad

The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).

Consider scoring rules instead. If we forecast F , observe y , then wereceive score

S(F , y)

S is a proper score if

G = arg minF

EY∼GS(F ,Y )

i.e. predicting G gives the best possibly score.

Encourages honest reporting

Examples:

Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)

∫f (x)αdx − γf (y)α−1

Minimum scoring rule estimation (Dawid et al. 2014 etc) uses

θ̂ = arg minθ

S(Fθ, y)

For proper scores

Eθ0

(∂

∂θS(Fθ, y)

∣∣∣∣θ=θ0

)=

∂

∂θEθ0S(Fθ, y)

∣∣∣∣θ=θ0

= 0

so we have an unbiased estimating equation, and hence get asymptoticconsistency for well-specified models. We also get asymptotic normality.

Dawid et al. 2014 show that if

∇θfθ(x) is bounded in x for all θBregman gauge of scoring rule is locally bounded

then the minimum scoring rule estimator θ̂ is B-robust

i.e. it has bounded influence function

IF (x ; θ̂,Fθ) = lim�→0

θ̂(�δx + (1− �)Fθ)− θ̂(Fθ)�

i.e. if Fθ is infected by outlier at x , this doesn’t unduly affect theinference.

Note both ABC and HM are B-robust in this sense, but using thelog-likelihood is not.

What type of robustness do we want here?

Bayes like approachesWhat about Bayesian like approaches with generalized scores?

© 2016 The Authors Journal of the Royal Statistical Society: Series B StatisticalMethodology published by John Wiley & Sons Ltd on behalf of the Royal Statistical Society.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, dis-tribution and reproduction in any medium, provided the original work is properly cited.

1369–7412/16/781103

J. R. Statist. Soc. B (2016)78, Part 5, pp. 1103–1130

A general framework for updating belief distributions

P. G. Bissiri,

University of Milano-Bicocca, Italy

C. C. Holmes

University of Oxford, UK

and S. G. Walker

University of Texas at Austin, USA

[Received December 2013. Final revision November 2015]

Summary. We propose a framework for general Bayesian inference. We argue that a valid up-date of a prior belief distribution to a posterior can be made for parameters which are connectedto observations through a loss function rather than the traditional likelihood function, which isrecovered as a special case. Modern application areas make it increasingly challenging forBayesians to attempt to model the true data-generating mechanism. For instance, when theobject of interest is low dimensional, such as a mean or median, it is cumbersome to have toachieve this via a complete model for the whole data distribution. More importantly, there aresettings where the parameter of interest does not directly index a family of density functionsand thus the Bayesian approach to learning about such parameters is currently regarded asproblematic. Our framework uses loss functions to connect information in the data to function-als of interest. The updating of beliefs then follows from a decision theoretic approach involvingcumulative loss functions. Importantly, the procedure coincides with Bayesian updating whena true likelihood is known yet provides coherent subjective inference in much more generalsettings. Connections to other inference frameworks are highlighted.

Keywords: Decision theory; General Bayesian updating; Generalized estimating equations;Gibbs posteriors; Information; Loss function; Maximum entropy; Provably approximatelycorrect Bayes methods; Self-information loss function

1. Introduction

Data sets are increasing in size and modelling environments are becoming more complex. Thispresents opportunities for Bayesian statistics but also major challenges, perhaps the greatestof which is the requirement to define the true sampling distribution, or likelihood, for thedata generator f0.x/, regardless of the study objective. Even if the task is inference for a lowdimensional parameter, Bayesian analysis is required to model the complete data distributionand, moreover, to assume that the model is ‘true’.

In this paper we present a coherent procedure for general Bayesian inference which is basedon the updating of a prior belief distribution to a posterior when the parameter of interest isconnected to observations via a loss function. Briefly here, and in the simplest scenario, supposethat interest is in the θ minimizing the expected loss

Address for correspondence: C. C. Holmes, Department of Statistics, University of Oxford, 24–29 St Giles,Oxford, OX1 3LB, UK.E-mail: [email protected]

Bissiri et al. 2016 consider updating prior beliefs when parameter θ isconnected to observations via a loss function L(θ, y).They argue the update must be of the form

π(θ|x) ∝ exp(−L(θ, x))π(θ)

via coherency arguments.Note using log-likelihood as the loss funct

Documents

Inference for misspecified models - Richard Wilkinson · 2021. 1. 24. · Inference for misspeci ed models Richard Wilkinson University of She eld. Mechanistic models Models describe