85
Inference for misspecified models Richard Wilkinson University of Sheffield

Inference for misspecified models - Richard Wilkinson · 2021. 1. 24. · Inference for misspeci ed models Richard Wilkinson University of She eld. Mechanistic models Models describe

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Inference for misspecified models

    Richard Wilkinson

    University of Sheffield

  • Mechanistic modelsModels describe hypothesised relationships between variables.

    Mechanistic model

    e.g. ODE/PDE models

    explains how/why the variables interact the way they do.

    parameters may have a physical meaning

    often imperfect representations of reality, but may be the only linkbetween the quantity of interest and the data

    e.g. Atrial fibrillation

    http://staffwww.dcs.shef.ac.uk/people/R.Clayton/MoviesAndPictures/fk3d-3panel.mpg

  • Mechanistic modelsModels describe hypothesised relationships between variables.

    Mechanistic model

    e.g. ODE/PDE models

    explains how/why the variables interact the way they do.

    parameters may have a physical meaning

    often imperfect representations of reality, but may be the only linkbetween the quantity of interest and the data

    e.g. Atrial fibrillation

    http://staffwww.dcs.shef.ac.uk/people/R.Clayton/MoviesAndPictures/fk3d-3panel.mpg

  • UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .

    Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.

    Affects around 610,000 people in UK.

    Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.

    40% of patients subsequently experience atrial tachycardia (AT).

    Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to

    Infer tissues properties, including regions of fibrotic material

    Predict AT pathways

    Aid clinical decision making (accounting for uncertainty)

    However, our simulator is imperfect. How should we proceed?

  • UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .

    Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.

    Affects around 610,000 people in UK.

    Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.

    40% of patients subsequently experience atrial tachycardia (AT).

    Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.

    We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to

    Infer tissues properties, including regions of fibrotic material

    Predict AT pathways

    Aid clinical decision making (accounting for uncertainty)

    However, our simulator is imperfect. How should we proceed?

  • UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .

    Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.

    Affects around 610,000 people in UK.

    Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.

    40% of patients subsequently experience atrial tachycardia (AT).

    Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to

    Infer tissues properties, including regions of fibrotic material

    Predict AT pathways

    Aid clinical decision making (accounting for uncertainty)

    However, our simulator is imperfect. How should we proceed?

  • UQ in Patient Specific Cardiac ModelsWith Sam Coveney, Richard Clayton, Steve Neiderer, Jeremy Oakley, . . .

    Atrial fibrillation (AF) - rapid and uncoordinated electrical activation(arrhythmia) leading to poor mechanical function.

    Affects around 610,000 people in UK.

    Catheter ablation removes/isolates pathological tissue thatsustain/initiate AF.

    40% of patients subsequently experience atrial tachycardia (AT).

    Aim: predict which AF patients will develop AT following ablation, andthen treat for both in a single procedure.We use complex electrophysiology simulations, combine these with sparseand noisy clinical data, to

    Infer tissues properties, including regions of fibrotic material

    Predict AT pathways

    Aid clinical decision making (accounting for uncertainty)

    However, our simulator is imperfect. How should we proceed?

  • Inference under discrepancyHow should we do inference if the model is imperfect?

    Data generating processy ∼ G

    Model (complex simulator, finite dimensional parameter)

    F = {Fθ : θ ∈ Θ}

    If G = Fθ0 ∈ F then we know what to do1.How should we proceed if

    G 6∈ F

    Note: Interest lies in inference of θ

    θ̂ ± σ or π(θ | y)

    not calibrated prediction:

    π(y ′ | y) =∫

    Fθ(y′)π(θ | y)dθ

    1

    Even if we can’t agree about it!

  • Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process

    y ∼ G

    Model (complex simulator, finite dimensional parameter)

    F = {Fθ : θ ∈ Θ}

    If G = Fθ0 ∈ F then we know what to do1.How should we proceed if

    G 6∈ F

    Note: Interest lies in inference of θ

    θ̂ ± σ or π(θ | y)

    not calibrated prediction:

    π(y ′ | y) =∫

    Fθ(y′)π(θ | y)dθ

    1

    Even if we can’t agree about it!

  • Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process

    y ∼ GModel (complex simulator, finite dimensional parameter)

    F = {Fθ : θ ∈ Θ}

    If G = Fθ0 ∈ F then we know what to do1.

    How should we proceed ifG 6∈ F

    Note: Interest lies in inference of θ

    θ̂ ± σ or π(θ | y)

    not calibrated prediction:

    π(y ′ | y) =∫

    Fθ(y′)π(θ | y)dθ

    1Even if we can’t agree about it!

  • Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process

    y ∼ GModel (complex simulator, finite dimensional parameter)

    F = {Fθ : θ ∈ Θ}

    If G = Fθ0 ∈ F then we know what to do1.How should we proceed if

    G 6∈ F

    Note: Interest lies in inference of θ

    θ̂ ± σ or π(θ | y)

    not calibrated prediction:

    π(y ′ | y) =∫

    Fθ(y′)π(θ | y)dθ

    1Even if we can’t agree about it!

  • Inference under discrepancyHow should we do inference if the model is imperfect?Data generating process

    y ∼ GModel (complex simulator, finite dimensional parameter)

    F = {Fθ : θ ∈ Θ}

    If G = Fθ0 ∈ F then we know what to do1.How should we proceed if

    G 6∈ F

    Note: Interest lies in inference of θ

    θ̂ ± σ or π(θ | y)

    not calibrated prediction:

    π(y ′ | y) =∫

    Fθ(y′)π(θ | y)dθ

    1Even if we can’t agree about it!

  • Maximum likelihoodMaximum likelihood estimator

    θ̂n = arg maxθ

    log π(y |θ)

    If G = Fθ0 ∈ F , then (under some conditions)

    θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)

    d=⇒ N(0, I−1(θ0))

    Asymptotic consistency, efficiency, normality.If G 6∈ F

    θ̂n → θ∗ = arg minθ

    DKL(G ,Fθ) a.s.

    = arg minθ

    ∫log

    dG

    dFθdG

    √n(θ̂n − θ∗)

    d=⇒ N(0,V−1)

  • Maximum likelihoodMaximum likelihood estimator

    θ̂n = arg maxθ

    log π(y |θ)

    If G = Fθ0 ∈ F , then (under some conditions)

    θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)

    d=⇒ N(0, I−1(θ0))

    Asymptotic consistency, efficiency, normality.

    If G 6∈ F

    θ̂n → θ∗ = arg minθ

    DKL(G ,Fθ) a.s.

    = arg minθ

    ∫log

    dG

    dFθdG

    √n(θ̂n − θ∗)

    d=⇒ N(0,V−1)

  • Maximum likelihoodMaximum likelihood estimator

    θ̂n = arg maxθ

    log π(y |θ)

    If G = Fθ0 ∈ F , then (under some conditions)

    θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)

    d=⇒ N(0, I−1(θ0))

    Asymptotic consistency, efficiency, normality.If G 6∈ F

    θ̂n → θ∗ = arg minθ

    DKL(G ,Fθ) a.s.

    = arg minθ

    ∫log

    dG

    dFθdG

    √n(θ̂n − θ∗)

    d=⇒ N(0,V−1)

  • Maximum likelihoodMaximum likelihood estimator

    θ̂n = arg maxθ

    log π(y |θ)

    If G = Fθ0 ∈ F , then (under some conditions)

    θ̂n → θ0 almost surely as n→∞√n(θ̂n − θ0)

    d=⇒ N(0, I−1(θ0))

    Asymptotic consistency, efficiency, normality.If G 6∈ F

    θ̂n → θ∗ = arg minθ

    DKL(G ,Fθ) a.s.

    = arg minθ

    ∫log

    dG

    dFθdG

    √n(θ̂n − θ∗)

    d=⇒ N(0,V−1)

  • Bayes

    Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)

    If G = Fθ0 ∈ F

    π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞

    Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.

    If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).

    there is no obvious meaning for Bayesian analysis in this case

    Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.

  • Bayes

    Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)

    If G = Fθ0 ∈ F

    π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞

    Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.

    If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).

    there is no obvious meaning for Bayesian analysis in this case

    Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.

  • Bayes

    Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)

    If G = Fθ0 ∈ F

    π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞

    Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.

    If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).

    there is no obvious meaning for Bayesian analysis in this case

    Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.

  • Bayes

    Bayesian posteriorπ(θ|y) ∝ π(y |θ)π(θ)

    If G = Fθ0 ∈ F

    π(θ|y) d=⇒ N(θ0, I−1(θ0)) as n→∞

    Bernstein-von Mises theorem: we forget the prior, and get asymptoticconcentration and normality.This also requires (a long list of) identifiability conditions to hold.

    If G 6∈ F , we still get asymptotic concentration (and possibly normality)but to θ∗ (the pseudo-true value).

    there is no obvious meaning for Bayesian analysis in this case

    Often with non-parametric models (eg GPs), we don’t even get thisconvergence to the pseudo-true value due to lack of identifiability.

  • An appealing idea: model the discrepancyKennedy an O’Hagan 2001

    Can we model our way out of trouble by expanding F into anon-parametric world?

    Grey-box models

    One way to expand the class of models isby adding a Gaussian process (GP) to thesimulator.

    If fθ(x) is our simulator, y the observation,then perhaps we can correct f using themodel

    y = fθ∗(x) + δ(x) where δ(·) ∼ GP

    and jointly infer θ∗ and δ(·)

  • An appealing idea: model the discrepancyKennedy an O’Hagan 2001

    Can we model our way out of trouble by expanding F into anon-parametric world?

    Grey-box models

    One way to expand the class of models isby adding a Gaussian process (GP) to thesimulator.

    If fθ(x) is our simulator, y the observation,then perhaps we can correct f using themodel

    y = fθ∗(x) + δ(x) where δ(·) ∼ GP

    and jointly infer θ∗ and δ(·)

  • An appealing, but flawed, ideaKennedy and O’Hagan 2001, Brynjarsdottir and O’Hagan 2014

    Simulator Reality

    fθ(x) = θx g(x) =θx

    1 + xaθ = 0.65, a = 20

    1 2 3 4

    0.5

    1.0

    1.5

    2.0

    2.5

    Solid=model with true theta, dashed=truth

    x

    y

  • An appealing, but flawed, ideaBolting on a GP can correct your predictions2, but won’t necessarily fixyour inference:

    No discrepancy:

    y = fθ(x) + N(0, σ2),

    θ ∼ N(0,100), σ2 ∼ Γ−1(0.001, 0.001)

    GP discrepancy:

    y = fθ(x) + δ(x) + N(0, σ2),

    δ(·) ∼ GP(·, ·) with objective priorsNo MD

    chains$beta

    Frequency

    0.2 0.4 0.6 0.8 1.0

    0200

    400

    600

    GP prior on MD

    chains3$beta

    Frequency

    0.2 0.4 0.6 0.8 1.0

    0100

    200

    300

    400

    Uniform MD on [−1,1]

    chains2b$beta

    Frequency

    0.2 0.4 0.6 0.8 1.0

    01000

    2000

    Uniform MD on [−0.5,0.5]

    chains2$beta

    Frequency

    0.2 0.4 0.6 0.8 1.0

    0500

    1500

    2500

    2as long as you are not extrapolating

  • Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.

    Consider the state space model:

    xt+1 = fθ(xt) + et , yt = g(xt) + �t

    Can we correct errors in f or g? eg, xt+1 = fθ(xt) + δ(xt) + et

    Chapter 6: Gaussian Process Models of Simulator Discrepancy

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=0.5xt+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function

    (red line) from using different incorrect simulators with Gaussian process discrepancy.

    Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters

    (q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and

    (1, 100) (3 plots in the bottom row), respectively.

    ancy from using different incorrect simulators with Gaussian process discrepancy to

    model different set of data. We have found that when data come from the system with

    193

    Fitting a GP is challenging: PGAS works but is expensive, reduced rankmethods better. Variational approaches (for parametric models) lookpromising...

  • Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.Consider the state space model:

    xt+1 = fθ(xt) + et , yt = g(xt) + �t

    Can we correct errors in f or g?

    eg, xt+1 = fθ(xt) + δ(xt) + et

    Chapter 6: Gaussian Process Models of Simulator Discrepancy

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=0.5xt+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function

    (red line) from using different incorrect simulators with Gaussian process discrepancy.

    Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters

    (q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and

    (1, 100) (3 plots in the bottom row), respectively.

    ancy from using different incorrect simulators with Gaussian process discrepancy to

    model different set of data. We have found that when data come from the system with

    193

    Fitting a GP is challenging: PGAS works but is expensive, reduced rankmethods better. Variational approaches (for parametric models) lookpromising...

  • Dynamic discrepancyTime structured problems give us many more opportunities to learn themodel discrepancy.Consider the state space model:

    xt+1 = fθ(xt) + et , yt = g(xt) + �t

    Can we correct errors in f or g? eg, xt+1 = fθ(xt) + δ(xt) + et

    Chapter 6: Gaussian Process Models of Simulator Discrepancy

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=0.5xt+8ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25xt/(1+xt^2)+8ut+GP(0,K)

    x(t)x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)=8 ut+GP(0,K)

    x(t)x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 0.5+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 25 xt/(1+xt^2)+8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    −30 −20 −10 0 10 20 30

    −15

    −10

    −5

    05

    10

    15

    f(xt,ut)= 8 ut+GP(0,K)

    x(t)

    x(t+1)−f(xt,ut)

    Figure 6.6: The learnt discrepancy (solid black line) and the true discrepancy function

    (red line) from using different incorrect simulators with Gaussian process discrepancy.

    Note that data are generated from Equations 6.4.1 and 6.4.2 with true parameters

    (q2, r2) as (0.1, 1) (3 plots in the top row), (1, 0.1) (3 plots in the middle row), and

    (1, 100) (3 plots in the bottom row), respectively.

    ancy from using different incorrect simulators with Gaussian process discrepancy to

    model different set of data. We have found that when data come from the system with

    193

    Fitting a GP is challenging: PGAS works but is expensive, reduced rankmethods better. Variational approaches (for parametric models) lookpromising...

  • Dangers of non-parametric model extensions

    There are (at least) two problems with this approach:

    We may still find G 6∈ FIdentifiability

    I A GP is an incredibly complex infinite dimensional model, which is notnecessarily identified even asymptotically. The posterior canconcentrate not on a point, but on some sub manifold of parameterspace, and the projection of the prior on this space continues toimpact the posterior even as more and more data are collected.

    ie We never forget the prior, but the prior is too complex to understand

    I Brynjarsdottir and O’Hagan 2014 try to model their way out oftrouble with prior information:

    δ(0) = 0 δ′(x) ≥ 0

    Great if you have this information.

  • Dangers of non-parametric model extensions

    There are (at least) two problems with this approach:

    We may still find G 6∈ FIdentifiability

    I A GP is an incredibly complex infinite dimensional model, which is notnecessarily identified even asymptotically. The posterior canconcentrate not on a point, but on some sub manifold of parameterspace, and the projection of the prior on this space continues toimpact the posterior even as more and more data are collected.

    ie We never forget the prior, but the prior is too complex to understand

    I Brynjarsdottir and O’Hagan 2014 try to model their way out oftrouble with prior information:

    δ(0) = 0 δ′(x) ≥ 0

    Great if you have this information.

  • Dangers of non-parametric model extensions

    There are (at least) two problems with this approach:

    We may still find G 6∈ FIdentifiability

    I A GP is an incredibly complex infinite dimensional model, which is notnecessarily identified even asymptotically. The posterior canconcentrate not on a point, but on some sub manifold of parameterspace, and the projection of the prior on this space continues toimpact the posterior even as more and more data are collected.

    ie We never forget the prior, but the prior is too complex to understand

    I Brynjarsdottir and O’Hagan 2014 try to model their way out oftrouble with prior information:

    δ(0) = 0 δ′(x) ≥ 0

    Great if you have this information.

  • Inferential approaches

    Instead of trying to model our way out of trouble, can we modify theinferential approach instead?

    Common approaches to inference:

    Maximum likelihood/minimum-distance

    Bayes(ish)

    History matching (HM)/ABC type methods (thresholding)

    How do these approaches behave for well-specified and mis-specifiedmodels?

    Try to understand why (at least anecdotally) HM and ABC seem to workwell in mis-specified cases.

    Big question3 is what properties would we like our inferential approach topossess?

    3

    To which I have no answer

  • Inferential approaches

    Instead of trying to model our way out of trouble, can we modify theinferential approach instead?Common approaches to inference:

    Maximum likelihood/minimum-distance

    Bayes(ish)

    History matching (HM)/ABC type methods (thresholding)

    How do these approaches behave for well-specified and mis-specifiedmodels?

    Try to understand why (at least anecdotally) HM and ABC seem to workwell in mis-specified cases.

    Big question3 is what properties would we like our inferential approach topossess?

    3

    To which I have no answer

  • Inferential approaches

    Instead of trying to model our way out of trouble, can we modify theinferential approach instead?Common approaches to inference:

    Maximum likelihood/minimum-distance

    Bayes(ish)

    History matching (HM)/ABC type methods (thresholding)

    How do these approaches behave for well-specified and mis-specifiedmodels?

    Try to understand why (at least anecdotally) HM and ABC seem to workwell in mis-specified cases.

    Big question3 is what properties would we like our inferential approach topossess?

    3To which I have no answer

  • ABC: approximate Bayesian computation

    Rejection Algorithm

    Draw θ from prior π(·)Accept θ with probability π(D | θ)

    Accepted θ are independent draws from the posterior distribution,π(θ | D).

    If the likelihood, π(D|θ), is unknown:

    ‘Mechanical’ Rejection Algorithm

    Draw θ from π(·)Simulate X ∼ f (θ) from the computer modelAccept θ if D = X , i.e., if computer output equals observation

    The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).

  • ABC: approximate Bayesian computation

    Rejection Algorithm

    Draw θ from prior π(·)Accept θ with probability π(D | θ)

    Accepted θ are independent draws from the posterior distribution,π(θ | D).If the likelihood, π(D|θ), is unknown:

    ‘Mechanical’ Rejection Algorithm

    Draw θ from π(·)Simulate X ∼ f (θ) from the computer modelAccept θ if D = X , i.e., if computer output equals observation

    The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).

  • Rejection ABC

    If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:

    Uniform Rejection Algorithm

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(D,X ) ≤ �

    � reflects the tension between computability and accuracy.

    As �→∞, we get observations from the prior, π(θ).If � = 0, we generate observations from π(θ | D).

  • Rejection ABC

    If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:

    Uniform Rejection Algorithm

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(D,X ) ≤ �

    � reflects the tension between computability and accuracy.

    As �→∞, we get observations from the prior, π(θ).If � = 0, we generate observations from π(θ | D).

  • � = 10

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ● ●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

    θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

    ρ(D,X ) = |D − X |, D = 2

  • � = 7.5

    ●● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ● ●●

    ● ●

    ● ●●

    ●●●

    ●●

    ● ●

    ● ●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ● ●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

    θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

    ρ(D,X ) = |D − X |, D = 2

  • � = 5

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●● ●

    ● ●●

    ●● ●

    ● ●

    ● ●

    ●●

    ● ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●● ●

    ●●

    ● ●

    ● ● ●●

    ●●

    ●●

    ● ●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

    θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

    ρ(D,X ) = |D − X |, D = 2

  • � = 2.5

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ● ●

    ●●

    ●● ●

    ●●

    ●●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D

    ●●

    ●●

    ●●

    ●●●●

    ● ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ● ●●

    ●●

    ●●

    ● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●● ●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●

    ● ● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●● ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●● ●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●

    − ε

    + ε

    D

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

    θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

    ρ(D,X ) = |D − X |, D = 2

  • � = 1

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●● ●

    ●●

    ● ● ●●

    ●●

    ● ●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ● ●

    ●●

    ●● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −10

    010

    20

    theta vs D

    theta

    D ●●● ●●●● ●●●●●●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●●●●

    ●●●

    ●●●

    ● ●●●●

    ● ● ●●●●

    ●●

    ●●

    ●●●●●

    ●●

    ●●

    ●●

    ●● ●● ●

    ● ●● ●●●

    ● ●● ● ●

    ●●● ●●

    − ε

    + ε

    −3 −2 −1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    Density

    theta

    Den

    sity

    ABCTrue

    θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

    ρ(D,X ) = |D − X |, D = 2

  • Rejection ABC

    If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionality

    Reduce the dimension using summary statistics, S(D).

    Approximate Rejection Algorithm With Summaries

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(S(D), S(X )) < �

    If S is sufficient this is equivalent to the previous algorithm.

    Simple → Popular with non-statisticians

  • Rejection ABC

    If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionality

    Reduce the dimension using summary statistics, S(D).

    Approximate Rejection Algorithm With Summaries

    Draw θ from π(θ)

    Simulate X ∼ f (θ)Accept θ if ρ(S(D), S(X )) < �

    If S is sufficient this is equivalent to the previous algorithm.

    Simple → Popular with non-statisticians

  • History matching and ABC

    History matching seeks to find a NROY set

    Pθ = {θ : SHM(F̂θ,y ) ≤ 3}

    where

    SHM(Fθ) =|EFθ(Y )− y |√

    VarFθ(Y )

    ABC approximates the posterior as

    π�(θ) ∝ π(θ)E(IS(F̂θ,y)≤�)

    for some choice of S (typically S(F̂θ, y) = ρ(η(y), η(y′)) where y ′ ∼ Fθ)

    and �.

    They have thresholding of a score in common and are algorithmicallycomparable.

  • History matching and ABC

    History matching seeks to find a NROY set

    Pθ = {θ : SHM(F̂θ,y ) ≤ 3}

    where

    SHM(Fθ) =|EFθ(Y )− y |√

    VarFθ(Y )

    ABC approximates the posterior as

    π�(θ) ∝ π(θ)E(IS(F̂θ,y)≤�)

    for some choice of S (typically S(F̂θ, y) = ρ(η(y), η(y′)) where y ′ ∼ Fθ)

    and �.

    They have thresholding of a score in common and are algorithmicallycomparable.

  • History matching and ABC

    History matching seeks to find a NROY set

    Pθ = {θ : SHM(F̂θ,y ) ≤ 3}

    where

    SHM(Fθ) =|EFθ(Y )− y |√

    VarFθ(Y )

    ABC approximates the posterior as

    π�(θ) ∝ π(θ)E(IS(F̂θ,y)≤�)

    for some choice of S (typically S(F̂θ, y) = ρ(η(y), η(y′)) where y ′ ∼ Fθ)

    and �.

    They have thresholding of a score in common and are algorithmicallycomparable.

  • History matching and ABC

    These methods (anecdotally) seem to work better in mis-specifiedsituations.

    Why?

    They differ from likelihood based approaches in that

    They only use some aspect of the simulator output

    I Typically we hand pick which simulator outputs to compare, andweight them on a case by case basis.

    Potentially use generalised scores/loss-functions

    The thresholding type nature potentially makes them somewhatconservative

  • History matching and ABC

    These methods (anecdotally) seem to work better in mis-specifiedsituations.

    Why?

    They differ from likelihood based approaches in that

    They only use some aspect of the simulator outputI Typically we hand pick which simulator outputs to compare, and

    weight them on a case by case basis.

    Potentially use generalised scores/loss-functions

    The thresholding type nature potentially makes them somewhatconservative

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?

    I I don’t want inconsistency.

    Frequency properties?

    I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?

    I I don’t want inconsistency.

    Frequency properties?

    I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?I I don’t want inconsistency.

    Frequency properties?

    I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?I I don’t want inconsistency.

    Asymptotic concentration or normality?

    Frequency properties?

    I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?I I don’t want inconsistency.

    Asymptotic concentration or normality?

    Frequency properties?

    I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?I I don’t want inconsistency.

    Asymptotic concentration or normality?

    Frequency properties?

    I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?I I don’t want inconsistency.

    Asymptotic concentration or normality?

    Frequency properties?I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?I I don’t want inconsistency.

    Asymptotic concentration or normality?

    Frequency properties?I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • What makes a good inferential approach?

    Do any of these approaches have favourable properties/characteristics forinference under discrepancy? Particularly when the discrepancy model iscrude?

    Consistency?I I don’t want inconsistency.

    Asymptotic concentration or normality?

    Frequency properties?I I wouldn’t object but seems impossible for subjective priors.

    Coherence?

    Robustness to small mis-specifications?

  • Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.

    A single outlier can make our inference arbitrarily bad

    The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).

    Consider scoring rules instead. If we forecast F , observe y , then wereceive score

    S(F , y)

    S is a proper score if

    G = arg minF

    EY∼GS(F ,Y )

    i.e. predicting G gives the best possibly score.

    Encourages honest reporting

    Examples:

    Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)

    ∫f (x)αdx − γf (y)α−1

  • Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.

    A single outlier can make our inference arbitrarily bad

    The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).

    Consider scoring rules instead. If we forecast F , observe y , then wereceive score

    S(F , y)

    S is a proper score if

    G = arg minF

    EY∼GS(F ,Y )

    i.e. predicting G gives the best possibly score.

    Encourages honest reporting

    Examples:

    Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)

    ∫f (x)αdx − γf (y)α−1

  • Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.

    A single outlier can make our inference arbitrarily bad

    The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).

    Consider scoring rules instead. If we forecast F , observe y , then wereceive score

    S(F , y)

    S is a proper score if

    G = arg minF

    EY∼GS(F ,Y )

    i.e. predicting G gives the best possibly score.

    Encourages honest reporting

    Examples:

    Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)

    ∫f (x)αdx − γf (y)α−1

  • Generalized scoresLikelihood based methods are notoriously sensitive to mis-specification.

    A single outlier can make our inference arbitrarily bad

    The likelihood can pick up on unintended aspects of the data (eg tailbehaviour).

    Consider scoring rules instead. If we forecast F , observe y , then wereceive score

    S(F , y)

    S is a proper score if

    G = arg minF

    EY∼GS(F ,Y )

    i.e. predicting G gives the best possibly score.

    Encourages honest reporting

    Examples:

    Log-likelihood S(F , y) = − log f (y)Tsallis-score (γ − 1)

    ∫f (x)αdx − γf (y)α−1

  • Minimum scoring rule estimation (Dawid et al. 2014 etc) uses

    θ̂ = arg minθ

    S(Fθ, y)

    For proper scores

    Eθ0

    (∂

    ∂θS(Fθ, y)

    ∣∣∣∣θ=θ0

    )=

    ∂θEθ0S(Fθ, y)

    ∣∣∣∣θ=θ0

    = 0

    so we have an unbiased estimating equation, and hence get asymptoticconsistency for well-specified models. We also get asymptotic normality.

  • Minimum scoring rule estimation (Dawid et al. 2014 etc) uses

    θ̂ = arg minθ

    S(Fθ, y)

    For proper scores

    Eθ0

    (∂

    ∂θS(Fθ, y)

    ∣∣∣∣θ=θ0

    )=

    ∂θEθ0S(Fθ, y)

    ∣∣∣∣θ=θ0

    = 0

    so we have an unbiased estimating equation, and hence get asymptoticconsistency for well-specified models. We also get asymptotic normality.

  • Dawid et al. 2014 show that if

    ∇θfθ(x) is bounded in x for all θBregman gauge of scoring rule is locally bounded

    then the minimum scoring rule estimator θ̂ is B-robust

    i.e. it has bounded influence function

    IF (x ; θ̂,Fθ) = lim�→0

    θ̂(�δx + (1− �)Fθ)− θ̂(Fθ)�

    i.e. if Fθ is infected by outlier at x , this doesn’t unduly affect theinference.

    Note both ABC and HM are B-robust in this sense, but using thelog-likelihood is not.

    What type of robustness do we want here?

  • Dawid et al. 2014 show that if

    ∇θfθ(x) is bounded in x for all θBregman gauge of scoring rule is locally bounded

    then the minimum scoring rule estimator θ̂ is B-robust

    i.e. it has bounded influence function

    IF (x ; θ̂,Fθ) = lim�→0

    θ̂(�δx + (1− �)Fθ)− θ̂(Fθ)�

    i.e. if Fθ is infected by outlier at x , this doesn’t unduly affect theinference.

    Note both ABC and HM are B-robust in this sense, but using thelog-likelihood is not.

    What type of robustness do we want here?

  • Bayes like approachesWhat about Bayesian like approaches with generalized scores?

    © 2016 The Authors Journal of the Royal Statistical Society: Series B StatisticalMethodology published by John Wiley & Sons Ltd on behalf of the Royal Statistical Society.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, dis-tribution and reproduction in any medium, provided the original work is properly cited.

    1369–7412/16/781103

    J. R. Statist. Soc. B (2016)78, Part 5, pp. 1103–1130

    A general framework for updating belief distributions

    P. G. Bissiri,

    University of Milano-Bicocca, Italy

    C. C. Holmes

    University of Oxford, UK

    and S. G. Walker

    University of Texas at Austin, USA

    [Received December 2013. Final revision November 2015]

    Summary. We propose a framework for general Bayesian inference. We argue that a valid up-date of a prior belief distribution to a posterior can be made for parameters which are connectedto observations through a loss function rather than the traditional likelihood function, which isrecovered as a special case. Modern application areas make it increasingly challenging forBayesians to attempt to model the true data-generating mechanism. For instance, when theobject of interest is low dimensional, such as a mean or median, it is cumbersome to have toachieve this via a complete model for the whole data distribution. More importantly, there aresettings where the parameter of interest does not directly index a family of density functionsand thus the Bayesian approach to learning about such parameters is currently regarded asproblematic. Our framework uses loss functions to connect information in the data to function-als of interest. The updating of beliefs then follows from a decision theoretic approach involvingcumulative loss functions. Importantly, the procedure coincides with Bayesian updating whena true likelihood is known yet provides coherent subjective inference in much more generalsettings. Connections to other inference frameworks are highlighted.

    Keywords: Decision theory; General Bayesian updating; Generalized estimating equations;Gibbs posteriors; Information; Loss function; Maximum entropy; Provably approximatelycorrect Bayes methods; Self-information loss function

    1. Introduction

    Data sets are increasing in size and modelling environments are becoming more complex. Thispresents opportunities for Bayesian statistics but also major challenges, perhaps the greatestof which is the requirement to define the true sampling distribution, or likelihood, for thedata generator f0.x/, regardless of the study objective. Even if the task is inference for a lowdimensional parameter, Bayesian analysis is required to model the complete data distributionand, moreover, to assume that the model is ‘true’.

    In this paper we present a coherent procedure for general Bayesian inference which is basedon the updating of a prior belief distribution to a posterior when the parameter of interest isconnected to observations via a loss function. Briefly here, and in the simplest scenario, supposethat interest is in the θ minimizing the expected loss

    Address for correspondence: C. C. Holmes, Department of Statistics, University of Oxford, 24–29 St Giles,Oxford, OX1 3LB, UK.E-mail: [email protected]

    Bissiri et al. 2016 consider updating prior beliefs when parameter θ isconnected to observations via a loss function L(θ, y).They argue the update must be of the form

    π(θ|x) ∝ exp(−L(θ, x))π(θ)

    via coherency arguments.Note using log-likelihood as the loss funct