Readings: Ho Chapter 9, West JRSSB 1984, Fuquene, P erez ... · BAS.stack=bas.lm(stack.loss~.,data=stack.out, prior="hyper-g-n",a=3, modelprior=tr.beta.binomial(1,1,15) , ... Response

duke.eps

Robust Bayesian RegressionReadings: Hoff Chapter 9, West JRSSB 1984, Fuquene, Perez

& Pericchi 2015

STA 721 Duke University

Duke University

November 21, 2017

duke.eps

Multiple Outliers

I Hoeting, Madigan and Raftery (in various permutations)consider the problem of simultaneous variable selection andoutlier identification.

I This is implemented in the library(BMA) in the functionMC3.REG. This has the advantage that more than 2 pointsmay be considered as outliers at the same time.

I The function uses a Markov chain to identify both importantvariables and potential outliers, but is coded in Fortran soshould run reasonably quickly.

I Can also use BAS or other variable selection programs

duke.eps

Multiple Outliers





duke.eps

Multiple Outliers





duke.eps

Multiple Outliers





duke.eps

Multiple Outliers





duke.eps

Using BAS

library(MASS)

data(stackloss)

n = nrow(stackloss)

stack.out = cbind(stackloss, diag(n))

library(BAS)

BAS.stack = bas.lm(stack.loss ~ ., data=stack.out,

prior="hyper-g-n", a=3,

modelprior=tr.beta.binomial(1, 1,15) ,

method="MCMC", MCMC.it=200000)

duke.eps

Output

P(B != 0 | Y) model 1 model 2 model 3 model 4 model 5Intercept 1.00 1.00 1.00 1.00 1.00 1.00Air.Flow 1.00 1.00 1.00 1.00 1.00 1.00

Water.Temp 0.23 0.00 0.00 0.00 1.00 1.00Acid.Conc. 0.04 0.00 0.00 0.00 0.00 0.00

‘1‘ 0.22 0.00 0.00 0.00 0.00 1.00‘2‘ 0.07 0.00 0.00 0.00 0.00 0.00‘3‘ 0.24 0.00 0.00 0.00 0.00 1.00‘4‘ 0.75 1.00 0.00 1.00 1.00 1.00‘5‘ 0.03 0.00 0.00 0.00 0.00 0.00‘6‘ 0.04 0.00 0.00 0.00 0.00 0.00‘7‘ 0.03 0.00 0.00 0.00 0.00 0.00‘8‘ 0.03 0.00 0.00 0.00 0.00 0.00‘9‘ 0.03 0.00 0.00 0.00 0.00 0.00

‘10‘ 0.03 0.00 0.00 0.00 0.00 0.00‘11‘ 0.03 0.00 0.00 0.00 0.00 0.00‘12‘ 0.04 0.00 0.00 0.00 0.00 0.00‘13‘ 0.16 0.00 0.00 1.00 0.00 0.00‘14‘ 0.08 0.00 0.00 0.00 0.00 0.00‘15‘ 0.03 0.00 0.00 0.00 0.00 0.00‘16‘ 0.03 0.00 0.00 0.00 0.00 0.00‘17‘ 0.03 0.00 0.00 0.00 0.00 0.00‘18‘ 0.02 0.00 0.00 0.00 0.00 0.00‘19‘ 0.04 0.00 0.00 0.00 0.00 0.00‘20‘ 0.06 0.00 0.00 0.00 0.00 0.00‘21‘ 0.94 1.00 1.00 1.00 1.00 1.00BF 0.13 0.01 0.08 0.07 1.00

PostProbs 0.24 0.11 0.03 0.02 0.02R2 0.96 0.93 0.97 0.97 0.99dim 4.00 3.00 5.00 5.00 7.00

logmarg 22.17 19.43 21.68 21.57 24.18

duke.eps

BAS

10 15 20 25 30 35 40

−3−1

01

23

Predictions under BMA

Resid

uals

Residuals vs Fitted

13

3

14

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

Model Search Order

Cum

ulat

ive P

roba

bility

Model Probabilities

5 10 15

05

1015

2025

Model Dimension

log(

Mar

gina

l)

Model Complexity

0.0

0.2

0.4

0.6

0.8

1.0

Mar

gina

l Inc

lusio

n Pr

obab

ility

Inte

rcep

tAi

r.Flo

wW

ater

.Tem

pAc

id.C

onc. ‘1

‘‘2

‘‘3

‘‘4

‘‘5

‘‘6

‘‘7

‘‘8

‘‘9

‘‘1

0‘‘1

1‘‘1

2‘‘1

3‘‘1

4‘‘1

5‘‘1

6‘‘1

7‘‘1

8‘‘1

9‘‘2

0‘‘2

1‘

Inclusion Probabilities

duke.eps

BAS

00.

479

1.04

31.

199

1.46

21.

653

2.98

53.

766

2014

119

87

65

43

21

Mod

el R

ank

Log

Post

erio

r Odd

s

Inte

rcep

tAi

r.Flo

wW

ater

.Tem

pAc

id.C

onc. ‘1

‘‘2

‘‘3

‘‘4

‘‘5

‘‘6

‘‘7

‘‘8

‘‘9

‘‘1

0‘‘1

1‘‘1

2‘‘1

3‘‘1

4‘‘1

5‘‘1

6‘‘1

7‘‘1

8‘‘1

9‘‘2

0‘‘2

1‘

duke.eps

Body Fat Data: Intervals w/ All Data

Response % Body Fat and Predictor Waist Circumference

95% confidence and prediction intervals for bodyfat.lm

Abdomen

Bod

yfat

0

10

20

30

40

50

60

80 100 120 140

xbar

xbar

observedfitconf intpred int

95% confidence and prediction intervals for bodyfat.lm2

Abdomen

Bod

yfat

0

20

40

60

80 100 120 140

xbar

xbar

observedfitconf intpred int

Which analysis do we use? with Case 39 or not – or somethingdifferent?

duke.eps

Cook’s Distance

0.00 0.02 0.04 0.06 0.08 0.10

−4

−2

02

Leverage

Sta

ndar

dize

d re

sidu

als

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

● ●

●

●●

●● ●

●

●

●

●

●

lm(Bodyfat ~ I(Abdomen − 2.54 * 34))

Cook's distance

1

0.5

0.5

Residuals vs Leverage

39

216

41

duke.eps

Options for Handling Influential Cases

I Are there scientific grounds for eliminating the case?

I Test if the case has a different mean than population

I Report results with and without the case

I Model Averaging to Account for Model Uncertainty?

I Full model Y = Xβ + Inδ + ε

I 2n submodels γi = 0⇔ δi = 0

I If γi = 1 then case i has a different mean “mean shift”outliers.

duke.eps









duke.eps









duke.eps









duke.eps









duke.eps









duke.eps

Mean Shift = Variance Inflation

I Model Y = Xβ + Inδ + ε

I Priorδi | γi ∼ N(0,Vσ2γi )γi ∼ Ber(π)

Then εi given σ2 is independent of δi and

ε∗i ≡ εi + δi | σ2{

N(0, σ2) wp (1− π)N(0, σ2(1 + V )) wp π

Model Y = Xβ + ε∗ “variance inflation”V + 1 = K = 7 in the paper by Hoeting et al. package BMA

duke.eps

Simultaneous Outlier and Variable Selection

MC3.REG(all.y = bodyfat$Bodyfat, all.x = as.matrix(bodyfat$Abdomen),

num.its = 10000, outliers = TRUE)

Model parameters: PI=0.02 K=7 nu=2.58 lambda=0.28 phi=2.85

15 models were selected

Best 5 models (cumulative posterior probability = 0.9939):

prob model 1 model 2 model 3 model 4 model 5

variables

all.x 1 x x x x x

outliers

39 0.94932 x x . x .

204 0.04117 . . . x .

207 0.10427 . x . . x

post prob 0.815 0.095 0.044 0.035 0.004

duke.eps

Change Error Assumptions

Yiind∼ t(ν, α + βxi , 1/φ)

L(α, β, φ) ∝n∏

i=1

φ1/2(

1 +φ(yi − α− βxi )2

ν

)− (ν+1)2

Use Prior p(α, β, φ) ∝ 1/φ

Posterior distribution

p(α, β, φ | Y ) ∝ φn/2−1n∏

i=1

(1 +

φ(yi − α− βxi )2

ν

)− (ν+1)2

duke.eps



L(α, β, φ) ∝n∏

i=1

φ1/2(

1 +φ(yi − α− βxi )2

ν

)− (ν+1)2



p(α, β, φ | Y ) ∝ φn/2−1n∏

i=1

(1 +


ν

)− (ν+1)2

duke.eps



L(α, β, φ) ∝n∏

i=1

φ1/2(

1 +φ(yi − α− βxi )2

ν

)− (ν+1)2



p(α, β, φ | Y ) ∝ φn/2−1n∏

i=1

(1 +


ν

)− (ν+1)2

duke.eps



L(α, β, φ) ∝n∏

i=1

φ1/2(

1 +φ(yi − α− βxi )2

ν

)− (ν+1)2



p(α, β, φ | Y ) ∝ φn/2−1n∏

i=1

(1 +


ν

)− (ν+1)2

duke.eps



L(α, β, φ) ∝n∏

i=1

φ1/2(

1 +φ(yi − α− βxi )2

ν

)− (ν+1)2



p(α, β, φ | Y ) ∝ φn/2−1n∏

i=1

(1 +


ν

)− (ν+1)2

duke.eps

Bounded Influence - West 1984 (and references within)

Treat σ2 as given, then influence of individual observations on theposterior distribution of β in the model where E[Yi ] = xTi β isinvestigated through the score function:

d

dβlog p(β | Y) =

d

dβlog p(β) +

n∑i=1

xg(yi − xTi β)

where

g(ε) = − d

dεlog p(ε)

is the influence function of the error distribution (unimodal,continuous, differentiable, symmetric)

An outlying observation yj is accommodated if the posteriordistribution for p(β | Y(i)) converges to p(β | Y) for all β as|Yi | → ∞. Requires error models with influence functions that goto zero such as the Student t (O’Hagan, 1979)

duke.eps



d

dβlog p(β | Y) =

d

dβlog p(β) +

n∑i=1

xg(yi − xTi β)

where

g(ε) = − d

dεlog p(ε)



duke.eps



d

dβlog p(β | Y) =

d

dβlog p(β) +

n∑i=1

xg(yi − xTi β)

where

g(ε) = − d

dεlog p(ε)



duke.eps



d

dβlog p(β | Y) =

d

dβlog p(β) +

n∑i=1

xg(yi − xTi β)

where

g(ε) = − d

dεlog p(ε)



duke.eps

Choice of dfI Score function for t with α degrees of freedom has turning

points at ±√α

−100 −50 0 50 100

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

eps

g(ep

s, 9

)

I g ′(ε) is negative when ε2 > α (standardized errors)I Contribution of observation to information matrix is negative

and the observation is doubtfulI Suggest taking α = 8 or α = 9 to reject errors larger than

√8

or 3 sd.

Problem: No closed form solution for posterior distribution

duke.eps


points at ±√α

−100 −50 0 50 100

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

eps

g(ep

s, 9

)

I g ′(ε) is negative when ε2 > α (standardized errors)

I Contribution of observation to information matrix is negativeand the observation is doubtful

I Suggest taking α = 8 or α = 9 to reject errors larger than√

8or 3 sd.


duke.eps


points at ±√α

−100 −50 0 50 100

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

eps

g(ep

s, 9

)


and the observation is doubtful

I Suggest taking α = 8 or α = 9 to reject errors larger than√

8or 3 sd.


duke.eps


points at ±√α

−100 −50 0 50 100

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

eps

g(ep

s, 9

)



√8

or 3 sd.


duke.eps


points at ±√α

−100 −50 0 50 100

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

eps

g(ep

s, 9

)



√8

or 3 sd.


duke.eps

Scale-Mixtures of Normal Representation

Ziiid∼ t(ν, 0, σ2)⇔

Zi | λiind∼ N(0, σ2/λi )

λiiid∼ G (ν/2, ν/2)

Integrate out “latent” λ’s to obtain marginal distribution.

duke.eps






duke.eps






duke.eps






duke.eps

Latent Variable Model

Yi | α, β, φ, λind∼ N(α + βxi ,

1

φλi)


p(α, β, φ) ∝ 1/φ

Joint Posterior Distribution:

p((α, β, φ, λ1, . . . , λn | Y ) ∝ φn/2 exp

{−φ

2

∑λi (yi − α− βxi )2

}×

φ−1

n∏i=1

λν/2−1i exp(−λiν/2)

duke.eps



1

φλi)


p(α, β, φ) ∝ 1/φ


p((α, β, φ, λ1, . . . , λn | Y ) ∝ φn/2 exp

{−φ

2


}×

φ−1

n∏i=1


duke.eps



1

φλi)


p(α, β, φ) ∝ 1/φ


p((α, β, φ, λ1, . . . , λn | Y ) ∝ φn/2 exp

{−φ

2


}×

φ−1

n∏i=1


duke.eps



1

φλi)


p(α, β, φ) ∝ 1/φ


p((α, β, φ, λ1, . . . , λn | Y ) ∝ φn/2 exp

{−φ

2


}×

φ−1

n∏i=1


duke.eps



1

φλi)


p(α, β, φ) ∝ 1/φ


p((α, β, φ, λ1, . . . , λn | Y ) ∝ φn/2 exp

{−φ

2


}×

φ−1

n∏i=1


duke.eps



1

φλi)


p(α, β, φ) ∝ 1/φ


p((α, β, φ, λ1, . . . , λn | Y ) ∝ φn/2 exp

{−φ

2


}×

φ−1

n∏i=1


duke.eps



1

φλi)


p(α, β, φ) ∝ 1/φ


p((α, β, φ, λ1, . . . , λn | Y ) ∝ φn/2 exp

{−φ

2


}×

φ−1

n∏i=1


duke.eps

JAGS

Just Another Gibbs Sampler (and more)

I Model

I Data

I Initial values (optional)

May do this through ordinary text files or use the functions inR2jags to specify model, data, and initial values then call jags.

duke.eps

JAGS


I Model

I Data



duke.eps

JAGS


I Model

I Data



duke.eps

JAGS


I Model

I Data



duke.eps

Model Specification via R2jags

rr.model = function() {

for (i in 1:n) {

mu[i] <- alpha0 + alpha1*(X[i] - Xbar)

lambda[i] ~ dgamma(9/2, 9/2)

prec[i] <- phi*lambda[i]

Y[i] ~ dnorm(mu[i], prec[i])

}

phi ~ dgamma(1.0E-6, 1.0E-6)

alpha0 ~ dnorm(0, 1.0E-6)

alpha1 ~ dnorm(0,1.0E-6)

}

duke.eps

Notes on Models

I Distributions of stochastic “nodes” are specified using ∼

I Assignment of deterministic “nodes” uses <- (NOT =)

I JAGS allows expressions as arguments in distributions

I Normal distributions are parameterized using precisions, sodnorm(0, 1.0E-6) is a N(0, 1.0× 106)

I uses for loop structure as in R for model description butcoded in C++ so is fast!

duke.eps

Notes on Models

I Distributions of stochastic “nodes” are specified using ∼I Assignment of deterministic “nodes” uses <- (NOT =)




duke.eps

Notes on Models





duke.eps

Notes on Models





duke.eps

Notes on Models





duke.eps

Data

A list or rectangular data structure for all data and summaries ofdata used in the model

bf.data = list(Y = bodyfat$Bodyfat,

X=bodyfat$Abdomen)

bf.data$n = length(bf.data$Y)

bf.data$Xbar = mean(bf.data$X)

duke.eps

Specifying which Parameters to Save

The parameters to be monitored and returned to R are specifiedwith the variable parameters

parameters = c("beta0", "beta1", "sigma",

"mu34", "y34", "lambda[39]")

I All of the above (except lambda) are calculated from the otherparameters. (See R-code for definitions of these parameters.)

I lambda[39] saves only the 39th case of λ

I To save a whole vector (for example all lambdas, just give thevector name)

duke.eps




"mu34", "y34", "lambda[39]")




duke.eps




"mu34", "y34", "lambda[39]")




duke.eps




"mu34", "y34", "lambda[39]")




duke.eps

Running jags from R

bf.sim = jags(bf.data, inits=NULL, par=parameters,

model=rr.model,

n.chains=2, n.iter=10000,

)

duke.eps

Output

mean sd 2.5% 50% 97.5%

beta0 -41.70 2.75 -46.91 -41.67 -36.40beta1 0.66 0.03 0.60 0.66 0.71sigma 4.48 0.23 4.05 4.46 4.96mu34 15.10 0.35 14.43 15.10 15.82

y34 14.94 5.15 4.37 15.21 24.65lambda[39] 0.33 0.16 0.11 0.30 0.7295% HPD interval for expected bodyfat (14.5, 15.8)

95% HPD interval for bodyfat (5.1, 25.3)

duke.eps

Comparison

I 95% Probability Interval for β is (0.60, 0.71) with t9 errors

I 95% Confidence Interval for β is (0.58, 0.69) (all data normalmodel)

I 95% Confidence Interval for β is (0.61, 0.73) ( normal modelwithout case 39)

Results intermediate without having to remove any observationsCase 39 down weighted by λ39

duke.eps

Comparison





duke.eps

Comparison





duke.eps

Comparison




Results intermediate without having to remove any observations

Case 39 down weighted by λ39

duke.eps

Comparison





duke.eps

Full Conditional for λj

p(λj | rest,Y ) ∝ p(α, β, φ, λ1, . . . , λn | Y )

∝ φn/2−1n∏

i=1

exp

{−φ

2λi (yi − α− βxi )2

}×

n∏i=1

λν+12−1

i exp(−λiν

2)

Ignore all terms except those that involve λj

λj | rest,Y ∼ G

(ν + 1

2,φ(yj − α− βxj)2 + ν

2

)

duke.eps



∝ φn/2−1n∏

i=1

exp

{−φ


}×

n∏i=1

λν+12−1

i exp(−λiν

2)


λj | rest,Y ∼ G

(ν + 1

2,φ(yj − α− βxj)2 + ν

2

)

duke.eps



∝ φn/2−1n∏

i=1

exp

{−φ


}×

n∏i=1

λν+12−1

i exp(−λiν

2)


λj | rest,Y ∼ G

(ν + 1

2,φ(yj − α− βxj)2 + ν

2

)

duke.eps



∝ φn/2−1n∏

i=1

exp

{−φ


}×

n∏i=1

λν+12−1

i exp(−λiν

2)


λj | rest,Y ∼ G

(ν + 1

2,φ(yj − α− βxj)2 + ν

2

)

duke.eps

Weights

Under prior E [λi ] = 1

Under posterior, large residuals are down-weighted (approximatelythose bigger than

√ν)

Posterior Distribution

λ39

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

duke.eps

Weights

Under prior E [λi ] = 1Under posterior, large residuals are down-weighted (approximatelythose bigger than

√ν)

Posterior Distribution

λ39

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

duke.eps

Prior Distributions on Parameter

As a general recommendation, the prior distribution should have“heavier” tails than the likelihood

I with t9 errors use a tα with α < 9

I also represent via scale mixture of normals

I Horseshoe, Double Pareto, Cauchy all have heavier tails

I See Stack-loss code

duke.eps







duke.eps







duke.eps







duke.eps







duke.eps







Documents

Readings: Ho Chapter 9, West JRSSB 1984, Fuquene, P erez ... · BAS.stack=bas.lm(stack.loss~.,data=stack.out, prior="hyper-g-n",a=3, modelprior=tr.beta.binomial(1,1,15) , ... Response