35
Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard Held As interpreted by: Rebecca Ferrell UW Statistics 572, Final Talk May 27, 2014 1

Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Bayesian auxiliary variable models for binary andmultinomial regression

(Bayesian Analysis, 2006)

Authors: Chris HolmesLeonhard Held

As interpreted by: Rebecca Ferrell

UW Statistics 572, Final Talk

May 27, 2014

1

Page 2: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Holmes & Held set out to take regression models for categoricaloutcomes and ...

2

Page 3: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Holmes & Held set out to take regression models for categoricaloutcomes and ...

2

Page 4: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Outline

I IntroductionI Intro to probit and logistic regression in Bayesian contextI Quick overview of the Gibbs sampler

I Probit regressionI Review popular way of doing Bayesian probit regression from

1993 by Albert & Chib (A&C)I Compare Holmes & Held (H&H) probit approach with A&C

I Logistic regressionI H&H’s modifications to make ideas work for logistic regressionI Empirical performance of sampling strategies

I DiscussionI Extension to model uncertainty (no time!)I Extension to multiple outcomes (no time!)I Concluding thoughts

3

Page 5: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Binary data setup

Classical framework with n binary responses yi and covariates xi :

yi ∼ Bernoulli(pi )

pi = g−1(ηi ), g−1 : R→ (0, 1)

ηi = xiβ, i = 1, . . . , n

xi = ( xi1 . . . xip )

β = ( β1 . . . βp )T

Put a prior on the unknown coefficients:

β ∼ π(β)

Inferential goal: compute posterior π(β | y) ∝ p(y | β)π(β)

4

Page 6: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Why are binary regression models hard to Bayesify?

I No conjugate priors – will need to use MCMC samplingI (Max. likelihood needs iterative methods, asymptotics)

I Previous approaches involve sampling from an approximationto the posterior, need tuning, or otherwise rely ondata-dependent accept-reject steps (e.g. Gamerman, 1997;Chen & Dey, 1998)

I Adaptive-rejection sampling (Dellaportas & Smith, 1993) onlyupdates individual coefficients, resulting in poor mixing whencoefficients are correlated

Wishlist for automatic and efficient Bayesian inference:

I MCMC samples from exact posterior distribution

I No tuning of proposal distributions or low accept-reject rates

I Reasonable mixing even with correlated parameters

5

Page 7: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Intro to Gibbs sampling

Setup: we don’t know posterior distribution π(β | y), but do knoweach conditional posterior π(βi | β−i , y).

Gibbs sampling iterates over conditional posteriors to produce asample from a Markov chain with stationary distribution π(β | y):

(1) Initialize β(0) = (β(0)1 , . . . , β

(0)p )

(2) Draw β(1)1 ∼ π(β1 | β(0)

−1, y)

(3) Draw β(1)2 ∼ π(β2 | β(1)1 ,β

(0)−{1,2}, y) . . .

(4) . . . Draw β(1)p ∼ π(βp | β(1)

−p, y)

(5) Done with sample observation β(1), now repeat (2) - (4)

Combine Gibbs steps into blocks: e.g. if distribution ofπ(β1, β2 | β−{1,2}, y) is available, can use in place of the individualconditionals in (2) and (3).

6

Page 8: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Probit regression

A&C auxiliary variable approach: introduce unobserved auxiliaryvariables zi and re-write the probit model as

yi = 1[zi>0]

zi= xiβ + εi

εi ∼ N(0, 1)

β ∼ N(b, v)

Equivalent to probit model with yi ∼ Bernoulli (pi = Φ(xiβ)):

pi = P(zi > 0 | β) = P(xiβ + εi > 0 | β)

= 1− Φ(−xiβ) = Φ(xiβ)

7

Page 9: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Probit the A&C way: iterative Gibbs steps

From joint posterior, obtain nice block conditional distributions ofthe parameters to iterate through in Gibbs steps:

π(β, z | y) ∝ p(y | z)p(z | β)π(β), so:

(1) π(β | z, y) ∝ p(z | β)π(β) = π(β)︸ ︷︷ ︸N(b,v)

∏ni=1 p(zi | β)︸ ︷︷ ︸

N(xiβ,1)

= multivariate normal

(2) π(z | β, y) ∝ p(y | z)p(z | β) =∏n

i=1 p(yi | zi )p(zi | β)

=n∏

i=1

(1[zi>0]1[yi=1] + 1[zi≤0]1[yi=0]

)φ(zi − xiβ)︸ ︷︷ ︸

π(zi |β,yi )∼=truncated normal

= product of truncated univariate normals

8

Page 10: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

z

β

9

Page 11: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Iterative steps

z

β

9

Page 12: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Iterative steps

z

β

9

Page 13: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Iterative steps

z

β

● ●

9

Page 14: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Iterative steps

z

β

9

Page 15: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Iterative steps

z

β

9

Page 16: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Iterative steps

z

β

●●

●●●

9

Page 17: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

z

β

9

Page 18: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Joint steps

z

β

9

Page 19: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Joint steps

z

β

9

Page 20: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Joint steps

z

β

9

Page 21: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Joint steps

z

β

9

Page 22: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Joint steps

z

β

9

Page 23: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Mixing demonstration

−6 −4 −2 0 2 4 6

−3

−2

−1

01

23

Joint steps

z

β

9

Page 24: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Smarter Gibbs sampling for probit?

H&H improve mixing by updating (β, z) jointly: simulate fromπ(z | y), then from π(β | z, y). With π(β) normal:

π(β, z | y)︸ ︷︷ ︸(known form)

= π(β | z, y)︸ ︷︷ ︸normal

π(z | y) implies

π(z | y) ∼ truncated multivariate normal

Truncated multivariate normal very hard to sample from directly,but univariate conditionals can be Gibbsed:

π(zi | z−i , y) ∼=

{N (mi , vi ) 1[zi>0] if yi = 1

N (mi , vi ) 1[zi≤0] if yi = 0

where mi and vi are leave-one-out functions of z−i , data, and prior

10

Page 25: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Probit sampler comparison

Iterative updates from A&C:

I Iterate between block Gibbs updates π(z | β, y), π(β | z, y)

I π(z | β, y) ∼ n independent truncated normals with variance 1

I Blocking, independence need just two matrix updates percycle, should run quickly — implementation in H&H paperappears not to have exploited this for π(z | β, y)

Joint updates from H&H:

I Iterate through n univariate Gibbs updates π(zi | z−i , y), thenone block Gibbs update π(β | z, y)

I π(zi | z−i , y) ∼ truncated normal with variance vi > 1

I Can’t do the zi ’s all at once, need n + 1 matrix calculationsper cycle — but maybe bigger variance can offset slownessthrough better mixing?

11

Page 26: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Efficient Bayesian inference

How might we see if a MCMC sampling algorithm is efficient?

(1) Time elapsed to run M iterations

(2) Effective sample size (ESS) for a single parameter:

ESS =M

1 + 2∑∞

k=1 ρ(k)

where ρ(k) = monotone sample autocorrelation at lag k (Kasset al, 1998)

(3) Average update distance: measure mixing with

1

M − 1

M−1∑i=1

‖β(i+1) − β(i)‖

Testing procedure: compute these metrics on each of 10 runs ofM =10,000 iterations per run (discard 1,000 burn-in)

12

Page 27: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Test data

H&H analyze several stock datasets with binary outcomes:

I Pima Indian data (n = 532, p = 8): outcome is diabetes;covariates include BMI, age, number of pregnancies

I Australian credit data (n = 690, p = 14): outcome is creditapproval; 14 generic covariates

I Heart disease data (n = 270, p = 13): outcome is heartdisease; covariates include age, sex, blood pressure, chest paintype

I German credit data (n = 1000, p = 24): outcome is goodvs. bad credit risk; covariates include checking account status,purpose of loan, gender and marital status

13

Page 28: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Probit performance: median values in 10 runs

14

Page 29: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Probit performance: relative

Standardize for run time: ESS/second, jointESS/second, iterative and Dist./second, joint

Dist./second, iterative

15

Page 30: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

From probit to logit

Extend auxiliary variables to logistic regression with another levelto model differing variances of the error terms:

yi = 1[zi>0]

zi= xiβ + εi

εi ∼ N(0, λi )

λi= (2ψi )2, ψi ∼ KS (Kolmogorov-Smirnov)

β ∼ N(b, v)

Equivalent to logit model with yi ∼ Bernoulli (pi = expit(xiβ))because εi has a logistic distribution (Andrews & Mallows, 1974)and CDF of logistic is expit function:

pi = P(zi > 0 | β) = P(εi > −xiβ | β)

= 1− expit(−xiβ) = expit(xiβ)

16

Page 31: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Logistic Gibbs

In similar fashion to probit model, simulate from posteriorconditionals:

π(β, z,λ | y) ∝ p(y | z)︸ ︷︷ ︸truncators

p(z | β,λ)︸ ︷︷ ︸indep. normal

p(λ)︸︷︷︸KS2

π(β)︸ ︷︷ ︸normal

(1) π(β | z,λ, y) ∝ p(z | β,λ)π(β) ∼= normal

(2) π(z | β,λ, y) ∝ p(y | z)p(z | β,λ) ∼= ind. truncated normals

(3) π(λ | β, z, y) ∝ p(z | β,λ)p(λ) ∼= ind. normal× KS2

Last conditional distribution is non-standard, but can be simulatedusing rejection sampling with Generalized Inverse Gaussianproposals and alternating series representation (“squeezing”)

17

Page 32: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Logistic sampler comparisonIterative updates (not analyzed in paper):

I Iterate block Gibbs updates π(β | z,λ, y), π(z | β,λ, y),π(λ | β, z, y)

I Variance of π(z | β,λ, y) is λi with expected value π2/3

Joint updating scheme (z,λ):

I Iterate block Gibbs updatesπ(z,λ | β, y) = π(z | β, y)︸ ︷︷ ︸

trunc ind logistic

π(λ | β, z)︸ ︷︷ ︸normal×KS2

, then π(β | z,λ)︸ ︷︷ ︸normal

I Variance of π(z | β, y) is π2/3 – little gain by marginalizing?

Joint updating scheme (z,β):

I Iterate Gibbs updatesπ(z,β | λ, y) = π(z | λ, y)︸ ︷︷ ︸

trunc mv normal

π(β | z,λ)︸ ︷︷ ︸normal

, then π(λ | β, z)︸ ︷︷ ︸normal×KS2

I Note that π(z | λ, y) will require Gibbsing throughπ(zi | z−i ,λ, y), but variance is > λi

18

Page 33: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Logit performance: median values in 10 runs

19

Page 34: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Logit performance: relative

Standardize for run time: ESS/second, jointESS/second, iterative and Dist./second, joint

Dist./second, iterative

20

Page 35: Bayesian auxiliary variable models for binary and …...Bayesian auxiliary variable models for binary and multinomial regression (Bayesian Analysis, 2006) Authors: Chris Holmes Leonhard

Concluding thoughts

I Latent variables can induce convenient conditionaldistributions to make MCMC sampling tractable for Bayesianmodels of binary data

I In these cases, all conditionals can be sampled from withoutMetropolis-Hastings

I Joint updating to increase variance in Gibbs sampling mightmake sense theoretically. . .

I . . . but only the scheme updating (z,λ) jointly in logisticregression was competitive with blocked iterative updates

I Don’t replace independent truncated univariate distributionswith a truncated multivariate normal!

I Auxiliary variable technique H&H introduced for logisticregression extends straightforwardly to Bayesian modeluncertainty situations, polytomous outcomes

21