Bayesian inference for logistic models using Polya-Gamma latent … · 2012. 6. 18. · Bayesian inference for logistic models using Polya-Gamma latent variables Nicholas G. Polson

Bayesian inference for logistic models usingPolya-Gamma latent variables

Nicholas G. Polson∗

University of Chicago

James G. Scott†

Jesse Windle‡

University of Texas at Austin

First Draft: August 2011This Draft: April 2012

Abstract

We propose a new data-augmentation strategy for fully Bayesian inference inmodels with logistic likelihoods. To illustrate the method we focus on fourexamples: binary logistic regression; contingency tables with fixed margins;multinomial logistic regression; and negative binomial models for count data.In each case, we show our how data-augmentation strategy leads to simple, ef-fective methods for posterior inference that: (1) entirely circumvent the need foranalytic approximations, numerical integration, or Metropolis–Hastings; and(2) outperform the existing state of the art, both in ease of use and in com-putational efficiency. We also describe how the method may be extended toother logit models, including topic models and nonlinear mixed-effects models.Our approach appeals heavily to the novel class of Polya-Gamma distributions;much of the paper is devoted to constructing this class in detail, and demon-strating its relevance for Bayesian inference. The methods described here areimplemented in the R package BayesLogit.

∗[email protected]†[email protected]‡[email protected]

1

arX

iv:1

205.

0310

v1 [

stat

.ME

] 2

May

201

2

1 Introduction

1.1 A latent-variable representation of logistic models

Many common statistical models involve a likelihood that decomposes into a product

of terms of the form

Li =(eψi)ai

(1 + eψi)bi, (1)

where ψi is a linear function of parameters, and where ai and bi involve the response

for subject i. One familiar case arises in binary logistic regression, where we observe

outcomes yi ∈ 0, 1, i = 1, . . . , n, and assume that

pr(yi = 1) =eψi

1 + eψi. (2)

Here ψi = xTi β for a known p-vector of predictors xi and a common, unknown set of

regression coefficients β = (β1, . . . , βp)T .

In this paper, we propose a new latent-variable representation of likelihoods in-

volving terms like (1). This leads directly to efficient Gibbs-sampling algorithms for a

very wide class of Bayesian models that have previously eluded simple treatment. This

paper focuses on four common models: binary logistic regression; contingency tables

with logistic-normal priors; multinomial logistic regression; and negative-binomial

models for count data. But the basic approach also applies to a much wider class

of models, including topic models, discrete mixtures of logits, random-effects and

mixed-effects models, non-Gaussian factor models, and hazard models.

Our representation is based upon a new family of Polya-Gamma distributions,

which we introduce here, and describe thoroughly in Section 2.

Definition 1. A random variable X has a Polya-Gamma distribution with parame-

ters b > 0 and c > 0, denoted X ∼ PG(b, c), if

XD=

1

2π2

∞∑k=1

gk(k − 1/2)2 + c2/(4π2)

, (3)

where each gk ∼ Ga(b, 1) is an independent gamma random variable.

With this definition in place, we proceed directly to our main result, which we

prove in Section 2.3.

Theorem 1. Let p(ω) denote the density of the random variable ω ∼ PG(b, 0), b > 0.

Then the following integral identity holds for all a ∈ R:

(eψ)a

(1 + eψ)b= 2−beκψ

∫ ∞0

e−ωψ2/2 p(ω) dω , (4)

where κ = a− b/2.

2

Moreover, the conditional distribution

p(ω | ψ) =e−ωψ

2/2 p(ω)∫∞0e−ωψ2/2 p(ω) dω

,

which arises in treating the integrand in (4) as an unnormalized joint density in (ψ, ω),

is also in the Polya-Gamma class: (ω | ψ) ∼ PG(b, ψ).

Equation (4) is significant because it allows us to write (1) in a conditionally

Gaussian form, given the latent precision ωi. Thus in the important case where

ψi = xTi β is a linear function of predictors, the full likelihood in β will also be

conditionally Gaussian. Moreover, the second half of Theorem 1 shows that the full

conditional distribution for each ωi, given ψi, is in the same class as the prior p(ωi).

Theorem 1 thus enables full Bayesian inference for a wide class of models involving

logistic likelihoods and conditionally Gaussian priors. All of these models can now be

fit with a Gibbs sampler having only two basic steps: multivariate normal draws for

the main parameters affecting the ψi terms, and Polya-Gamma draws for the latent

variables ωi. (We will describe this second step in detail.)

1.2 Existing work

As we will show, our approach has significant advantages over existing approaches

for Bayesian inference in logistic models. While traditional approaches have involved

numerical integration, normal approximations to the likelihood (Carlin, 1992; Gelman

et al., 2004), or the Metropolis–Hastings algorithm (Dobra et al., 2006), the most

direct comparison here is with the more recent methods of Holmes and Held (2006)

and Fruhwirth-Schnatter and Fruhwirth (2007). Both of these papers rely upon a

random-utility construction of the logistic model, where the outcomes yi are assumed

to be thresholded versions of an underlying continuous quantity zi. For the purposes

of exposition, assume a linear model where

yi =

1 , zi ≥ 0

−1 , zi < 0

zi = xTi β + εi (5)

εi ∼ Lo(1) ,

where εi ∼ Lo(1) has a standard logistic distribution. Upon marginalizing over zi,

the likelihood in (2) is recovered. A useful analogy is with the work of Albert and

Chib (1993), who represent the probit regression model in the same way, subject to

the modification that εi has a standard normal distribution.

The difficulty with this latent-utility representation is that the logistic distribution

does not lead to an easy method for sampling from the posterior distribution of β.

This is very different from the probit case, where one may exploit the conditional

normality of the likelihood in β, given zi.

3

β

yiiφi

i = 1, . . . , n

β

yi

i = 1, . . . , n

ωi

Figure 1: Directed acyclic graphs depicting two latent-variable constructions forthe logistic-regression model: the random-utility model of Holmes and Held (2006)and Fruhwirth-Schnatter and Fruhwirth (2007), on the left; versus our direct data-augmentation scheme, on the right.

To respond to this difficulty, Holmes and Held (2006) introduce a second layer of

latent variables for the logit model, writing the errors εi in (5) as

(εi | φi) ∼ N(0, φi)

φi = (2λ2i )

λi ∼ KS(1) ,

where λi has a Kolmogorov–Smirnov distribution. Marginalizing over λi recovers the

original latent-utility representation of the likelihood for yi. This exploits the fact

that the logistic distribution is a scale mixture of normals (Andrews and Mallows,

1974). Posterior sampling proceeds by iterating three steps: sample zi from a trun-

cated normal distribution, given λi, yi, and β; sample β for a multivariate normal

distribution, given zi and λi; and sample λi, given zi and β. This latter update can

be accomplished via adaptive rejection sampling, using the alternating-series method

of Devroye (1986).

Fruhwirth-Schnatter and Fruhwirth (2007) take a different approach based upon

auxiliary mixture sampling. They approximate p(εi) in (5) using a discrete mixture

of normals, rather than a scale mixture:

(εi | φi) ∼ N(0, φi)

φi ∼K∑k=1

wkδφ(k) ,

where δφ indicates a Dirac measure at φ. The weights wk and the points φ(k) in

the discrete mixture are fixed for a given choice of K so that the Kullback–Leibler

divergence from the true distribution of the random utilities is minimized. Fruhwirth-

Schnatter and Fruhwirth (2007) argue that the choice of K = 10 leads to a good

approximation, and list the optimal weights and variances for this choice.

This results in a Gibbs-sampling scheme that also requires two levels of latent

4

variables, and that has a very similar structure to that of Holmes and Held (2006):

sample zi, given φi, yi, and β; sample β, given zi and φi; and sample φi, given ziand β, from its discrete conditional distribution. The difficulties of dealing with the

Kolmogorov–Smirnov distribution for λi are avoided, at the cost of approximating

the true model for εi with a thinner-tailed Gaussian mixture.

A similar approach to ours is that of Gramacy and Polson (2012), who concentrate

on a latent-variable representation of a powered-up version of the logit likelihood. This

representation is very useful for obtaining classical penalized-likelihood estimates via

simulation. But the difficulty with performing fully Bayesian inference using this

representation is that, when applied to the likelihood in (2), it leads to an improper

mixing distribution for the latent variable. This requires modifications that make

simulation very challenging in the general logit case. In contrast, our Polya-Gamma

representation sidesteps this issue entirely, and results in a proper mixing distribution

for all common choices of ai, bi in (1).

1.3 Outline

In Section 2, we construct the class of Polya–Gamma variables, and prove our key

result (Theorem 1). Then in Section 3 we illustrate the power of this result by

showing how it leads to a simple, efficient sampling algorithm for posterior inference

in binary logistic regression. We also benchmark our method extensively against

those of Holmes and Held (2006) and Fruhwirth-Schnatter and Fruhwirth (2007). In

Sections 4–6 we illustrate the method in three further settings: contingency tables,

multinomial logistic regression, and negative-binomial regression.

Then in Section 7, we address the key requirement for Theorem 1 to be practically

useful: namely, an efficient method for simulating Polya-Gamma random variables.

The identity in (3) suggests a naıve way to approximate this distribution using a

large number of independent gamma draws. But in Section 7, we describe a far more

efficient and stable way to sample exactly from the Polya-Gamma distribution, which

avoids the difficulties that can result from truncating an infinite sum. To do so, we use

the alternating-series method from Devroye (1986) to construct a rejection sampler

for the Polya-Gamma distribution.

As we will prove, this sampler is highly efficient: on average, it requires 1.000803

proposals for every accepted draw, regardless of the parameters of the underlying

distribution. The proposal distribution itself is easy to sample from, and involves only

exponential and inverse-Gaussian draws, together with a handful of simple auxiliary

computations.

We conclude in Section 8 with some final remarks about how the method can be

generalized to other settings.

5

2 Polya-Gamma random variables

2.1 The case PG(b, 0)

The key step in our approach in the construction of a novel class of Polya-Gamma

random variables. Together with the sampling method described in Section 7, this

distributional theory greatly simplifies Bayesian inference in models with logistic like-

lihoods.

Following Devroye (2009), a random variable J∗ has a Jacobi distribution if

J∗D=

2

π2

∞∑k=1

ek(k − 1/2)2

, (6)

where the ek are independent, standard exponential random variables. The moment-

generating function of this distribution is

E(e−tJ∗) =

1

cosh(√

2t). (7)

The density of this distribution is expressible as a multi-scale mixture of inverse-

Gaussians; all moments are finite and expressible in terms of Riemann zeta functions.

The Jacobi is related to the Polya distribution (Barndorff-Nielsen et al., 1982), in

that if J∗ has a Jacobi distribution, and ωD= J∗/4, then ω ∼ Pol(1/2, 1/2).

Let ωk ∼ Pol(1/2, 1/2) for k = 1, . . . , n be a set of independent Polya-distributed

random variables. A PG(n, 0) random variable is then defined by the sum ω?D=∑n

k=1 ωk. Its moment generating function follows from that of a Jacobi distribution:

Eexp(−ωkt) =1

cosh(√t/2)

.

Therefore, for the Polya-Gamma with parameters (n, 0), we have

Eexp(−ω?t) =1

coshn(√t/2)

. (8)

The name “Polya-Gamma” arises from the following observation. From (6),

ω?D=

n∑l=1

(1

2π2

∞∑k=1

el,k(k − 1/2)2

),

6

where el,k are independent exponential random variables. Rearranging terms,

ω?D=

1

2π2

∞∑k=1

∑nl=1 el,k

(k − 1/2)2

D=

1

2π2

∞∑k=1

gk(k − 1/2)2

,

where gk are independent Gamma(n, 1) random variables. More generally we may

replace n with any positive real number b.

2.2 The general PG(b, c) class

The general PG(b, c) class arises through an exponential tilting of the PG(b, 0) density.

Specifically, we define the density of a PG(b, c) random variable as

p(ω | b, c) =exp

(− c2

2ω)p(ω | b, 0)

Eω

exp(− c2

2ω) , (9)

where p(ω | b, 0) is the density of a PG(b, 0) random variable. The expectation in

the denominator is taken with respect to the PG(b, 0) distribution, ensuring that

p(ω | b, c) integrates to 1.

Using the Weierstrass factorization theorem, write the moment-generating func-

tion of the PG(b, c) distribution as

Eω

exp

(−1

2ωt

)=

coshb(c2

)coshb

(√c2+t2

)=∞∏k=1

(1 + c2

4(k−1/2)2π2

1 + c2+t4(k−1/2)2π2

)b

=∞∏k=1

(1 + d−1k t)−b , where dk = 4

(k − 1

2

)2

π2 + c2 .

Each term in the product is recognizable as the moment-generating function of

the gamma distribution. We can therefore write a PG(b, c) random variable as

ωD= 2

∞∑k=1

Ga(b, 1)

dk=

1

2π2

∞∑k=1

Ga(b, 1)

(k − 12)2 + c2/(4π2)

.

2.3 Proof of Theorem 1

The key step in the proof of Theorem 1 is simply the construction of the Polya-Gamma

class itself. With this theory in place, the proof proceeds very straightforwardly.

7

Proof. Appealing to the moment-generating function in (8), we may write the likeli-

hood in (4) as

(eψ)a

(1 + eψ)b=

2−b expκψcoshb(ψ/2)

= 2−beκψ Eωexp(−ωψ2/2 ,

where the expectation is taken with respect to ω ∼ PG(b, 0), and where κ = a− b/2.

Turn now to the conditional distribution

p(ω | ψ) =e−ωψ

2/2 p(ω)∫∞0e−ωψ2/2 p(ω) dω

,

where p(ω) is the density of the prior, PG(b, 0). This is of the same form as (9), with

ψ = c. Therefore (ω | ψ) ∼ PG(b, ψ).

3 Binary logistic regression

3.1 Overview of approach

Several examples will serve to illustrate the utility of Theorem 1, beginning with the

case of binary logistic regression.

Let yi ∈ 0, 1 be a binary outcome for unit i, with corresponding predictors

xi = (xi1, . . . , xip). Suppose that the logistic model of (2) holds, and that we model

the log odds of success as ψi = xTi β. Appealing to Theorem 1, the contribution of yito the likelihood in β may be written as

Li(β) =exp(xTi β)yi1 + exp(xTi β)

∝ exp(κixTi β)

∫ ∞0

exp−ωi(xTi β)2/2 p(ωi | 1, 0) ,

where κi = yi − 1/2, and where p(ωi | 1, 0) is the density of a Polya-Gamma random

variable with parameters (1, 0).

Combining all n terms gives the following expression for the conditional likelihood

8

in β, given ω = (ω1, . . . , ωn):

L(β | ω) =n∏i=1

Li(β | ωi) ∝n∏i=1

expκix

Ti β − ωi(xTi β)2/2

∝

n∏i=1

expωi

2(xTi β − κi/ωi)2

∝ exp

−1

2(z −Xβ)TΩ(z −Xβ)

,

where z = (κ1/ω1, . . . , κn/ωn), and where Ω = diag(ω1, . . . , ωn). Given all ωi terms,

we therefore have a conditionally Gaussian likelihood in β, with pseudo-response

vector z, design matrix X, and covariance matrix Ω−1.

Suppose that β has a conditionally normal prior β ∼ N(b, B). We may then

sample from the joint posterior distribution for β by iterating two simple Gibbs

steps:

(ωi | β) ∼ PG(1, xTi β)

(β | y, ω) ∼ N(m,V ) ,

where

V = (XTΩX + V −1)−1

m = V (XTΩz + V −1b) ,

recalling that zi = ω−1i (yi − 1/2) itself depends upon the augmentation variable ωi,

along with the binary outcome yi. If the number of regressors p exceeds the number

of observations, then these matrix computations can be done faster by exploiting the

Sherman–Morrison–Woodbury identity. Moreover, the outer products xixTi can be

pre-evaluated, which will speed computation of XTΩX at each step of the Gibbs

sampler.

3.2 Example: spam classification

To demonstrate the approach, we fit a logistic-regression model to the data set on

spam classification from the online machine-learning repository at the University of

California, Irvine. The data set has n = 4601 observations and p = 57 predictors, all

of which correspond to the frequencies with which certain words appear in the body of

an e-mail. In fitting the model, we omitted the frequencies of the words “George” and

“cs.” These are the e-mail recipient’s first name and home department, respectively,

and therefore not generalizable to spam filters for other e-mail recipients.

9

Coefficient

-4-2

02

4

Intercept

word.freq.make

word.freq.address

word.freq.all

word.freq.3d

word.freq.our

word.freq.over

word.freq.remove

word.freq.internet

word.freq.order

word.freq.mail

word.freq.receive

word.freq.will

word.freq.people

word.freq.report

word.freq.addresses

word.freq.free

word.freq.business

word.freq.email

word.freq.you

word.freq.credit

word.freq.your

word.freq.font

word.freq.000

word.freq.money

word.freq.hp

word.freq.hpl

word.freq.650

word.freq.lab

word.freq.labs

word.freq.telnet

word.freq.857

word.freq.data

word.freq.415

word.freq.85

word.freq.technology

word.freq.1999

word.freq.parts

word.freq.pm

word.freq.direct

word.freq.meeting

word.freq.original

word.freq.project

word.freq.re

word.freq.edu

word.freq.table

word.freq.conference

char.freq.;

char.freq.(

char.freq.[

char.freq.!

char.freq.$

char.freq.#

capital.run.length.average

capital.run.length.longest

capital.run.length.total

Figure 2: Posterior distributions for the regression coefficients on the spam-classification data. The black dots are the posterior means; the light- and dark-greylines indicate 90% and 50% posterior credible intervals, respectively; and the crossesare the corresponding maximum-likelihood estimates.

As a prior for β, we used a logistic distribution, following Gelman et al. (2008):

p(β) ∝ exp(xT0 β)1/2

1 + exp(xT0 β)

where each entry of the prior design point x0 is the sample mean of the corresponding

column of the design matrix X. This encodes the prior belief that the probability of

a success is 1/2 at the global average value of the predictors, and can be interpreted

as a pseudo-observation of y0 = 1/2 with a prior sample size of 1. This prior is easily

accommodated by the Polya-Gamma data-augmentation framework. In principle,

any prior which is a scale mixture of normals may be used instead; see, e.g. Polson

and Scott (2012).

Results of the model fit are shown in Figure 2. For the most part, the Bayesian

model yields results very similar to the maximum-likelihood solution. This is unsur-

prising, given the large amount of data and the weakly informative prior.

3.3 Comparison with existing methods

To demonstrate the merits of our method, we benchmark it against both those of

Fruhwirth-Schnatter and Fruhwirth (2007) and Holmes and Held (2006)—specifically,

their joint updating technique, as it has the largest effective sample size of those they

10

consider. We use the same metrics, datasets, and sample sizes described in Holmes

and Held (2006). The efficiency of each sampler is measured in two ways: by the

average step size between successive draws of the regression coefficients and by the

effective sample size of each component averaged across all components of β. The

average step size is

Dist. =1

N − 1

N−1∑i=1

‖β(i) − β(i+1)‖

and the average effective sample size is

ESS =1

P

P∑i=1

ESSi where ESSi = M/(1 + 2k∑j=1

ρi(j)),

M is the number of post-burn-in samples, and ρi(j) is the jth autocorrelation of βi

as estimated by the initial montone sequence estimator of Geyer (1992).

The four datasets considered are the Pima Indians diabetes dataset used in Ripley

(1996), and the Australian Credit, German Credit, and Heart datasets found in the

STATLOG project (Michie et al., 1994). For each dataset and each method we

ran 10 simulations of 10,000 samples each, discarding the first 1,000 samples of each

simulation as burn-in. In each case we assume a diffuse normal prior, β ∼ N(0, 100I).

Table 3.3 presents the results of these comparisons. For the examples considered

here, the Polya-Gamma method is at least 2.8 times as efficient as that of Holmes and

Held, and at least 35 times as efficient as that of Fruhwirth-Schnatter and Fruhwirth.

These draw-by-draw comparisons of efficiency are useful, but do not take into con-

sideration the relative speed of individual draws from each method. We compare the

running times of the Polya-Gamma (PG) method versus Holmes and Held (HH) to

measure the relative effective sample sizes per unit time. These results were produced

using R. In general, R is a poor language in which to benchmark execution time, as

the number of for loops, while loops, if statements, function calls, and vectorized op-

erations can have a large impact how long a script runs. Nonetheless, the comparison

is still appropriate in this case, as the PG and HH algorithms have a similar structure.

Furthermore, in both cases we use an external routine for the most time-consuming

steps: drawing from the Polya-Gamma distribution and drawing from p(λ | z, β) in

the case of the HH method.

As seen in Table 3.3, the gains in efficiency are even greater for the PG method

when accounting for per-draw execution time. The PG method finished roughly twice

as fast on average, producing at least a five-fold improvement in effective sample size

per unit time. While the direct comparison was performed in R, the BayesLogit

package has implemented the Polya-Gamma procedure for binary logistic regression

in C++. The approximate running times in BayesLogit are also listed in Table 3.3,

and are much faster.

These results can be explained by the fact that our approach requires only a

single layer of latent variables. It also maintains conditional conjugacy, and does not

11

Dataset PG HH FSF ESS Ration p p∗ ESS Dist. ESS Dist. ESS Dist. HH FSF

Diabetes 768 8 9 4797 0.70 1695 0.42 114 0.11 2.8 42Heart 270 13 19 3184 4.86 891 2.70 85 0.78 3.6 37Australia 690 14 35 3173 10.93 724 5.37 84 1.64 4.4 38Germany 1000 20 49 4981 3.95 1723 2.40 115 0.68 2.9 43

Table 1: The effective sample size and average distance between samples of β for the fourdata sets considered. The ESS ratio is the ratio of the effective sample sizes of Polya-Gammatechnique to those of Holmes and Held (2006) and Fruhwirth-Schnatter and Fruhwirth(2007), respectively. For each data set there are n observations, p predictors, and p∗ compo-nents of β including interactions. The Australian credit dataset possess perfectly separabledata that is captured by a diverging entry in β. This infects the measure of distance be-tween β’s resulting from each method, and is affected strongly by the choice of prior for β.Consequently, one should view the distance between β’s for the Australian credit datasetas a measure of how quickly each model captures this pathological coefficient.

Dataset BayesLogit PG (R) HH (R)Time Time ESS/sec. Time ESS/sec. ESS/sec. Ratio

Diabetes 5.3 167 29 334 5.1 5.6Heart 2.6 61 53 100 9.0 5.9Australia 9.6 158 20 334 2.1 9.3Germany 18.0 234 21 553 3.1 6.9

Table 2: Comparisons of actual run times in seconds for the Polya-Gamma and Holmes–Held methods implemented in R. All computations were performed on a Linux machinewith an Intel Xeon 3.30 GHz CPU and 4GB of RAM. The average running time (sec.)was calculated using the proc.time command in R. Both the KS mixture distribution andthe Polya–Gamma distribution were sampled from R by calls to an external compiled Croutine, making these times comparable.The BayesLogit column is the average time it tookto run 10,000 samples using the BayesLogit package, which calls a compiled C++ routineto perform posterior sampling.

require an analytic approximation to p(εi). In fact, we dispense with the latent-utility

approach altogether in favor of a direct mixture representation of the logit likelihood,

which also avoids the step of sampling a truncated normal variable. Moreover, the

latent precision ωi in our construction is faster to sample than the latent variance

φi in the Kolmogorov–Smirnov mixture construction, leading to the further gains in

speed shown in Table 3.3.

Another important gain in efficiency arises when the data contain repeated ob-

servations. This can happen in logistic regression when there are repeated design

points, but is most relevant for the analysis of contingency tables with fixed margins.

Suppose that we cross-tabulate n observations into k distinct nonzero cells. If we

parametrize the cell probabilities by their log-odds and fit a logistic-normal model,

the latent-utility approach will require 2n latent variables. By contrast, our approach

12

will require only k latent variables; the cell counts affect the distribution of these la-

tent variables via the a and b terms in (4), but not their number. Our latent-variable

construction will therefore scale to larger data sets much more efficiently.

4 Contingency tables with fixed margins

4.1 2 × 2 × N tables

Next, consider a simple example of a binary-response clinical trial conducted in each

of N different centers. Let nij be the number of patients assigned to treatment

regime j in center i; and let Y = yij be the corresponding number of successes

for i = 1, . . . , N . Table 1 presents a data set along these lines, from Skene and

Wakefield (1990). These data arise from a multi-center trial comparing the efficacy

of two different topical cream preparations, labeled the treatment and the control.

Let pij denote the underlying success probability in center i for treatment j, and

ψij the corresponding log-odds. If the ψi = (ψi1, ψi2)T is assigned a bivariate normal

prior ψi ∼ N(µ,Σ) then the posterior for Ψ = ψij is

p(Ψ | Y ) ∝N∏i=1

eyi1ψi1

(1 + eψi1)ni1eyi2ψi2

(1 + eψi2)ni2p(ψi1, ψi2 | µ,Σ)

.

We apply Theorem 1 to each term in the posterior, thereby introducing augmen-

tation variables Ωi = diag(ωi1, ωi2) for each center. This yields, after some algebra, a

simple Gibbs sampler that iterates between two sets of conditional distributions:

(ψi | Y,Ωi, µ,Σ) ∼ N(mi, VΩi) (10)

(ωij | ψij) ∼ PG (nij, ψij) ,

where

V −1Ωi

= Ωi + Σ−1

mi = VΩi(κi + Σ−1µ)

κi = (yi1 − ni1/2, yi2 − ni2/2)T .

Figure 3 shows the results of applying this Gibbs sampler to the data from Skene

and Wakefield (1990). Notice that our method requires 16 latent variables for a data

set with 273 observations. By comparison, a Gibbs sampler based on the random-

utility construction in (5) would require 546 latent variables: a (zi, φi) pair for every

individual in the study.

In this analysis, we used a normal-Wishart prior for (µ,Σ−1). Hyperparameters

were chosen to match Table II from Skene and Wakefield (1990), who parameterize

13

Table 3: Data from a multi-center, binary-response study on topical cream effective-ness (Skene and Wakefield, 1990).

Treatment ControlCenter Success Total Success Total

1 11 36 10 372 16 20 22 323 14 19 7 194 2 16 1 175 6 17 0 126 1 11 0 107 1 5 1 98 4 6 6 7

Log-odds ratios in an 8-center binary-response study

Treatment Center

Log-

Odd

s R

atio

of S

ucce

ss

XX

XX X

X

X

X

X

X

X

X

X

X

1 2 3 4 5 6 7 8

-6-4

-20

2

Treatment (95% CI)Control (95% CI)

Figure 3: Posterior distributions for the log-odds ratio for each of the 8 centers in thetopical-cream study from Skene and Wakefield (1990). The vertical lines are central95% posterior credible intervals; the dots are the posterior means; and the X’s arethe maximum-likelihood estimates of the log-odds ratios, with no shrinkage amongthe treatment centers. Note that the maximum-likelihood estimate is ψi2 = −∞ forthe control group in centers 5 and 6, as no successes were observed.

14

the model in terms of the prior expected values for ρ, σ2ψ1

, and σ2ψ2

, where

Σ =

(σ2ψ1

ρ

ρ σ2ψ2

).

To match their choices, we use the following identity that codifies a relationship

between the hyperparameters B and d, and the prior moments for marginal variances

and the correlation coefficient. If Σ ∼ IW(d,B), then

B = (d− 3)

E(σ2ψ2

) + E(σ2ψ1

) + 2 E(ρ)√

E(σ2ψ2

) E(σ2ψ1

) E(σ2ψ2

) + E(ρ)√

E(σ2ψ2

) E(σ2ψ1

)

E(σ2ψ2

) + E(ρ)√

E(σ2ψ2

) E(σ2ψ1

) E(σ2ψ2

)

.

In this way we are able to map from pre-specified moments to hyperparameters,

ending up with d = 4 and

B =

(0.754 0.857

0.857 1.480

).

4.2 Higher-order tables

Now consider a multi-center, multinomial response study with more than two treat-

ment arms. This can be modeled using hierarchy of N different two-way tables, each

having the same J treatment regimes and K possible outcomes. The data D consist

of triply indexed outcomes yijk, each indicating the number of observations in center i

and treatment j with outcome k. We let nij =∑

k yij indicate the number of subjects

assigned to have treatment j at center k.

Let P = pijk denote the set of probabilities that a subject in center i with

treatment j experiences outcome k, such that∑

k pijk = 1 for all i, j. Given these

probabilities, the full likelihood is

L(P ) =N∏i=1

J∏j=1

K∏k=1

pyijkijk .

Following Leonard (1975), we can model these probabilities using a logistic trans-

formation. Let

pijk =exp(ψijk)∑Kl=1 exp(ψijl)

.

Many common prior structures will maintain conditional conjugacy using the Polya-

Gamma framework outlined thus far. For example, we may assume an exchangeable

matrix-normal prior at the level of treatment centers:

ψi ∼ N(M,ΣR,ΣC) ,

15

where ψi is the matrix whose (j, k) entry is ψijk; M is the mean matrix; and ΣR

and ΣC are row- and column-specific covariance matrices, respectively. See Dawid

(1981) for further details on matrix-normal theory. Note that, for identifiability, we

set ψijK = 0, implying that ΣC is of dimension K − 1.

This leads to a posterior of the form

p(Ψ | D) = ·N∏i=1

[p(ψi) ·

J∏j=1

K∏k=1

(exp(ψijk)∑Kl=1 exp(ψijl)

)yijk],

suppressing any dependence on (M,ΣR,ΣC) for notational ease.

To show that this fits within the Polya-Gamma framework, we use a similar ap-

proach to Holmes and Held (2006), rewriting each probability as

pijk =exp(ψijk)∑

l 6=k exp(ψijl) + exp(ψijk)

=eψijk−cijk

1 + eψijk−cijk,

where cijk = log∑

l 6=k exp(ψijl) is implicitly a function of the other ψijl’s for l 6= k.

We now fix values of i and k and examine the conditional posterior distribution

for ψi·k = (ψi1k, . . . , ψiJk)′, given ψi·l for l 6= k:

p(ψi·k | D,ψi·(−k)) ∝ p(ψi·k | ψi·(−k)) ·J∏j=1

(eψijk−cijk

1 + eψijk−cijk

)yijk ( 1

1 + eψijk−cijk

)nij−yijk= p(ψi·k | ψi·(−k)) ·

J∏j=1

eyijk(ψijk−cijk)

(1 + eψijk−cijk)nij

This is simply a multivariate version of the same bivariate form in that arises in

a 2 × 2 table. Appealing to the theory of Polya-Gamma random variables outlined

above, we may express this as:

p(ψi·k | D,ψi·(−k)) ∝ p(ψi·k | ψi·(−k)) ·J∏j=1

eκijk[ψijk−cijk]

coshnij([ψijk − cijk]/2)

= p(ψi·k | ψi·(−k)) ·J∏j=1

[eκijk[ψijk−cijk] · E

e−ωijk[ψijk−cijk]2/2

],

where ωijk ∼ PG(nij, 0), j = 1, . . . , J ; and κijk = yijk − nij/2. Given ωijk for

j = 1, . . . , J , all of these terms will combine in a single normal kernel, whose mean

and covariance structure will depend heavily upon the particular choices of hyperpa-

16

rameters in the matrix-normal prior for ψi. Each ωijk term can be updated as

(ωijk | ψijk) ∼ PG(nij, ψijk − cijk) ,

leading to a simple MCMC that loops over centers and responses, drawing each vector

of parameters ψi·k (that is, for all treatments at once) conditional on the other ψi·(−k)’s.

5 Multinomial logistic regression

One may extend the Polya-Gamma method used for binary logistic regression to

multinomial logistic regression. Consider the multinomial sample yi = yijJj=1 that

records the number of responses in each category j = 1, . . . , J and the total number

of responses ni. The logistic link function for polychotomous regression stipulates

that the probability of randomly drawing a single response from the jth category in

the ith sample is

pij =expψij∑Ji=1 expψik

where the log odds ψij is modeled by xTi βj and βJ has been constrained to zero for

purposes of identification. Following Holmes and Held (2006) the likelihood for βjconditional upon β−j, the matrix with column vector βj removed, is

`(βj|β−j, y) =N∏i=1

(eηij

1 + eηij

)yij ( 1

1 + eηij

)ni−yijwhere

ηij = xTi βj − Cij with Cij = log∑k 6=j

expxTi βk,

which looks like the binary logistic likelihood previously discussed. Incorporating the

Polya-Gamma auxiliary variable, the likelihood becomes

N∏i=1

eκijηije−η2ij2 ωijPG(ωij|ni, 0)

where κij = (yij−ni/2). Employing the conditionally conjugate prior βj ∼ N(m0j, V0j)

yields a two-part update:

(βj | Ωj) ∼ N(mj, Vj)

(ωij | βj) ∼ PG(ni, ηij) for i = 1, · · · , N,

17

where

V −1j = X ′ΩjX + V −1

0j ,

mj = Vj(X ′(κj − ΩjCj) + V −1

0j m0j

)and Ωj = diag(ωijNi=1). One may sample the posterior of (β | y) via Gibbs sampling

by repeatedly iterating over the above steps for j = 1, . . . , J − 1.

The Polya-Gamma method generates samples from the joint posterior distribu-

tion without appealing to analytic approximations to the posterior. This offers an

important advantage when the number of observations is not significantly larger than

the number of parameters.

To see this, consider sampling the joint posterior for β using a Metropolis-Hastings

algorithm with an independence proposal. The likelihood in β is approximately

normal, centered at the posterior mode m, and with variance V equal to the inverse

of the Hessian matrix evaluated at the mode. (Both of these may be found using

standard numerical routines.) Thus a natural proposal for (vec(β(t)) | y) is vec(b) ∼N(m, aV ) for some a ≈ 1. When data are plentiful, this method is both simple and

highly efficient, and is implemented in many standard software packages (e.g. Martin

et al., 2011).

But when vec(β) is high-dimensional relative to the number of observations the

Hessian matrix H may be ill-conditioned, making it impossible or impractical to

generate normal proposals. Multinomial logistic regression succumbs to this problem

more quickly than binary logistic regression, as the number of parameters scales like

the product of the number of categories and the number of predictors.

To illustrate this phenomenon, we consider glass-identification data from German

(1987). This data set has J = 6 categories of glass and nine predictors describing the

chemical and optical properties of the glass that one may measure in a forensics lab

and use in a criminal investigation. This generates up to 50 = 10 × 5 parameters,

including the intercepts and the constraint that βJ = 0. These must be estimated

using n = 214 observations. In this case, the Hessian H at the posterior mode is

poorly conditioned when employing a vague prior, incapacitating the independent

Metropolis-Hastings algorithm. Numerical experiments confirm that even when a

vague prior is strong enough to produce a numerically invertible Hessian, rejection

rates are prohibitively high. In contrast, the multinomial Polya-Gamma method still

produces reasonable posterior distributions in a fully automatic fashion, even with

a weakly informative normal prior for each βj. Table 4, which shows the in-sample

performance of the multinomial logit model, demonstrates the problem with the joint

proposal distribution: category 6 is perfectly separable into cases and non-cases, even

though the other categories are not. This is a well-known problem with maximum-

likelihood estimation of logistic models. The same problem also forecloses the option

of posterior sampling using methods that require a unique MLE to exist.

18

Class 1 2 3 5 6 7Total 70 76 17 13 9 29Correct 50 55 0 9 9 27

Table 4: “Correct” refers to the number of glass fragments for each category that werecorrectly identified by the Bayesian multinomial logit model. The glass identificationdataset includes a type of glass, class 4, for which there are no observations.

6 Negative-binomial models for count data

6.1 With regressors

Suppose that we have a Poisson model with gamma overdispersion:

(yi | λi) ∼ Pois(λi)

(λi | h, pi) ∼ Ga

(h,

pi1− pi

).

If we marginalize over λi, we get a negative binomial marginal distribution for yi:

p(yi | h, pi) ∝ (1− pi)h pyii ,

ignoring constants of proportionality. Now use a logit transform to represent pi as

pi =eψi

1 + eψi.

Since eψi = pi/(1− pi), the ψi term is analogous to a re-scaled version of the typical

linear predictor in a Poisson generalized-linear model using the canonical log link.

Appealing to Theorem 1, we can rewrite the negative binomial likelihood in terms

of ψi as

(1− pi)h pyii =exp(ψi)yi

1 + exp(ψi)h+yi

∝ eκiψi∫ ∞

0

e−ωiψ2/2 p(ω | h+ yi, 0) dω ,

where κi = (yi − h)/2, and where the mixing distribution is Polya-Gamma. Condi-

tional upon ωi, we have a likelihood proportional to e−Q(ψi) for some quadratic form

Q. Therefore this will be conditionally conjugate to any Gaussian prior, or any prior

can be made conditionally Gaussian.

In case the case regressors are present, we have that ψi = xTi β for some p-vector xi.

Then, conditional upon ωi, the contribution of the ith observation to the likelihood

19

is

Li(β) ∝ expκixTi β − ωi(xTi β)2/2

∝ exp

ωi2

(yi − φ

2ωi− xTi β

)2

Let Ω = diag(ω1, . . . , ωn); let zi = (yi − φ)/(2ωi); and let z denote the stacked vector

of zi terms. Then putting all the terms in the likelihood together, this is equivalent

to the sampling model

(z | β,Ω) ∼ N(Xβ,Ω−1) ,

or a simple Gaussian regression model with (known) covariance matrix Ω−1 describing

the error structure.

Suppose that we assume a conditionally Gaussian prior, β ∼ N(b, B). Then the

Gibbs sampling proceeds in two simple steps:

(ωi | φ, β) ∼ PG(h+ yi, xTi β)

(β | Ω, z) ∼ N(m,V ) ,

where

V = (XTΩX +B−1)−1

m = V (XTΩz +B−1b) ,

recalling the definition of z above. A related paper is that of Zhou et al. (2012),

who use our Polya-Gamma construction to arrive at a similar MCMC method for

log-normal mixtures of negative binomial models.

6.2 Example: an AR(1) model for the incidence of flu

To illustrate the further elaborations that are possible to the basic framework, we

fit a negative-binomial AR(1) model to four years (2008–11) of weekly data on flu

incidence in Texas. This data was collected from the Texas Department of State

Health Services. We emphasize that there are many ways to handle the overdispersion

present in this data set, and that we do not intend our model to be taken as a definitive

analysis. We merely intend it as a “proof of concept” example showing how various

aspects of Bayesian time-series modeling—in this case, a simple AR(1) model—can

now be incorporated seamlessly into models with non-Gaussian likelihoods, such as

those that arise in the analysis of binary, multinomial, and count data.

Let yt denote the number of reported cases of influenza-like illness (ILI) in week

t. We assume that these counts follow a negative-binomial model, or equivalently a

20

050

100

150

200

0

1000

2000

3000

4000

5000

Bay

esia

n N

egat

ive

Bin

omia

l AR

(1) m

odel

Wee

ks s

ince

Sta

rt of

200

8

Cases

Est

imat

ed ra

te95

% P

redi

ctiv

e In

terv

al

050

100

150

200

0

1000

2000

3000

4000

5000

Max

imum

Lik

elih

ood

Pois

son

Reg

ress

ion

With

Lag

Ter

m

Week

Cases

Figure 4: Incidence of influenza-like illness in Texas, 2008–11, together with theestimated mean λt from the negative-binomial AR(1) model (left) and Poisson re-gression model incorporating a lagged value of the response variable as a predictor(right). The blanks in weeks 21-41 correspond to missing data. In each frame thegrey lines depict the upper and lower bounds of a 95% predictive interval.

21

Gamma-overdispersed Poisson model:

yt ∼ NB(h, pt)

pt =eψt

1 + eψt

ψt = α + γt

γt = φγt−1 + εt

εt ∼ N(0, σ2) ,

assuming that t indexes time. It is trivial to sample from this model by combining the

results of the previous section with standard results on AR(1) models. In the following

analysis, we assumed an improper uniform prior on the dispersion parameter h, and

fixed φ and σ2 to 0.98 and 1, respectively. But it would be equally straightforward to

place hyper-priors upon each parameter, and to sample them in a hierarchical fashion.

It would also be straightforward to incorporate fixed effects in the form of regressors.

Figure 4 shows the results of the fit, along with a lag-1 Poisson regression model

incorporating xt = log(yt−1) as a regressor for the rate parameter at time t. In each

frame the grey lines depict the upper and lower bounds of a 95% predictive interval.

In the case of the Poisson model, this predictive interval is of negligible width, and

fails entirely to capture the right degree of dispersion. (The grey lines sit almost

directly over the black line; the predictive interval has a half-width essentially equal

to twice the square root of the estimated conditional rate parameter.)

7 Simulating Polya-Gamma random variables

7.1 Overview of approach

All our developments thus far require an efficient method for sampling Polya-Gamma

random variables. In this section, we derive such a method; it is implemented in a

separate sampling routine in the R package BayesLogit.

First, observe that one may sample Polya-Gamma random variables naively (and

approximately) using the sum-of-gammas representation in Equation (3). But this

is slow, and involves the potentially dangerous step of truncating an infinite sum.

We therefore construct an alternate, exact method by extending the approach taken

by Devroye (2009) for simulating distributions related to the Jacobi theta function,

which henceforth we will simply call Jacobi distributions. Of particular interest is J∗,

a Jacobi random variable having the moment-generating function given by (7).

The PG(1, z) distribution is related to an exponentially tilted Jacobi distribution

J∗(1, z), defined by

f(x | z) = cosh(z) e−xz2/2 f(x) (11)

22

where f(x) denotes the density function for J∗, through the rescaling

PG(1, z) =1

4J∗(1, z/2). (12)

We have appealed to (7) to compute the normalizing constant for f(x|z). Since

PG(b, z) is the sum of b independent PG(1, z) random variables, and since for our

purposes b is generally small, it will suffice to generate draws from PG(1, z), or

equivalently J(1, z/2). (Though it should be noted that the algorithm we propose is

fast enough that it remains easy to simulate from PG(b, z) even for relatively large

b.) Thus the task at hand is to quickly simulate from the J∗(1, z/2) distribution. One

easily sums and scales b such draws to generate a PG(b, z) random variate.

7.2 The J∗(1, z) Sampler

Devroye (2009) develops an efficient J∗ sampler. Following this work, we develop an

efficient exponentially tilted J∗ random variate J∗(1, z). In both cases, the density

of interest can be written as an infinite, alternating sum and is amenable to von

Neumann’s alternating sum method, which says that when the density of interest f

sits between the partial sums of the series one may draw from a proposal g and then

accept or reject that proposal using the partial sums to sample f (Devroye (2009)). In

particular, suppose we want to sample from the density f using the proposal density

g where ‖f/g‖∞ ≤ c. Employing the accept-reject algorithm, one proposes X ∼ g

and accepts this proposal if Y ≤ f(X) where Y ∼ U(0, cg(X)). If f is an infinite

series and its partial sums satisfy

S0(x) ≥ S2(x) ≥ · · · ≥ f(x) ≥ · · · ≥ S3(x) ≥ S1(x). (13)

then Y ≤ f(X) is equivalent to Y ≤ Sn(X) for some odd n and Y > f(X) is

equivalent to Y > Sn(X) for some even n. Thus to sample f using accept-reject

one one may propose X ∼ g, draw Y ∼ U(0, cg(X)), and accept the proposal X if

Y ≤ Sn(x) for some even n and rejecting the proposal if Y > Sn(x) for some odd n.

These partial sums may be calculated and checked on the fly.

The Jacobi density has two infinite sum representations that when spliced together

yield f(x) =∑∞

i=0(−1)nan(x) with

an(x) =

π(n+ 1/2)

(2

πx

)3/2

exp

−2(n+ 1/2)2

x

0 < x ≤ t, (14)

π(n+ 1/2) exp

−(n+ 1/2)2π2

2x

x > t, (15)

which satisfies the partial sum criterion (13) for x > 0 as shown by Devroye. It is

necessary to use both representations to satisfy the partial sum criterion for all x.

The J∗(1, z) density can be written as an alternating sum f(x|z) =∑∞

i=0(−1)nan(x|z)

23

that satisfies the partial sum criterion by setting

an(x|z) = cosh(z) exp

−z

2x

2

an(x)

in the manner of (11). The first term in the series provides a natural proposal, as

a0(x|z) ≥ f(x|z), suggesting that

c(z) g(x|z) = cosh(z)

(

2

πx

)3/2

exp

−z

2x

2− 1

2x

, 0 < x ≤ t,

exp

−(z2

2+π2

8

)x

, x > t.

(16)

Examining the piecewise kernels one finds that g(x|z) can be written as the mixture

g(x|z) =

IG(|z|−1, 1)1(0,t] with prob. p/(p+ q)

Ex(−z2/2 + π2/8)1(t,∞) with prob. q/(p+ q)

where p =∫ t

0c g(x|z)dx and q =

∫∞tc g(x|z)dx. With this proposal in hand sampling

a J∗(1, z) can be summarized as follows.

1. Generate a proposal X ∼ g(x|z).

2. Generate Y ∼ U(0, cg(X|z)).

3. Iteratively calculate Sn(X) until Y ≤ Sn(X) for an odd n or until Y > Sn(X)

for an even n

4. Accept X if n is odd; return to step 1 if n is even.

The details of the implementation, along with pseudocode, are described in the Ap-

pendix.

7.3 Analysis of acceptance rate

The J∗(1, z) sampler is highly efficient. The parameter c found in (16), which depends

on both the tilting parameter z and the truncation point t, describes on average how

many proposals we expect to make before accepting. Devroye shows that in the case

of z = 0, one can pick t so that c is near unity. The following extends this result to

non-zero tilting parameters so that, on average, the J∗(1, z) sampler rejects no more

than 8 out of every 10,000 draws, regardless of z. We proceed in two steps to show

that one may chose t so that c(z, t) is near unity for all z ≥ 0.

To begin, we show that there is some t∗ so that c(z, t∗) ≤ c(z, t) for all z ≥ 0 and for

all t for which the alternating sum criterion holds. Let aLn denote the left coefficient,

24

an1(0,t], and aRn denote the right coefficient, an1(t,∞), from (14) and (15) respectively.

Take z to be fixed. We are interested in picking t to minimize c(t) = p(t)+q(t) where

p =

∫ t

0

cosh(z) exp

−z

2x

2

aL0 (x)dx

and

q =

∫ ∞t

cosh(z) exp

−z

2x

2

aR0 (x)dx.

The truncation point t must be in the interval (log 3)/π2 ≤ t ≤ 4/ log 3 for the

alternating sum criterion to hold. Therefore c will attain a minima (for fixed z) at

some t∗ in that interval, as c is continuous in t. Differentiating with respect to t, we

find that any critical point t∗ will satisfy

cosh(z) exp

−z

2t∗

2

[aL0 (t∗)− aR0 (t∗)

]= 0 ,

or will be at the boundary of the interval. Thus any minima on the interior of this

interval will be independent of z. Devroye suggests that the best choice of t is indeed

on the interior of the aforementioned interval and is t∗ = 0.64.

Having found that the best t∗ is independent of z, we now show that the maximum

of c(z, t∗) is achieved over z > 0. Indeed, differentiating under the integral we find

that c′(z) =(

tanh(z)− z2

2

)c(z). Hence a critical point z∗ must solve

tanh(z)− z2

2= 0 ,

as c(z) > 0. Solving for z and checking the second order conditions one can easily

show that the maximum is achieved at z∗ = 1.378293, for which c(z∗, t∗) = 1.0008. In

other words, one expects to reject at most 8 out of every 10,000 draws for any z > 0.

8 Discussion

We have shown that Bayesian inference for logistic models can be implemented using

a data augmentation scheme based on Polya-Gamma distributions. This leads to

simple Gibbs-sampling algorithms for posterior computation that exploit standard

normal linear-model theory.

It also opens the door for exact Bayesian treatments of many modern-day machine-

learning classification methods based on mixtures of logits. Indeed, many likelihood

functions long thought to be intractable resemble the sum-of-exponentials form in

the multinomial logit model; two prominent examples are restricted Boltzmann ma-

chines (Salakhutdinov et al., 2007) and logistic-normal topic models (Blei and Laf-

ferty, 2007). Applying the Polya-Gamma mixture framework to such problems is

25

currently an active area of research.

A further useful fact is that the expected value of a Polya-Gamma random variable

is available in closed form. If ω ∼ PG(b, c), then

E(ω) =b

2ctanh(c/2) .

We arrive at this result by appealing to the moment-generating function for the

PG(b, 0) density, evaluated at c2/2:

cosh−b( c

2

)= E

(e−

12ωc2)

=

∫ ∞0

e−12ωc2p(ω | b, 0)dω .

Taking logs and differentiating under the integral sign with respect to c then gives

the moment identity

E (ω) =1

c

∂

∂clog coshb

( c2

).

Simple algebra reduces this down to the form above.

This allows the same data-augmentation scheme to be used in EM algorithms,

where the latent ω’s will form a set of complete-data sufficient statistics for the logistic

likelihood. We are actively studying how this fact can be exploited in large-scale

problems where Gibbs sampling becomes intractable. Some early results along these

lines are described in Polson and Scott (2011).

A number of technical details of our latent-variable representation are worth fur-

ther comment. First, the dimensionality of the set of latent ωi’s does not depend on

the sample size ni corresponding to each unique design point. Rather, the sample

size only affects the distribution of these latent variables. Therefore, our MCMC al-

gorithms are more parsimonious than traditional approaches that require one latent

variable for each observation. This is a major source of efficiency in the analysis of

contingency tables.

Second, posterior updating via exponential tilting is a quite general situation

that arises in Bayesian inference incorporating latent variables. For example, the

posterior distribution of ω that arises under normal data with precision ω and a

PG(b, 0) prior is precisely an exponentially titled PG(b, 0) random variable. This led

to our characterization of the general PG(b, c) class.

Notice, moreover, that one may identify the conditional posterior for ωij strictly

using its moment-generating function, without ever appealing to Bayes’ rule for den-

sity functions. This follows the Levy-penalty framework of Polson and Scott (2012)

and relates to work by Ciesielski and Taylor (1962), who use a similar argument to

characterize sojourn times of Brownian motion. Doubtless there are many other mod-

eling situations where the basic idea is also applicable, or will lead to new insights.

26

A Details of sampling algorithm

Algorithm 1 shows how to simulate a PG(1, z) random variate. Recall that for the

PG(b, z) case, where b is an integer, we add up b independent copies of PG(1, z). We

expand upon the simulation of truncated inverse-Gaussian random variables, denoted

IN(µ, λ). We break the IN draw up into two separate scenarios. When µ = 1/Z

is large the inverse Gaussian distribution is approximately inverse χ21, motivating an

accept-reject algorithm. When µ is small, a simple rejection algorithm suffices, as the

truncated inverse Gaussian will have most of its mass below the truncation point t.

Thus when µ > t we generate a truncated inverse-Gaussian random variate using

accept-reject sampling. The proposal is Y ∼ 1/χ21 1(t,∞). Following Devroye (2009),

this variate may be generated by squaring a truncated standard normal draw z, letter

x ∼ z1[1/√t,∞), and then taking the reciprocal of x. Devroye (1986) (p. 382) suggests

sampling the truncated normal by generating independent standard-exponential pairs

(E,E ′) until E2 ≤ 2E ′/t and returning (1 + tE)/√t. The ratio of the proposal kernel

to the inverse-χ21 kernel is

x−3/2 exp(−12x− z2

2x)

x−3/2 exp(−12x

) = exp(−z2

2x),

whose supremum is unity. This implies an acceptance probability of

exp−z2

2x.

Since Z < 1/t and X < t we may compute a lower bound on the average rate of

acceptance using

E

exp(−Z2

2X)≥ exp

−1

2t= 0.61 .

See algorithm (2) for pseudocode.

When µ < t, we generate a truncated inverse-Gaussian random variate using

rejection sampling. Devroye (1986) (p. 149) describes how to sample from an inverse-

Gaussian distribution using a many-to-one transformation. Sampling X in this fash-

ion until X < t yields an acceptance rate bounded below by∫ t

0

IG(x|µ = 1/Z, λ = 1)dx ≥∫ t

0

IG(x|µ = t, λ = 1) = 0.67

for all µ < t. See algorithm (3) for pseudocode.

A final note applies to the negative-binomial case where the term parameter of

the Polya-Gamma (b) may be a real number. Here we exploit the additivity of the

Polya-Gamma class, writing b = b + e where b is the integer part of b, and where e

is the decimal remainder. We then sample from PG(b, z) exactly, and from PG(e, z)

27

Algorithm 1 Sampling from PG(1, z)

Input: z, a positive real numberDefine: pigauss(t | µ, λ), the CDF of the inverse Gaussian distributionDefine: an(x), the piecewise-defined coefficients in Equations (14) and (15)z ← |z|/2, t← 0.64, K ← π2/8 + z2/2p← π

2Kexp(−Kt)

q ← 2 exp(−|z|) pigauss(t | µ = 1/z, λ = 1.0)repeat

Generate U, V ∼ U(0, 1)if U < p/(p+ q) then

(Truncated Exponential)X ← t+ E/K where E ∼ E(1)

else(Truncated Inverse Gaussian)µ← 1/zif µ > t then

repeatGenerate 1/X ∼ χ2

11(t,∞)

until U(0, 1) < exp(− z2

2X)

elserepeat

Generate X ∼ IN (µ, 1.0)until X < t

end ifend ifS ← a0(X), Y ← V S, n← 0repeat

n← n+ 1if n is odd then

S ← S − an(X); if Y < S, then return X / 4else

S ← S + an(X); if Y > S, then breakend if

until FALSEuntil FALSE

28

Algorithm 2 Algorithm used to generate IG(µ = 1/Z, λ = 1)1(0,t) when µ > t.

Truncation point: t. Z = 1/µ.repeat

repeatGenerate E,E ′ ∼ E(1).

until E2 ≤ 2E ′/tX ← t/(1 + tE)2

α← exp(−12Z2X)

U ∼ Uuntil U ≤ α

Algorithm 3 Algorithm used to generate IG(µ = 1/Z, λ = 1)1(0,t) when µ ≤ t.

repeatY ∼ N(0, 1)2.X ← µ+ 0.5µ2Y − 0.5µ

√4µY + (µY )2

U ∼ UIf (U > µ/(µ+X)), then X ← µ2/X.

until X ≤ R.

using the finite sum-of-gammas approximation. With 200 terms in the sum, we find

that the approximation is quite accurate for such small values of the first parameter,

as each Ga(e, 1) term in the sum tends to be small, and the weights in the sum decay

like 1/k2. (This approximation is never necessary in the case of the logit models

considered.)

References

J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journalof the American Statistical Association, 88(422):669–79, 1993.

D. Andrews and C. Mallows. Scale mixtures of normal distributions. Journal of the Royal StatisticalSociety, Series B, 36:99–102, 1974.

O. E. Barndorff-Nielsen, J. Kent, and M. Sorensen. Normal variance-mean mixtures and z distribu-tions. International Statistical Review, 50:145–59, 1982.

D. M. Blei and J. Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1(1):17–35, 2007.

J. Carlin. Meta-analysis for 2× 2 tables: a Bayesian approach. Statistics in Medicine, 11(2):141–58,1992.

Z. Ciesielski and S. J. Taylor. First passage times and sojourn times for Brownian motion in spaceand the exact Hausdorff measure of the sample path. Transactions of the American MathematicalSociety, 103(3):434–50, 1962.

29

A. P. Dawid. Some matrix-variate distribution theory: Notational considerations and a Bayesianapplication. Biometrika, 68:265–274, 1981.

L. Devroye. Non-uniform random variate generation. Springer, 1986.

L. Devroye. On exact simulation algorithms for some distributions related to Jacobi theta functions.Statistics & Probability Letters, 79(21):2251–9, 2009.

A. Dobra, C. Tebaldi, and M. West. Data augmentation in multi-way contingency tables with fixedmarginal totals. Journal of Statistical Planning and Inference, 136(2):355–72, 2006.

S. Fruhwirth-Schnatter and R. Fruhwirth. Auxiliary mixture sampling with applications to logisticmodels. Computational Statistics and Data Analysis, 51:3509–28, 2007.

A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC,2nd edition, 2004.

A. Gelman, A. Jakulin, M. Pittau, and Y. Su. A weakly informative default prior distribution forlogistic and other regression models. The Annals of Applied Statistics, 2(4):1360–83, 2008.

B. German. Glass identification dataset, 1987. URL http://archive.ics.uci.edu/ml/datasets/

Glass+Identification.

C. Geyer. Practical markov chain monte carlo. Statistical Science, 7:473–511, 1992.

R. B. Gramacy and N. G. Polson. Simulation-based regularized logistic regression. Bayesian Anal-ysis, 2012.

C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regression.Bayesian Analysis, 1(1):145–68, 2006.

T. Leonard. Bayesian estimation methods for two-way contingency tables. Journal of the RoyalStatistical Society (Series B), 37(1):23–37, 1975.

A. D. Martin, K. M. Quinn, and J. H. Park. MCMCpack: Markov chain Monte Carlo in r. Journalof Statistical Software, 42(9):1–21, 2011.

D. Michie, D. Spiegelhalter, , and C. Taylor. Machine Learning, Neural and Statistical Classification.Ellis Horwood Limited, 1994.

N. G. Polson and J. G. Scott. Data augmentation for non-Gaussian regression mod-els using variance-mean mixtures. Technical report, University of Texas at Austin,http://arxiv.org/abs/1103.5407v3, 2011.

N. G. Polson and J. G. Scott. Local shrinkage rules, Levy processes, and regularized regression.Journal of the Royal Statistical Society (Series B), 2012. (to appear).

B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.

R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative fil-tering. In Proceedings of the 24th Annual International Conference on Machine Learning, pages791–8, 2007.

A. Skene and J. C. Wakefield. Hierarchical models for multi-centre binary response studies. Statisticsin Medicine, 9:919–29, 1990.

M. Zhou, L. Li, D. Dunson, and L. Carin. ognormal and gamma mixed negative binomial regression.In International Conference on Machine Learning (ICML), 2012.

30

http://archive.ics.uci.edu/ml/datasets/Glass+Identification

http://archive.ics.uci.edu/ml/datasets/Glass+Identification

Documents

Bayesian inference for logistic models using Polya-Gamma latent … · 2012. 6. 18. · Bayesian inference for logistic models using Polya-Gamma latent variables Nicholas G. Polson