Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bayesian inference for logistic models usingPolya-Gamma latent variables
Nicholas G. Polson∗
University of Chicago
James G. Scott†
Jesse Windle‡
University of Texas at Austin
First Draft: August 2011This Draft: April 2012
Abstract
We propose a new data-augmentation strategy for fully Bayesian inference inmodels with logistic likelihoods. To illustrate the method we focus on fourexamples: binary logistic regression; contingency tables with fixed margins;multinomial logistic regression; and negative binomial models for count data.In each case, we show our how data-augmentation strategy leads to simple, ef-fective methods for posterior inference that: (1) entirely circumvent the need foranalytic approximations, numerical integration, or Metropolis–Hastings; and(2) outperform the existing state of the art, both in ease of use and in com-putational efficiency. We also describe how the method may be extended toother logit models, including topic models and nonlinear mixed-effects models.Our approach appeals heavily to the novel class of Polya-Gamma distributions;much of the paper is devoted to constructing this class in detail, and demon-strating its relevance for Bayesian inference. The methods described here areimplemented in the R package BayesLogit.
∗[email protected]†[email protected]‡[email protected]
1
arX
iv:1
205.
0310
v1 [
stat
.ME
] 2
May
201
2
1 Introduction
1.1 A latent-variable representation of logistic models
Many common statistical models involve a likelihood that decomposes into a product
of terms of the form
Li =(eψi)ai
(1 + eψi)bi, (1)
where ψi is a linear function of parameters, and where ai and bi involve the response
for subject i. One familiar case arises in binary logistic regression, where we observe
outcomes yi ∈ 0, 1, i = 1, . . . , n, and assume that
pr(yi = 1) =eψi
1 + eψi. (2)
Here ψi = xTi β for a known p-vector of predictors xi and a common, unknown set of
regression coefficients β = (β1, . . . , βp)T .
In this paper, we propose a new latent-variable representation of likelihoods in-
volving terms like (1). This leads directly to efficient Gibbs-sampling algorithms for a
very wide class of Bayesian models that have previously eluded simple treatment. This
paper focuses on four common models: binary logistic regression; contingency tables
with logistic-normal priors; multinomial logistic regression; and negative-binomial
models for count data. But the basic approach also applies to a much wider class
of models, including topic models, discrete mixtures of logits, random-effects and
mixed-effects models, non-Gaussian factor models, and hazard models.
Our representation is based upon a new family of Polya-Gamma distributions,
which we introduce here, and describe thoroughly in Section 2.
Definition 1. A random variable X has a Polya-Gamma distribution with parame-
ters b > 0 and c > 0, denoted X ∼ PG(b, c), if
XD=
1
2π2
∞∑k=1
gk(k − 1/2)2 + c2/(4π2)
, (3)
where each gk ∼ Ga(b, 1) is an independent gamma random variable.
With this definition in place, we proceed directly to our main result, which we
prove in Section 2.3.
Theorem 1. Let p(ω) denote the density of the random variable ω ∼ PG(b, 0), b > 0.
Then the following integral identity holds for all a ∈ R:
(eψ)a
(1 + eψ)b= 2−beκψ
∫ ∞0
e−ωψ2/2 p(ω) dω , (4)
where κ = a− b/2.
2
Moreover, the conditional distribution
p(ω | ψ) =e−ωψ
2/2 p(ω)∫∞0e−ωψ2/2 p(ω) dω
,
which arises in treating the integrand in (4) as an unnormalized joint density in (ψ, ω),
is also in the Polya-Gamma class: (ω | ψ) ∼ PG(b, ψ).
Equation (4) is significant because it allows us to write (1) in a conditionally
Gaussian form, given the latent precision ωi. Thus in the important case where
ψi = xTi β is a linear function of predictors, the full likelihood in β will also be
conditionally Gaussian. Moreover, the second half of Theorem 1 shows that the full
conditional distribution for each ωi, given ψi, is in the same class as the prior p(ωi).
Theorem 1 thus enables full Bayesian inference for a wide class of models involving
logistic likelihoods and conditionally Gaussian priors. All of these models can now be
fit with a Gibbs sampler having only two basic steps: multivariate normal draws for
the main parameters affecting the ψi terms, and Polya-Gamma draws for the latent
variables ωi. (We will describe this second step in detail.)
1.2 Existing work
As we will show, our approach has significant advantages over existing approaches
for Bayesian inference in logistic models. While traditional approaches have involved
numerical integration, normal approximations to the likelihood (Carlin, 1992; Gelman
et al., 2004), or the Metropolis–Hastings algorithm (Dobra et al., 2006), the most
direct comparison here is with the more recent methods of Holmes and Held (2006)
and Fruhwirth-Schnatter and Fruhwirth (2007). Both of these papers rely upon a
random-utility construction of the logistic model, where the outcomes yi are assumed
to be thresholded versions of an underlying continuous quantity zi. For the purposes
of exposition, assume a linear model where
yi =
1 , zi ≥ 0
−1 , zi < 0
zi = xTi β + εi (5)
εi ∼ Lo(1) ,
where εi ∼ Lo(1) has a standard logistic distribution. Upon marginalizing over zi,
the likelihood in (2) is recovered. A useful analogy is with the work of Albert and
Chib (1993), who represent the probit regression model in the same way, subject to
the modification that εi has a standard normal distribution.
The difficulty with this latent-utility representation is that the logistic distribution
does not lead to an easy method for sampling from the posterior distribution of β.
This is very different from the probit case, where one may exploit the conditional
normality of the likelihood in β, given zi.
3
β
yiiφi
i = 1, . . . , n
β
yi
i = 1, . . . , n
ωi
Figure 1: Directed acyclic graphs depicting two latent-variable constructions forthe logistic-regression model: the random-utility model of Holmes and Held (2006)and Fruhwirth-Schnatter and Fruhwirth (2007), on the left; versus our direct data-augmentation scheme, on the right.
To respond to this difficulty, Holmes and Held (2006) introduce a second layer of
latent variables for the logit model, writing the errors εi in (5) as
(εi | φi) ∼ N(0, φi)
φi = (2λ2i )
λi ∼ KS(1) ,
where λi has a Kolmogorov–Smirnov distribution. Marginalizing over λi recovers the
original latent-utility representation of the likelihood for yi. This exploits the fact
that the logistic distribution is a scale mixture of normals (Andrews and Mallows,
1974). Posterior sampling proceeds by iterating three steps: sample zi from a trun-
cated normal distribution, given λi, yi, and β; sample β for a multivariate normal
distribution, given zi and λi; and sample λi, given zi and β. This latter update can
be accomplished via adaptive rejection sampling, using the alternating-series method
of Devroye (1986).
Fruhwirth-Schnatter and Fruhwirth (2007) take a different approach based upon
auxiliary mixture sampling. They approximate p(εi) in (5) using a discrete mixture
of normals, rather than a scale mixture:
(εi | φi) ∼ N(0, φi)
φi ∼K∑k=1
wkδφ(k) ,
where δφ indicates a Dirac measure at φ. The weights wk and the points φ(k) in
the discrete mixture are fixed for a given choice of K so that the Kullback–Leibler
divergence from the true distribution of the random utilities is minimized. Fruhwirth-
Schnatter and Fruhwirth (2007) argue that the choice of K = 10 leads to a good
approximation, and list the optimal weights and variances for this choice.
This results in a Gibbs-sampling scheme that also requires two levels of latent
4
variables, and that has a very similar structure to that of Holmes and Held (2006):
sample zi, given φi, yi, and β; sample β, given zi and φi; and sample φi, given ziand β, from its discrete conditional distribution. The difficulties of dealing with the
Kolmogorov–Smirnov distribution for λi are avoided, at the cost of approximating
the true model for εi with a thinner-tailed Gaussian mixture.
A similar approach to ours is that of Gramacy and Polson (2012), who concentrate
on a latent-variable representation of a powered-up version of the logit likelihood. This
representation is very useful for obtaining classical penalized-likelihood estimates via
simulation. But the difficulty with performing fully Bayesian inference using this
representation is that, when applied to the likelihood in (2), it leads to an improper
mixing distribution for the latent variable. This requires modifications that make
simulation very challenging in the general logit case. In contrast, our Polya-Gamma
representation sidesteps this issue entirely, and results in a proper mixing distribution
for all common choices of ai, bi in (1).
1.3 Outline
In Section 2, we construct the class of Polya–Gamma variables, and prove our key
result (Theorem 1). Then in Section 3 we illustrate the power of this result by
showing how it leads to a simple, efficient sampling algorithm for posterior inference
in binary logistic regression. We also benchmark our method extensively against
those of Holmes and Held (2006) and Fruhwirth-Schnatter and Fruhwirth (2007). In
Sections 4–6 we illustrate the method in three further settings: contingency tables,
multinomial logistic regression, and negative-binomial regression.
Then in Section 7, we address the key requirement for Theorem 1 to be practically
useful: namely, an efficient method for simulating Polya-Gamma random variables.
The identity in (3) suggests a naıve way to approximate this distribution using a
large number of independent gamma draws. But in Section 7, we describe a far more
efficient and stable way to sample exactly from the Polya-Gamma distribution, which
avoids the difficulties that can result from truncating an infinite sum. To do so, we use
the alternating-series method from Devroye (1986) to construct a rejection sampler
for the Polya-Gamma distribution.
As we will prove, this sampler is highly efficient: on average, it requires 1.000803
proposals for every accepted draw, regardless of the parameters of the underlying
distribution. The proposal distribution itself is easy to sample from, and involves only
exponential and inverse-Gaussian draws, together with a handful of simple auxiliary
computations.
We conclude in Section 8 with some final remarks about how the method can be
generalized to other settings.
5
2 Polya-Gamma random variables
2.1 The case PG(b, 0)
The key step in our approach in the construction of a novel class of Polya-Gamma
random variables. Together with the sampling method described in Section 7, this
distributional theory greatly simplifies Bayesian inference in models with logistic like-
lihoods.
Following Devroye (2009), a random variable J∗ has a Jacobi distribution if
J∗D=
2
π2
∞∑k=1
ek(k − 1/2)2
, (6)
where the ek are independent, standard exponential random variables. The moment-
generating function of this distribution is
E(e−tJ∗) =
1
cosh(√
2t). (7)
The density of this distribution is expressible as a multi-scale mixture of inverse-
Gaussians; all moments are finite and expressible in terms of Riemann zeta functions.
The Jacobi is related to the Polya distribution (Barndorff-Nielsen et al., 1982), in
that if J∗ has a Jacobi distribution, and ωD= J∗/4, then ω ∼ Pol(1/2, 1/2).
Let ωk ∼ Pol(1/2, 1/2) for k = 1, . . . , n be a set of independent Polya-distributed
random variables. A PG(n, 0) random variable is then defined by the sum ω?D=∑n
k=1 ωk. Its moment generating function follows from that of a Jacobi distribution:
Eexp(−ωkt) =1
cosh(√t/2)
.
Therefore, for the Polya-Gamma with parameters (n, 0), we have
Eexp(−ω?t) =1
coshn(√t/2)
. (8)
The name “Polya-Gamma” arises from the following observation. From (6),
ω?D=
n∑l=1
(1
2π2
∞∑k=1
el,k(k − 1/2)2
),
6
where el,k are independent exponential random variables. Rearranging terms,
ω?D=
1
2π2
∞∑k=1
∑nl=1 el,k
(k − 1/2)2
D=
1
2π2
∞∑k=1
gk(k − 1/2)2
,
where gk are independent Gamma(n, 1) random variables. More generally we may
replace n with any positive real number b.
2.2 The general PG(b, c) class
The general PG(b, c) class arises through an exponential tilting of the PG(b, 0) density.
Specifically, we define the density of a PG(b, c) random variable as
p(ω | b, c) =exp
(− c2
2ω)p(ω | b, 0)
Eω
exp(− c2
2ω) , (9)
where p(ω | b, 0) is the density of a PG(b, 0) random variable. The expectation in
the denominator is taken with respect to the PG(b, 0) distribution, ensuring that
p(ω | b, c) integrates to 1.
Using the Weierstrass factorization theorem, write the moment-generating func-
tion of the PG(b, c) distribution as
Eω
exp
(−1
2ωt
)=
coshb(c2
)coshb
(√c2+t2
)=∞∏k=1
(1 + c2
4(k−1/2)2π2
1 + c2+t4(k−1/2)2π2
)b
=∞∏k=1
(1 + d−1k t)−b , where dk = 4
(k − 1
2
)2
π2 + c2 .
Each term in the product is recognizable as the moment-generating function of
the gamma distribution. We can therefore write a PG(b, c) random variable as
ωD= 2
∞∑k=1
Ga(b, 1)
dk=
1
2π2
∞∑k=1
Ga(b, 1)
(k − 12)2 + c2/(4π2)
.
2.3 Proof of Theorem 1
The key step in the proof of Theorem 1 is simply the construction of the Polya-Gamma
class itself. With this theory in place, the proof proceeds very straightforwardly.
7
Proof. Appealing to the moment-generating function in (8), we may write the likeli-
hood in (4) as
(eψ)a
(1 + eψ)b=
2−b expκψcoshb(ψ/2)
= 2−beκψ Eωexp(−ωψ2/2 ,
where the expectation is taken with respect to ω ∼ PG(b, 0), and where κ = a− b/2.
Turn now to the conditional distribution
p(ω | ψ) =e−ωψ
2/2 p(ω)∫∞0e−ωψ2/2 p(ω) dω
,
where p(ω) is the density of the prior, PG(b, 0). This is of the same form as (9), with
ψ = c. Therefore (ω | ψ) ∼ PG(b, ψ).
3 Binary logistic regression
3.1 Overview of approach
Several examples will serve to illustrate the utility of Theorem 1, beginning with the
case of binary logistic regression.
Let yi ∈ 0, 1 be a binary outcome for unit i, with corresponding predictors
xi = (xi1, . . . , xip). Suppose that the logistic model of (2) holds, and that we model
the log odds of success as ψi = xTi β. Appealing to Theorem 1, the contribution of yito the likelihood in β may be written as
Li(β) =exp(xTi β)yi1 + exp(xTi β)
∝ exp(κixTi β)
∫ ∞0
exp−ωi(xTi β)2/2 p(ωi | 1, 0) ,
where κi = yi − 1/2, and where p(ωi | 1, 0) is the density of a Polya-Gamma random
variable with parameters (1, 0).
Combining all n terms gives the following expression for the conditional likelihood
8
in β, given ω = (ω1, . . . , ωn):
L(β | ω) =n∏i=1
Li(β | ωi) ∝n∏i=1
expκix
Ti β − ωi(xTi β)2/2
∝
n∏i=1
expωi
2(xTi β − κi/ωi)2
∝ exp
−1
2(z −Xβ)TΩ(z −Xβ)
,
where z = (κ1/ω1, . . . , κn/ωn), and where Ω = diag(ω1, . . . , ωn). Given all ωi terms,
we therefore have a conditionally Gaussian likelihood in β, with pseudo-response
vector z, design matrix X, and covariance matrix Ω−1.
Suppose that β has a conditionally normal prior β ∼ N(b, B). We may then
sample from the joint posterior distribution for β by iterating two simple Gibbs
steps:
(ωi | β) ∼ PG(1, xTi β)
(β | y, ω) ∼ N(m,V ) ,
where
V = (XTΩX + V −1)−1
m = V (XTΩz + V −1b) ,
recalling that zi = ω−1i (yi − 1/2) itself depends upon the augmentation variable ωi,
along with the binary outcome yi. If the number of regressors p exceeds the number
of observations, then these matrix computations can be done faster by exploiting the
Sherman–Morrison–Woodbury identity. Moreover, the outer products xixTi can be
pre-evaluated, which will speed computation of XTΩX at each step of the Gibbs
sampler.
3.2 Example: spam classification
To demonstrate the approach, we fit a logistic-regression model to the data set on
spam classification from the online machine-learning repository at the University of
California, Irvine. The data set has n = 4601 observations and p = 57 predictors, all
of which correspond to the frequencies with which certain words appear in the body of
an e-mail. In fitting the model, we omitted the frequencies of the words “George” and
“cs.” These are the e-mail recipient’s first name and home department, respectively,
and therefore not generalizable to spam filters for other e-mail recipients.
9
Coefficient
-4-2
02
4
Intercept
word.freq.make
word.freq.address
word.freq.all
word.freq.3d
word.freq.our
word.freq.over
word.freq.remove
word.freq.internet
word.freq.order
word.freq.mail
word.freq.receive
word.freq.will
word.freq.people
word.freq.report
word.freq.addresses
word.freq.free
word.freq.business
word.freq.email
word.freq.you
word.freq.credit
word.freq.your
word.freq.font
word.freq.000
word.freq.money
word.freq.hp
word.freq.hpl
word.freq.650
word.freq.lab
word.freq.labs
word.freq.telnet
word.freq.857
word.freq.data
word.freq.415
word.freq.85
word.freq.technology
word.freq.1999
word.freq.parts
word.freq.pm
word.freq.direct
word.freq.meeting
word.freq.original
word.freq.project
word.freq.re
word.freq.edu
word.freq.table
word.freq.conference
char.freq.;
char.freq.(
char.freq.[
char.freq.!
char.freq.$
char.freq.#
capital.run.length.average
capital.run.length.longest
capital.run.length.total
Figure 2: Posterior distributions for the regression coefficients on the spam-classification data. The black dots are the posterior means; the light- and dark-greylines indicate 90% and 50% posterior credible intervals, respectively; and the crossesare the corresponding maximum-likelihood estimates.
As a prior for β, we used a logistic distribution, following Gelman et al. (2008):
p(β) ∝ exp(xT0 β)1/2
1 + exp(xT0 β)
where each entry of the prior design point x0 is the sample mean of the corresponding
column of the design matrix X. This encodes the prior belief that the probability of
a success is 1/2 at the global average value of the predictors, and can be interpreted
as a pseudo-observation of y0 = 1/2 with a prior sample size of 1. This prior is easily
accommodated by the Polya-Gamma data-augmentation framework. In principle,
any prior which is a scale mixture of normals may be used instead; see, e.g. Polson
and Scott (2012).
Results of the model fit are shown in Figure 2. For the most part, the Bayesian
model yields results very similar to the maximum-likelihood solution. This is unsur-
prising, given the large amount of data and the weakly informative prior.
3.3 Comparison with existing methods
To demonstrate the merits of our method, we benchmark it against both those of
Fruhwirth-Schnatter and Fruhwirth (2007) and Holmes and Held (2006)—specifically,
their joint updating technique, as it has the largest effective sample size of those they
10
consider. We use the same metrics, datasets, and sample sizes described in Holmes
and Held (2006). The efficiency of each sampler is measured in two ways: by the
average step size between successive draws of the regression coefficients and by the
effective sample size of each component averaged across all components of β. The
average step size is
Dist. =1
N − 1
N−1∑i=1
‖β(i) − β(i+1)‖
and the average effective sample size is
ESS =1
P
P∑i=1
ESSi where ESSi = M/(1 + 2k∑j=1
ρi(j)),
M is the number of post-burn-in samples, and ρi(j) is the jth autocorrelation of βi
as estimated by the initial montone sequence estimator of Geyer (1992).
The four datasets considered are the Pima Indians diabetes dataset used in Ripley
(1996), and the Australian Credit, German Credit, and Heart datasets found in the
STATLOG project (Michie et al., 1994). For each dataset and each method we
ran 10 simulations of 10,000 samples each, discarding the first 1,000 samples of each
simulation as burn-in. In each case we assume a diffuse normal prior, β ∼ N(0, 100I).
Table 3.3 presents the results of these comparisons. For the examples considered
here, the Polya-Gamma method is at least 2.8 times as efficient as that of Holmes and
Held, and at least 35 times as efficient as that of Fruhwirth-Schnatter and Fruhwirth.
These draw-by-draw comparisons of efficiency are useful, but do not take into con-
sideration the relative speed of individual draws from each method. We compare the
running times of the Polya-Gamma (PG) method versus Holmes and Held (HH) to
measure the relative effective sample sizes per unit time. These results were produced
using R. In general, R is a poor language in which to benchmark execution time, as
the number of for loops, while loops, if statements, function calls, and vectorized op-
erations can have a large impact how long a script runs. Nonetheless, the comparison
is still appropriate in this case, as the PG and HH algorithms have a similar structure.
Furthermore, in both cases we use an external routine for the most time-consuming
steps: drawing from the Polya-Gamma distribution and drawing from p(λ | z, β) in
the case of the HH method.
As seen in Table 3.3, the gains in efficiency are even greater for the PG method
when accounting for per-draw execution time. The PG method finished roughly twice
as fast on average, producing at least a five-fold improvement in effective sample size
per unit time. While the direct comparison was performed in R, the BayesLogit
package has implemented the Polya-Gamma procedure for binary logistic regression
in C++. The approximate running times in BayesLogit are also listed in Table 3.3,
and are much faster.
These results can be explained by the fact that our approach requires only a
single layer of latent variables. It also maintains conditional conjugacy, and does not
11
Dataset PG HH FSF ESS Ration p p∗ ESS Dist. ESS Dist. ESS Dist. HH FSF
Diabetes 768 8 9 4797 0.70 1695 0.42 114 0.11 2.8 42Heart 270 13 19 3184 4.86 891 2.70 85 0.78 3.6 37Australia 690 14 35 3173 10.93 724 5.37 84 1.64 4.4 38Germany 1000 20 49 4981 3.95 1723 2.40 115 0.68 2.9 43
Table 1: The effective sample size and average distance between samples of β for the fourdata sets considered. The ESS ratio is the ratio of the effective sample sizes of Polya-Gammatechnique to those of Holmes and Held (2006) and Fruhwirth-Schnatter and Fruhwirth(2007), respectively. For each data set there are n observations, p predictors, and p∗ compo-nents of β including interactions. The Australian credit dataset possess perfectly separabledata that is captured by a diverging entry in β. This infects the measure of distance be-tween β’s resulting from each method, and is affected strongly by the choice of prior for β.Consequently, one should view the distance between β’s for the Australian credit datasetas a measure of how quickly each model captures this pathological coefficient.
Dataset BayesLogit PG (R) HH (R)Time Time ESS/sec. Time ESS/sec. ESS/sec. Ratio
Diabetes 5.3 167 29 334 5.1 5.6Heart 2.6 61 53 100 9.0 5.9Australia 9.6 158 20 334 2.1 9.3Germany 18.0 234 21 553 3.1 6.9
Table 2: Comparisons of actual run times in seconds for the Polya-Gamma and Holmes–Held methods implemented in R. All computations were performed on a Linux machinewith an Intel Xeon 3.30 GHz CPU and 4GB of RAM. The average running time (sec.)was calculated using the proc.time command in R. Both the KS mixture distribution andthe Polya–Gamma distribution were sampled from R by calls to an external compiled Croutine, making these times comparable.The BayesLogit column is the average time it tookto run 10,000 samples using the BayesLogit package, which calls a compiled C++ routineto perform posterior sampling.
require an analytic approximation to p(εi). In fact, we dispense with the latent-utility
approach altogether in favor of a direct mixture representation of the logit likelihood,
which also avoids the step of sampling a truncated normal variable. Moreover, the
latent precision ωi in our construction is faster to sample than the latent variance
φi in the Kolmogorov–Smirnov mixture construction, leading to the further gains in
speed shown in Table 3.3.
Another important gain in efficiency arises when the data contain repeated ob-
servations. This can happen in logistic regression when there are repeated design
points, but is most relevant for the analysis of contingency tables with fixed margins.
Suppose that we cross-tabulate n observations into k distinct nonzero cells. If we
parametrize the cell probabilities by their log-odds and fit a logistic-normal model,
the latent-utility approach will require 2n latent variables. By contrast, our approach
12
will require only k latent variables; the cell counts affect the distribution of these la-
tent variables via the a and b terms in (4), but not their number. Our latent-variable
construction will therefore scale to larger data sets much more efficiently.
4 Contingency tables with fixed margins
4.1 2 × 2 × N tables
Next, consider a simple example of a binary-response clinical trial conducted in each
of N different centers. Let nij be the number of patients assigned to treatment
regime j in center i; and let Y = yij be the corresponding number of successes
for i = 1, . . . , N . Table 1 presents a data set along these lines, from Skene and
Wakefield (1990). These data arise from a multi-center trial comparing the efficacy
of two different topical cream preparations, labeled the treatment and the control.
Let pij denote the underlying success probability in center i for treatment j, and
ψij the corresponding log-odds. If the ψi = (ψi1, ψi2)T is assigned a bivariate normal
prior ψi ∼ N(µ,Σ) then the posterior for Ψ = ψij is
p(Ψ | Y ) ∝N∏i=1
eyi1ψi1
(1 + eψi1)ni1eyi2ψi2
(1 + eψi2)ni2p(ψi1, ψi2 | µ,Σ)
.
We apply Theorem 1 to each term in the posterior, thereby introducing augmen-
tation variables Ωi = diag(ωi1, ωi2) for each center. This yields, after some algebra, a
simple Gibbs sampler that iterates between two sets of conditional distributions:
(ψi | Y,Ωi, µ,Σ) ∼ N(mi, VΩi) (10)
(ωij | ψij) ∼ PG (nij, ψij) ,
where
V −1Ωi
= Ωi + Σ−1
mi = VΩi(κi + Σ−1µ)
κi = (yi1 − ni1/2, yi2 − ni2/2)T .
Figure 3 shows the results of applying this Gibbs sampler to the data from Skene
and Wakefield (1990). Notice that our method requires 16 latent variables for a data
set with 273 observations. By comparison, a Gibbs sampler based on the random-
utility construction in (5) would require 546 latent variables: a (zi, φi) pair for every
individual in the study.
In this analysis, we used a normal-Wishart prior for (µ,Σ−1). Hyperparameters
were chosen to match Table II from Skene and Wakefield (1990), who parameterize
13
Table 3: Data from a multi-center, binary-response study on topical cream effective-ness (Skene and Wakefield, 1990).
Treatment ControlCenter Success Total Success Total
1 11 36 10 372 16 20 22 323 14 19 7 194 2 16 1 175 6 17 0 126 1 11 0 107 1 5 1 98 4 6 6 7
Log-odds ratios in an 8-center binary-response study
Treatment Center
Log-
Odd
s R
atio
of S
ucce
ss
XX
XX X
X
X
X
X
X
X
X
X
X
1 2 3 4 5 6 7 8
-6-4
-20
2
Treatment (95% CI)Control (95% CI)
Figure 3: Posterior distributions for the log-odds ratio for each of the 8 centers in thetopical-cream study from Skene and Wakefield (1990). The vertical lines are central95% posterior credible intervals; the dots are the posterior means; and the X’s arethe maximum-likelihood estimates of the log-odds ratios, with no shrinkage amongthe treatment centers. Note that the maximum-likelihood estimate is ψi2 = −∞ forthe control group in centers 5 and 6, as no successes were observed.
14
the model in terms of the prior expected values for ρ, σ2ψ1
, and σ2ψ2
, where
Σ =
(σ2ψ1
ρ
ρ σ2ψ2
).
To match their choices, we use the following identity that codifies a relationship
between the hyperparameters B and d, and the prior moments for marginal variances
and the correlation coefficient. If Σ ∼ IW(d,B), then
B = (d− 3)
E(σ2ψ2
) + E(σ2ψ1
) + 2 E(ρ)√
E(σ2ψ2
) E(σ2ψ1
) E(σ2ψ2
) + E(ρ)√
E(σ2ψ2
) E(σ2ψ1
)
E(σ2ψ2
) + E(ρ)√
E(σ2ψ2
) E(σ2ψ1
) E(σ2ψ2
)
.
In this way we are able to map from pre-specified moments to hyperparameters,
ending up with d = 4 and
B =
(0.754 0.857
0.857 1.480
).
4.2 Higher-order tables
Now consider a multi-center, multinomial response study with more than two treat-
ment arms. This can be modeled using hierarchy of N different two-way tables, each
having the same J treatment regimes and K possible outcomes. The data D consist
of triply indexed outcomes yijk, each indicating the number of observations in center i
and treatment j with outcome k. We let nij =∑
k yij indicate the number of subjects
assigned to have treatment j at center k.
Let P = pijk denote the set of probabilities that a subject in center i with
treatment j experiences outcome k, such that∑
k pijk = 1 for all i, j. Given these
probabilities, the full likelihood is
L(P ) =N∏i=1
J∏j=1
K∏k=1
pyijkijk .
Following Leonard (1975), we can model these probabilities using a logistic trans-
formation. Let
pijk =exp(ψijk)∑Kl=1 exp(ψijl)
.
Many common prior structures will maintain conditional conjugacy using the Polya-
Gamma framework outlined thus far. For example, we may assume an exchangeable
matrix-normal prior at the level of treatment centers:
ψi ∼ N(M,ΣR,ΣC) ,
15
where ψi is the matrix whose (j, k) entry is ψijk; M is the mean matrix; and ΣR
and ΣC are row- and column-specific covariance matrices, respectively. See Dawid
(1981) for further details on matrix-normal theory. Note that, for identifiability, we
set ψijK = 0, implying that ΣC is of dimension K − 1.
This leads to a posterior of the form
p(Ψ | D) = ·N∏i=1
[p(ψi) ·
J∏j=1
K∏k=1
(exp(ψijk)∑Kl=1 exp(ψijl)
)yijk],
suppressing any dependence on (M,ΣR,ΣC) for notational ease.
To show that this fits within the Polya-Gamma framework, we use a similar ap-
proach to Holmes and Held (2006), rewriting each probability as
pijk =exp(ψijk)∑
l 6=k exp(ψijl) + exp(ψijk)
=eψijk−cijk
1 + eψijk−cijk,
where cijk = log∑
l 6=k exp(ψijl) is implicitly a function of the other ψijl’s for l 6= k.
We now fix values of i and k and examine the conditional posterior distribution
for ψi·k = (ψi1k, . . . , ψiJk)′, given ψi·l for l 6= k:
p(ψi·k | D,ψi·(−k)) ∝ p(ψi·k | ψi·(−k)) ·J∏j=1
(eψijk−cijk
1 + eψijk−cijk
)yijk ( 1
1 + eψijk−cijk
)nij−yijk= p(ψi·k | ψi·(−k)) ·
J∏j=1
eyijk(ψijk−cijk)
(1 + eψijk−cijk)nij
This is simply a multivariate version of the same bivariate form in that arises in
a 2 × 2 table. Appealing to the theory of Polya-Gamma random variables outlined
above, we may express this as:
p(ψi·k | D,ψi·(−k)) ∝ p(ψi·k | ψi·(−k)) ·J∏j=1
eκijk[ψijk−cijk]
coshnij([ψijk − cijk]/2)
= p(ψi·k | ψi·(−k)) ·J∏j=1
[eκijk[ψijk−cijk] · E
e−ωijk[ψijk−cijk]2/2
],
where ωijk ∼ PG(nij, 0), j = 1, . . . , J ; and κijk = yijk − nij/2. Given ωijk for
j = 1, . . . , J , all of these terms will combine in a single normal kernel, whose mean
and covariance structure will depend heavily upon the particular choices of hyperpa-
16
rameters in the matrix-normal prior for ψi. Each ωijk term can be updated as
(ωijk | ψijk) ∼ PG(nij, ψijk − cijk) ,
leading to a simple MCMC that loops over centers and responses, drawing each vector
of parameters ψi·k (that is, for all treatments at once) conditional on the other ψi·(−k)’s.
5 Multinomial logistic regression
One may extend the Polya-Gamma method used for binary logistic regression to
multinomial logistic regression. Consider the multinomial sample yi = yijJj=1 that
records the number of responses in each category j = 1, . . . , J and the total number
of responses ni. The logistic link function for polychotomous regression stipulates
that the probability of randomly drawing a single response from the jth category in
the ith sample is
pij =expψij∑Ji=1 expψik
where the log odds ψij is modeled by xTi βj and βJ has been constrained to zero for
purposes of identification. Following Holmes and Held (2006) the likelihood for βjconditional upon β−j, the matrix with column vector βj removed, is
`(βj|β−j, y) =N∏i=1
(eηij
1 + eηij
)yij ( 1
1 + eηij
)ni−yijwhere
ηij = xTi βj − Cij with Cij = log∑k 6=j
expxTi βk,
which looks like the binary logistic likelihood previously discussed. Incorporating the
Polya-Gamma auxiliary variable, the likelihood becomes
N∏i=1
eκijηije−η2ij2 ωijPG(ωij|ni, 0)
where κij = (yij−ni/2). Employing the conditionally conjugate prior βj ∼ N(m0j, V0j)
yields a two-part update:
(βj | Ωj) ∼ N(mj, Vj)
(ωij | βj) ∼ PG(ni, ηij) for i = 1, · · · , N,
17
where
V −1j = X ′ΩjX + V −1
0j ,
mj = Vj(X ′(κj − ΩjCj) + V −1
0j m0j
)and Ωj = diag(ωijNi=1). One may sample the posterior of (β | y) via Gibbs sampling
by repeatedly iterating over the above steps for j = 1, . . . , J − 1.
The Polya-Gamma method generates samples from the joint posterior distribu-
tion without appealing to analytic approximations to the posterior. This offers an
important advantage when the number of observations is not significantly larger than
the number of parameters.
To see this, consider sampling the joint posterior for β using a Metropolis-Hastings
algorithm with an independence proposal. The likelihood in β is approximately
normal, centered at the posterior mode m, and with variance V equal to the inverse
of the Hessian matrix evaluated at the mode. (Both of these may be found using
standard numerical routines.) Thus a natural proposal for (vec(β(t)) | y) is vec(b) ∼N(m, aV ) for some a ≈ 1. When data are plentiful, this method is both simple and
highly efficient, and is implemented in many standard software packages (e.g. Martin
et al., 2011).
But when vec(β) is high-dimensional relative to the number of observations the
Hessian matrix H may be ill-conditioned, making it impossible or impractical to
generate normal proposals. Multinomial logistic regression succumbs to this problem
more quickly than binary logistic regression, as the number of parameters scales like
the product of the number of categories and the number of predictors.
To illustrate this phenomenon, we consider glass-identification data from German
(1987). This data set has J = 6 categories of glass and nine predictors describing the
chemical and optical properties of the glass that one may measure in a forensics lab
and use in a criminal investigation. This generates up to 50 = 10 × 5 parameters,
including the intercepts and the constraint that βJ = 0. These must be estimated
using n = 214 observations. In this case, the Hessian H at the posterior mode is
poorly conditioned when employing a vague prior, incapacitating the independent
Metropolis-Hastings algorithm. Numerical experiments confirm that even when a
vague prior is strong enough to produce a numerically invertible Hessian, rejection
rates are prohibitively high. In contrast, the multinomial Polya-Gamma method still
produces reasonable posterior distributions in a fully automatic fashion, even with
a weakly informative normal prior for each βj. Table 4, which shows the in-sample
performance of the multinomial logit model, demonstrates the problem with the joint
proposal distribution: category 6 is perfectly separable into cases and non-cases, even
though the other categories are not. This is a well-known problem with maximum-
likelihood estimation of logistic models. The same problem also forecloses the option
of posterior sampling using methods that require a unique MLE to exist.
18
Class 1 2 3 5 6 7Total 70 76 17 13 9 29Correct 50 55 0 9 9 27
Table 4: “Correct” refers to the number of glass fragments for each category that werecorrectly identified by the Bayesian multinomial logit model. The glass identificationdataset includes a type of glass, class 4, for which there are no observations.
6 Negative-binomial models for count data
6.1 With regressors
Suppose that we have a Poisson model with gamma overdispersion:
(yi | λi) ∼ Pois(λi)
(λi | h, pi) ∼ Ga
(h,
pi1− pi
).
If we marginalize over λi, we get a negative binomial marginal distribution for yi:
p(yi | h, pi) ∝ (1− pi)h pyii ,
ignoring constants of proportionality. Now use a logit transform to represent pi as
pi =eψi
1 + eψi.
Since eψi = pi/(1− pi), the ψi term is analogous to a re-scaled version of the typical
linear predictor in a Poisson generalized-linear model using the canonical log link.
Appealing to Theorem 1, we can rewrite the negative binomial likelihood in terms
of ψi as
(1− pi)h pyii =exp(ψi)yi
1 + exp(ψi)h+yi
∝ eκiψi∫ ∞
0
e−ωiψ2/2 p(ω | h+ yi, 0) dω ,
where κi = (yi − h)/2, and where the mixing distribution is Polya-Gamma. Condi-
tional upon ωi, we have a likelihood proportional to e−Q(ψi) for some quadratic form
Q. Therefore this will be conditionally conjugate to any Gaussian prior, or any prior
can be made conditionally Gaussian.
In case the case regressors are present, we have that ψi = xTi β for some p-vector xi.
Then, conditional upon ωi, the contribution of the ith observation to the likelihood
19
is
Li(β) ∝ expκixTi β − ωi(xTi β)2/2
∝ exp
ωi2
(yi − φ
2ωi− xTi β
)2
Let Ω = diag(ω1, . . . , ωn); let zi = (yi − φ)/(2ωi); and let z denote the stacked vector
of zi terms. Then putting all the terms in the likelihood together, this is equivalent
to the sampling model
(z | β,Ω) ∼ N(Xβ,Ω−1) ,
or a simple Gaussian regression model with (known) covariance matrix Ω−1 describing
the error structure.
Suppose that we assume a conditionally Gaussian prior, β ∼ N(b, B). Then the
Gibbs sampling proceeds in two simple steps:
(ωi | φ, β) ∼ PG(h+ yi, xTi β)
(β | Ω, z) ∼ N(m,V ) ,
where
V = (XTΩX +B−1)−1
m = V (XTΩz +B−1b) ,
recalling the definition of z above. A related paper is that of Zhou et al. (2012),
who use our Polya-Gamma construction to arrive at a similar MCMC method for
log-normal mixtures of negative binomial models.
6.2 Example: an AR(1) model for the incidence of flu
To illustrate the further elaborations that are possible to the basic framework, we
fit a negative-binomial AR(1) model to four years (2008–11) of weekly data on flu
incidence in Texas. This data was collected from the Texas Department of State
Health Services. We emphasize that there are many ways to handle the overdispersion
present in this data set, and that we do not intend our model to be taken as a definitive
analysis. We merely intend it as a “proof of concept” example showing how various
aspects of Bayesian time-series modeling—in this case, a simple AR(1) model—can
now be incorporated seamlessly into models with non-Gaussian likelihoods, such as
those that arise in the analysis of binary, multinomial, and count data.
Let yt denote the number of reported cases of influenza-like illness (ILI) in week
t. We assume that these counts follow a negative-binomial model, or equivalently a
20
050
100
150
200
0
1000
2000
3000
4000
5000
Bay
esia
n N
egat
ive
Bin
omia
l AR
(1) m
odel
Wee
ks s
ince
Sta
rt of
200
8
Cases
Est
imat
ed ra
te95
% P
redi
ctiv
e In
terv
al
050
100
150
200
0
1000
2000
3000
4000
5000
Max
imum
Lik
elih
ood
Pois
son
Reg
ress
ion
With
Lag
Ter
m
Week
Cases
Figure 4: Incidence of influenza-like illness in Texas, 2008–11, together with theestimated mean λt from the negative-binomial AR(1) model (left) and Poisson re-gression model incorporating a lagged value of the response variable as a predictor(right). The blanks in weeks 21-41 correspond to missing data. In each frame thegrey lines depict the upper and lower bounds of a 95% predictive interval.
21
Gamma-overdispersed Poisson model:
yt ∼ NB(h, pt)
pt =eψt
1 + eψt
ψt = α + γt
γt = φγt−1 + εt
εt ∼ N(0, σ2) ,
assuming that t indexes time. It is trivial to sample from this model by combining the
results of the previous section with standard results on AR(1) models. In the following
analysis, we assumed an improper uniform prior on the dispersion parameter h, and
fixed φ and σ2 to 0.98 and 1, respectively. But it would be equally straightforward to
place hyper-priors upon each parameter, and to sample them in a hierarchical fashion.
It would also be straightforward to incorporate fixed effects in the form of regressors.
Figure 4 shows the results of the fit, along with a lag-1 Poisson regression model
incorporating xt = log(yt−1) as a regressor for the rate parameter at time t. In each
frame the grey lines depict the upper and lower bounds of a 95% predictive interval.
In the case of the Poisson model, this predictive interval is of negligible width, and
fails entirely to capture the right degree of dispersion. (The grey lines sit almost
directly over the black line; the predictive interval has a half-width essentially equal
to twice the square root of the estimated conditional rate parameter.)
7 Simulating Polya-Gamma random variables
7.1 Overview of approach
All our developments thus far require an efficient method for sampling Polya-Gamma
random variables. In this section, we derive such a method; it is implemented in a
separate sampling routine in the R package BayesLogit.
First, observe that one may sample Polya-Gamma random variables naively (and
approximately) using the sum-of-gammas representation in Equation (3). But this
is slow, and involves the potentially dangerous step of truncating an infinite sum.
We therefore construct an alternate, exact method by extending the approach taken
by Devroye (2009) for simulating distributions related to the Jacobi theta function,
which henceforth we will simply call Jacobi distributions. Of particular interest is J∗,
a Jacobi random variable having the moment-generating function given by (7).
The PG(1, z) distribution is related to an exponentially tilted Jacobi distribution
J∗(1, z), defined by
f(x | z) = cosh(z) e−xz2/2 f(x) (11)
22
where f(x) denotes the density function for J∗, through the rescaling
PG(1, z) =1
4J∗(1, z/2). (12)
We have appealed to (7) to compute the normalizing constant for f(x|z). Since
PG(b, z) is the sum of b independent PG(1, z) random variables, and since for our
purposes b is generally small, it will suffice to generate draws from PG(1, z), or
equivalently J(1, z/2). (Though it should be noted that the algorithm we propose is
fast enough that it remains easy to simulate from PG(b, z) even for relatively large
b.) Thus the task at hand is to quickly simulate from the J∗(1, z/2) distribution. One
easily sums and scales b such draws to generate a PG(b, z) random variate.
7.2 The J∗(1, z) Sampler
Devroye (2009) develops an efficient J∗ sampler. Following this work, we develop an
efficient exponentially tilted J∗ random variate J∗(1, z). In both cases, the density
of interest can be written as an infinite, alternating sum and is amenable to von
Neumann’s alternating sum method, which says that when the density of interest f
sits between the partial sums of the series one may draw from a proposal g and then
accept or reject that proposal using the partial sums to sample f (Devroye (2009)). In
particular, suppose we want to sample from the density f using the proposal density
g where ‖f/g‖∞ ≤ c. Employing the accept-reject algorithm, one proposes X ∼ g
and accepts this proposal if Y ≤ f(X) where Y ∼ U(0, cg(X)). If f is an infinite
series and its partial sums satisfy
S0(x) ≥ S2(x) ≥ · · · ≥ f(x) ≥ · · · ≥ S3(x) ≥ S1(x). (13)
then Y ≤ f(X) is equivalent to Y ≤ Sn(X) for some odd n and Y > f(X) is
equivalent to Y > Sn(X) for some even n. Thus to sample f using accept-reject
one one may propose X ∼ g, draw Y ∼ U(0, cg(X)), and accept the proposal X if
Y ≤ Sn(x) for some even n and rejecting the proposal if Y > Sn(x) for some odd n.
These partial sums may be calculated and checked on the fly.
The Jacobi density has two infinite sum representations that when spliced together
yield f(x) =∑∞
i=0(−1)nan(x) with
an(x) =
π(n+ 1/2)
(2
πx
)3/2
exp
−2(n+ 1/2)2
x
0 < x ≤ t, (14)
π(n+ 1/2) exp
−(n+ 1/2)2π2
2x
x > t, (15)
which satisfies the partial sum criterion (13) for x > 0 as shown by Devroye. It is
necessary to use both representations to satisfy the partial sum criterion for all x.
The J∗(1, z) density can be written as an alternating sum f(x|z) =∑∞
i=0(−1)nan(x|z)
23
that satisfies the partial sum criterion by setting
an(x|z) = cosh(z) exp
−z
2x
2
an(x)
in the manner of (11). The first term in the series provides a natural proposal, as
a0(x|z) ≥ f(x|z), suggesting that
c(z) g(x|z) = cosh(z)
(
2
πx
)3/2
exp
−z
2x
2− 1
2x
, 0 < x ≤ t,
exp
−(z2
2+π2
8
)x
, x > t.
(16)
Examining the piecewise kernels one finds that g(x|z) can be written as the mixture
g(x|z) =
IG(|z|−1, 1)1(0,t] with prob. p/(p+ q)
Ex(−z2/2 + π2/8)1(t,∞) with prob. q/(p+ q)
where p =∫ t
0c g(x|z)dx and q =
∫∞tc g(x|z)dx. With this proposal in hand sampling
a J∗(1, z) can be summarized as follows.
1. Generate a proposal X ∼ g(x|z).
2. Generate Y ∼ U(0, cg(X|z)).
3. Iteratively calculate Sn(X) until Y ≤ Sn(X) for an odd n or until Y > Sn(X)
for an even n
4. Accept X if n is odd; return to step 1 if n is even.
The details of the implementation, along with pseudocode, are described in the Ap-
pendix.
7.3 Analysis of acceptance rate
The J∗(1, z) sampler is highly efficient. The parameter c found in (16), which depends
on both the tilting parameter z and the truncation point t, describes on average how
many proposals we expect to make before accepting. Devroye shows that in the case
of z = 0, one can pick t so that c is near unity. The following extends this result to
non-zero tilting parameters so that, on average, the J∗(1, z) sampler rejects no more
than 8 out of every 10,000 draws, regardless of z. We proceed in two steps to show
that one may chose t so that c(z, t) is near unity for all z ≥ 0.
To begin, we show that there is some t∗ so that c(z, t∗) ≤ c(z, t) for all z ≥ 0 and for
all t for which the alternating sum criterion holds. Let aLn denote the left coefficient,
24
an1(0,t], and aRn denote the right coefficient, an1(t,∞), from (14) and (15) respectively.
Take z to be fixed. We are interested in picking t to minimize c(t) = p(t)+q(t) where
p =
∫ t
0
cosh(z) exp
−z
2x
2
aL0 (x)dx
and
q =
∫ ∞t
cosh(z) exp
−z
2x
2
aR0 (x)dx.
The truncation point t must be in the interval (log 3)/π2 ≤ t ≤ 4/ log 3 for the
alternating sum criterion to hold. Therefore c will attain a minima (for fixed z) at
some t∗ in that interval, as c is continuous in t. Differentiating with respect to t, we
find that any critical point t∗ will satisfy
cosh(z) exp
−z
2t∗
2
[aL0 (t∗)− aR0 (t∗)
]= 0 ,
or will be at the boundary of the interval. Thus any minima on the interior of this
interval will be independent of z. Devroye suggests that the best choice of t is indeed
on the interior of the aforementioned interval and is t∗ = 0.64.
Having found that the best t∗ is independent of z, we now show that the maximum
of c(z, t∗) is achieved over z > 0. Indeed, differentiating under the integral we find
that c′(z) =(
tanh(z)− z2
2
)c(z). Hence a critical point z∗ must solve
tanh(z)− z2
2= 0 ,
as c(z) > 0. Solving for z and checking the second order conditions one can easily
show that the maximum is achieved at z∗ = 1.378293, for which c(z∗, t∗) = 1.0008. In
other words, one expects to reject at most 8 out of every 10,000 draws for any z > 0.
8 Discussion
We have shown that Bayesian inference for logistic models can be implemented using
a data augmentation scheme based on Polya-Gamma distributions. This leads to
simple Gibbs-sampling algorithms for posterior computation that exploit standard
normal linear-model theory.
It also opens the door for exact Bayesian treatments of many modern-day machine-
learning classification methods based on mixtures of logits. Indeed, many likelihood
functions long thought to be intractable resemble the sum-of-exponentials form in
the multinomial logit model; two prominent examples are restricted Boltzmann ma-
chines (Salakhutdinov et al., 2007) and logistic-normal topic models (Blei and Laf-
ferty, 2007). Applying the Polya-Gamma mixture framework to such problems is
25
currently an active area of research.
A further useful fact is that the expected value of a Polya-Gamma random variable
is available in closed form. If ω ∼ PG(b, c), then
E(ω) =b
2ctanh(c/2) .
We arrive at this result by appealing to the moment-generating function for the
PG(b, 0) density, evaluated at c2/2:
cosh−b( c
2
)= E
(e−
12ωc2)
=
∫ ∞0
e−12ωc2p(ω | b, 0)dω .
Taking logs and differentiating under the integral sign with respect to c then gives
the moment identity
E (ω) =1
c
∂
∂clog coshb
( c2
).
Simple algebra reduces this down to the form above.
This allows the same data-augmentation scheme to be used in EM algorithms,
where the latent ω’s will form a set of complete-data sufficient statistics for the logistic
likelihood. We are actively studying how this fact can be exploited in large-scale
problems where Gibbs sampling becomes intractable. Some early results along these
lines are described in Polson and Scott (2011).
A number of technical details of our latent-variable representation are worth fur-
ther comment. First, the dimensionality of the set of latent ωi’s does not depend on
the sample size ni corresponding to each unique design point. Rather, the sample
size only affects the distribution of these latent variables. Therefore, our MCMC al-
gorithms are more parsimonious than traditional approaches that require one latent
variable for each observation. This is a major source of efficiency in the analysis of
contingency tables.
Second, posterior updating via exponential tilting is a quite general situation
that arises in Bayesian inference incorporating latent variables. For example, the
posterior distribution of ω that arises under normal data with precision ω and a
PG(b, 0) prior is precisely an exponentially titled PG(b, 0) random variable. This led
to our characterization of the general PG(b, c) class.
Notice, moreover, that one may identify the conditional posterior for ωij strictly
using its moment-generating function, without ever appealing to Bayes’ rule for den-
sity functions. This follows the Levy-penalty framework of Polson and Scott (2012)
and relates to work by Ciesielski and Taylor (1962), who use a similar argument to
characterize sojourn times of Brownian motion. Doubtless there are many other mod-
eling situations where the basic idea is also applicable, or will lead to new insights.
26
A Details of sampling algorithm
Algorithm 1 shows how to simulate a PG(1, z) random variate. Recall that for the
PG(b, z) case, where b is an integer, we add up b independent copies of PG(1, z). We
expand upon the simulation of truncated inverse-Gaussian random variables, denoted
IN(µ, λ). We break the IN draw up into two separate scenarios. When µ = 1/Z
is large the inverse Gaussian distribution is approximately inverse χ21, motivating an
accept-reject algorithm. When µ is small, a simple rejection algorithm suffices, as the
truncated inverse Gaussian will have most of its mass below the truncation point t.
Thus when µ > t we generate a truncated inverse-Gaussian random variate using
accept-reject sampling. The proposal is Y ∼ 1/χ21 1(t,∞). Following Devroye (2009),
this variate may be generated by squaring a truncated standard normal draw z, letter
x ∼ z1[1/√t,∞), and then taking the reciprocal of x. Devroye (1986) (p. 382) suggests
sampling the truncated normal by generating independent standard-exponential pairs
(E,E ′) until E2 ≤ 2E ′/t and returning (1 + tE)/√t. The ratio of the proposal kernel
to the inverse-χ21 kernel is
x−3/2 exp(−12x− z2
2x)
x−3/2 exp(−12x
) = exp(−z2
2x),
whose supremum is unity. This implies an acceptance probability of
exp−z2
2x.
Since Z < 1/t and X < t we may compute a lower bound on the average rate of
acceptance using
E
exp(−Z2
2X)≥ exp
−1
2t= 0.61 .
See algorithm (2) for pseudocode.
When µ < t, we generate a truncated inverse-Gaussian random variate using
rejection sampling. Devroye (1986) (p. 149) describes how to sample from an inverse-
Gaussian distribution using a many-to-one transformation. Sampling X in this fash-
ion until X < t yields an acceptance rate bounded below by∫ t
0
IG(x|µ = 1/Z, λ = 1)dx ≥∫ t
0
IG(x|µ = t, λ = 1) = 0.67
for all µ < t. See algorithm (3) for pseudocode.
A final note applies to the negative-binomial case where the term parameter of
the Polya-Gamma (b) may be a real number. Here we exploit the additivity of the
Polya-Gamma class, writing b = b + e where b is the integer part of b, and where e
is the decimal remainder. We then sample from PG(b, z) exactly, and from PG(e, z)
27
Algorithm 1 Sampling from PG(1, z)
Input: z, a positive real numberDefine: pigauss(t | µ, λ), the CDF of the inverse Gaussian distributionDefine: an(x), the piecewise-defined coefficients in Equations (14) and (15)z ← |z|/2, t← 0.64, K ← π2/8 + z2/2p← π
2Kexp(−Kt)
q ← 2 exp(−|z|) pigauss(t | µ = 1/z, λ = 1.0)repeat
Generate U, V ∼ U(0, 1)if U < p/(p+ q) then
(Truncated Exponential)X ← t+ E/K where E ∼ E(1)
else(Truncated Inverse Gaussian)µ← 1/zif µ > t then
repeatGenerate 1/X ∼ χ2
11(t,∞)
until U(0, 1) < exp(− z2
2X)
elserepeat
Generate X ∼ IN (µ, 1.0)until X < t
end ifend ifS ← a0(X), Y ← V S, n← 0repeat
n← n+ 1if n is odd then
S ← S − an(X); if Y < S, then return X / 4else
S ← S + an(X); if Y > S, then breakend if
until FALSEuntil FALSE
28
Algorithm 2 Algorithm used to generate IG(µ = 1/Z, λ = 1)1(0,t) when µ > t.
Truncation point: t. Z = 1/µ.repeat
repeatGenerate E,E ′ ∼ E(1).
until E2 ≤ 2E ′/tX ← t/(1 + tE)2
α← exp(−12Z2X)
U ∼ Uuntil U ≤ α
Algorithm 3 Algorithm used to generate IG(µ = 1/Z, λ = 1)1(0,t) when µ ≤ t.
repeatY ∼ N(0, 1)2.X ← µ+ 0.5µ2Y − 0.5µ
√4µY + (µY )2
U ∼ UIf (U > µ/(µ+X)), then X ← µ2/X.
until X ≤ R.
using the finite sum-of-gammas approximation. With 200 terms in the sum, we find
that the approximation is quite accurate for such small values of the first parameter,
as each Ga(e, 1) term in the sum tends to be small, and the weights in the sum decay
like 1/k2. (This approximation is never necessary in the case of the logit models
considered.)
References
J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journalof the American Statistical Association, 88(422):669–79, 1993.
D. Andrews and C. Mallows. Scale mixtures of normal distributions. Journal of the Royal StatisticalSociety, Series B, 36:99–102, 1974.
O. E. Barndorff-Nielsen, J. Kent, and M. Sorensen. Normal variance-mean mixtures and z distribu-tions. International Statistical Review, 50:145–59, 1982.
D. M. Blei and J. Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1(1):17–35, 2007.
J. Carlin. Meta-analysis for 2× 2 tables: a Bayesian approach. Statistics in Medicine, 11(2):141–58,1992.
Z. Ciesielski and S. J. Taylor. First passage times and sojourn times for Brownian motion in spaceand the exact Hausdorff measure of the sample path. Transactions of the American MathematicalSociety, 103(3):434–50, 1962.
29
A. P. Dawid. Some matrix-variate distribution theory: Notational considerations and a Bayesianapplication. Biometrika, 68:265–274, 1981.
L. Devroye. Non-uniform random variate generation. Springer, 1986.
L. Devroye. On exact simulation algorithms for some distributions related to Jacobi theta functions.Statistics & Probability Letters, 79(21):2251–9, 2009.
A. Dobra, C. Tebaldi, and M. West. Data augmentation in multi-way contingency tables with fixedmarginal totals. Journal of Statistical Planning and Inference, 136(2):355–72, 2006.
S. Fruhwirth-Schnatter and R. Fruhwirth. Auxiliary mixture sampling with applications to logisticmodels. Computational Statistics and Data Analysis, 51:3509–28, 2007.
A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC,2nd edition, 2004.
A. Gelman, A. Jakulin, M. Pittau, and Y. Su. A weakly informative default prior distribution forlogistic and other regression models. The Annals of Applied Statistics, 2(4):1360–83, 2008.
B. German. Glass identification dataset, 1987. URL http://archive.ics.uci.edu/ml/datasets/
Glass+Identification.
C. Geyer. Practical markov chain monte carlo. Statistical Science, 7:473–511, 1992.
R. B. Gramacy and N. G. Polson. Simulation-based regularized logistic regression. Bayesian Anal-ysis, 2012.
C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regression.Bayesian Analysis, 1(1):145–68, 2006.
T. Leonard. Bayesian estimation methods for two-way contingency tables. Journal of the RoyalStatistical Society (Series B), 37(1):23–37, 1975.
A. D. Martin, K. M. Quinn, and J. H. Park. MCMCpack: Markov chain Monte Carlo in r. Journalof Statistical Software, 42(9):1–21, 2011.
D. Michie, D. Spiegelhalter, , and C. Taylor. Machine Learning, Neural and Statistical Classification.Ellis Horwood Limited, 1994.
N. G. Polson and J. G. Scott. Data augmentation for non-Gaussian regression mod-els using variance-mean mixtures. Technical report, University of Texas at Austin,http://arxiv.org/abs/1103.5407v3, 2011.
N. G. Polson and J. G. Scott. Local shrinkage rules, Levy processes, and regularized regression.Journal of the Royal Statistical Society (Series B), 2012. (to appear).
B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative fil-tering. In Proceedings of the 24th Annual International Conference on Machine Learning, pages791–8, 2007.
A. Skene and J. C. Wakefield. Hierarchical models for multi-centre binary response studies. Statisticsin Medicine, 9:919–29, 1990.
M. Zhou, L. Li, D. Dunson, and L. Carin. ognormal and gamma mixed negative binomial regression.In International Conference on Machine Learning (ICML), 2012.
30