35
Bayesian recursive variable selection Li Ma September 21, 2012 Abstract We show that all model space priors for linear regression can be represented by a recursive procedure that randomly adds variables into the model in a fashion analogous to the classical forward-stepwise strategy. We show that this recursive representation enjoys desirable properties that provide new tools for Bayesian inference. In partic- ular, this representation transforms the computation of model space posteriors from an integration problem to a recursion one. In high-dimensional problems, where ex- act evaluation of the posterior is computationally infeasible, this enables one to infer the structure of the posterior through computational strategies designed for approx- imating recursions, without resorting to sampling algorithms such as Markov Chain Monte Carlo. While sampling methods are powerful for exploring local structures of the posterior but prone to poor mixing and convergence in high-dimensional spaces, our approach can effectively identify the global structure of the posterior, which can then be used either directly for inference or for suggesting good starting points for sampling methods. In addition, the recursive representation also facilitates prior specifications for incorporating structural features of the model space. For illustration, we show how one may choose an appropriate prior to take into account model space redundancy arising from strong correlation among predictors. Keywords: Bayesian model selection, Bayesian model averaging, stochastic variable search, dilution effect, high-dimensional inference, forward-stepwise selection. 1

Bayesian recursive variable selection - Duke Universitylm186/files/ModelSel.pdfBayesian recursive variable selection Li Ma September 21, 2012 Abstract We show that all model space

  • Upload
    lamdiep

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Bayesian recursive variable selection

Li Ma

September 21, 2012

Abstract

We show that all model space priors for linear regression can be represented by a

recursive procedure that randomly adds variables into the model in a fashion analogous

to the classical forward-stepwise strategy. We show that this recursive representation

enjoys desirable properties that provide new tools for Bayesian inference. In partic-

ular, this representation transforms the computation of model space posteriors from

an integration problem to a recursion one. In high-dimensional problems, where ex-

act evaluation of the posterior is computationally infeasible, this enables one to infer

the structure of the posterior through computational strategies designed for approx-

imating recursions, without resorting to sampling algorithms such as Markov Chain

Monte Carlo. While sampling methods are powerful for exploring local structures of

the posterior but prone to poor mixing and convergence in high-dimensional spaces, our

approach can effectively identify the global structure of the posterior, which can then

be used either directly for inference or for suggesting good starting points for sampling

methods. In addition, the recursive representation also facilitates prior specifications

for incorporating structural features of the model space. For illustration, we show how

one may choose an appropriate prior to take into account model space redundancy

arising from strong correlation among predictors.

Keywords: Bayesian model selection, Bayesian model averaging, stochastic variable search,

dilution effect, high-dimensional inference, forward-stepwise selection.

1

1 Introduction

We consider variable selection in linear regression. Suppose the data are n observations with

p potential predictors and a response. Let Y = (y1, y2, . . . , yn) denote the response vector

and Xj = (x1j, x2j, . . . , xnj) the jth predictor vector for j = 1, 2, . . . , p. We consider linear

regression models of the form

Mγ : Y = 1nα + Xγβγ + ǫ

where 1n stands for an n-vector of “1”s; ǫ = (ǫ1, ǫ2, . . . , ǫn) is a vector of i.i.d. Gaussian noise

with mean 0 and variance 1/φ; γ = (γ1, γ2, . . . , γp) ∈ 0, 1p ≡ ΩM is the “model identifier”,

that is, a vector of indicators whose jth element γj = 1 if and only if the jth variable Xj

enters the model; Xγ and βγ represent the corresponding design matrix and coefficients.

Bayesian inference for model choice is based on the posterior probability for each of the

models under consideration, which is determined by

p(Mγ|Y ) =p(Mγ)p(Y |Mγ)

γ p(Mγ)p(Y |Mγ)

where p(Mγ) is the prior probability assigned to model Mγ and p(Y |Mγ) is the marginal

likelihood under that model. In the current context of variable selection in linear regression,

p(Y |Mγ) =

p(Y |θγ,Mγ)p(θγ|Mγ)dθγ

where θγ = (α,βγ, φ), representing the parameters under model Mγ and p(θγ|Mγ) is the

prior on these parameters given the model Mγ.

Recent decades have witnessed remarkable advance in the development of Bayesian vari-

able selection methods. The first studies on the choice of prior for model coefficients emerged

2

in the 1970’s and this topic has remained an active direction of research: see for example

Zellner (1971); Leamer (1978a,b); Zellner and Siow (1980); Zellner (1981, 1986); Stewart and

Davis (1986); Mitchell and Beauchamp (1988); Foster and George (1994); Kass and Wasser-

man (1995); George et al. (2000); Clyde and George (2000); Hansen and Yu (2001); Berger

and Pericchi (2001); Fernndez et al. (2001); Clyde and George (2004); Liang et al. (2008).

Tremendous progress has also been made in the development of algorithms for effectively

exploring the posterior distribution on the model space especially since the birth of Markov

Chain Monte Carlo (MCMC) methods such as Gibbs sampling and the Metropolis-Hastings

algorithm: see for example George and McCulloch (1993); Geweke (1996); Smith and Kohn

(1996); Clyde et al. (1996); George and McCulloch (1997); Raftery et al. (1997); Hoeting

et al. (1999); Jones et al. (2005); Hans et al. (2007); Clyde et al. (2011).

In comparison, less effort has been made in studying the choice of priors on the model

space p(Mγ), and the independence model (George and McCulloch, 1993; Chipman, 1996;

George and McCulloch, 1997; Raftery et al., 1997)

p(Mγ) =

p∏

i=1

wγi

i (1 − wi)1−γi (1)

where wi ∈ (0, 1) is the prior marginal inclusion probability of predictor Xi, has become a

popular choice mostly due to its simplicity. (Some recent exceptions include Li and Zhang

(2010) and George (2010).) For complex problems where it is hard to elicit prior information

on the wi’s, a common choice is to set wi ≡ w, e.g. w = 1/2, for all i.

Despite its popularity, inference using this simple independence prior on the model space

is unsatisfactory in several respects. First, this seemingly “non-informative” choice of the

prior in fact imposes strong assumptions on model complexity. In particular, it induces a

Binomial(p, w) on the model size—that is the number of predictors involved (Clyde and

George, 2004)—and favors large models when pw is large. One consequence of this on the

3

inference is the lack of control for multiple testing as pointed out recently by Scott and Berger

(2010). Empirical Bayes (George et al., 2000) and fully Bayes (Cui and George, 2008; Ley and

Steel, 2009; Carvalho and Scott, 2009) methods have been proposed to address this difficulty

by either estimating the value of w or placing a hyperprior on it. Second, it seems more

reasonable to allow the inclusion probability of each predictor given the others in the model

to depend on what those other predictors are. For example, when some predictors represent

the interactions of some other variables, one often wants to impose the constraint that the

inclusion probabilities of those “interaction” variables are non-zero only if the corresponding

main effects are included in the model. The independence prior does not allow specification

of such conditional inclusion probabilities.

A challenge faced by all existing Bayesian variable selection methods in high-dimensional

problems is efficient exploration of the posterior on model space. Almost all current meth-

ods are relying on sampling techniques such as Markov Chain Monte Carlo (MCMC) and

stochastic variable search (SVS) to achieve this goal. (For one exception, see Hans et al.

(2007).) The performance of these techniques for drawing models in high-dimensional situ-

ations, however, is hard to guarantee and to evaluate. They are efficient in exploring local

features of the posterior but are prone to being trapped locally instead of exploring the larger

landscape of the posterior. Consequently, it is desirable to have a way to more effectively

capture the global shape of the posterior—one can then couple it with sampling methods for

fine mapping of the posterior.

In this work we introduce a new perspective on Bayesian variable selection that offers solu-

tions to the aforementioned statistical and computational challenges in a principled manner.

The key to this perspective is an observation that all priors on model spaces in the regression

context, including of course the commonly adopted independence prior, can be represented

using a recursive constructive procedure that randomly generates models by adding variables

sequentially. This procedure is specified by two sets of parameters that respectively control

4

for model complexity and conditional inclusion probabilities of the predictors, facilitating

the incorporation of such prior information. Computationally, this representation allows the

posterior on model spaces to be calculated analytically through a sequence of recursions. In

high-dimensional problems, where exact evaluation of the posterior is computationally infea-

sible and sampling methods are typically inefficient, one can use this representation to learn

the general structure of the posterior using strategies designed for approximating recursions.

The rest of the work is organized as follows. In Section 2 we introduce the construction

of the recursive representation of model space priors. We show that this representation is

completely general—all model space priors (and posteriors) can be represented this way. We

then show how to carry out Bayesian inference with priors under this representation based

on a recipe for analytically computing posteriors through recursion. Next, we show that

this representation allows one to utilize computational strategies designed for approximating

recursions to infer the shape of the posterior in high-dimensional problems. Section 3 presents

several simulated and real data examples to illustrate its work. Additionally, in Section 4, we

carry out a case study that illustrates the flexibility of this representation for incorporating

model space structures. In particular, we provide an example showing how to incorporate

strong correlation among the predictors into account through prior specification under the

recursive representation. We conclude with some discussions in Section 5.

2 Methods

2.1 The forward-stepwise distribution

One way to construct a prior for Mγ, or equivalently one for γ, is by designing a random

mechanism that draws a model out of the collection of all possible models ΩM. In this

subsection we present such a procedure, and in the next we show its generality—all model

space priors can be constructed this way. Our proposed procedure starts from an initial model

5

Mγ(0) and randomly modifies this model in stages to produce the final model, denoted by

Mγ. It is a probabilistic counterpart of the frequentist forward-stepwise model selection

procedure, and can be most easily described in a recursive manner (details given below).

Figure 1 presents a graphical illustration of the procedure.

Figure 1: The forward-stepwise procedure.

The forward-stepwise (FS) procedure. Suppose after t steps of the recursive construction,

we have the model γ(t) = γ. (To begin, let t = 0, and we may start from the null model

γ(0) = (0, 0, . . . , 0) as our initial model.) Draw a Bernoulli random variable S(γ) with success

probability ρ(γ). If S(γ) = 1, then the procedure is terminated and we end up with the

final model Mγ = Mγ. If instead S(γ) = 0, then Mγ is not the final model, and we then

modify it by adding one more predictor randomly drawn from the predictors that are not

already in the model.

More specifically, let J (γ) := j in 1, 2, . . . , p : γj = 0, that is, the collection of predic-

tors not already in Mγ. (In particular, for t = 0, all p predictors are in J (γ(0)), and as we

will see, for γ(t) = γ, the number of elements in J (γ) is always p− t under the design of the

prior.) We randomly draw one element J(γ) out of J (γ) such that P (J(γ) = j) = λj(γ) and∑

j∈J (γ) λj(γ) = 1. (This is equivalent to drawing J(γ) out of 1, 2, . . . , p with λj(γ) = 0

6

for all j /∈ J (γ).) If J(γ) = j, then we add Xj into Mγ to get a new model Mγ(t+1) . That is,

we let γ(t+1) = (γ(t+1)1 , γ

(t+1)1 , . . . , γ

(t+1)p ) be such that γ

(t+1)l = γl for all l 6= j and γ

(t+1)j = 1.

From now on we shall use γ+j to denote the indicator vector of this new model, where the

superscript “+j” stands for the additional inclusion of variable Xj into Mγ. This com-

pletes the (t + 1)th step of the recursive construction. The procedure then repeats itself on

γ(t+1) = γ+j, starting from the drawing of a stopping variable, until it eventually stops and

produces the final model Mγ. Note that if the procedure does not stop in the first p steps,

then it will reach the full model, γ(p) = γfull = (1, 1, . . . , 1). Because no further variables

can be added, we must have ρ(γfull) = 1 by design so the procedure always terminates.

Definition 1. A model Mγ, or equivalently a vector γ in 0, 1p, that arises from the

above random recursive procedure is said to have a forward-stepwise (FS) distribution. The

corresponding parameters for such a distribution are the stopping probability ρ(γ) and the

selection probabilities λ(γ) := λj(γ) : j ∈ J (γ), for each Mγ ∈ ΩM.

The stopping probabilities and the selection probabilities in the FS procedure characterize

two key aspects of model selection—the former on the size or complexity of the model

generated and the latter on the relevant predictors to be included, the order of their inclusion,

and the conditional inclusion probability given the variables already included in the model.

Later we will see through examples how we may incorporate prior information regarding

these aspects of the model through proper specification of these parameters.

2.2 Relationship to other model space priors

While the construction of the FS distribution seems quite peculiar, as is stated in the next

theorem, this class of distributions is actually as rich as it can possibly be: all probability

distributions—in particular all prior and posterior distributions—on the model space ΩM

belong to this class. Thus we lose no generality by considering Bayesian variable selection

7

with priors under the FS representation. (A proof is given in the Supplementary Materials.)

Theorem 1 (Generality of the FS distribution). All probability measures on ΩM are FS

distributions. That is, all distributions on ΩM can be represented by a FS procedure with the

corresponding parameters.

Remark: Note that this theorem only establishes the existence of an FS representation for

any given distribution on ΩM. It does not establish the uniqueness. In fact, different FS

procedures may give rise to the same model space distribution. To understand this, note that

the FS procedure generates samples on the “expanded model space” that incorporates the

order of variable inclusions. The induced FS distribution on ΩM is in essence the marginal

distribution on the model space after integrating out the different orderings.

Because the parameters ρ and λ directly correspond to two important aspects of model

selection decisions—model complexity and conditional inclusion probabilities of the predic-

tors, the FS representation can be used to motivate the choice of model space priors. For

example, a useful way to specify the prior stopping probabilities is to let ρ(γ) ≡ ρ|γ| where

|γ| denote the number of 1’s in γ or the number of predictors in Mγ. One could let ρs ≡ ρ,

a constant in [0, 1) for all s = 0, 1, 2, . . . p − 1. For values of ρ not too close to zero, this

imposes further parsimony over the size of the model than what is implicit through the use

of marginal likelihoods (Berger and Pericchi, 2001), as the total prior probability for models

including s predictors is ρ(1 − ρ)s for s < p and (1 − ρ)s for s = p. The prior expected size

of the model is thus (1/ρ − 1)[1 − (1 − ρ)p] which is decreasing in ρ and is close to 1/ρ − 1

for large p. To impose even stronger parsimony over the model size, one can let ρs be an

increasing function in s, such as ρs = 1 − e−as for some positive constant a.

Other choices for ρs can be used to incorporate different prior knowledge (or lack of

knowledge) about the model size. For example, if one believes that the actual model size is

between kmin and kmax where 0 ≤ kmin ≤ kmax ≤ p, but has no a priori reason to prefer any

one particular model size in this range over another, then one may wish to assign uniform

8

prior probability for the model size on the support kmin, kmin + 1, . . . , kmax. This can be

achieved by letting ρs = 0 for s < kmin, and ρs = 1/(kmax − s + 1) for kmin ≤ s ≤ kmax. In

particular, if one wishes to assign uniform prior probability on 0, 1, 2, . . . , p for the model

size, then one can let ρs = 1/(p − s + 1) for all s.

For the prior values of λj(γ), a simple choice is the uniform selection probabilities, that

is, to let λj(γ) ≡ 1/|J (γ)| for all j ∈ J (γ) and all γ ∈ Ωγ. This assumes no prior

knowledge about any particular predictor being more likely to be included than others. More

sophisticated choices of the prior selection probabilities may be used to take into account

the dependence structure among the predictors. For example, one may want to impose

the constraint that the conditional inclusion probability of a predictor that represents an

interaction be zero unless the corresponding main effects have been included in the model.

A more sophisticated example involves the so-called “dilution effect”, which arises from

strong correlation among the predictors as noted in George (1999) and George (2010). We

will devote Section 4 to illustrating how this effect can be accounted in the FS framework.

Note that if one adopts ρ(γ) ≡ ρ|γ| and the uniform values on λ(γ) as described above,

then by symmetry, the prior probability mass on each model of size t is

π(Mγ) = ρt

t−1∏

s=1

(1 − ρs)/

(

p

t

)

for all γ such that |γ| = t.

Any symmetric prior over ΩM—symmetric in the sense that any two models of the same

size receive equal prior probability—can thus be represented with appropriate choices of the

stopping probabilities ρ0, ρ1, . . . , ρp−1.

2.3 Bayesian variable selection with FS priors

Next we investigate how to draw inference using the FS representation for model space priors.

First note that one can of course still apply any of the existing methods for variable selection

9

based on sampling methods to a prior specified under the FS representation. But we will see

that the FS representation leads to an entirely new set of inferential tools that do not rely

on any sampling algorithms. More specifically, we will see that under the FS representation,

the posterior for any prior can be calculated analytically through a sequence of recursion.

The rest of this subsection is devoted to establishing this essential result.

Suppose a model Mγ is generated from an FS procedure. Also, suppose during the

corresponding recursive procedure, after t steps of recursion, we arrive at a model Mγ(t) =

Mγ. If S(γ) = 1, then Mγ = Mγ is the final model, and again we let θγ denote the

corresponding regression coefficients and variance, drawn from a prior p(θγ|Mγ). In this

case the likelihood under the final model is p(Y |θγ,Mγ). If instead S(γ) = 0, then Mγ

is not the final model. In this case, the final model Mγ, and thus the likelihood under it,

depends on further steps of the recursion procedure as characterized by the stopping and

selection variables.

The above description can be represented mathematically also in a recursive manner. For

any model Mγ of size t, we let q(γ|γ) denote the likelihood under the final model Mγ given

that the FS procedure does not terminate in the first t steps and the model reached at the

tth step is Mγ(t) = Mγ. Then we have

q(γ|γ) =

p(Y |θγ,Mγ) if S(γ) = 1 and the model coefficients are θγ

q(γ|γ+J(γ)) if S(γ) = 0 and the next variable to include is J(γ),

or equivalently,

q(γ|γ) = S(γ)p(Y |θγ,Mγ) + (1 − S(γ))q(γ|γ+J(γ)). (2)

10

Note that the likelihood q(γ|γ) is determined by γ along with the following sets of variables

Sγ⊂ = S(γ∗) for all Mγ∗ containing Mγ as a submodel

Jγ⊂ = J(γ∗) for all Mγ∗ containing Mγ as a submodel

θγ⊂ = θγ∗ for all Mγ∗ containing Mγ as a submodel.

That is, it depends on the stopping and selection variables along with the regression coeffi-

cients and variance for all models containing Mγ as a submodel.

Now we define Φ(γ) to be the marginal likelihood under the final model Mγ integrated

over the FS procedure given that the procedure does not stop in the first t steps and Mγ(t) =

Mγ. That is,

Φ(γ) =

∫ ∫

q(γ|γ)p(dθγ|Mγ)π(dMγ|γ(t) = γ)

=∑

γ:Mγ⊂Mγ

q(γ|γ)p(dθγ|Mγ)π(Mγ|γ(t) = γ)

=∑

sγ⊂ , jγ⊂

q(γ|γ)p(dθγ|Mγ)π(sγ⊂, jγ⊂) (3)

where π(·|γ(t) = γ) denotes the model space prior conditional on that the FS procedure

does not terminate in the first t steps and γ(t) = γ, and it is determined by the prior on

(Sγ⊂,Jγ⊂) denoted by π(sγ⊂, jγ⊂). The last summation is taken over all possible values of

Sγ⊂ and Jγ⊂.

Eqs. (2) and (3) together give rise to a recursive representation of Φ(γ)

Φ(γ) = ρ(γ)p(Y |Mγ) + (1 − ρ(γ))∑

j∈J (γ)

λj(γ)Φ(γ+j), (4)

with the convention that if γ = (1, 1, . . . , 1), the full model, then ρ(γ) = 1 and so Φ(γ) =

11

p(Y |Mγ). Note that in order to carry out this recursion, the marginal likelihood term

p(Y |Mγ) under each Mγ needs to be available. (To this end, convenient choices for the

conditional prior of the regression coefficients include the g-prior (Zellner, 1986) and the

hyper-g prior (Liang et al., 2008). We will adopt these choices in the numerical exam-

ples. More generally one may apply methods such as Laplace’s approximation to calculate

P (Y |Mγ) if other priors for the coefficients are adopted.)

But why do we even care about the Φ(γ) terms? They allow us analytically compute

any model space posterior through recursion! The recipe for achieving this is given in the

following theorem. (A proof is given in the Supplementary Materials.)

Theorem 2 (FS representation of a model space posterior). An FS representation for any

model space posterior can be computed analytically based on an FS representation for the

corresponding prior. Specifically, if the prior can be represented by an FS procedure with

parameters ρ and λ, then the posterior can be represented as follows. For each Mγ ∈ ΩM,

1. Posterior stopping probability:

ρpost(γ) = ρ(γ)p(Y |Mγ)/Φ(γ),

2. Posterior selection probabilities:

λpostj (γ) =

(1 − ρ(γ)) λj(γ)Φ(γ+j)

Φ(γ) − ρ(γ)p(Y |Mγ).

Remark: This theorem gives an analytic representation of the corresponding posterior distri-

bution. It allows us to compute the posterior exactly through the recursion formula Eq. (4),

to sample from this posterior directly by simulating the FS procedure with the updated

parameters, and to compute certain summary statistics analytically without using sampling

at all. These will be illustrated in our numerical examples.

12

Eq. (4) and Theorem 2 can be written in terms of Bayes factors (BFs) with respect to

a base model Mγb. Specifically, if we let Φb(γ) = Φ(γ)/P (Y |Mγb

) and BF(Mγ : Mγb) =

P (Y |Mγ)/P (Y |Mγb), then Eq. (4) becomes

Φb(γ) = ρ(γ)BF(Mγ : Mγb) + (1 − ρ(γ))

j∈J (γ)

λjΦb(γ+j).

Accordingly, the posterior parameter updates in Theorem 2 becomes

ρpost(γ) = ρ(γ)BF(Mγ : Mγb)/Φb(γ) and λpost

j (γ) =(1 − ρ(γ))λj(γ)Φb(γ

+j)

Φb(γ) − ρ(γ)BF(Mγ : Mγb).

Thus one can carry out Bayesian variable selection using priors under the FS representation

completely in terms of the BFs. This will become very handy as it is often easier to compute

the BFs with respect to a baseline model than to compute the marginal likelihoods. (See

Section 3.1.) A simple choice of the baseline model is the null model. We will use this choice

in the rest of this work without further declaration.

2.4 Computational strategies in high-dimensional problems

Computing the exact model space posterior requires enumerating all models, as well as com-

puting and summing the marginal likelihood or Bayes factor for each of them. When the

number of potential predictors p is large (> 30), such brute-force integration is infeasible.

Traditionally, sampling methods such as MCMC are then employed to approximate the

integral through simulation. In a different vein, the FS representation through Eq. (4) con-

verts the problem of integration into one of recursion. Consequently, many computational

strategies for approximating recursive computation on trees can be adopted to effectively

approximate the posterior in high-dimensional problems. These strategies form an alterna-

tive to sampling algorithms such as MCMC and SVS for posterior inference, and hold much

13

promise for Bayesian analysis in high-dimensional settings. While sampling algorithms are

effective in exploring local details of the posterior distribution, approximate recursion algo-

rithms provide a means to learning the global shape of the posterior while avoiding difficulties

in mixing and convergence. When combined, the two types of algorithms can form a powerful

set of Bayesian armory for exploring large multi-modal posterior distributions.

One of the simplest approaches for approximating a recursion is to impose an upper limit

k on the depth of the recursion. We shall refer to this method as k-thresholding. In the

current context, this can be achieved by restricting the support of the prior to models of size

no more than k—simply let ρ(γ) = 1 for each model Mγ of size k. When the underlying

model indeed involves no more than k predictors (essentially a sparsity assumption), such an

assumption will actually result in a gain in statistical power in selecting the “true” model. If

otherwise, the “true” model falls out of the support of the posterior, but the method will still

identify relevant sub-models of size no more than k. If desired, one may then keep all the

variables with moderate to high marginal inclusion probabilities as estimated by draws from

the approximate posterior, (which effectively serves as a dimensionality reduction step,) and

carry out another round of selection on that sub-space of models spanned by these “candidate

variables” with a larger model size limit.

A more involved approximation technique, the k-step look-ahead, is a generalization of

k-thresholding. More specifically, one can start by imposing a size limit k on the support of

the FS procedure. After computing the corresponding posterior through a k-level recursion,

one can identify one or a small number of candidate “representative” models that receive high

posterior probability. One can then place a new “local” prior under the FS representation

with size limit k further along the branch of each of those models, and carry out another

iteration of k-level recursion to find its posterior. This procedure continues until the local FS

posteriors suggest termination through the posterior stopping probabilities. The reasoning

behind this algorithm is that large models typically involve good sub-models. Through this

14

procedure, one can effectively zoom into the parts of the model space that are likely to

receive the highest posterior probability.

Two examples of such k-step look-ahead procedures are given in the following boxes as

Algorithm 1 and Algorithm 2. How they work is illustrated in the next section through

Example 3. In the description of Algorithms 1 and 2, S denotes the set of predictors se-

lected by the algorithm, V the rest of the predictors, ΩVM the model space spanned by the

predictors in V , and ΩVM(k) the subset of models in ΩV

M of size no more than k. Algorithm 1

uses the hierarchical maximum a posteriori (hMAP) model (Wong and Ma, 2010) as the

“representative” model in each k-level recursion. (For the definition of the hMAP and the

motivation to use it instead of the more common MAP model, see Supplementary Materials.)

Algorithm 2 is a variant that adds only one additional variable in each k-level recursion. It

is computationally more demanding than Algorithm 1 but less aggressive in approximation.

Algorithm 1 k-step look-ahead variable selection using stepwise hMAP

1: Initialization: S ← ∅, V ← 1, 2, . . . , p.2: Demean: yi ← yi − y, xij ← xij − x·j for all i = 1, 2, . . . , n and j ∈ V .3: k-thresholding: Place an FS-prior on ΩV

M supported on ΩVM(k).

4: Compute the posterior FS distribution by recursions as prescribed in Eq. (4) and Theo-rem 2.

5: Find the hMAP model MVγhMAP .

6: Scurr ← variables in MVγhMAP , V ← V \Scurr, and S ← S ∪ Scurr.

7: if |Scurr| = k and |S| < p then

8: for each j ∈ V do

9: Regress predictor Xj on the variables in Scurr.10: Update xij ← eij, the corresponding residual, for each i = 1, 2, . . . , n.11: Regress Y on the variables in Scurr.12: Update yi ← ei, the corresponding residual, for each i = 1, 2, . . . , n.13: end for

14: Go to Step 3.15: else

16: The algorithm terminates and returns S.17: end if

The k-step look-ahead is greedy to varying degrees depending on the choice of k. In

15

Algorithm 2 k-step look-ahead variable selection with inclusion of one additional variablefor each k-level recursion1: Initialization: S ← ∅, V ← 1, 2, . . . , p.2: Demean: yi ← yi − y, xij ← xij − x·j for all i = 1, 2, . . . , n and j ∈ V .3: k-thresholding: Place an FS-prior on ΩV

M supported on ΩVM(k).

4: Compute the posterior FS distribution by recursions as prescribed in Eq. (4) and Theo-rem 2.

5: if ρpost(ΩVM) < 0.5 and |S| < p then

6: Scurr ← the variable with the highest posterior selection probabilityλpost

j (ΩVM), V ← V \Scurr, and S ← S ∪ Scurr.

7: for each j ∈ V do

8: Regress predictor Xj on the variable in Scurr.9: Update xij ← eij, the corresponding residual, for each i = 1, 2, . . . , n.

10: Regress Y on the variable in Scurr.11: Update yi ← ei, the corresponding residual, for each i = 1, 2, . . . , n.12: end for

13: Go to Step 3.14: else

15: The algorithm terminates and returns S.16: end if

particular, for k = 1 this is a completely greedy algorithm. This type of algorithms go

beyond the realm of probability-based inference, but they can be extremely powerful in many

problems involving a large number of potential predictors where exact Bayesian computation

is prohibitive. Of course, by making this Bayesian/frequentist compromise, one must pay

extra attention to the potential problem of overfitting as the models selected by such a

procedure is no longer automatically corrected for all of the multiple testing involved in the

procedure. The potential gain in scalability, however, can be so substantial in large problems

that it well justifies the extra effort needed in addressing such complications. In the current

work, we do not delve further into this particular aspect but do acknowledge its importance.

2.5 Posterior model averaging using the FS representation

While the main focus of this work is on model selection, in this subsection we digress slightly

and show that using the FS representation of model space posteriors, Bayesian model aver-

16

aging (BMA) (Hoeting et al., 1999), which is essentially integration over the posterior, can

also be carried out through recursion. We let ∆ denote a quantity of interest. BMA is based

on the posterior distribution of ∆:

P (∆|Y ) =∑

γ

P (∆|Y ,Mγ)P (Mγ|Y ),

and in particular its posterior expectation

E(∆|Y ) =∑

γ

E(∆|Y ,Mγ)P (Mγ|Y ).

Under an FS representation of the posterior, these two formulas have a recursive represen-

tation as well. To see this, we first define, for each model Mγ, a quantity Ψ(γ) to be the

posterior expectation of ∆ given that Mγ arises during the random procedure that generates

the final model after |γ| steps. Then

Ψ(γ) = ρpost(γ)E(∆|Y ,Mγ) +(

1 − ρpost(γ))

j∈J (γ)

λpostj (γ)Ψ(γ+j). (5)

Note that E(∆|Y ) = Ψ(γ(0)) and so it can be calculated recursively according to Eq. (5). It

follows that the posterior distribution P (∆|Y ) can also be computed this way by setting ∆

to the appropriate indicator function. Therefore, to carry out BMA, we can first use Theo-

rem 1 to recursively compute the corresponding posterior FS representation, then use Eq. (5)

to again recursively compute the model averaged quantities. Note that the existence of such

a recursive representation of BMA suggests that the computational strategies for approxi-

mating recursions introduced previously can also be adopted for carrying out approximate

BMA in high-dimensional problems.

Example 1 (BMA estimation of marginal inclusion probabilities). In this example we use

17

the above recipe to evaluate the posterior marginal inclusion probability of a variable Xi. In

this case, we let ∆ be the indicator for the event Xi is in the final model. For any model

Mγ, we define Ψ(γ) as in Eq. (5). Note that if Mγ contains Xi, then Eq. (5) reduces to

Ψ(γ) = 1. On the other hand, if Mγ does not contain Xi, then E(∆|Y ,Mγ) = 0, and so

Ψ(γ) =(

1 − ρpost(γ))

j∈J (γ)

λpostj (γ)Ψ(γ+j)

=(

1 − ρpost(γ))

λposti (γ) +

j∈J (γ)\i

λpostj (γ)Ψ(γ+j)

.

By the above recursion, we can calculate Ψ(γ(0)), which is exactly the posterior marginal

inclusion probability for variable Xi.

Remark: While the above example shows that marginal inclusion probabilities can be com-

puted recursively under the proposed framework, it does not suggest that this is the most

efficient way to evaluate those probabilities, especially when there are a large number of

potential variables and the investigator wants to evaluate the inclusion probability for each.

An alternative approach is to sample models from the posterior FS procedure, and estimate

the inclusion probabilities using the sample averages. Although some Monte Carlo error will

be introduced, sampling can be a much more efficient approach when there are many predic-

tors. For this reason, in our later numerical examples, we use sampling from the posterior

to estimate the marginal inclusion probabilities.

3 Numerical examples

In this section we use several numerical examples to illustrate variable selection using priors

under the FS representation. We adopt the g-prior and the hyper-g prior on the regression

coefficients and the noise variance given a model Mγ. Such choices are made due to the

18

availability of closed-form marginal likelihoods, P (Y |Mγ), and BFs. First we provide some

brief background in how the marginal likelihood and BFs are computed for the g-prior and

the hyper-g prior. We refer the interested reader to Liang et al. (2008) for more detail.

3.1 Bayes factors under g-prior and hyper-g prior

Given a particular model Mγ, Zellner’s g-prior in its most popular form is the following

prior on the regression coefficients and the noise variance

p(φ) ∝ 1/φ and βγ|φ,Mγ ∼ N(β0γ, g(XT X)−1/φ)

where β0γ and g are hyperparameters. Following the exposition in Liang et al. (2008), we

assume without loss of generality that the predictor variables X1, X2, . . . , Xp have all

been mean centered at zero. Then we can place a common non-informative flat prior on the

intercept α for all models. So p(α, φ) ∝ 1/φ. Under this prior setup, the marginal likelihood

for model Mγ is

p(Y |Mγ) =Γ((n − 1)/2)√

π(n−1)√

n

(

n∑

i=1

(yi − y)2

)−(n−1)/2

· (1 + g)(n−1−|γ|)/2

(

1 + g(1 − R2γ)

)(n−1)/2

where R2γ is the coefficient of determination for model Mγ. If we choose our baseline model

Mγbto be the null model—that is γb = (0, 0, . . . , 0). Then the Bayes factor for a model Mγ

versus Mγbis

BF(Mγ : Mγb) =

(1 + g)(n−1−|γ|)/2

(

1 + g(1 − R2γ)

)(n−1)/2.

Due to the simplicity of the BF in comparison to the marginal likelihood, we will carry out

the inference through Theorem 2 in terms of the BFs.

To avoid undesirable features of the g-priors such as Barlett’s paradox and the informa-

tion paradox (Berger and Pericchi, 2001), Liang et al. (2008) proposed the use of mixtures

19

of g-priors. In particular, they introduced the hyper-g prior, which adopts the following

hyperprior on g:

g

1 + g∼ Beta(1, a/2 − 1).

This prior also renders a closed form representation for the model-specific marginal likelihood,

and thus for the corresponding BFs. In particular, Liang et al. (2008) showed that the BF

of a model Mγ versus the null model Mγbis given by

BF(Mγ : Mγb) =

a − 2

|γ| + a − 2· 2F1

(

(n − 1)/2, 1; (|γ| + a)/2; R2γ

)

where 2F1(·, ·; ·; ·) is the hypergeometric function. More specifically, in the notations of Liang

et al. (2008),

2F1(a, b; c; z) =Γ(c)

Γ(b)Γ(c − b)

∫ 1

0

tb−1(1 − t)c−b−1

(1 − tz)adt.

Programming libraries are available for evaluating this function (Liang et al., 2008).

3.2 Examples

Example 2 (US Crime data). This is a classical data set introduced in Vandaele (1978)

and adopted in many articles including Raftery et al. (1997) and Clyde et al. (2011) to

compare methods for model selection and averaging. This data set contains 15 variables and

so an exhaustive computation of the marginal likelihood of all 215 models is possible. We

place a prior under the FS representation on this model space with constant prior stopping

probability ρs ≡ 0.5 for all s = 0, 2, . . . , p − 1 and uniform prior selection probabilities,

together with a g-prior with g = n on the coefficients for each model. After computing the

corresponding posterior FS representation, we randomly draw from this posterior—through

simulating the FS procedure—10,000 models, and use the sample average to estimate the

marginal inclusion probability of each predictor.

20

Figure 2 presents the estimated marginal inclusion probabilities of the 15 predictors

along with the posterior distribution of the model size. In addition, we also compute the

“co-inclusion” probability of each pair of variables—that is, the fraction of sampled models

in which two variables are selected simultaneously. Figure 3, which we call a “co-inclusion”

graph, summarizes the results. In the figure, the size of each node (variable) is proportional

to its estimated marginal inclusion probability, while the width of the link between two nodes

is proportional to their co-inclusion probability. The links between pairs with estimated co-

inclusion probabilities more than 20% are dark colored. Note for example although X4 and

X5 both have fairly high marginal inclusion probabilities, they typically do not both appear

in the sampled models due to their strong correlation.

Estimated posterior marginal inclusion probability

0.0

0.2

0.4

0.6

0.8

1.0

Posterior distribution of model size

Model size

Pos

terio

r pr

obab

ility

0 5 10 15

0.00

0.05

0.10

0.15

Figure 2: Estimated marginal inclusion probabilities (left) and the estimated posterior ofthe model size (right) for the US crime data.

Example 3 (k-step look-ahead in a high-dimensional model space). In this example we

illustrate the work of the k-step look-ahead algorithms introduced in the previous section

through a simulated data set with 1,000 potential predictor variables, X1, X2,. . . , X1000.

These 1,000 predictors are drawn from a multivariate normal distribution, for which the

marginal means are 0 and the marginal variances are 1, and the correlation between variables

Xi and Xj is corr(Xi, Xj) = (1 − 0.05|i − j|)1|i−j|≤20. That is, the “close-by” predictors are

21

Figure 3: A “co-inclusion” graph for the US crime data set.

highly correlated. A response Y is simulated as

Y = 10 + 3X3 − 3X120 + 3X230 − 1.5X300 + 2X562 − 2X722 + 2X819 + ǫ

where the errors are independent draws from a Normal(0, 102) distribution.

We simulate such a data set of size 1,500, and then apply the k-step look-ahead procedure

described as Algorithm 1 to this data set with a step size k set to 3. More specifically, in each

iteration of the procedure, we place a local prior under the FS representation on the model

space spanned by those variables that have not been selected into the model after taking a

partial regression on all of the variables selected in previous iterations, and the prior places all

its probability mass on models of size no more than k. Again, the prior stopping probabilities

are set to 0.5 for all non-full models, and uniform prior selection probabilities are adopted.

22

A hyper-g prior with a = 3 is placed on the coefficients given each model. After computing

the corresponding posterior FS distribution through recursion, the variables present in the

hMAP model found at each iteration are added into the set of selected variables. Thus

in each iteration at most k variables can be selected. If the hMAP model at an iteration

reaches the maximum size k, then the procedure takes the next iteration and places a new

FS-prior on the model space spanned by the variables not yet selected after taking the partial

regression on the ones already included. This procedure terminates when the hMAP model

at an iteration falls short of the maximum size supported by the prior or the full model has

been reached.

For the current example, the k-step look-ahead terminated after three iterations. In

the first iteration, the hMAP model contains variables X3, X120 and X231. In the second

iteration, variables X560, X722, and X819 are added into the model. Finally, in the third

iteration, a single variable X300 is added and the procedure terminates. (The entire process

took about 5 minutes on a single Intel Core i7 CPU core at 3.8Ghz.) So the “best” model

this procedure selects is Y ∼ X3 + X120 + X231 + X300 + X560 + X722 + X819.

Due to the strong correlation among neighboring markers in this particular simulation

the procedure chose X231 and X560 instead of X230 and X562, but the selected model is indeed

very close to the underlying truth. Note that the algorithm automatically terminated after

three iterations of k-level recursion and the inclusion of the seven relevant variables into the

model. The relevant predictors (or their immediate neighbors) are identified by the algorithm

in an order matching their corresponding effect sizes.

In addition to finding a single representative model, we can also estimate the inclusion

probabilities of the variables as follows. At each iteration of k-step look-ahead, given those

already selected in the previous iterations, we sample 1,000 models from the k-thresholded

posterior FS-distribution on the model space spanned by the variables not yet selected.

These are then used to estimate the conditional inclusion probabilities of the variables given

23

the variables selected in previous steps. We compute the conditional inclusion probability

for each variable not yet in the model in each iteration. The estimated conditional inclusion

probabilities are presented in Figure 4. Due to the correlation among the predictors, the

neighbors of the true predictors of the response also have moderate inclusion probabilities.

Looking at the lower panel in Figure 4, it may seem surprising that in the third iteration

of the k-step algorithm, despite the relatively low inclusion probabilities of all the avail-

able predictors, the algorithm was able to identify X300 in the hMAP model. This shows a

desirable aspect of inference using the FS representation. Although due to the strong cor-

relation among the predictors there are no predictors that stand out as should be included

with high probability, the recursive framework through Eq. (4) allows us to combine the

evidence for the inclusion of an additional predictor from all the neighbors around X300.

As a result, the posterior stopping probability on the model reached after two iterations,

Y ∼ X3 + X120 + X231 + X560 + X722 + X819, is small—about 32%, suggesting the inclusion

of another predictor, and then X300 is chosen as it has the highest conditional inclusion

probability among the variables not in the model. We have also applied Algorithm 2 to the

same simulated data set. The result is very similar and so is not reported here to avoid

redundancy. The performance of the algorithms is consistent across repeated simulations.

Our next example illustrates that one can combine the recursive approach with the stan-

dard sampling methods for inference. In particular, given the model chosen by our method

as a starting point, one can start his/her favorite sampling-based algorithms such as MCMC

or SVS to further explore the posterior around this model. This helps ensure that the chain

starts in a region of “good” models.

Example 4. We simulate data sets of size 1,000 with 100 predictors from the model:

Y = 10 + 3X3 − 3X15 + 3X25 − 1.5X50 + 2X70 − 2X80 + 2X95 + ǫ

24

1 25 53 81 112 147 182 217 252 287 322 357 392 427 462 497 532 567 602 637 672 707 742 777 812 847 882 917 952 987

Inclusion probabilities in Round 1

0.0

0.2

0.4

0.6

0.8

1.0

1 25 53 81 112 147 182 217 252 287 322 357 392 427 462 497 532 567 602 637 672 707 742 777 812 847 882 917 952 987

Inclusion probabilities in Round 2

0.0

0.2

0.4

0.6

0.8

1.0

1 25 53 81 112 147 182 217 252 287 322 357 392 427 462 497 532 567 602 637 672 707 742 777 812 847 882 917 952 987

Inclusion probabilities in Round 3

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4: Estimated inclusion probabilities for Example 3 in the three rounds of k-steplook-ahead in Algorithm 1.

with the same error distribution and correlation structure among the predictors as in Exam-

ple 3. For each simulated data set, we adopt the same FS prior specification as in Example 3

and compute the corresponding approximate posterior through recursion with 3-step look-

ahead. We use Algorithm 1 to find a good model. This typically results in a model very

close to the true model, such as Y ∼ X3 + X15 + X25 + X52 + X70 + X80 + X96.

We then use this model as the starting point and apply the stochastic search variable

selection (SSVS) method proposed by George and McCulloch (1993). (The corresponding

parameters are ci = 10 and τi = 0.3.) In comparison, we perform the same SSVS method

using the full model as a starting point. Figure 5 presents the estimated posterior marginal

inclusion probabilities based on these two SSVS searches. The inclusion probabilities are

25

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Predictor

Est

imat

ed m

argi

nal i

nclu

sion

pro

babi

lity

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Figure 5: Estimated posterior marginal inclusion probabilities based on two SSVS searches.The solid curve is for the SSVS with the model selected by Algorithm 1 as the starting point.The dashed curve is for the SSVS with the full model as the starting point. The light verticallines indicate the seven predictors in the true model.

estimated based on 2,000 Gibbs iterations with an additional 1,000 burn-in iterations for the

SSVS starting from the full model. While the strong correlation structure makes stochastic

model search very challenging in this example, we can see that the chain that started at the

model selected by our method performs better in terms of sampling around the true model.

We find that if we keep running SSVS chains long enough then the effect of a good starting

point becomes less significant as one would expect. Thus the higher the dimensionality is,

the more important the benefit of a good starting point brings as it will take longer for the

SSVS to get to regions of good models, and take even longer to get reliable estimates of the

inclusion probabilities. We note that Algorithm 1 with 3-step look-ahead takes about two

seconds in this example on a single 3.8Ghz Intel Core-i7 CPU core.

4 Case study: Incorporating model space redundancy

In this subsection we carry out a case study to show that the FS representation of model

space priors provides much flexibility in incorporating prior information. In particular, we

show how one can address an interesting phenomenon called the dilution effect first noted by

26

George (1999). “Dilution” occurs when there is redundancy in the model space. More specif-

ically, consider the scenario where there is strong correlation among some of the predictors,

and any one of these predictors captures virtually all of the association between them and

the response. In this case models that contain different members of this class but are other-

wise identical are essentially the same. As a result, if, say, a symmetric prior specification is

adopted, these models will receive more prior probability than they properly should. At the

same time, other models that do not include members of this class will be down-weighted

in the prior. In real data, this phenomenon occurs to varying degrees depending on the

underlying correlation structure among the predictors.

Next, we present a very simple specification of the FS procedure producing a prior that

can effectively address this phenomenon. We do not claim that this approach is the “best”

way to deal with dilution, but rather use this as an example to illustrate the flexibility

rendered by the FS representation. The specification can be described in two steps.

Step I. Pre-clustering the predictors based on their correlation. First, we carry out a

hierarchical clustering over the predictor variables using the (absolute) correlation as the

similarity metric, which divides the predictors into K clusters—C1, C2, . . . , CK . We recom-

mend using complete linkage as this will ensure that the variables within each cluster are all

very “close” to each other. One needs to choose a correlation threshold s for cutting the cor-

responding dendrogram into clusters—in the case of complete linkage, this is the minimum

correlation for two variables to be in the same cluster. We recommend choosing a large s,

such as 0.9, to place variables into the same basket only if they are highly correlated.

Step II. Specification of the FS prior given the predictor clusters. Based on the predictor

clusters, we assign prior selection probabilities for a model Mγ to the variables not yet in

the model in the following manner. First, we place equal total prior selection probability

over each of the available clusters. Then within each cluster, we assign selection probability

evenly across the variables.

27

For example, consider the situation where there are a total of 10 predictors X1 through

X10, and following Step I, they form four clusters C1 = X1, X2, X3, C2 = X4, X10,

C3 = X5, X7, X9 and C4 = X6, X8. Let Mγ be the model that contains variables X1,

X4, X5, X6, and X8. That is, γ = (1, 0, 0, 1, 1, 1, 0, 1, 0, 0). If the FS procedure reaches Mγ

and the procedure does not stop, that is, S(γ) = 0, then five variables, X2, X3, X7, X9, X10,

from three clusters C ′1 = X2, X3, C ′

2 = X4, and C ′3 = X5 are available for further

inclusion. In this case we choose the selection probabilities λ(γ) to be: λ1(γ) = λ4(γ) =

λ5(γ) = λ6(γ) = λ8(γ) = 0, λ2(γ) = λ3(γ) = 1/3 × 1/2 = 1/6, λ4 = 1/3 and λ5 = 1/3.

Under such a specification, the predictors falling in the same cluster evenly share a fixed

piece of the prior selection probability, which ensures that the prior weight on the other

variables are not “diluted”.

Example 5. We simulate a data set with 60 predictors X1, X2, . . . , X60. They have the

following correlation structure:

corr(Xi, Xj) =

1 − 0.01|i − j| if i = j or maxi, j ≤ 50

0 if i 6= j and maxi, j > 50.

In other words, there is strong correlation among the first 50 predictors X1, X2, . . . , X50

while each of the other 10 variables is independent of all other predictors. We simulate a

response variable Y = 10+4X3 +3X60 + ǫ with ǫ being independent N(0, 102) noise. Due to

the high correlation among X1 through X50, many models that contain different subsets of

them but are otherwise identical are essentially the same. If a symmetric prior specification

is adopted then these models will receive too large a proportion of the prior probability mass.

For example, the first 50 variables possess 5/6 of the prior selection probability to be added

into the null model, while the “effective” number of predictors among them is much smaller.

Consequently, variables X51 to X60 will receive much less prior selection probability than they

28

609 10 7 8 1 2 3 4 5 6

18 19 20 21 22 17 15 16 13 14 11 12 34 35 30 31 32 33 29 27 28 23 24 25 26 47 45 46 50 48 49 38 39 36 37 43 44 42 40 4154 56

5753 58

52 5951 55

10.

80.

60.

40.

20

Cor

rela

tion

Figure 6: Dendrogram for the simulated predictors in Example 5 under complete linkage.The dashed horizontal line indicates the cluster cutoff based on a correlation of 0.9.

deserve, and that can lead to substantial underestimation of their inclusion probabilities.

We generate 300 observations from the above model, and place a prior under the FS

representation supported on the models including at most 6 variables—that is, we adopt a

k-thresholded version of the FS representation with k = 6—and carry out the correspond-

ing recursive computation to find the posterior. We place strong sparsity assumption on

the model complexity by setting ρs ≡ 0.9 for all s and again adopt a hyper-g prior with

a = 3 for the model coefficients and variance conditional on each model. We compare two

prior specifications for the selection probabilities: (1) a simple uniform specification with

λj(γ) = 1/|J (γ)| for all j = 1, 2, . . . , |J (γ)|, and (2) the dilution specification which assigns

equal selection probability to each cluster of the predictors determined by the pre-clustering

step. For the pre-clustering step, we use hierarchical clustering with complete linkage and

a correlation threshold of 0.9 as the cutoff for clusters. This divides the predictors into 17

clusters with X51 through X60 each being its own cluster and X1 through X50 divided into

7 clusters. The dendrogram is presented in Figure 6.

29

1 7 14 22 30 38 46 54

Inclusion probabilities: dilution

0.0

0.2

0.4

0.6

0.8

1.0

1 7 14 22 30 38 46 54

Inclusion probabilities: symmetric

0.0

0.2

0.4

0.6

0.8

1.0

Posterior for model size: dilution

Den

sity

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

Posterior for model size: symmetric

Den

sity

0 1 2 3 4 5 60.

00.

20.

40.

6

Figure 7: The estimated marginal inclusion probabilities and the posterior for model size.

In Figure 7 we present the estimated marginal inclusion probabilities and the posterior

distribution of the model size, both estimated from 5,000 draws from the posterior under

each of the two specifications on prior selection probabilities. We see that incorporating

dilution into the prior specification considerably improves our inference on both variable

inclusion and model complexity.

5 Discussions

In this work we have showed that all distributions on linear regression model spaces can be

represented by a recursive constructive procedure. This general representation is specified

in terms of parameters that control for (1) the complexity of the models generated and (2)

the variables included in the model along with their conditional inclusion probabilities. This

representation transforms the challenge for Bayesian posterior inference in high-dimensional

30

problems from integration (and sampling) into recursion, and so computational strategies

for approximating recursions can be adopted to exploring the structure of the posterior,

without resorting to sampling algorithms such as MCMC and SVS. This new approach to

posterior inference is in fact complementary to the traditional sampling-based methods. The

approximate recursion methods are effective at learning the global shape of the posterior and

thus can serve as a tool for choosing starting points for sampling methods, which are good

at exploring local features of the posterior.

We note that the construction of the FS procedure can be completely “reversed”—starting

from the full model, which includes all predictors, one can use an analogous recursive pro-

cedure to drop predictors one at a time until stopping. This gives rise to a corresponding

backward-stepwise (BS) representation. Similarly, one can introduce a “two-directional”

representation that also maintains the properties of the FS procedure. To avoid redundancy

in the presentation, we leave the details to the interested reader. The BS procedure is less

useful than the FS as the former imposes a penalty on model simplicity and often favors

more complex models. In the uncommon situation where one has reasons to believe that

the underlying model involves a majority of the potential predictors, however, the BS proce-

dure will be an appropriate choice. In problems with a large number of potential predictors

while the underlying model is expected to involve only a small number of them, the FS

representation seems to be the natural option.

An R package for the introduced method will be available in the near future.

Acknowledgment

The author is thankful to Jim Berger, Merlise Clyde, David Dunson, Fan Li, James Scott,

and Mike West for helpful discussions and comments. The author is especially grateful to

Quanli Wang for help in programming that substantially improved the software efficiency.

31

References

Berger, J. O. and L. R. Pericchi (2001). Objective Bayesian methods for model selection:

Introduction and comparison. Lecture Notes-Monograph Series 38, pp. 135–207.

Carvalho, C. M. and J. G. Scott (2009). Objective bayesian model selection in gaussian

graphical models. Biometrika 96 (3), 497–512.

Chipman, H. (1996). Bayesian variable selection with related predictors. Canadian Journal

of Statistics 24, 17–36.

Clyde, M., H. Desimone, and G. Parmigiani (1996). Prediction via orthogonalized model

mixing. Journal of the American Statistical Association 91 (435), pp. 1197–1208.

Clyde, M. and E. I. George (2000). Flexible empirical Bayes estimation for wavelets. Journal

of the Royal Statistical Society: Series B (Statistical Methodology) 62 (4), 681–698.

Clyde, M. and E. I. George (2004). Model uncertainty. STATIST. SCI 19, 81–94.

Clyde, M. A., J. Ghosh, and M. L. Littman (2011). Bayesian adaptive sampling for variable

selection and model averaging. Journal of Computational and Graphical Statistics 20 (1),

80–101.

Cui, W. and E. I. George (2008). Empirical Bayes vs. fully Bayes variable selection. Journal

of Statistical Planning and Inference 138 (4), 888–900.

Fernndez, C., E. Ley, and M. F. J. Steel (2001). Benchmark priors for Bayesian model

averaging. Journal of Econometrics 100, 381–427.

Foster, D. P. and E. I. George (1994). The risk inflation criterion for multiple regression.

The Annals of Statistics 22 (4), pp. 1947–1975.

32

George, E. I. (1999). Sampling considerations for model averaging and model search. invited

discussion of “Model averaging and model search”, by M. Clyde. In J. M. Bernado, J. O.

Berger, A. P. Dawid, and A. F. M. Smith (Eds.), Bayesian Statistics 6, pp. 175–177.

Oxford, UK: Oxford University Press.

George, E. I. (2010). Dilution priors: Compensating for model space redundancy. In Bor-

rowing Strength: Theory Powering Applications - A Festschrift for Lawrence Brown, pp.

158–165. IMS Collections.

George, E. I., I. George, and D. P. Foster (2000). Calibration and empirical Bayes variable

selection. Biometrika 87, 731–747.

George, E. I. and R. E. McCulloch (1993). Variable Selection Via Gibbs Sampling. Journal

of the American Statistical Association 88 (423), 881–889.

George, E. I. and R. E. McCulloch (1997). Approaches for Bayesian variable selection.

Statistica Sinica 7 (2), 339–373.

Geweke, J. (1996). Variable selection and model comparison in regression. In J. M. Bernado,

J. O. Berger, A. P. Dawid, and A. F. M. Smith (Eds.), Bayesian Statistics 5, pp. 339–348.

Oxford, UK: Oxford University Press.

Hans, C., A. Dobra, and M. West (2007, June). Shotgun Stochastic Search for “Large p”

Regression. Journal of the American Statistical Association 102 (478), 507–516.

Hansen, M. H. and B. Yu (2001). Model selection and the principle of minimum description

length. Journal of the American Statistical Association 96 (454), 746–774.

Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky (1999). Bayesian model

averaging: A tutorial. Statistical Science 14 (4), pp. 382–401.

33

Jones, B., C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West (2005). Experiments in

stochastic computation for high-dimensional graphical models. Statistical Science 20 (4),

388–400.

Kass, R. E. and L. Wasserman (1995). A reference Bayesian test for nested hypotheses

and its relationship to the Schwarz criterion. Journal of the American Statistical Associ-

ation 90 (431), pp. 928–934.

Leamer, E. E. (1978a). Regression Selection Strategies and Revealed Priors. Journal of the

American Statistical Association 73 (363), 580–587.

Leamer, E. E. (1978b). Specification searches : ad hoc inference with nonexperimental data.

New York: Wiley.

Ley, E. and M. F. Steel (2009). On the effect of prior assumptions in Bayesian model

averaging with applications to growth regression. Journal of Applied Econometrics 24 (4),

651–674.

Li, F. and N. R. Zhang (2010). Bayesian variable selection in structured high-dimensional

covariate spaces with applications in genomics. JASA Theory and Methods 105, 1202–

1214.

Liang, F., R. Paulo, G. Molina, M. A. Clyde, and J. O. Berger (2008). Mixtures of g-Priors

for Bayesian Variable Selection. Journal of the American Statistical Association 103 (481),

410–423.

Mitchell, T. J. and J. J. Beauchamp (1988). Bayesian variable selection in linear regression.

Journal of the American Statistical Association 83 (404), pp. 1023–1032.

Raftery, A. E., D. Madigan, and J. A. Hoeting (1997). Bayesian model averaging for linear

regression models. Journal of the American Statistical Association 92 (437), pp. 179–191.

34

Scott, J. G. and J. O. Berger (2010). Bayes and empirical-Bayes multiplicity adjustment in

the variable-selection problem. ANNALS OF STATISTICS 38, 2587.

Smith, M. and R. Kohn (1996). Nonparametric regression using bayesian variable selection.

Journal of Econometrics 75 (2), 317 – 343.

Stewart, L. and W. W. Davis (1986). Bayesian posterior distributions over sets of possi-

ble models with inferences computed by Monte Carlo integration. Journal of the Royal

Statistical Society. Series D (The Statistician) 35 (2), pp. 175–182.

Vandaele, W. (1978). Participation in illegitimate activities—Ehrlich revisited. In A. Blum-

stein, J. Cohen, and D. Nagin (Eds.), Deterrence and Incapacitation, pp. 270–335. Wash-

ington, DC: National Academy of Sciences Press.

Wong, W. H. and L. Ma (2010). Optional Polya tree and Bayesian inference. Annals of

Statistics 38 (3), 1433–1459.

Zellner, A. (1971). An Introduction to Bayesian Inference in Econometrics. John Wiley and

Sons.

Zellner, A. (1981). Posterior odds ratios for regression hypotheses: General considerations

and some specific results. Journal of Econometrics 16 (1), 151–152.

Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-

prior distributions. In in Bayesian Inference and Decision Techniques: Essays in Honour

of Bruno de Finetti, pp. 233–243. North-Holland.

Zellner, A. and A. Siow (1980). Posterior odds ratios for selected regression hypotheses.

Bayesian Statistics. Proceedings of the First Valencia International Meeting Held in Va-

lencia (Spain), 585–603.

35