Bayesian structure learning in graphical modelsghoshal/papers/Bayesian...Bayesian structure learning in graphical models Sayantan Banerjee1, Subhashis Ghosal2 1The University of Texas

Bayesian structure learning in graphicalmodels

Sayantan Banerjee1∗, Subhashis Ghosal21The University of Texas MD Anderson Cancer Center

2North Carolina State University

AbstractWe consider the problem of estimating a sparse precision matrix of

a multivariate Gaussian distribution, where the dimension p may belarge. Gaussian graphical models provide an important tool in describ-ing conditional independence through presence or absence of edges inthe underlying graph. A popular non-Bayesian method of estimatinga graphical structure is given by the graphical lasso. In this paper, weconsider a Bayesian approach to the problem. We use priors which puta mixture of a point mass at zero and certain absolutely continuousdistribution on off-diagonal elements of the precision matrix. Hencethe resulting posterior distribution can be used for graphical struc-ture learning. The posterior convergence rate of the precision matrixis obtained and is shown to match the oracle rate. The posterior dis-tribution on the model space is extremely cumbersome to computeusing the commonly used reversible jump Markov chain Monte Carlomethods. However, the posterior mode in each graph can be easilyidentified as the graphical lasso restricted to each model. We proposea fast computational method for approximating the posterior proba-bilities of various graphs using the Laplace approximation approachby expanding the posterior density around the posterior mode. Wealso provide estimates of the accuracy in the approximation.

Keywords : Graphical lasso; Graphical models; Laplace approximation; Pos-terior convergence; Precision matrix.

∗Corresponding author at: Department of Biostatistics, The University of Texas MDAnderson Cancer Center, 1400 Pressler Street, Houston, Texas 77030. Tel: +1-919-699-8773. e-mail: [email protected]

1

1 Introduction

Statistical inference on a large covariance or precision matrix (inverse of co-variance matrix) is a topic of growing interest in recent times. Often thedimension p grows with the sample size n and even p can exceed n. Dataof this type are frequently encountered in fMRI, spectroscopy, gene arrayexpressions and so on. Estimation of a covariance or precision matrix is ofspecial interest because of its importance in methods like principal compo-nent analysis (PCA), linear discriminant analysis (LDA), etc. In cases wherep > n, the sample covariance matrix is necessarily singular, and hence anestimator of the precision matrix cannot be obtained by inverting it. There-fore we need to resort to other techniques for handling the high-dimensionalproblems.

Regularization methods for estimation of the covariance or precision ma-trix have been proposed and studied in recent literature for high-dimensionalproblems. These include banding, thresholding, tapering and penalizationbased methods; for example, see Ledoit and Wolf (2004); Huang et al. (2006);Yuan and Lin (2007); Bickel and Levina (2008a,b); Karoui (2008); Friedmanet al. (2008); Rothman et al. (2008); Lam and Fan (2009); Rothman et al.(2009); Cai et al. (2010, 2011); see also Banerjee and Ghosal (2014) for aBayesian method based on banding. The primary goal of these regulariza-tion based methods is to impose a sparsity structure in the matrix. Mostof these methods are applicable to situations where there is a natural order-ing in the underlying variables, for example in data from time series, spatialdata, etc., so that variables which are far off from each other have smallercorrelations or partial correlations. In high-dimensional situations for dataarising from genetics or econometrics, a natural ordering of the underlyingvariables may not always be readily available and hence estimation methodswhich are invariant to the ordering of the variables are desirable.

For estimation of a sparse inverse covariance matrix, graphical models(Lauritzen, 1996) provide an excellent tool, as the conditional dependencebetween the component variables is captured by an undirected graph; see Do-bra et al. (2004); Meinshausen and Buhlmann (2006); Yuan and Lin (2007);Friedman et al. (2008). There are several methods in the frequentist literaturefor the estimation of the precision matrix through graphical models. Thesemethods include minimization of the penalized log-likelihood of the data witha lasso type penalty on the elements of the precision matrix. Several algo-rithms have been developed in the literature to solve the above optimization

2

problem, including coordinate descent based algorithm for the lasso, which ispopularly known as the graphical lasso (Meinshausen and Buhlmann, 2006;Friedman et al., 2008; Banerjee et al., 2008; Yuan and Lin, 2007; Guo et al.,2011; Witten et al., 2011). Other methods include the Sparse PermutationInvariant Covariance Estimator (SPICE) (Rothman et al., 2008).

Frequentist behavior of Bayesian methods in the context of high dimen-sional covariance matrix estimation have been studied only by a few authors.Ghosal (2000) studied asymptotic normality of posterior distributions forexponential families, which include the normal model with unknown covari-ance matrix, when the dimension p→∞, but restricting to p n. Recently,Pati et al. (2014) considered sparse Bayesian factor models for dimension-ality reduction in high dimensional problems and showed consistency in theL2-operator norm (also known as the spectral norm) by using a point massmixture prior on the factor loadings, assuming such a factor model represen-tation for the true covariance matrix.

Bayesian methods for inference using graphical models have also been de-veloped, as in Roverato (2000); Atay-Kayis and Massam (2005); Letac andMassam (2007). A conjugate family of priors, known as the G-Wishart prior(Roverato, 2000) have been developed for incomplete decomposable graphs.The equivalent prior on the covariance matrix is termed as the hyper inverseWishart distribution in Dawid and Lauritzen (1993). Letac and Massam(2007) introduced a more general family of conjugate priors for the precisionmatrix, known as the WPG

-Wishart family of distributions, which also has theconjugacy property. The properties of this family of distributions, includingexpressions for the Bayes estimators were further explored in Rajaratnamet al. (2008). Recently Banerjee and Ghosal (2014) studied posterior conver-gence rates for a G-Wishart prior inducing a banding structure, where thetrue precision matrix need not have the banding structure.

Wang (2012) developed a Bayesian version of the graphical lasso, byputting Laplace priors on the off-diagonal elements of the precision matrixand exponential priors on the diagonals. Similar in lines with the Bayesianlasso (Park and Casella, 2008), the posterior mode in this case coincideswith the graphical lasso estimate. A block Gibbs sampler is also developedfor sampling from the resulting posterior. However, the Bayesian graphicallasso does not introduce any sparsity in the graphical structure because of theabsence of a point mass at zero in the prior distribution for the off-diagonalelements. On the other hand, if point masses are introduced, the result-ing posterior distribution on the structure of the graph becomes extremely

3

difficult to compute based on the traditional reversible jump Markov chainMonte Carlo method.

In this paper, we derive posterior convergence rates for the Bayesiangraphical lasso prior in terms of the Frobenius norm under appropriate spar-sity conditions when the dimension p grows with the sample size n. Forcomputing the posterior distribution, we propose a Laplace approximationbased method to compute the posterior probability of different graphicalstructures. Such Laplace approximations based methods have been devel-oped for variable selection in regression models; for example, see Yuan andLin (2005); Curtis et al. (2014). The lasso type penalty on the elements leadto non-differentiability of the integrand, when the graphical lasso sets an off-diagonal entry to zero, but the model includes that off-diagonal entry as afree variable. We shall call such models non-regular following the terminol-ogy used by Yuan and Lin (2005) for variable selection in linear regressionmodels. We show that the posterior probability of non-regular models aresubstantially smaller than their regular counterparts and hence in compar-ison may be ignored from consideration. We also estimate the error in theLaplace approximation for regular models.

The paper is organized as follows. In the next section, we introduce no-tations and discuss preliminaries on graphical models required for the othersections of the paper. In Section 3, we state model assumptions and specifythe prior distribution on the underlying parameters, derive the form of theposterior and obtain the posterior convergence rate using the general theorydeveloped in Ghosal et al. (2000). In Section 4, we develop the approxima-tion of the posterior probabilities for different graphical models and discussthe issue of non-regular graphical models. We also show that the error in ap-proximation of the posterior probabilities using the Laplace approximation isasymptotically negligible under appropriate conditions. A simulation studyis performed in the Section 5 followed by a real data example in Section 6.Proofs of main results and additional lemmas are included in the Appendix.

2 Notations and preliminaries

An undirected graph G comprises of a non-empty set V of p vertices indexingthe components of a p-dimensional random vector along with an edge set E ⊂(i, j) ∈ V ×V : i < j. Let X = (X1, . . . , Xp)

T be distributed as Np(0,Ω−1),

where the precision matrix Ω = ((ωij)) is such that (i, j) 6∈ E implies ωij =

4

0. We then say that X follows a Gaussian graphical model (GGM) withrespect to the graph G. Since the absence of an edge between i and j impliesconditional independence of Xi and Xj given (Xr: r 6= i, j), a GGM serves asan excellent tool in representing the sparsity structure in the precision matrix.Following the notation in Letac and Massam (2007), the canonical parameterΩ is restricted to PG, where PG is the cone of positive definite symmetricmatrices of order p having zero entry corresponding to each missing edgein E. We also denote the linear space of symmetric matrices of order p byM, and M+ ⊂ M to be the cone of positive definite matrices of order p.Corresponding to each GGM G = (V,E), we define the set E = (i, j) ∈V × V : i = j, or (i, j) ∈ E.

By tn = O(δn) (respectively, o(δn)), we mean that tn/δn is bounded (re-spectively, tn/δn → 0 as n→∞). For a random sequence Xn, Xn = OP (δn)(respectively, Xn = oP (δn)) means that P(|Xn| ≤ Mδn) → 1 for some con-stant M (respectively, P(|Xn| < εδn) → 1 for all ε > 0). For numericalsequences rn and sn, by rn sn (or, sn rn) we mean that rn = o(sn),while by rn . sn (or sn & rn) we mean that rn = O(sn). By rn snwe mean that both rn . sn and sn . rn hold, while rn ∼ sn stands forrn/sn → 1. The indicator function is denoted by 1l. Non-stochastic vectorsare represented in bold lowercase English or Greek letters with the compo-nents of a vector by the corresponding non-bold letters, that is, for x ∈ Rp,x = (x1, . . . , xp)

T . For a vector x ∈ Rp, we define the following vector

norms: ‖x‖r =(∑p

j=1 |xj|r)1/r

, ‖x‖∞ = maxj |xj|. Matrices are denoted in

bold uppercase English or Greek letters, like A = ((aij)), where aij standsfor the (i, j)th entry of A. The identity matrix of order p will be denoted byIp. If A is a symmetric p× p matrix, let eig1(A) ≤ · · · ≤ eigp(A) stand forits eigenvalues and let the trace of A be denoted by tr(A). Viewing A as avector in Rp2 , we define Lr, 1 ≤ r <∞ and L∞-norms on p× p matrices as

‖A‖r =

(p∑i=1

p∑j=1

|aij|r)1/r

, 1 ≤ r <∞, ‖A‖∞ = maxi,j|aij|.

Note that ‖A‖2 =√

tr(ATA), the Frobenius norm. Viewing A an operatorfrom (Rp, ‖ · ‖r) to (Rp, ‖ · ‖s), where 1 ≤ r, s ≤ ∞, we can also define,‖A‖(r,s) = sup(‖Ax‖s: ‖x‖r = 1). We refer to the norm ‖ · ‖(r,r) as the Lr-operator norm. This gives the L2-operator norm of A as

‖A‖(2,2) = [maxeigi(ATA): 1 ≤ i ≤ p]1/2.

5

For symmetric matrices, ‖A‖(2,2) = max|eigi(A)|: 1 ≤ i ≤ p. For symmet-ric matrices A and B of order p, we have the following:

‖A‖∞ ≤ ‖A‖(2,2) ≤ ‖A‖2 ≤ p‖A‖∞,‖AB‖2 ≤ ‖A‖(2,2)‖B‖2, ‖AB‖2 ≤ ‖A‖2‖B‖(2,2).

(2.1)

Let A1/2 stand for the unique positive definite square root of a positivedefinite matrix A. We say that A > 0 if A is positive definite, where 0stands for the zero matrix. We denote sets in non-bold uppercase Englishletters. The cardinality of a set T , that is, the number of elements in T isdenoted by #T . We define the symmetric matrix E(i,j) = ((1l(i,j),(j,i)(l,m))).

The Hellinger distance between two probability densities q1 and q2 is givenby h(q1, q2) = ‖√q1 −

√q2‖2.

For a subset A of a metric space (S, d), N(ε, A, d) denotes the ε-coveringnumber of A with respect to d, that is, the minimum number of d-balls ofsize ε in S needed to cover A.

3 Model, prior and posterior concentration

Consider n independent random samples X1, . . . ,Xn from Np(0,Σ), whereΣ is nonsingular and the precision matrix Ω = Σ−1 is sparse. The problemis to estimate Ω and to learn the underlying graphical structure. Let Σ =n−1

∑ni=1 XiX

Ti , the natural estimator of Σ.

The graphical lasso produces sparse solutions for the precision matrix, asthe lasso does for linear regression. The graphical lasso estimator minimizestwo times the penalized average negative log-likelihood

− log det(Ω) + tr(ΣΩ) +λ

n‖Ω‖1, (3.1)

over the class of positive definite matrices, and λ ≥ 0 acts as the penaltyparameter. Rothman et al. (2008) derived frequentist convergence rates of thepenalized estimator under some sparsity assumptions on the true precisionmatrix. More specifically, consider the following class of positive definitematrices of order p:

U(ε0, s) = Ω: #(i, j): 1 ≤ i < j ≤ p, ωij 6= 0 ≤ s,

0 < ε0 ≤ eig1(Ω) ≤ eigp(Ω) ≤ ε−10 <∞

.

(3.2)

6

Though Rothman et al. (2008) considered penalizing only the off-diagonalelements of Ω, a minor modification of their proof leads to the same conver-gence rate for the graphical lasso estimator, obtained by additionally penal-izing the diagonal elements. Let us denote Ω∗ as the graphical lasso obtainedby minimizing (3.1) based on a sample of size n from a p-dimensional Gaus-sian distribution with precision matrix Ω0 ∈ U(ε0, s). Then, it follows fromTheorem 1 in Rothman et al. (2008) that the rate of convergence of Ω∗ isn−1/2(p + s)1/2(log p)1/2 in the Frobenius norm. In particular, this impliesthat ‖Ω∗ −Ω0‖2 tends to zero in probability whenever n−1(p+ s) log p→ 0.Under this condition, we have that

‖Ω∗‖(2,2) = OP (1), ‖Ω∗−1‖(2,2) = OP (1). (3.3)

The first relation follows by the triangle inequality ‖Ω∗‖(2,2) ≤ ‖Ω0‖(2,2) +‖Ω∗ −Ω0‖(2,2) and the norm inequality ‖Ω∗ −Ω0‖(2,2) ≤ ‖Ω∗ −Ω0‖2, whilefor the second relation observe that

‖Ω∗−1‖(2,2) ≤ ‖Ω−10 ‖(2,2) + ‖Ω∗−1 −Ω−1

0 ‖(2,2)

≤ ‖Ω−10 ‖(2,2) + ‖Ω−1

0 ‖(2,2)‖Ω∗ −Ω0‖(2,2)‖Ω∗−1‖(2,2),

which leads to

‖Ω∗−1‖(2,2) ≤‖Ω−1

0 ‖(2,2)

1− ‖Ω−10 ‖(2,2)‖Ω∗ −Ω0‖(2,2)

. (3.4)

In the Bayesian context, Wang (2012) introduced the graphical lasso prior,which uses exponential distributions on diagonal elements and Laplace den-sity λe−λ|x|/2 on off-diagonal elements, all independently of each other, andfinally imposes a positive definiteness constraint. The graphical lasso priorhas a drawback that it puts absolutely continuous priors on the elementsof the precision matrix, and hence the posterior probabilities of the eventωij = 0 is always exactly zero.

Wang (2012) also mentioned an extension of the graphical lasso by puttingan additional level of prior on the underlying graphical model structure usingpoint mass priors on the events corresponding to the absence of an edge inthe edge-set E, but did not pursue the method. We put point-mass prior onthe events ωij = 0 to make posterior inference about the sparse structureof the underlying graphical model. Define Γ = (γij: 1 ≤ i < j ≤ p) to be a(p2

)-dimensional vector of edge-inclusion indicator, that is,

γij = 1l(i, j) ∈ E, 1 ≤ i < j ≤ p. (3.5)

7

Identifying Γ with the set of indices (i, j): γij = 1, we denote by #Γ thenumber of non-zero elements in Γ. Similar to the Bayesian graphical lassoprior, given the underlying graphical structure, we put a Laplace prior onthe non-zero off-diagonal elements of the precision matrix and for the diag-onal elements we have an exponential prior, overall maintaining the positivedefiniteness of the parameter. In order to establish convergence rates, we infact need to keep Ω and its inverse away from singular matrices by imposinga restriction on their eigenvalues. Let M+

0 be a subset of M+ whose ele-ments have eigenvalues bounded between two fixed positive numbers. Thenthe joint prior density on Ω is given by,

p(Ω|Γ) ∝∏γij=1

exp(−λ|ωij|)p∏i=1

exp (−λωii/2) 1lM+0

(Ω). (3.6)

We propose two different priors on the graphical structure indicator Γ. Theedge indicators γij, 1 ≤ i < j ≤ p, are considered to be independent andidentically distributed (i.i.d) Bernoulli(q) random variables, but then condi-tioned to the restriction that the model size

∑1≤i<j≤p γij does not exceed R.

For some a1, a2 > 0, the prior distribution on R is assumed to satisfy

P(R > a1m) ≤ e−a2m logm, m = 1, 2, . . . . (3.7)

This prior is similar to that used by Castillo and van der Vaart (2012),which chooses the model size first according to a distribution with a similartail decay and then subsets are selected randomly with equal probability.We can also specify the individual priors on γij the same as above, butnow truncating the model size to some fixed r, where r may depend on nand is chosen to satisfy a metric entropy condition required for posteriorconvergence.

Thus, in the first situation, the prior on the graphical structure indicatorΓ, given R, is given by,

p(Γ | R) ∝ q#Γ(1− q)(p2)−#Γ1l(#Γ ≤ R), (3.8)

leading to

p(Γ) ∝ q#Γ(1− q)(p2)−#ΓP(R ≥ #Γ). (3.9)

In the second case, the prior on Γ is simply given by

p(Γ) ∝ q#Γ(1− q)(p2)−#Γ1l(#Γ ≤ r). (3.10)

8

Smaller values of q prefer graphical models with fewer number of edges, andhence induce more sparsity in the precision matrix.

Due to the positive definiteness constraint on the parameter Ω, the nor-malizing constant corresponding posterior distribution of the graphical modelis intractable. One possible solution is to employ a reversible jump Markovchain Monte Carlo (RJMCMC) algorithm, which jumps from models of vary-ing dimensions to evaluate the posterior probabilities. As there are as many

as 2(p2) possible models, the posterior model probabilities estimated by RJM-

CMC visits are extremely unreliable. We consider a radically different ap-proach to posterior computation based on Laplace approximations, elabo-rated in the next section.

Under the above prior specifications, the joint posterior distribution of Ωand Γ given the data X(n) = (X1, . . . ,Xn) is given by

p(Ω,Γ|X(n)) ∝ p(X(n)|Ω,Γ)p(Ω|Γ)p(Γ)

= (2π)−np/2det(Ω)n/2 exp−n tr(ΣΩ)/2

×∏γij=1

λ exp(−λ|ωij|)/2p∏i=1

λ exp (−λωii/2) /2

× p(Γ)1lM+0

(Ω). (3.11)

Thus,p(Ω,Γ|X(n)) ∝ CΓQ(Ω,Γ|X(n)), (3.12)

where

CΓ = (2π)−np/2q#Γ(1− q)(p2)−#Γ(λ/2)p+#Γβ(Γ),

β(Γ) =

P(R ≥ #Γ), for prior as in (3.9),

1l#Γ ≤ r, for prior as in (3.10),

Q(Ω,Γ|X(n)) = det(Ω)n/2 exp−n tr(ΣΩ)/2∏γij=1

exp(−λ|ωij|)

×p∏i=1

exp (−λωii/2) 1lM+0

(Ω). (3.13)

The following result gives posterior convergence rate as n → ∞. Weassume that the true model is sparse, as described by (3.2).

9

Theorem 3.1. Let X(n) = (X1, . . . ,Xn) be a random sample from a p-dimensional Gaussian distribution with mean 0 and precision matrix Ω0 ∈U(ε0, s) for some 0 < ε0 <∞ and 0 ≤ s ≤ p(p− 1)/2. Also assume that theprior distributions p(Ω | Γ) and p(Γ) as in (3.6) and (3.9) or (3.10) withq < 1/2 and the range of eigenvalues of matrices in M+

0 is sufficiently broadto contain [ε0, ε

−10 ]. Then the posterior distribution of Ω satisfies

E0

[P‖Ω−Ω0‖2 > Mεn |X(n)

]→ 0, (3.14)

for εn = n−1/2(p+ s)1/2(log p)1/2 and a sufficiently large constant M > 0.

The proof uses the general theory of posterior convergence of Ghosal et al.(2000) and will be given in the appendix. The above posterior convergencerate matches exactly with the convergence rate of the graphical lasso obtainedin Rothman et al. (2008).

Note that, Theorem 3.1 gives ‖Ω −Ω0‖2 = O(εn) with posterior proba-bility tending to one in probability and from Rothman et al. (2008) it followsthat, ‖Ω∗ −Ω0‖2 = OP (εn), where Ω∗ is the graphical lasso. Hence, by thetriangle inequality, ‖Ω−Ω∗‖2 = O(εn) with posterior probability tending toone in probability. This gives,∫

‖Ω−Ω∗‖2≤εn f(Ω)∏

(i,j)∈E dωij∫Ω∈M+

0f(Ω)

∏(i,j)∈E dωij

→ 1, (3.15)

where f(Ω) is a bounded and positive measurable function of Ω. If the modelis restricted to Γ, then the posterior and the graphical lasso will concentratearound the projection of the true precision matrix on the model at the rateεn, so that the posterior probability of an εn-Frobenius neighborhood aroundthe graphical lasso in model Γ will go to one.

4 Posterior Computation

The marginal posterior density of the graphical structure indicator Γ canbe obtained by integrating out elements of the precision matrix in the jointposterior density in (3.11), to get

p(Γ|X(n)) ∝ CΓ

∫ΩΓ∈M+

0

exp−nhΓ(ΩΓ)/2∏

(i,j)∈EΓ

dωΓ,ij, (4.1)

10

where

hΓ(ΩΓ) = − log det(ΩΓ) + tr(ΣΩΓ) +2λ

n

∑γij=1

|ωΓ,ij|+λ

n

p∑i=1

ωΓ,ii, (4.2)

and EΓ = (i, j) ∈ V × V : i = j, or γij = 1 for i 6= j. In other words, EΓ

refers to the indices of the diagonal elements and the non-zero off-diagonalelements corresponding to the edges in the graphical model defined by theedge-inclusion indicator vector Γ, to be referred to as model Γ below.

Note that hΓ(ΩΓ) is minimized at ΩΓ = Ω∗Γ, the graphical lasso corre-sponding to the penalty parameter λ/n and model Γ.

The marginal posterior of Γ is, however, intractable. We give an approx-imate method for the posterior probability computations of various modelsusing Laplace approximation. The Laplace approximation requires expand-ing the integrand in (4.1) around the maximum, which in this case, coincideswith the graphical lasso in model Γ. Laplace approximation technique is wellaccepted in Bayesian computation, most notably in deriving the BayesianInformation Criterion (BIC) as an approximation to logarithm of posteriormodel probabilities.

4.1 Approximating model posterior probabilities

Define ∆Γ = ΩΓ −Ω∗Γ = ((uΓ,ij)), where Ω∗Γ = ((ω∗Γ,ij)) is the graphical lassosolution corresponding to the underlying graphical model structure Γ andpenalty parameter λ/n. Then,

p(Γ|X) ∝ CΓ exp−nhΓ(Ω∗Γ)/2 det(Ω∗Γ)−n/2

×∫

∆Γ+Ω∗Γ∈M+0

exp−n gΓ(∆Γ)/2∏

(i,j)∈EΓ

duΓ,ij,(4.3)

where gΓ(∆Γ) is given by

− log det(∆Γ +Ω∗Γ)+tr(Σ∆Γ)+2λ

n

∑γij=1

(|uΓ,ij +ω∗Γ,ij|−|ω∗Γ,ij|)+λ

n

p∑i=1

uΓ,ii.

(4.4)Clearly gΓ(∆Γ) is minimized at ∆Γ = 0 by the definition of Ω∗Γ, so the firstderivative of gΓ(∆Γ) vanishes at 0, provided that it is differentiable at 0.

11

Define the matrix HB = ((hB(i, j), (l,m))), where

hB(i, j), (l,m) = trB−1E(i,j)B

−1E(l,m)

. (4.5)

Using standard matrix calculus (for example, see Section 15.9 of Harville(2008)), we can find that the Hessian of gΓ(∆Γ) is the #EΓ ×#EΓ matrixH∆Γ+Ω∗Γ

, whose (i, j), (l,m)th entry for (i, j), (l,m) ∈ EΓ is given by

∂2gΓ(∆Γ)

∂uΓ,ij∂uΓ,lm

= tr

(∆Γ + Ω∗Γ)−1E(i,j)(∆Γ + Ω∗Γ)−1E(l,m)

. (4.6)

Thus the Laplace approximation p∗(Γ | X(n)) to the posterior probabilityp(Γ |X(n)) is given by

p∗(Γ|X(n)) ∝ CΓ exp−nhΓ(Ω∗Γ)/2 det(Ω∗Γ)−n/2 exp−n g(0)/2

× (2π)#EΓ/2(n/2)−#EΓ/2

[det

∂2gΓ(∆Γ)

∂∆Γ∂∆TΓ

∣∣∣∣0

]−1/2

= CΓ exp−nhΓ(Ω∗Γ)/2(π/n)#EΓ/2det(HΩ∗Γ)−1/2.

(4.7)

The approximation in (4.7) is meaningful only if all entries of the graphicallasso in model Γ are non-zero; otherwise the derivative of gΓ(∆Γ) does notexist. A similar situation arises in the context of regression models; seeYuan and Lin (2005) and Curtis et al. (2014). In the next section, we showthat such “non-regular models” can essentially be ignored for the purposeof posterior probability evaluation. Also to satisfy the regularity conditionsfor Laplace approximation, the determinant of the Hessian above needs tobe bounded away from zero. We have shown this in Lemma A.3 by arguingthat the minimum eigenvalue of the Hessian is bounded away from zero.The partial derivatives as defined in equation (4.6) need to be bounded in aneighborhood of ∆Γ = 0, which follows from continuity, since Ω∗Γ is close tothe projection of the true Ω0 on Γ with high probability where entries of theinverse are bounded.

4.2 Ignorability of non-regular models

As discussed in the previous section, the objective function of the graphicallasso problem in model Γ is not differentiable if the graphical lasso has a zerooff-diagonal entry, that is, ω∗ij = 0 for at least one γij = 1. Let us assume,

12

for notational simplicity, that the vector Γ has been arranged in such a waythat the first t elements of Γ are 1 and the rest are 0. Also, among those t1s, the last r of them have corresponding graphical lasso solution equal tozero. For such a non-regular model, we argue that the submodel Γ′, withfirst (t− r) 1s and rest 0s, provides the same graphical lasso solution for thenon-zero elements as the bigger model Γ. This means that for (i, j) such thatγij = γ′ij = 1, the graphical lasso solution corresponding to Γ, given by ω∗Γ,ijis identical with that corresponding to Γ′, given by ω∗Γ′,ij. We refer to such asubmodel Γ′ as the corresponding regular submodel of the non-regular modelΓ.

Lemma 4.1. For the corresponding regular submodel Γ′ of Γ, the graphicallasso solution corresponding to models Γ and Γ′ are identical.

We give a proof of the above lemma in the appendix. Let ΩΓ denote theprecision matrix in model Γ and ∆Γ = ΩΓ−Ω∗Γ = ((uΓ,ij)). Let the graphicallasso in model Γ be denoted by Ω∗Γ. Note that Ω∗Γ′ = Ω∗Γ by the definitionof the regular submodel Γ′. The ratio of the posterior model probabilities ofthe two model is given by,

p(Γ|X(n))

p(Γ′|X(n))=

CΓ

∫∆Γ+Ω∗Γ∈M

+0

exp−nhΓ(∆Γ)/2∏

(i,j)∈EΓduΓ,ij

CΓ′∫

∆Γ′+Ω∗Γ′∈M

+0

exp−nhΓ′(∆Γ′)/2∏

(i,j)∈EΓ′duΓ′,ij

. (4.8)

The following result shows the ignorability of the non-regular models.

Theorem 4.2. Consider the prior on Γ as given in (3.9) or (3.10) withq < 1/2. The posterior probability of a non-regular model Γ is always lessthan that of the corresponding regular submodel Γ′.

Proof. Using (3.15), we have,

p(Γ|X(n))

p(Γ′|X(n))=

CΓ

∫‖∆Γ‖2≤εn

exp−nhΓ(∆Γ)/2∏

(i,j)∈EΓduΓ,ij + o(1)

CΓ′∫‖∆Γ′‖2≤εn

exp−nhΓ′(∆Γ′)/2∏

(i,j)∈EΓ′duΓ′,ij + o(1)

.

Now, note that for (i, j) such that γij = γ′ij = 1, we have,

uΓ,ij: ‖∆Γ‖2 ≤ εn ⊂ uΓ′,ij: ‖∆Γ′‖2 ≤ εn.

13

Hence, using Lemma A.2, we get

p(Γ|X(n))

p(Γ′|X(n))≤ CΓ

CΓ′

∫‖∆Γ‖2≤εn

exp

−n2

2λ

n

∑γij=1,γ′ij=0

|uΓ,ij|

∏(i,j)∈EΓ∩Ec

Γ′

duΓ,ij

≤ CΓ

CΓ′

∫exp

−λ ∑γij=1,γ′ij=0

|uΓ,ij|

∏(i,j)∈EΓ∩Ec

Γ′

duΓ,ij

=CΓ

CΓ′

(2

λ

)#Γ−#Γ′

=

(q

1− q

)rβ(Γ)

β(Γ′)

≤(

q

1− q

)r. (4.9)

The last inequality follows from the fact that if the prior as in (3.9) is used,then P(R ≥ #Γ) ≤ P(R ≥ #Γ′) since #Γ > #Γ′. For the other prior as in(3.10), the inequality follows trivially as it involves the ratio of two indicatorvariables only.

For q < 1/2, the above ratio is less than 1. This completes the proof.

The above result is particularly important in the sense that we can focuson regular models only, ignoring non-regular ones especially if q is chosento be small. While approximating the posterior probabilities of the regularmodels, we re-normalize the values considering the regular models only.

4.3 Error in Laplace approximation

The approximation of the posterior probability of a model Γ is based on aTaylor series expansion of the function h(ΩΓ) defined in (4.2) around thegraphical lasso in model Γ. Let ∆Γ = ΩΓ − Ω∗Γ, and vec(∆Γ) denote thevectorized version of ∆Γ, but excluding the entries set to zero by the modelΓ. Thus vec(∆Γ) is a vector of dimension at most p + #Γ. The followingresult gives the bound on the remainder term of the Taylor series expansionunder the above assumptions.

Lemma 4.3. For any regular model Γ, with probability tending to one, theremainder term in the expansion of the function hΓ(ΩΓ), as defined in (4.2),

14

around Ω∗Γ, is bounded by (p+#Γ)‖∆Γ‖22 (C1‖∆Γ‖2 + C2‖∆Γ‖2

2) /2 for somepositive constants C1, C2.

This result can be used to find a bound for the error in Laplace approxi-mation of the posterior probabilities of the graphical model structures. Thefollowing result gives the condition for which the error in approximation isasymptotically negligible.

Theorem 4.4. With probability tending to one, the error in Laplace approx-imation of the posterior probability of any regular model Γ is asymptoticallynegligible if (p+ #Γ)2εn → 0 in probability, where εn is the posterior conver-gence rate, that is, the error in the Laplace approximation tends to zero inprobability if n−1/2(p+ #Γ)5/2(log p)1/2 → 0 in probability.

The proof of the above result depends on several additional results, in-cluding Lemma 4.3 involving the bound on the remainder term in the Taylorseries expansion of hΓ(ΩΓ). We give a proof of the above result along withthese additional results in the appendix.

5 Simulation results

We perform a simulation study to assess the performance of the Bayesianmethod for graphical structure learning. We use 4 different models for oursimulations, and we specify these models in terms of the elements of thecovariance matrix Σ = ((σij)) or the precision matrix Ω = ((ωij)), as follows:

1. Model 1: AR(1) model, σij = 0.7|i−j|.

2. Model 2: AR(2) model, ωii = 1, ωi,i−1 = ωi−1,i = 0.5, ωi,i−2 = ωi−2,i =0.25.

3. Model 3: Star model, where every node is connected to the first node,and ωii = 1, ω1,i = ωi,1 = 0.1, and ωij = 0 otherwise.

4. Model 4: Circle model, ωii = 2, ωi−1,i = ωi,i−1 = 1, ω1,p = ωp,1 = 0.9.

Corresponding to each model, we generate samples of size n = 100, 200and dimension p = 30, 50, 100. The penalty parameter λ for the graphicallasso algorithm is chosen such that λ/n = 0.5 and the value of q appearing in

15

the prior of the graphical structure indicator is taken to be 0.4. In computa-tion, there is implicit restriction on the size by the size of the graphical lasso.For the theory, there is a restriction on the model size through the variableR which has sharp tails (that is, R is unlikely to be big). Thus in either way,a restriction on size is imposed meaning larger values of q are acceptablewithout problem (subject to q < 1/2 for ignorability of non-regular models).Thus if we need to fix a value of q, it makes sense to choose relatively big toavoid low sensitivity. Lower values will be justified if we have strong priorinformation. Choosing q from data by putting a prior and calculating itsposterior is of course sensible but cannot be done in our setting, as we arenot computing the full posterior.

We run 100 replications for each of the models and find the median prob-ability model for each replication. To assess the performance of the medianprobability model (denoted by ‘MPP’), we compute the specificity (SP), sen-sitivity (SE) and Matthews Correlation Coefficient (MCC) averaged acrossthe replications as defined below and also compute the same for the graphicallasso (denoted by ‘GL’). The results are presented in Table 1.

SP =TN

TN + FP, SE =

TP

TP + FN

MCC =TP× TN− FP× FN√

(TP + FP)(TP + FN)(TN + FP)(TN + FN), (5.1)

where TP, TN, FP and FN respectively denote the true positives (edgesincluded which are present in the true model), true negatives (edges excludedwhich are absent in the true model), false positives (edges included whichare absent in the true model) and false negatives (edges excluded which arepresent in the true model) in the selected model. In this context we alsodefine the False Positive Rate (FPR) as FPR = FP/(TN + FP), that is,FPR = 1 − SP. We show the ROC curves corresponding to the variousmodels with values of the penalty parameter ranging between 0.1 and 1plotting Sensitivity (SE) against False Positive Rate (FPR). The curves areshown in Figures 1 and 2.

In all the cases, the Bayesian method performs slightly better than thegraphical lasso in terms of specificity, but suffers a bit in sensitivity. This isexpected from a theoretical viewpoint as the approximate posterior probabil-ities are computed for the graphs which are sub-graphs of the graphical struc-ture identified by the graphical lasso. The remaining graphs are not explored

16

owing to non-regularity. Choosing q < 1/2 ensures that those structures canbe ignored safely.

The sensitivity results for both the methods are not good for AR(2) andStar models. The ROC curves corresponding to these two models reveal thathigher values of the penalty parameter result in better sensitivity, but atthe cost of higher false positive rate. Overall, in terms of MCC, the medianprobability model performs better than the model selected by the graphicallasso.

Figure 1: ROC curves for AR(1) and AR(2) structures of the precision matrixcorresponding to sample size n = 100 and matrix dimensions p = 30, 50, 100.The penalty parameter in the graphical lasso algorithm varies between 0.1and 1. Red curve corresponds to the median probability model (MPP) andthe blue curve corresponds to graphical lasso (GL).

17

Tab

le1:

Sim

ula

tion

resu

lts

for

diff

eren

tst

ruct

ure

sof

pre

cisi

onm

atri

ces

n=

100

n=

200

MP

PG

LM

PP

GL

Mod

elp

SP

SE

MC

CS

PS

EM

CC

SP

SE

MC

CS

PS

EM

CC

300.

977

0.94

10.

831

0.9

61

0.9

83

0.7

84

0.9

86

0.9

96

0.9

07

0.9

69

1.0

00

0.8

23

(0.0

03)

(0.0

19)

(0.0

15)

(0.0

03)

(0.0

10)

(0.0

13)

(0.0

02)

(0.0

03)

(0.0

14)

(0.0

02)

(0.0

00)

(0.0

13)

AR

(1)

500.

987

0.95

30.

841

0.9

77

0.9

86

0.7

85

0.9

91

0.9

92

0.9

03

0.9

80

1.0

00

0.8

23

(0.0

02)

(0.0

13)

(0.0

10)

(0.0

01)

(0.0

04)

(0.0

10)

(0.0

01)

(0.0

04)

(0.0

08)

(0.0

01)

(0.0

00)

(0.0

06)

100

0.99

20.

967

0.83

70.9

89

0.9

91

0.8

04

0.9

94

0.9

95

0.8

90

0.9

91

0.9

99

0.8

27

(0.0

01)

(0.0

08)

(0.0

07)

(0.0

01)

(0.0

03)

(0.0

06)

(0.0

01)

(0.0

02)

(0.0

08)

(0.0

01)

(0.0

01)

(0.0

06)

300.

975

0.47

00.

546

0.9

64

0.5

35

0.5

58

0.9

87

0.4

95

0.6

17

0.9

82

0.5

17

0.6

10

(0.0

03)

(0.0

14)

(0.0

13)

(0.0

02)

(0.0

13)

(0.0

12)

(0.0

02)

(0.0

08)

(0.0

08)

(0.0

02)

(0.0

09)

(0.0

07)

AR

(2)

500.

983

0.46

20.5

410.9

71

0.5

08

0.5

22

0.9

93

0.4

89

0.6

29

0.9

87

0.5

34

0.6

22

(0.0

01)

(0.0

13)

(0.0

11)

(0.0

02)

(0.0

10)

(0.0

09)

(0.0

01)

(0.0

05)

(0.0

07)

(0.0

01)

(0.0

01)

(0.0

06)

100

0.98

90.

470

0.53

70.9

80

0.5

31

0.5

14

0.9

95

0.4

84

0.6

24

0.9

93

0.5

29

0.6

24

(0.0

01)

(0.0

06)

(0.0

06)

(0.0

01)

(0.0

07)

(0.0

07)

(0.0

01)

(0.0

06)

(0.0

04)

(0.0

01)

(0.0

09)

(0.0

05)

300.

947

0.28

90.

228

0.9

37

0.3

10

0.2

24

0.9

95

0.2

10

0.3

78

0.9

93

0.2

52

0.4

02

(0.0

04)

(0.0

38)

(0.0

36)

(0.0

03)

(0.0

43)

(0.0

36)

(0.0

01)

(0.0

32)

(0.0

41)

(0.0

01)

(0.0

36)

(0.0

38)

Sta

r50

0.94

50.

492

0.33

20.9

34

0.5

14

0.3

17

0.9

93

0.4

75

0.5

85

0.9

90

0.5

14

0.5

77

(0.0

03)

(0.0

34)

(0.0

25)

(0.0

03)

(0.0

35)

(0.0

23)

(0.0

00)

(0.0

34)

(0.0

24)

(0.0

01)

(0.0

32)

(0.0

22)

100

0.93

91.

000

0.48

50.9

27

1.0

00

0.4

52

0.9

88

1.0

00

0.7

92

0.9

84

1.0

00

0.7

48

(0.0

02)

(0.0

00)

(0.0

07)

(0.0

02)

(0.0

00)

(0.0

05)

(0.0

00)

(0.0

00)

(0.0

08)

(0.0

01)

(0.0

00)

(0.0

07)

300.

733

1.00

00.

399

0.6

94

1.0

00

0.3

69

0.7

19

1.0

00

0.3

88

0.6

74

1.0

00

0.3

54

(0.0

04)

(0.0

00)

(0.0

03)

(0.0

06)

(0.0

00)

(0.0

04)

(0.0

05)

(0.0

00)

(0.0

04)

(0.0

04)

(0.0

00)

(0.0

03)

Cir

cle

500.

831

1.00

00.

409

0.8

22

1.0

00

0.3

98

0.8

33

1.0

00

0.4

11

0.8

14

1.0

00

0.3

90

(0.0

03)

(0.0

00)

(0.0

03)

(0.0

02)

(0.0

00)

(0.0

03)

(0.0

02)

(0.0

00)

(0.0

03)

(0.0

02)

(0.0

00)

(0.0

02)

100

0.89

11.

000

0.37

80.8

94

1.0

00

0.3

83

0.9

03

1.0

00

0.3

99

0.9

02

1.0

00

0.3

97

(0.0

01)

(0.0

00)

(0.0

02)

(0.0

01)

(0.0

00)

(0.0

02)

(0.0

08)

(0.0

00)

(0.0

02)

(0.0

01)

(0.0

00)

(0.0

02)

18

Figure 2: ROC curves for Star and Circle structures of the precision matrixcorresponding to sample size n = 100 and matrix dimensions p = 30, 50, 100.The penalty parameter in the graphical lasso algorithm varies between 0.1and 1. Red curve corresponds to the median probability model (MPP) andthe blue curve corresponds to graphical lasso (GL).

6 Illustration with real data

In this section we illustrate the Bayesian graphical structure learning methodwith the stock price data from Yahoo! Finance. Description of the dataset can be found in Liu et al. (2009) and available in the huge package onCRAN (Zhao et al., 2012) as stockdata. The data set consists of closingprices of stocks that were consistently included in the S&P 500 index inthe time period January 1, 2003 to January 1, 2008 for a total of 1258days. The stocks are also categorized into 10 Global Industry ClassificationStandard (GICS) sectors, namely, “Health Care”, “Materials”, “Industrials”,“Consumer Staples”, “Consumer Discretionary”, “Utilities”, “InformationTechnology”, “Financials”, “Energy”, “Telecommunication Services”.

19

Denoting Ytj as the closing stock price for the jth stock on day t, weconstruct the 1257×452 data matrix S with entries stj = log(Y(t+1)j/Ytj), t =1, . . . , 1257, j = 1, . . . , 452. For analysis, we construct the data matrix X bystandardizing S, so that each stock has mean zero and standard deviationone. We find the median probability model as selected by the Bayesiangraphical structure learning method. The corresponding graphical structureis displayed in Figure 3. The vertices of the graph are colored correspondingto the different GICS sectors. We find that stocks from the same sectorstend to be related with other members from that category, and generallynot related across different sectors, though there are some connections. Thegrouping of the stocks corresponding to their sectors is expected, implyingthat the stock prices for a particular sector are conditionally independent ofthose of other sectors.

We also individually study data pertaining to some of the specific sectorsto have a closer look at the strength of the groupings where perturbations dueto latent factors is least expected. For this, we consider the sectors “Utilities”and “Information Technology”. The graphical structure is displayed in Figure4. The stock prices for the two sectors clearly separate as expected.

A Proofs

Proof of Theorem 3.1. We apply the general theory of posterior convergencerate by verifying the conditions in Theorem 2.1 of Ghosal et al. (2000). Thisrequires evaluating the prior concentration rate of Kullback-Leibler neigh-borhoods, finding a suitable sieve in the space of densities and controlling itsHellinger metric entropy and showing that the complement of the sieve hasexponentially small prior probability. Then the posterior convergence rateat the true density is obtained in terms of the Hellinger distance. Finally, byLemma A.1, within the model the Hellinger distance is equivalent with theFrobenius distance on precision matrices. Hence the entropy calculation canbe done in terms of the Frobenius distance and the convergence rate at thetrue precision matrix will also follow in terms of the Frobenius distance.

To estimate prior concentration, let

B(pΩ0 , εn) = p:K(pΩ0 , pΩ) ≤ ε2n, V (pΩ0 , pΩ) ≤ ε2n,

where K(f, g) =∫f log(f/g) and V (f, g) =

∫f log2(f/g). Recall that, for

20

Figure 3: Graphical structure of the median probability model selected bythe Bayesian graphical structure learning method.

Z ∼ Np(0,Σ) and a p× p symmetric matrix A, we have,

E(ZTAZ) = tr(AΣ), Var(ZTAZ) = 2 tr(AΣAΣ). (A.1)

Then

K(pΩ0 , pΩ) =1

2(log det Ω0 − log det Ω)− 1

2tr(Ip −ΩΩ−1

0 )

= −1

2

p∑i=1

log di −1

2

p∑i=1

(1− di),

21

Figure 4: Graphical structure corresponding to the subgraph correspondingto the sectors “Utilities” [red] and “Information Technology”[violet].

and

V (pΩ0 , pΩ) =1

2tr(Ip − 2ΩΩ−1

0 + ΩΩ−10 ΩΩ−1

0 )

=1

2tr(Ip −Ω

−1/20 ΩΩ

−1/20 )2

=1

2

p∑i=1

(1− di)2,

where di, i = 1, . . . , p, are the eigenvalues of Ω−1/20 ΩΩ

−1/20 . Now, K(pΩ0 , pΩ) ≥

h2(pΩ0 , pΩ), and hence by Lemma A.1 maxi |di − 1| < 1. Hence we can ex-pand log di in the powers of (1−di) to get K(pΩ0 , pΩ) ∼ 1

4

∑pi=1(1−di)2. By

Lemma A.1,∑p

i=1(1− di)2 is bounded by a constant multiple of ‖Ω−Ω0‖22,

and hence a constant multiple of p2‖Ω − Ω0‖2∞ in view of (2.1). Therefore

22

B(pΩ0 , εn) ⊃ pΩ: ‖Ω −Ω0‖∞ ≤ cεn/p, and hence it suffices to get a lowerestimate of the prior probability of the latter set. The components of Ω arenot independently distributed, since the prior for Ω is truncated to M+

0 .However, as a small neighborhood of Ω0 ∈ U(ε0, s) lies withinM+

0 , the trun-cation can only increase prior concentration of B(pΩ0 , εn). Therefore we canpretend componentwise independence for the purpose of lower bounding theabove prior probability which gives the estimate

Π (‖Ω0 −Ω‖∞ ≤ cεn/p) & (cεn/p)p+s , (A.2)

where Π denotes the prior distribution on Ω. The prior concentration ratecondition thus gives,

(p+ s)(log p+ log εn−1) nε2n, (A.3)

so as to get εn = n−1/2(p+ s)1/2(log n)1/2.Consider the sieve Pn to be the space of all densities pΩ such that the

graph induced by Ω has maximum number of edges r <(p2

)/2 and each entry

of Ω is at most L in absolute value, where r and L depend on n and are to bechosen later. Then the metric entropy of the set of precision matrices withrespect to the Frobenius distance is given by

log

r∑j=1

(L

εn

)j ((p2

)j

)≤ log

r

(L

εn

)r (p+

(p2

)r

). log r + r logL+ r log ε−1

n + r log p,

so the choices r ∼ b1nε2n/ log n, and L ∼ b2nε

2n for any choice of constants

b1, b2 > 0 will satisfy the rate equation

log r + r log p+ r log εn−1 + r log(nε2n) nε2n. (A.4)

Note that under the condition nε2n/ log n (p2

), the requirement r <

(p2

)/2

is satisfied as n→∞.To bound the prior probability of Pcn, observe that pΩ can fall outside

Pn only if either an entry exceeds L or the number of off-diagonal entriesexceed r. The probability of the former event is bounded by

(p2

)e−L, which is

bounded by exp(−b3nε2n) for some constant b3 > 0 by the choice L ∼ b2nε

2n

and b3 can be chosen to be as large as we like by choosing b2 sufficiently large.The probability of the latter event is

P(R > r) ≤ exp(−a′2b1nε2n), (A.5)

23

where a′2b1 can be made as large as possible by making b1 large. Henceεn = n−1/2(p+ s)1/2(log n)1/2 is the posterior convergence rate.

The following lemma establishes a norm equivalence necessary for findingposterior convergence rate and metric entropy calculations.

Lemma A.1. If pΩkis the density of Np(0,Ω

−1k ), k = 1, 2, then for all

Ωk ∈M+0 , k = 1, 2, and di, i = 1, . . . , p, eigenvalues of A = Ω

−1/21 Ω2Ω

−1/21 ,

we have that for some δ > 0 and constant c0 > 0,

(i) c−10 ‖Ω1 −Ω2‖2

2 ≤∑p

i=1 |di − 1|2 ≤ c0‖Ω1 −Ω2‖22,

(ii) h(pΩ1 , pΩ2) < δ implies maxi |di−1| < 1 and ‖Ω1−Ω2‖2 ≤ c0h2(pΩ1 , pΩ2),

(iii) h2(pΩ1 , pΩ2) ≤ c0‖Ω1 −Ω2‖22,

Proof. Recall that as Ω1 ∈ M+0 , both ‖Ω1‖2 and ‖Ω−1

1 ‖2 are bounded by aconstant. As Ip −A have eigenvalues (1− d1), . . . , (1− dp), we have

‖Ω1 −Ω2‖22 = ‖Ω1/2

1 (Ip −A)Ω1/21 ‖2

2

≤ ‖Ω1‖2(2,2)‖Ip −A‖2

2

= ‖Ω1‖2(2,2)tr (Ip −A)2

= ‖Ω1‖2(2,2)

p∑i=1

(di − 1)2

while conversely

p∑i=1

(di − 1)2 = ‖Ip −A‖22

= ‖Ω−1/21 (Ω1 −Ω2)Ω

−1/21 ‖2

2

≤ ‖Ω−11 ‖2

(2,2)‖Ω1 −Ω2‖22.

This establishes (i). In particular, if ‖Ω1 − Ω2‖2 is sufficiently small, thenmaxi |di − 1| < 1.

Now by direct calculations, in a normal scale model,

1

2h2(pΩ1 , pΩ2) = 1−det(A1/2+A−1/2)−1/2 = 1−

p∏i=1

1

2(d

1/2i + d

−1/2i )

−1/2

,

24

If h(pΩ1 , pΩ2) < δ, this implies 1− ∏p

i=112(d

1/2i + d

−1/2i )−1/2 ≤ δ2/2. Rear-

ranging the terms, we get,∏p

i=112(d

1/2i + d

−1/2i ) ≤ (1− δ2/2)−2 = 1 + η, say.

Since every term in the product exceeds 1, we have,

maxi

1

2(d

1/2i + d

−1/2i ) ≤ 1 + η. (A.6)

The above equation, upon squaring and rearrangement of terms, gives, forall i, (di− 1)2 ≤ 2d

1/2i η. Note that equation (A.6) gives that d

1/2i ≤ 2(1 + η).

Hence, the above equation implies that (di − 1)2 ≤ 4η(1 + η). If δ > 0 ischosen sufficiently small to make η < (

√2 − 1)/2, then |di − 1| < 1 for all

i = 1, . . . , p.Abbreviating h2(pΩ1 , pΩ2) by h2, the expression for the Hellinger distance

gives∏p

i=1(d1/2i +d

−1/2i ) = 2p(1−h2)−2. On the other hand, for some constants

c1, c2 > 0,[1 + c1

p∑i=1

(di − 1)2

]≤ 2−p

p∏i=1

(d1/2i + d

−1/2i ) ≤

[1 + c2

p∑i=1

(di − 1)2

](A.7)

by a Taylor series expansion, which is possible since maxi |di − 1| < 1. Thelower estimate gives c1

∑pi=1(di − 1)2 ≤ (1− h2)−2 ∼ 2h2, which proves (ii).

Finally, since the Hellinger distance is bounded, to prove (iii), it sufficesto consider the case ‖Ω1 −Ω2‖2 < δ where δ > 0 is sufficiently small. Thenpart (i) implies that maxi |di−1| < 1, and hence the upper estimate in (A.7)gives 2h2 ∼ (1− h2)−2 ≤ c2

∑pi=1(di − 1)2, which proves (iii) in view of part

(i).

Proof of Lemma 4.1. The graphical lasso for the model Γ, given by Ω∗Γ =((ω∗Γ,ij)) satisfies

Ω∗−1Γ − Σ− λG = 0, (A.8)

where G = ((gij)) is a matrix with elements gij = ω∗Γ,ij/|ω∗Γ,ij| if ω∗Γ,ij 6= 0 andgij ∈ [−1, 1] if ω∗Γ,ij = 0, by the Karush-Kuhn-Tucker (KKT) condition; see,for example, Boyd and Vandenberghe (2004), Witten et al. (2011). Whenthe model Γ is non-regular and Γ′ is its corresponding regular submodel, let(i, j) be a pair such that γij = 1 but ω∗Γ,ij = 0. Then Ω∗Γ automaticallysatisfies the KKT condition also for the model Γ′ because ω∗Γ,ij = 0 for anyγ′ij = 0.

25

The following lemma is essential in proving the ignorability of the non-regular models for posterior probability evaluation.

Lemma A.2. Consider a non-regular model Γ and let Γ′ be its correspondingregular submodel with their common graphical lasso Ω∗Γ. Let ∆Γ = ΩΓ−Ω∗ =((uΓ,ij)) and ∆Γ′ = ((uΓ′,ij)) such that uΓ′,ij = uΓ,ij if i = j or γij = γ′ij = 1and uΓ′,ij = 0 for pairs (i, j) with γ′ij = 0. Then for fixed values of uΓ′,ij forγ′ij = 1, we have,

log det(∆Γ + Ω∗Γ)− tr(Σ∆Γ) ≤ log det(∆Γ′ + Ω∗Γ)− tr(Σ∆Γ′). (A.9)

Proof. Consider maximization of the function

f(∆Γ) = log det(∆Γ + Ω∗Γ)− tr(Σ∆Γ). (A.10)

with respect to the elements uΓ,ij where (i, j) ∈ (i, j): γij = 1, γ′ij = 0.Differentiating the above function for a particular value of uij gives,

∂f(∆Γ)

∂uΓ,ij

= tr[

(∆Γ + Ω∗Γ)−1E(i,j) − ΣE(i,j)

]. (A.11)

The maximizer uΓ,ij satisfies tr[

(∆Γ + Ω∗Γ)−1E(i,j) − ΣE(i,j)

]= 0. Now

consider the function gΓ(∆Γ) defined in (4.4). The derivative of gΓ(∆Γ) withrespect to uΓ,ij satisfies

∂gΓ(∆Γ)

∂uΓ,ij

∣∣∣∣uΓ,ij=0+,uΓ,lm=0,∀(l,m)6=(i,j)

≥ 0, (A.12)

and∂gΓ(∆Γ)

∂uΓ,ij

∣∣∣∣uΓ,ij=0−,uΓ,lm=0,∀(l,m)6=(i,j)

≤ 0. (A.13)

The above two conditions give,

tr[

(∆Γ + Ω∗Γ)−1E(i,j) − ΣE(i,j)

]∣∣∣uΓ,ij=0+,uΓ,lm=0,∀(l,m)6=(i,j)

= 0 ≤ 2λ

n,

tr[

(∆Γ + Ω∗Γ)−1E(i,j) − ΣE(i,j)

]∣∣∣uΓ,ij=0−,uΓ,lm=0,∀(l,m)6=(i,j)

= 0 ≥ −2λ

n.

Since the first derivative of f(∆Γ) is continuous at 0, we have uΓ,ij = 0. Thisimmediately implies the result stated in the lemma.

26

Proof of Lemma 4.3. The Taylor series expansion of hΓ(ΩΓ) under model Γ,defined by (4.2) gives,

hΓ(ΩΓ) = hΓ(Ω∗Γ) +1

2vec(∆Γ)THΩ∗Γ

vec(∆Γ) +Rn, (A.14)

where Rn is the remainder term in the expansion. Using the integral form ofthe remainder, we have,

hΓ(ΩΓ) = hΓ(Ω∗Γ) + vec(∆Γ)T∫ 1

0

(1− ν)HΩ∗Γ+ν∆Γdν

vec(∆Γ). (A.15)

Subtracting (A.15) from (A.14) gives,

Rn = vec(∆Γ)T∫ 1

0

(1− ν)HΩ∗Γ+ν∆Γdν

vec(∆Γ)− 1


vec(∆Γ)

= vec(∆Γ)T∫ 1

0

(1− ν)(HΩ∗Γ+ν∆Γ

−HΩ∗Γ

)dν

vec(∆Γ)

≤ ‖∆Γ‖22

∥∥∥∥∫ 1

0

(1− ν)(HΩ∗Γ+ν∆Γ

−HΩ∗Γ

)dν

∥∥∥∥(2,2)

≤ ‖∆Γ‖22

∫ 1

0

(1− ν)‖HΩ∗Γ+ν∆Γ−HΩ∗Γ

‖(2,2)dν

≤ 1

2‖∆Γ‖2

2 max0≤ν≤1

‖HΩ∗Γ+ν∆Γ−HΩ∗Γ

‖(2,2)

≤ 1

2‖∆Γ‖2

2(p+ #Γ) max0≤ν≤1

‖HΩ∗Γ+ν∆Γ−HΩ∗Γ

‖∞ (A.16)

since the Hessian is a matrix of order (p + #Γ) × (p + #Γ) for a regularmodel Γ. The above bound involves the maximum of the absolute differencesbetween the elements of the Hessian matrices H computed at two differentvalues Ω∗Γ + ν∆Γ and Ω∗Γ. Observe that by the matrix norm relations in(2.1) and (3.1) with probability tending to one,

‖(Ω∗Γ + ν∆Γ)−1 −Ω∗−1Γ ‖∞ ≤ ‖(Ω∗Γ + ν∆Γ)−1 −Ω∗−1

Γ ‖(2,2)

= ‖νΩ∗−1Γ ∆Γ(I + νΩ∗−1

Γ ∆Γ)−1Ω∗−1Γ ‖(2,2)

≤ ν‖Ω∗−1Γ ‖

2(2,2)‖∆Γ‖(2,2)

≤ K‖∆Γ‖2. (A.17)

For any symmetric matrix A = ((aij)) we note that trAE(i,j)AE(l,m)

has the form 2ailajm + 2aimajl for i 6= j, l 6= m; 2ailaim for i = j, l 6= m; a2

il

27

for i = j, l = m. Hence the elements of HΩ∗Γ+ν∆Γ−HΩ∗Γ

have the respectiveforms (2ailajm + 2aimajl)− (2bilbjm + 2bimbjl), 2ailaim − 2bilbim and a2

il − b2il,

where ((aij)) = (Ω∗Γ)−1 and ((bij)) = (Ω∗Γ + ν∆Γ)−1. Then, using equation(A.17), we get, with probability tending to one, for all ((i, j), (l,m))∑

ailajm −∑

bilbjm ≤ C1‖∆Γ‖2 + C2‖∆Γ‖22. (A.18)

Since this holds true for any arbitrary element of HΩ∗+ν∆−HΩ∗ , using (3.3)and (A.18), we get that with probability tending to one,

‖HΩ∗+ν∆ −HΩ∗‖∞ ≤ C1‖∆‖2 + C2‖∆‖22, (A.19)

where C1 and C2 are suitable constants. Now using (A.16) and (A.19), withprobability tending to one, we have,

Rn ≤1

2(p+ #Γ)‖∆‖2

2

(C1‖∆‖2 + C2‖∆‖2

2

).

Proof of Theorem 4.4. Using the Taylor series expansion of hΓ(ΩΓ) as in(A.14), we can write the posterior probability of model Γ given the dataX(n) as in equation (4.1) to be proportional to∫

∆Γ+Ω∗Γ∈M+0

exp

−n

2

(hΓ(Ω∗Γ) +

1


vec(∆Γ) +Rn

) ∏(i,j)∈EΓ

duΓ,ij.

We denote∏

(i,j)∈EΓduΓij by d∆Γ for notational simplicity. Using (3.15), we

get∫‖∆Γ‖2≤εn

exp[−nhΓ(Ω∗Γ) + 1


vec(∆Γ) +Rn

/2]d∆Γ∫

∆Γ+Ω∗Γ∈M+0

exp[−nhΓ(Ω∗Γ) + 1


vec(∆Γ) +Rn

/2]d∆Γ

→ 1.

Also, for ‖∆Γ‖2 ≤ εn, Rn ≤ (p+ #Γ)‖∆Γ‖22εn/2. Thus, the upper and lower

bounds of the integral∫‖∆Γ‖2≤εn

exp−nh(ΩΓ)/2d∆Γ are given by

e−nhΓ(Ω∗Γ)/2

∫‖∆Γ‖2≤εn

exp[−nvec(∆Γ)T

HΩ∗Γ

∓ (p+ #Γ)εnI

vec(∆Γ)/4]d∆Γ.

Note that,∫‖∆Γ‖2>εn

exp[−nvec(∆Γ)T

HΩ∗Γ

∓ (p+ #Γ)εnI

vec(∆Γ)/4]d∆Γ → 0,

28

if (p + #Γ)εn → 0 and the minimum eigenvalue of HΩ∗Γis bounded away

from zero, which we prove in Lemma A.3 below. Hence, the bounds can besimplified to

e−nhΓ(Ω∗Γ)/2

∫∆Γ+Ω∗Γ∈M

+0

exp[−n

4vec(∆Γ)T

HΩ∗Γ

∓ (p+ #Γ)εnI

vec(∆Γ)]d∆Γ.

Using the above bounds, the ratio of the actual integral to the approximateintegral has upper and lower bounds given by∫

∆Γ+Ω∗Γ∈M+0

exp[−nvec(∆Γ)T

HΩ∗Γ

∓ (p+ #Γ)εnI

vec(∆Γ)/4]d∆Γ∫

∆Γ+Ω∗Γ∈M+0

exp−nvec(∆Γ)THΩ∗Γ

vec(∆Γ)/4d∆Γ

=

[detHΩ∗Γ

± (p+ #Γ)εnIp

det(HΩ∗Γ)

]−1/2

. (A.20)

The above expression lies between[1∓ eig1(HΩ∗Γ

)−1(p+ #Γ)εn]−(p+#Γ)/2

.Using Lemma A.3 below, eig1(HΩ∗Γ

) 0, and hence the above bound on theratio goes to 1 if (p+#Γ)2εn → 0, so that the error in Laplace approximationis asymptotically small.

We now prove the result that the eigenvalues of the Hessian HΩ∗Γare

bounded away from zero.

Lemma A.3. Given a model Γ, the minimum eigenvalue of the Hessian HΩ∗Γcorresponding to the function hΓ(ΩΓ), evaluated at Ω∗Γ, is bounded away fromzero.

Proof. Note that the Hessian HΩ∗Γevaluated at the graphical lasso Ω∗Γ cor-

responding to the model Γ has the form T ′AΩ∗ΓT , where the (p + 2#Γ)-

dimensional matrix AΩ∗Γis a principal minor of the p2× p2 matrix (Ω∗Γ)−1⊗

(Ω∗Γ)−1 , and T is a (p + 2#Γ) × (p + #Γ) matrix of 0s and 1s having fullcolumn rank. Thus T ′AΩ∗Γ

T has full rank if the minimum eigenvalue of(Ω∗Γ)−1 ⊗ (Ω∗Γ)−1 is bounded away from zero. Note that, eig1(Ω∗Γ)−1 ⊗(Ω∗Γ)−1 = [eig1(Ω∗Γ)−1]2. The parameter space M+

0 of precision ma-trices insist on fixed bounds for minimum and maximum eigenvalues andthus Ω∗Γ in any model Γ has to maintain the eigenvalue restriction. Hence,[eig1(Ω∗Γ)−1]2 = 1/‖Ω∗Γ‖2

2 > 0.

29

References

Atay-Kayis, A. and Massam, H. (2005). A Monte-Carlo method for com-puting the marginal likelihood in nondecomposable Gaussian graphicalmodels. Biometrika, 92(2):317–335.

Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selectionthrough sparse maximum likelihood estimation for multivariate Gaussianor binary data. J. Mach. Learn. Res., 9:485–516.

Banerjee, S. and Ghosal, S. (2014). Posterior convergence rates for estimatinglarge precision matrices using Graphical models. To appear in Electron. J.Stat.

Bickel, P. and Levina, E. (2008a). Covariance regularization by thresholding.Ann. Statist., 36(6):2577–2604.

Bickel, P. and Levina, E. (2008b). Regularized estimation of large covariancematrices. Ann. Statist., 36(1):199–227.

Boyd, S. P. and Vandenberghe, L. (2004). Convex Optimization. CambridgeUniversity Press.

Cai, T., Liu, W., and Luo, X. (2011). A constrained `1-minimization ap-proach to sparse precision matrix estimation. J. Amer. Statist. Assoc.,106(494):594–607.

Cai, T., Zhang, C., and Zhou, H. (2010). Optimal rates of convergence forcovariance matrix estimation. Ann. Statist., 38(4):2118–2144.

Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack:Posterior concentration for possibly sparse sequences. Ann. Statist.,40(4):2069–2101.

Curtis, S. M., Banerjee, S., and Ghosal, S. (2014). Fast Bayesian modelassessment for nonparametric additive regression. Comput. Statist. DataAnal., 71:347–358.

Dawid, A. and Lauritzen, S. (1993). Hyper Markov laws in the statisticalanalysis of decomposable graphical models. Ann. Statist., 21(3):1272–1317.

30

Dobra, A., Hans, C., Jones, B., Nevins, J., Yao, G., and West, M. (2004).Sparse graphical models for exploring gene expression data. J. MultivariateAnal., 90(1):196–212.

Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covarianceestimation with the graphical lasso. Biostatistics, 9(3):432–441.

Ghosal, S. (2000). Asymptotic normality of posterior distributions for ex-ponential families when the number of parameters tends to infinity. J.Multivariate Anal., 74(1):49–68.

Ghosal, S., Ghosh, J. K., and Van Der Vaart, A. W. (2000). Convergencerates of posterior distributions. Ann. Statist., 28(2):500–531.

Guo, J., Levina, E., Michailidis, G., and Zhu, J. (2011). Joint estimation ofmultiple graphical models. Biometrika, 98(1):1–15.

Harville, D. A. (2008). Matrix Algebra from a Statistician’s Perspective.Springer.

Huang, J., Liu, N., Pourahmadi, M., and Liu, L. (2006). Covariance ma-trix selection and estimation via penalised normal likelihood. Biometrika,93(1):85–98.

Karoui, N. (2008). Operator norm consistent estimation of large-dimensionalsparse covariance matrices. Ann. Statist., 36(6):2717–2756.

Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in largecovariance matrix estimation. Ann. Statist., 37(6B):4254.

Lauritzen, S. (1996). Graphical Models. Clarendon Press, Oxford.

Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal., 88(2):365–411.

Letac, G. and Massam, H. (2007). Wishart distributions for decomposablegraphs. Ann. Statist., 35(3):1278–1323.

Liu, H., Lafferty, J., and Wasserman, L. (2009). The nonparanormal: Semi-parametric estimation of high dimensional undirected graphs. J. Mach.Learn. Res., 10:2295–2328.

31

Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs andvariable selection with the lasso. Ann. Statist., 34(3):1436–1462.

Park, T. and Casella, G. (2008). The Bayesian Lasso. J. Amer. Statist.Assoc., 103(482):681–686.

Pati, D., Bhattacharya, A., Pillai, N., and Dunson, D. (2014). Posterior con-traction in sparse Bayesian factor models for massive covariance matrices.Ann Statist., 42(3):1102–1130.

Rajaratnam, B., Massam, H., and Carvalho, C. (2008). Flexible covarianceestimation in graphical Gaussian models. Ann. Statist., 36(6):2818–2849.

Rothman, A., Bickel, P., Levina, E., and Zhu, J. (2008). Sparse permutationinvariant covariance estimation. Electron. J. Statist., 2:494–515.

Rothman, A., Levina, E., and Zhu, J. (2009). Generalized thresholding oflarge covariance matrices. J. Amer. Statist. Assoc., 104(485):177–186.

Roverato, A. (2000). Cholesky decomposition of a hyper inverse Wishartmatrix. Biometrika, 87(1):99–112.

Wang, H. (2012). Bayesian graphical lasso models and efficient posteriorcomputation. Bayesian Analysis, 7(4):867–886.

Witten, D. M., Friedman, J. H., and Simon, N. (2011). New insights andfaster computations for the graphical lasso. J. Comput. Graph. Statist.,20(4):892–900.

Yuan, M. and Lin, Y. (2005). Efficient empirical bayes variable selection andestimation in linear models. J. Amer. Statist. Assoc., 100(472).

Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussiangraphical model. Biometrika, 94(1):19–35.

Zhao, T., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2012). Thehuge package for High-dimensional Undirected Graph Estimation in R. J.Mach. Learn. Res., 98888:1059–1062.

32

Documents

Bayesian structure learning in graphical modelsghoshal/papers/Bayesian...Bayesian structure learning in graphical models Sayantan Banerjee1, Subhashis Ghosal2 1The University of Texas