Variable Selection for Gaussian Graphical Modelsproceedings.mlr.press/v22/honorio12/honorio12.pdfVariable Selection for Gaussian Graphical Models Jean Honorio 1 Dimitris Samaras 1

Variable Selection for Gaussian Graphical Models

Jean Honorio1 Dimitris Samaras1 Irina Rish2 Guillermo Cecchi2

[email protected] [email protected] [email protected] [email protected]

1Stony Brook University, Stony Brook, NY 11794, USA2IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA

Abstract

We present a variable-selection structurelearning approach for Gaussian graphicalmodels. Unlike standard sparseness promot-ing techniques, our method aims at selectingthe most-important variables besides simplysparsifying the set of edges. Through sim-ulations, we show that our method outper-forms the state-of-the-art in recovering theground truth model. Our method also ex-hibits better generalization performance in awide range of complex real-world datasets:brain fMRI, gene expression, NASDAQ stockprices and world weather. We also show thatour resulting networks are more interpretablein the context of brain fMRI analysis, whileretaining discriminability. From an optimiza-tion perspective, we show that a block coor-dinate descent method generates a sequenceof positive definite solutions. Thus, we re-duce the original problem into a sequence ofstrictly convex (`1,`p) regularized quadraticminimization subproblems for p ∈ 2,∞.Our algorithm is well founded since the opti-mal solution of the maximization problem isunique and bounded.

1 Introduction

Structure learning aims to discover the topology ofa probabilistic graphical model such that this modelrepresents accurately a given dataset. Accuracy ofrepresentation is measured by the likelihood that themodel explains the observed data. From an algo-rithmic point of view, one challenge faced by struc-

Appearing in Proceedings of the 15th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2012, La Palma, Canary Islands. Volume XX of JMLR:W&CP XX. Copyright 2012 by the authors.

ture learning is that the number of possible struc-tures is super-exponential in the number of variables.From a statistical perspective, it is very important tofind good regularization techniques in order to avoidover-fitting and to achieve better generalization per-formance. Such regularization techniques will aim toreduce the complexity of the graphical model, whichis measured by its number of parameters.

For Gaussian graphical models, the number of param-eters, the number of edges in the structure and thenumber of non-zero elements in the inverse covarianceor precision matrix are equivalent measures of com-plexity. Therefore, several techniques focus on enforc-ing sparseness of the precision matrix. An approxi-mation method proposed in [1] relied on a sequenceof sparse regressions. Maximum likelihood estimationwith an `1-norm penalty for encouraging sparseness isproposed in [2, 3, 4].

In this paper, we enforce a particular form of sparse-ness: that only a small number of nodes in the graph-ical model interact with each other. Intuitively, wewant to select these “important” nodes. However, theabove methods for sparsifying network structure donot directly promote variable selection, i.e. group-wise elimination of all edges adjacent to an “unim-portant” node. Variable selection in graphical mod-els present several advantages. From a computationalpoint of view, reducing the number of variables cansignificantly reduce the number of precision-matrix pa-rameters. Moreover, group-wise edge elimination mayserve as a more aggressive regularization, removingall “noisy” edges associated with nuisance variablesat once, and potentially leading to better generaliza-tion performance, especially if, indeed, the underly-ing problem structure involves only a limited numberof “important” variables. Finally, variable selectionimproves interpretability of the graphical model: forexample, when learning a graphical model of brainarea connectivity, variable selection may help to lo-calize brain areas most relevant to particular mentalstates.

538


Table 1: Notation used in this paper.

Notation Description

‖c‖1 `1-norm of c ∈ RN , i.e.∑

n |cn|‖c‖∞ `∞-norm of c ∈ RN , i.e. maxn |cn|‖c‖2 Euclidean norm of c ∈ RN , i.e.

√∑n c2n

A º 0 A ∈ RN×N is symmetric and positivesemidefinite

A Â 0 A ∈ RN×N is symmetric and positive defi-nite

‖A‖1 `1-norm of A ∈ RM×N , i.e.∑

mn |amn|‖A‖∞ `∞-norm of A ∈ RM×N , i.e. maxmn |amn|‖A‖2 spectral norm of A ∈ RN×N , i.e. the maxi-

mum eigenvalue of A Â 0‖A‖F Frobenius norm of A ∈ RM×N , i.e.√∑

mn a2mn

〈A,B〉 scalar product of A,B ∈ RM×N , i.e.∑mn amnbmn

Our contribution is to develop variable-selection in thecontext of learning sparse Gaussian graphical models.To achieve this, we add an `1,p-norm regularizationterm to the maximum likelihood estimation problem,for p ∈ 2,∞. We optimize this problem through ablock coordinate descent method which yields sparseand positive definite estimates. We show that ourmethod outperforms the state-of-the-art in recover-ing the ground truth model through synthetic experi-ments. We also show that our structures have highertest log-likelihood than competing methods, in a widerange of complex real-world datasets: brain fMRI,gene expression, NASDAQ stock prices and worldweather. In particular, in the context of brain fMRIanalysis, we show that our method produces more in-terpretable models that involve few brain areas, unlikestandard sparseness promoting techniques which pro-duce hard-to-interpret networks involving most of thebrain. Moreover, our structures are as good as stan-dard sparseness promoting techniques, when used forclassification purposes.

Sec.2 introduces Gaussian graphical models and tech-niques for learning them from data. Sec.3 sets up the`1,p-regularized maximum likelihood problem and dis-cusses its properties. Sec.4 describes our block coor-dinate descent method. Experimental results are inSec.5.

2 Background

In this paper, we use the notation in Table 1.

A Gaussian graphical model is a graph in which allrandom variables are continuous and jointly Gaussian.This model corresponds to the multivariate normaldistribution for N variables with covariance matrixΣ ∈ RN×N . Conditional independence in a Gaussiangraphical model is simply reflected in the zero entries

of the precision matrixΩ = Σ−1 [5]. LetΩ = ωn1n2,two variables n1 and n2 are conditionally independentif and only if ωn1n2 = 0.

The estimation of sparse precision matrices was firstintroduced in [6]. It is well known that finding themost sparse precision matrix which fits a datasetis a NP-hard problem [2]. Therefore, several `1-regularization methods have been proposed.

Given a dense sample covariance matrix Σ º 0, theproblem of finding a sparse precision matrix Ω by reg-ularized maximum likelihood estimation is given by:

maxΩÂ0

(log detΩ− 〈Σ,Ω〉 − ρ‖Ω‖1

)(1)

for ρ > 0. The term log detΩ−〈Σ,Ω〉 is the Gaussianlog-likelihood. The term ‖Ω‖1 encourages sparsenessof the precision matrix or conditional independenceamong variables.

Several algorithms have been proposed for solvingeq.(1): covariance selection [2], graphical lasso [3] andthe Meinshausen-Buhlmann approximation [1].

Besides sparseness, several regularizers have been pro-posed for Gaussian graphical models, for enforcing di-agonal structure [7], spatial coherence [8], commonstructure among multiple tasks [9], or sparse changesin controlled experiments [10]. In particular, differ-ent group sparse priors have been proposed for en-forcing block structure for known block-variable as-signments [11, 12] and unknown block-variable assign-ments [13, 14], or power law regularization in scale freenetworks [15].

Variable selection has been applied to very diverseproblems, such as linear regression [16], classification[17, 18, 19] and reinforcement learning [20].

Structure learning through `1-regularization has beenalso proposed for different types of graphical models:Markov random fields [21]; Bayesian networks on bi-nary variables [22]; Conditional random fields [23]; andIsing models [24].

3 Preliminaries

In this section, we set up the problem and discuss someof its properties.

3.1 Problem Setup

We propose priors that are motivated from the variableselection literature from regression and classification,such as group lasso [25, 26, 27] which imposes an `1,2-norm penalty, and simultaneous lasso [28, 29] whichimposes an `1,∞-norm penalty.

539

Jean Honorio, Dimitris Samaras, Irina Rish, Guillermo Cecchi

Recall that an edge in a Gaussian graphical model cor-responds to a non-zero entry in the precision matrix.We promote variable selection by learning a structurewith a small number of nodes that interact with eachother, or equivalently a large number of nodes that aredisconnected from the rest of the graph. For each dis-connected node, its corresponding row in the precisionmatrix (or column given that it is symmetric) con-tains only zeros (except for the diagonal). Therefore,the use of row-level regularizers such as the `1,p-normare natural in our context. Note that our goal dif-fers from sparse Gaussian graphical models, in whichsparseness is imposed at the edge level only. We ad-ditionally impose sparseness at the node level, whichpromotes conditional independence of variables withrespect to all other variables.

Given a dense sample covariance matrix Σ º 0, welearn a precision matrix Ω ∈ RN×N for N variables.The variable-selection structure learning problem is de-fined as:

maxΩÂ0

(log detΩ− 〈Σ,Ω〉 − ρ‖Ω‖1 − τ‖Ω‖1,p

)(2)

for ρ > 0, τ > 0 and p ∈ 2,∞. The term

log detΩ−〈Σ,Ω〉 is the Gaussian log-likelihood. ‖Ω‖1encourages sparseness of the precision matrix or con-ditional independence among variables. The last term‖Ω‖1,p is our variable selection regularizer, and it isdefined as:

‖Ω‖1,p =∑

n

‖(ωn,1, . . . , ωn,n−1, ωn,n+1, . . . , ωn,N )‖p

(3)

In a technical report, [30] proposed an optimizationproblem that is similar to eq.(2). The main differ-ences are that their model does not promote sparse-ness, and that they do not solve the original maximumlikelihood problem, but instead build upon an approx-imation (pseudo-likelihood) approach of Meinshausen-Buhlmann [1] based on independent linear regressionproblems. Finally, note that regression based meth-ods such as [1] have been already shown in [3] to haveworse performance than solving the original maximumlikelihood problem. In this paper, we solve the originalmaximum likelihood problem.

3.2 Bounds

In what follows, we discuss uniqueness and bounded-ness of the optimal solution of our problem.

Lemma 1. For ρ > 0, τ > 0, the variable-selectionstructure learning problem in eq.(2) is a maximizationproblem with concave (but not strictly concave) objec-tive function and convex constraints.

Proof. The Gaussian log-likelihood is concave, sincelog det is concave on the space of symmetric positivedefinite matrices, and since the linear operator 〈·, ·〉 isalso concave. Both regularization terms, the negative`1-norm as well as the negative `1,p-norm defined ineq.(3) are non-smooth concave functions. Finally, Ω Â0 is a convex constraint.

For clarity of exposition, we assume that the diagonalsof Ω are penalized by our variable selection regularizerdefined in eq.(3).

Theorem 2. For ρ > 0, τ > 0, the optimal solutionto the variable-selection structure learning problem ineq.(2) is unique and bounded as follows:

(1

‖Σ‖2 +Nρ+N1/p′τ

)I ¹ Ω∗ ¹

(N

max(ρ, τ)

)I

(4)where `p′-norm is the dual of the `p-norm, i.e. (p =2, p′ = 2) or (p = ∞, p′ = 1).

Proof. By using the identity for dual norms κ‖c‖p =max‖d‖p′≤κ d

Tc in eq.(2), we get:

maxΩÂ0

min‖A‖∞≤ρ

‖B‖∞,p′≤τ

(log detΩ− 〈Σ+A+B,Ω〉

)(5)

where ‖B‖∞,p′ = maxn ‖(bn,1, . . . , bn,N )‖p′ . By virtueof Sion’s minimax theorem, we can swap the orderof max and min. Furthermore, note that the opti-mal solution of the inner equation is given by Ω =

(Σ+A+B)−1

. By replacing this solution in eq.(5),we get the dual problem of eq.(2):

min‖A‖∞≤ρ

‖B‖∞,p′≤τ

(− log det(Σ+A+B)−N

)(6)

In order to find a lower bound for the minimumeigenvalue of Ω∗, note that ‖Ω∗−1‖2 = ‖Σ + A +

B‖2 ≤ ‖Σ‖2 + ‖A‖2 + ‖B‖2 ≤ ‖Σ‖2 + N‖A‖∞ +

N1/p′‖B‖∞,p′ ≤ ‖Σ‖2 +Nρ+N1/p′τ . (Here we used

‖B‖2 ≤ N1/p′‖B‖∞,p′ as shown in Appendix A)

In order to find an upper bound for the maximumeigenvalue of Ω∗, note that, at optimum, the primal-dual gap is zero:

−N + 〈Σ,Ω∗〉+ ρ‖Ω∗‖1 + τ‖Ω∗‖1,p = 0 (7)

The upper bound is found as follows: ‖Ω∗‖2 ≤‖Ω∗‖F ≤ ‖Ω∗‖1 = (N − 〈Σ,Ω∗〉 − τ‖Ω∗‖1,p)/ρ. Note

that τ‖Ω∗‖1,p ≥ 0, and since Σ º 0 and Ω∗ Â 0, it

follows that 〈Σ,Ω∗〉 ≥ 0. Therefore, ‖Ω∗‖2 ≤ Nρ . In

540


a similar fashion, ‖Ω∗‖2 ≤ ‖Ω∗‖1,p = (N − 〈Σ,Ω∗〉 −ρ‖Ω∗‖1)/τ . (Here we used ‖Ω∗‖2 ≤ ‖Ω∗‖1,p asshown in Appendix A). Note that ρ‖Ω∗‖1 ≥ 0 and

〈Σ,Ω∗〉 ≥ 0. Therefore, ‖Ω∗‖2 ≤ Nτ .

4 Block Coordinate Descent Method

Since the objective function in eq.(2) contains a non-smooth regularizer, methods such as gradient descentcannot be applied. On the other hand, subgradientdescent methods very rarely converge to non-smoothpoints [31]. In our problem, these non-smooth pointscorrespond to zeros in the precision matrix, are oftenthe true minima of the objective function, and are verydesirable in the solution because they convey informa-tion of conditional independence among variables.

We apply block coordinate descent method on the pri-mal problem [8, 9], unlike covariance selection [2] andgraphical lasso [3] which optimize the dual. Optimiza-tion of the dual problem in eq.(6) by a block coordinatedescent method can be done with quadratic program-ming for p = ∞ but not for p = 2 (i.e. the objectivefunction is quadratic for p ∈ 2,∞, the constraintsare linear for p = ∞ and quadratic for p = 2). Op-timization of the primal problem provides the sameefficient framework for p ∈ 2,∞. We point out thata projected subgradient method as in [11] cannot beapplied since our regularizer does not decompose intodisjoint subsets. Our problem contains a positive def-initeness constraint and therefore it does not fall inthe general framework of [25, 26, 27, 32, 28, 29] whichconsider unconstrained problems only. Finally, morerecent work of [33, 34] consider subsets with overlap,but it does still consider unconstrained problems only.

Theorem 3. The block coordinate descent methodfor the variable-selection structure learning problem ineq.(2) generates a sequence of positive definite solu-tions.

Proof. Maximization can be performed with respectto one row and column of all precision matrices Ω ata time. Without loss of generality, we use the last rowand column in our derivation. Let:

Ω =

[W yyT z

], Σ =

[S uuT v

](8)

where W,S ∈ RN−1×N−1, y,u ∈ RN−1.

In terms of the variables y, z and the constant matrixW, the variable-selection structure learning problemin eq.(2) can be reformulated as:

maxΩÂ0

(log(z − yTW−1y)− 2uTy − (v + ρ)z−2ρ‖y‖1 − τ‖y‖p − τ

∑n ‖(yn, tn)‖p

)(9)

where tn = ‖(wn,1, . . . , wn,n−1, wn,n+1, . . . , wn,N )‖p.If Ω is a symmetric matrix, according to theHaynsworth inertia formula, Ω Â 0 if and only if itsSchur complement z − yTW−1y > 0 and W Â 0. Bymaximizing eq.(9) with respect to z, we get:

z − yTW−1y =1

v + ρ(10)

and since v > 0 and ρ > 0, this implies that the Schurcomplement in eq.(10) is positive. Finally, in our itera-tive optimization, it suffices to initialize Ω to a matrixknown to be positive definite, e.g. a diagonal matrixwith positive elements.

Theorem 4. The block coordinate descent methodfor the variable-selection structure learning problem ineq.(2) is equivalent to solving a sequence of strictlyconvex (`1,`1,p) regularized quadratic subproblems forp ∈ 2,∞:

miny∈RN−1

(12y

T(v + ρ)W−1y + uTy+ρ‖y‖1 + τ

2‖y‖p + τ2

∑n ‖(yn, tn)‖p

)(11)

Proof. By replacing the optimal z given by eq.(10) intothe objective function in eq.(9), we get eq.(11). SinceW Â 0 ⇒ W−1 Â 0, hence eq.(11) is strictly convex.

Lemma 5. If ‖u‖∞ ≤ ρ+τ/(2(N−1)1/p′) or ‖u‖p′ ≤

ρ+ τ/2, the (`1, `1,p) regularized quadratic problem ineq.(11) has the minimizer y∗ = 0.

Proof. Note that since W Â 0 ⇒ W−1 Â 0, y∗ = 0is the minimizer of the quadratic part of eq.(11). Itsuffices to prove that the remaining part is also min-imized for y∗ = 0, i.e. uTy + ρ‖y‖1 + τ

2‖y‖p +τ2

∑n ‖(yn, tn)‖p ≥ τ

2

∑n tn for an arbitrary y. The

lower bound comes from setting y∗ = 0 in eq.(11) andby noting that (∀n) tn > 0.

By using lower bounds∑

n ‖(yn, tn)‖p ≥ ∑n tn and

either ‖y‖p ≥ ‖y‖1/(N − 1)1/p′or ‖y‖1 ≥ ‖y‖p,

we modify the original claim into a stronger one,i.e. uTy + (ρ + τ/(2(N − 1)1/p

′))‖y‖1 ≥ 0 or

uTy + (ρ + τ/2)‖y‖p ≥ 0. Finally, by using theidentity for dual norms κ‖y‖p = max‖d‖p′≤κ d

Ty,

we have max‖d‖∞≤ρ+τ/(2(N−1)1/p′ ) (u+ d)Ty ≥ 0 or

max‖d‖p′≤ρ+τ/2 (u+ d)Ty ≥ 0, which proves our

claim.

Remark 6. By using Lemma 5, we can reduce the sizeof the original problem by removing variables in whichthis condition holds, since it only depends on the densesample covariance matrix.

541


Theorem 7. The coordinate descent method for the(`1, `1,p) regularized quadratic problem in eq.(11) isequivalent to solving a sequence of strictly convex(`1, `p) regularized quadratic subproblems:

minx

(1

2qx2 − cx+ ρ|x|+ τ

2‖(x, a)‖p +

τ

2‖(x, b)‖p

)

(12)

Proof. Without loss of generality, we use the last rowand column in our derivation, since permutation ofrows and columns is always possible. Let:

W−1 =

[H11 h12

h12T h22

], y =

[y1

x

], u =

[u1

u2

]

(13)where H11 ∈ RN−2×N−2, h12,y1,u1 ∈ RN−2.

In terms of the variable x and the constants q = (v +ρ)h22, c = −((v + ρ)h12

Ty1 + u2), a = ‖y1‖p, b = tn,the (`1, `1,p) regularized quadratic problem in eq.(11)can be reformulated as in eq.(12). Moreover, sincev > 0∧ρ > 0∧h22 > 0 ⇒ q > 0, and therefore eq.(12)is strictly convex.

For p = ∞, eq.(12) has five points in whichthe objective function is non-smooth, i.e. x ∈−max(a, b),−min(a, b), 0,min(a, b),max(a, b). Fur-thermore, since the objective function is quadratic oneach interval, it admits a closed form solution.

For p = 2, eq.(12) has only one non-smooth point,i.e. x = 0. Given the objective function f(x), wefirst compute the left derivative ∂−f(0) = −c− ρ andthe right derivative ∂+f(0) = −c + ρ. If ∂−f(0) ≤0∧∂+f(0) ≥ 0 ⇒ x∗ = 0. If ∂−f(0) > 0 ⇒ x∗ < 0 andwe use the one-dimensional Newton-Raphson methodfor finding x∗. If ∂+f(0) < 0 ⇒ x∗ > 0. For numericalstability, we add a small ε > 0 to the `2-norms byusing

√x2 + a2 + ε instead of ‖(x, a)‖2.

Algorithm 1 shows the block coordinate descentmethod in detail. A careful implementation leads toa time complexity of O(KN3) for K iterations andN variables. In our experiments, the algorithm con-verges quickly in usually K = 10 iterations. Polyno-mial dependence O(N3) on the number of variables isexpected since no algorithm can be faster than com-puting the inverse of the sample covariance in the caseof an infinite sample.

5 Experimental Results

We test with a synthetic example the ability of themethod to recover ground truth structure from data.The model contains N ∈ 50, 100, 200 variables. Foreach of 50 repetitions, we first select a proportion of

Algorithm 1 Block Coordinate Descent

Input: Σ º 0, ρ > 0, τ > 0, p ∈ 2,∞Initialize Ω = diag(Σ)

−1

for each iteration 1, . . . ,K and each variable 1, . . . , Ndo

Split Ω into W,y, z and Σ into S,u, v as described ineq.(8)Update W−1 by using the Sherman-Woodbury-Morrison formula (Note that when iterating from onevariable to the next one, only one row and columnchange on matrix W)for each variable 1, . . . , N − 1 do

Split W−1,y,u as in eq.(13)Solve the (`1, `p) regularized quadratic problem inclosed form (p = ∞) or by using the Newton-Raphson method (p = 2)

end forUpdate z ← 1

v+ρ+ yTW−1y

end forOutput: Ω Â 0

“connected” nodes (either 0.2,0.5,0.8) from the N vari-ables. The unselected (i.e. “disconnected”) nodes donot participate in any edge of the ground truth model.We then generate edges among the connected nodeswith a required density (either 0.2,0.5,0.8), where eachedge weight is generated uniformly at random from−1,+1. We ensure positive definiteness of Ωg byverifying that its minimum eigenvalue is at least 0.1.We then generate a dataset of 50 samples. We modelthe ratio σc/σd between the standard deviation of con-nected versus disconnected nodes. In the “high vari-ance confounders” regime, σc/σd = 1 which meansthat on average connected and disconnected variableshave the same standard deviation. In the “low vari-ance confounders” regime, σc/σd = 10 which meansthat on average the standard deviation of a connectedvariable is 10 times the one of a disconnected variable.Variables with low variance produce higher values inthe precision matrix than variables with high variance.We analyze both regimes in order to evaluate the im-pact of this effect in structure recovery.

In order to measure the closeness of the recovered mod-els to the ground truth, we measured the Kullback-Leibler (KL) divergence, sensitivity (one minus thefraction of falsely excluded edges) and specificity (oneminus the fraction of falsely included edges). Wecompare to the following methods: covariance selec-tion [2], graphical lasso [3], Meinshausen-Buhlmannapproximation [1] and Tikhonov regularization. Forour method, we found that the variable selection pa-rameter τ = 50ρ provides reasonable results, in bothsynthetic and real-world experiments. Therefore, wereport results only with respect to the sparseness pa-rameter ρ.

First, we test the performance of our methods for in-

542


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Sen

sitiv

ity

1 − Specificity High variance, N = 50

MA MO GL CS L2 LI

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Sen

sitiv

ity


MA MO GL CS L2 LI

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Sen

sitiv

ity


MA MO GL CS L2 LI

0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4

1/2 1 2

Regularization parameter High variance, N = 50

MA MO GL CS TR L2 LI

0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4

1/2 1 2



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4

1/2 1 2



0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Sen

sitiv

ity

1 − Specificity Low variance, N = 50

MA MO GL CS L2 LI

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Sen

sitiv

ity


MA MO GL CS L2 LI

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1S

ensi

tivity


MA MO GL CS L2 LI

0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4

Regularization parameter Low variance, N = 50


0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



Figure 1: ROC curves (first row) and KL divergence (sec-ond row) for the “high variance confounders” regime. ROCcurves (third row) and KL divergence (fourth row) for the“low variance confounders” regime. Left: N = 50 vari-ables, center: N = 100 variables, right: N = 200 variables(connectedness 0.8, edge density 0.5). Our proposed meth-ods `1,2 (L2) and `1,∞ (LI) recover edges better and pro-duce better probability distributions than Meinshausen-Buhlmann with AND-rule (MA), OR-rule (MO), graphicallasso (GL), covariance selection (CS) and Tikhonov regu-larization (TR). Our methods degrade less in recovering theground truth edges when the number of variables grows.

creasing number of variables, moderate edge density(0.5) and high proportion of connected nodes (0.8).Fig.1 shows the ROC curves and KL divergence be-tween the recovered models and the ground truth. Inboth “low” and “high variance confounders” regimes,our `1,2 and `1,∞ methods recover ground truth edgesbetter than competing methods (higher ROC) andproduce better probability distributions (lower KLdivergence) than the other methods. Our methodsdegrade less than competing methods in recoveringthe ground truth edges when the number of variablesgrows, while the KL divergence behavior remains sim-ilar.

Second, we test the performance of our methods with

0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4

Regularization parameter Connectedness 0.2, density 0.2


0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



0

5

10

15

Kul

lbac

k−Le

ible

r di

verg

ence

1/32

768

1/16

384

1/81

92

1/40

96

1/20

48

1/10

241/

5121/

2561/

1281/

641/

321/

16 1/8

1/4



Figure 2: Cross-validated KL divergence for structureslearnt for the “low variance confounders” regime (N = 50variables, different connectedness and density levels). Ourproposed methods `1,2 (L2) and `1,∞ (LI) produce bet-ter probability distributions than Meinshausen-Buhlmannwith AND-rule (MA), OR-rule (MO), graphical lasso (GL),covariance selection (CS) and Tikhonov regularization(TR).

respect to edge density and the proportion of con-nected nodes. Fig.2 shows the KL divergence be-tween the recovered models and the ground truth forthe “low variance confounders” regime. Our `1,2 and`1,∞ methods produce better probability distributions(lower KL divergence) than the remaining techniques.(Please, see Appendix B for results on ROC and the“high variance confounders” regime.)

Our `1,2 method takes 0.07s for N = 100, 0.12s forN = 200 variables. Our `1,∞ method takes 0.13s forN = 100, 0.63s for N = 200. Graphical lasso [3],the fastest and most accurate competing method inour evaluation, takes 0.11s for N = 100, 0.49s for N =200. Our `1,∞ method is slightly slower than graphicallasso, while our `1,2 method is the fastest. One reasonfor this is that Lemma 5 eliminates more variables inthe `1,2 setting.

For experimental validation on real-world datasets,we use datasets with a diverse nature of probabilis-tic relationships: brain fMRI, gene expression, NAS-DAQ stock prices and world weather. The brain fMRIdataset collected by [35] captures brain function of15 cocaine addicted and 11 control subjects underconditions of monetary reward. Each subject con-tains 87 scans of 53 × 63 × 46 voxels each, taken ev-

543


ery 3.5 seconds. Registration to a common spatialtemplate and spatial smoothing was done in SPM2(http://www.fil.ion.ucl.ac.uk/spm/). After samplingeach 4 × 4 × 4 voxels, we obtained 869 variables.The gene expression dataset contains 8,565 variablesand 587 samples. The dataset was collected by [36]from drug treated rat livers, by treating rats witha variety of fibrate, statin, or estrogen receptor ago-nist compounds. The dataset is publicly available athttp://www.ebi.ac.uk/. In order to consider the fullset of genes, we had to impute a very small percentage(0.90%) of missing values by randomly generating val-ues with the same mean and standard deviation. TheNASDAQ stocks dataset contains daily opening andclosing prices for 2,749 stocks from Apr 19, 2010 to Apr18, 2011 (257 days). The dataset was downloaded fromhttp://www.google.com/finance. For our experiments,we computed the percentage of change between theclosing and opening prices. The world weather datasetcontains monthly measurements of temperature, pre-cipitation, vapor, cloud cover, wet days and frost daysfrom Jan 1990 to Dec 2002 (156 months) on a 2.5×2.5degree grid that covers the entire world. The dataset ispublicly available at http://www.cru.uea.ac.uk/. Af-ter sampling each 5×5 degrees, we obtained 4,146 vari-ables. For our experiments, we computed the changebetween each month and the month in the previousyear.

For all the datasets, we used one third of the data fortraining, one third for validation and the remainingthird for testing. Since the brain fMRI dataset has avery small number of subjects, we performed six rep-etitions by making each third of the data take turnsas training, validation and testing sets. In our evalua-tion, we included scale free networks [15]. We did notinclude the covariance selection method [2] since wefound it is extremely slow for these high-dimensionaldatasets. We report the negative log-likelihood on thetesting set in Fig.3 (we subtracted the entropy mea-sured on the testing set and then scaled the resultsfor visualization purposes). We can observe that thelog-likelihood of our method is remarkably better thanthe other techniques for all the datasets.

Regarding comparison to group sparse methods, inour previous experiments we did not include blockstructure for known block-variable assignments [11, 12]since our synthetic and real-world datasets lack suchassignments. We did not include block structure forunknown assignments [13, 14] given their time com-plexity ([14] has a O(N5)-time Gibbs sampler step forN variables and it is applied for N = 60 only, while[13] has a O(N4)-time ridge regression step). Instead,we evaluated our method in the baker’s yeast gene ex-pression dataset in [11] which contains 677 variables

TR MO MA GL SF LI L20

0.2

0.4

0.6

0.8

1

Neg

ativ

e lo

g−lik

elih

ood


0.2

0.4

0.6

0.8

1

Neg

ativ

e lo

g−lik

elih

ood

(a) (b)


0.2

0.4

0.6

0.8

1

Neg

ativ

e lo

g−lik

elih

ood


0.2

0.4

0.6

0.8

1

Neg

ativ

e lo

g−lik

elih

ood


0.2

0.4

0.6

0.8

1

Neg

ativ

e lo

g−lik

elih

ood

(c) (d) (e)

Figure 3: Test negative log-likelihood of structures learntfor (a) addicted subjects and (b) control subjects in thebrain fMRI dataset, (c) gene expression, (d) NASDAQstocks and (e) world weather. Our proposed methods`1,2 (L2) and `1,∞ (LI) outperforms the Meinshausen-Buhlmann with AND-rule (MA), OR-rule (MO), graph-ical lasso (GL), Tikhonov regularization (TR) and scalefree networks (SF).

and 173 samples. We used the experimental settingsof Fig.3 in [13]. For learning one structure, [13] took5 hours while our `1,2 method took only 50 seconds.Our method outperforms block structures for knownand unknown assignments. The log-likelihood is 0 forTikhonov regularization, 6 for [11, 13], 8 for [12], and22 for our `1,2 method.

We show the structures learnt for cocaine addictedand control subjects in Fig.4, for our `1,2 methodand graphical lasso [3]. The disconnected variablesare not shown. Note that our structures involve re-markably fewer connected variables but yield a higherlog-likelihood than graphical lasso (Fig.3), which sug-gests that the discarded edges from the disconnectednodes are not important for accurate modeling of thisdataset. Moreover, removal of a large number ofnuisance variables (voxels) results into a more inter-pretable model, clearly demonstrating brain areas in-volved in structural model differences that discrimi-nate cocaine addicted from control subjects. Note thatgraphical lasso (bottom of Fig.4) connects most of thebrain voxels in both populations, making them impos-sible to compare. Our approach produces more “local-ized” networks (top of the Fig.4) involving a relativelysmall number of brain areas: cocaine addicted subjectsshow increased interactions between the visual cortex(back of the brain, on the left in the image) and theprefrontal cortex (front of the brain, on the right in theimage), while at the same time decreased density of in-teractions between the visual cortex with other brainareas (more clearly present in control subjects). Thealteration in this pathway in the addict group is highlysignificant from a neuroscientific perspective. First,the trigger for reward was a visual stimulus. Abnor-malities in the visual cortex was reported in [37] whencomparing cocaine abusers to control subjects. Sec-ond, the prefrontal cortex is involved in higher-order

544


Figure 4: Structures learnt for cocaine addicted (left)and control subjects (right), for our `1,2 method (top)and graphical lasso (bottom). Regularization parameterρ = 1/16. Positive interactions in blue, negative interac-tions in red. Our structures are sparser (density 0.0016)than graphical lasso (density 0.023) where the number ofedges in a complete graph is ≈378000.

cognitive functions such as decision making and re-ward processing. Abnormal monetary processing inthe prefrontal cortex was reported in [38] when com-paring cocaine addicted individuals to controls. Al-though a more careful interpretation of the observedresults remains to be done in the near future, these re-sults are encouraging and lend themselves to specificneuroscientific hypothesis testing.

In a different evaluation, we used generatively learntstructures for a classification task. We performed afive-fold cross-validation on the subjects. From thesubjects in the training set, we learned one structurefor cocaine addicted and one structure for control sub-jects. Then, we assigned a test subject to the structurethat gave highest probability for his data. All meth-ods in our evaluation except Tikhonov regularizationobtained 84.6% accuracy. Tikhonov regularization ob-tained 65.4% accuracy. Therefore, our method pro-duces structures that retain discriminability with re-spect to standard sparseness promoting techniques.

6 Conclusions and Future Work

In this paper, we presented variable selection in thecontext of learning sparse Gaussian graphical mod-els by adding an `1,p-norm regularization term, forp ∈ 2,∞. We presented a block coordinate descentmethod which yields sparse and positive definite esti-mates. We solved the original problem by efficientlysolving a sequence of strictly convex (`1,`p) regularizedquadratic minimization subproblems.

The motivation behind this work was to incorporatevariable selection into structure learning of sparseMarkov networks, and specifically Gaussian graphi-

cal models. Besides providing a better regularizer (asobserved on several real-world datasets: brain fMRI,gene expression, NASDAQ stock prices and worldweather), key advantages of our approach include amore accurate structure recovery in the presence ofmultiple noisy variables (as demonstrated by simula-tions), significantly better interpretability and samediscriminability of the resulting network in practicalapplications (as shown for brain fMRI analysis).

There are several ways to extend this research. Inpractice, our technique converges in a small number ofiterations, but an analysis of convergence rate needsto be performed. Consistency when the number ofsamples grows to infinity needs to be proved.

Acknowledgments

We thank Rita Goldstein for providing us the fMRIdataset. This work was supported in part by NIHGrants 1 R01 DA020949 and 1 R01 EB007530.

References

[1] N. Meinshausen and P. Buhlmann. High dimen-sional graphs and variable selection with the lasso.The Annals of Statistics, 2006.

[2] O. Banerjee, L. El Ghaoui, A. d’Aspremont, andG. Natsoulis. Convex optimization techniques forfitting sparse Gaussian graphical models. ICML,2006.

[3] J. Friedman, T. Hastie, and R. Tibshirani. Sparseinverse covariance estimation with the graphicallasso. Biostatistics, 2007.

[4] M. Yuan and Y. Lin. Model selection and estima-tion in the Gaussian graphical model. Biometrika,2007.

[5] S. Lauritzen. Graphical Models. Oxford Press,1996.

[6] A. Dempster. Covariance selection. Biometrics,1972.

[7] E. Levina, A. Rothman, and J. Zhu. Sparse esti-mation of large covariance matrices via a nestedlasso penalty. The Annals of Applied Statistics,2008.

[8] J. Honorio, L. Ortiz, D. Samaras, N. Paragios,and R. Goldstein. Sparse and locally constantGaussian graphical models. NIPS, 2009.

[9] J. Honorio and D. Samaras. Multi-task learningof Gaussian graphical models. ICML, 2010.

545


[10] B. Zhang and Y. Wang. Learning structuralchanges of Gaussian graphical models in con-trolled experiments. UAI, 2010.

[11] J. Duchi, S. Gould, and D. Koller. Projected sub-gradient methods for learning sparse Gaussians.UAI, 2008.

[12] M. Schmidt, E. van den Berg, M. Friedlander,and K. Murphy. Optimizing costly functions withsimple constraints: A limited-memory projectedquasi-Newton algorithm. AISTATS, 2009.

[13] B. Marlin and K.Murphy. Sparse Gaussian graph-ical models with unknown block structure. ICML,2009.

[14] B. Marlin, M. Schmidt, and K. Murphy. Groupsparse priors for covariance estimation. UAI,2009.

[15] Q. Liu and A. Ihler. Learning scale free networksby reweighted `1 regularization. AISTATS, 2011.

[16] R. Tibshirani. Regression shrinkage and selectionvia the lasso. Journal of the Royal Statistical So-ciety, 1996.

[17] A. Chan, N. Vasconcelos, and G. Lanckriet. Di-rect convex relaxations of sparse SVM. ICML,2007.

[18] S. Lee, H. Lee, P. Abbeel, and A. Ng. Efficient `1regularized logistic regression. AAAI, 2006.

[19] J. Duchi and Y. Singer. Boosting with structuralsparsity. ICML, 2009.

[20] R. Parr, L. Li, G. Taylor, C. Painter-Wakefield,and M. Littman. An analysis of linear models,linear value-function approximation, and featureselection for reinforcement learning. ICML, 2008.

[21] S. Lee, V. Ganapathi, and D. Koller. Efficientstructure learning of Markov networks using `1-regularization. NIPS, 2006.

[22] M. Schmidt, A. Niculescu-Mizil, and K. Mur-phy. Learning graphical model structure using`1-regularization paths. AAAI, 2007.

[23] M. Schmidt, K. Murphy, G. Fung, and R. Ros-ales. Structure learning in random fields for heartmotion abnormality detection. CVPR, 2008.

[24] M. Wainwright, P. Ravikumar, and J. Lafferty.High dimensional graphical model selection using`1-regularized logistic regression. NIPS, 2006.

[25] M. Yuan and Y. Lin. Model selection and estima-tion in regression with grouped variables. Journalof the Royal Statistical Society, 2006.

[26] L. Meier, S. van de Geer, and P. Buhlmann. Thegroup lasso for logistic regression. Journal of theRoyal Statistical Society, 2008.

[27] G. Obozinski, B. Taskar, and M. Jordan. Jointcovariate selection and joint subspace selectionfor multiple classification problems. Statistics andComputing, 2010.

[28] B. Turlach, W. Venables, and S. Wright. Simul-taneous variable selection. Technometrics, 2005.

[29] J. Tropp. Algorithms for simultaneous sparse ap-proximation, part II: convex relaxation. SignalProcessing, 2006.

[30] J. Friedman, T. Hastie, and R. Tibshirani. Ap-plications of the lasso and grouped lasso to theestimation of sparse graphical models. TechnicalReport, Stanford University, 2010.

[31] J. Duchi and Y. Singer. Efficient learning usingforward-backward splitting. NIPS, 2009.

[32] A. Quattoni, X. Carreras, M. Collins, and T. Dar-rell. An efficient projection for `1,∞ regulariza-tion. ICML, 2009.

[33] X. Chen, Q. Lin, S. Kim, J. Carbonell, andE. Xing. Smoothing proximal gradient methodfor general structured sparse learning. UAI, 2011.

[34] J. Mairal, R. Jenatton, G. Obozinski, andF. Bach. Network flow algorithms for structuredsparsity. NIPS, 2010.

[35] R. Goldstein, D. Tomasi, N. Alia-Klein, L. Zhang,F. Telang, and N. Volkow. The effect of practiceon a sustained attention task in cocaine abusers.NeuroImage, 2007.

[36] G. Natsoulis, L. El Ghaoui, G. Lanckriet, A. Tol-ley, F. Leroy, S. Dunlea, B. Eynon, C. Pearson,S. Tugendreich, and K. Jarnagin. Classificationof a large microarray data set: algorithm com-parison and analysis of drug signatures. GenomeResearch.

[37] J. Lee, F. Telang, C. Springer, and N. Volkow.Abnormal brain activation to visual stimulationin cocaine abusers. Life Sciences, 2003.

[38] R. Goldstein, N. Alia-Klein, D. Tomasi, J. Hono-rio, T. Maloney, P. Woicik, R. Wang, F. Telang,and N. Volkow. Anterior cingulate cortex hypoac-tivations to an emotionally salient task in cocaineaddiction. Proceedings of the National Academyof Sciences, USA, 2009.

546

Documents

Variable Selection for Gaussian Graphical Modelsproceedings.mlr.press/v22/honorio12/honorio12.pdfVariable Selection for Gaussian Graphical Models Jean Honorio 1 Dimitris Samaras 1