Statistical Inference for Matrix-variate Gaussian ... · Similar to the vector-variate Gaussian graphical models (GGMs) where the conditional independence among variables is encoded

Statistical Inference for Matrix-variate Gaussian

Graphical Models and False Discovery Rate Control

Xi Chen∗ and Weidong Liu †

Abstract

Matrix-variate Gaussian graphical models (GGM) have been widely used for

modelling matrix-variate data. Since the supports of sparse row and column preci-

sion matrices encode the conditional independence among rows and columns of the

data, it is of great interest to conduct support recovery. A commonly used approach

is the penalized log-likelihood method. However, due to the complicated structure

of the precision matrices of matrix-variate GGMs, the log-likelihood is non-convex,

which brings a great challenge for both computation and theoretical analysis. In

this paper, we propose an alternative approach by formulating the support recov-

ery problem into a multiple testing problem. A new test statistic is proposed and

based on that, we further develop a method to control false discovery rate (FDR)

asymptotically. Our method is computationally attractive since it only involves con-

vex optimization. Theoretically, our method allows very weak conditions, i.e., even

when the sample size is a constant and the dimensions go to infinity, the asymptotic

normality of the test statistics and FDR control can still be guaranteed. The finite

sample performance of the proposed method is illustrated by both simulated and

real data analysis.

∗Stern School of Business, new York University, Email: [email protected]†Department of Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong Uni-

versity. Research supported by NSFC, Grants No.11322107 and No.11431006, the Program for Profes-

sor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, Shanghai

Shuguang Program and 973 Program (2015CB856004). Email: [email protected].

1

arX

iv:1

509.

0545

3v1

[st

at.M

E]

17

Sep

2015

1 Introduction

In the era of big data, matrix-variate observations are becoming prevalent in a wide range

of domains such as biomedical imaging, genomics, financial markets, spatio-temporal en-

vironmental data analysis, etc. A typical example is the gene expression data in genomics,

where each observation contains expression levels of p genes on q microarrays of the same

subject (see, e.g., Efron [2009], Yin and Li [2012]). Another example of such data is

multi-channel electroencephalography (EEG) data for brain imaging studies (see, e.g.,

Bijma et al. [2005]), where each measurement can be expressed as a matrix with rows

corresponding to p different channels and columns to q time points. Leng and Tang [2012]

provided more interesting examples of matrix data analysis such as volatility of equality

option. Due to the prevalence of matrix-variate observations (especially, high-dimensional

observations), it is important for us to understand the structural information encoded in

these observations.

To study matrix-variate data where each observation X is a p×q matrix, it is commonly

assumed that X follows a matrix-variate normal distribution, e.g., [Efron, 2009, Allen and

Tibshirani, 2010, Yin and Li, 2012, Leng and Tang, 2012, Zhou, 2014]. The matrix-variate

normal distribution is a generalization of the familiar multivariate normal distribution for

vector data. In particular, let vec(X) ∈ Rpq×1 be the vectorization of matrix X obtained

by stacking the columns of X on top of each other. We say X follows a matrix-variate

normal distribution with mean matrix µ ∈ Rp×q, row covariance matrix Σ ∈ Rp×p, column

covariance matrix Ψ ∈ Rq×q if and only if vec(X′) ∼ N(vec(µ′),Σ⊗Ψ), where X′ denotes

the transpose of X and ⊗ is the Kronecker product. Or more succinctly,

X ∼ Np,q(µ,Σ⊗Ψ) if and only if vec(X′) ∼ N(vec(µ′),Σ⊗Ψ). (1)

We note that the subscript p, q in Np,q(µ,Σ⊗Ψ) will be omitted for notational simplicity

when the dimensionality is clear from the context. The readers might refer to Dawid

[1981], Gupta and Nagar [1999] for more properties of matrix-variate normal distribution.

Similar to the vector-variate Gaussian graphical models (GGMs) where the conditional

independence among variables is encoded in the precision matrix, one can analogously

define matrix-variate Gaussian graphical models (MGGM) (a.k.a. Gaussian bigraphical

model). In particular, let Ω = Σ−1 = (ωij)p×p, Γ = Ψ−1 = (γij)q×q, the positions of zeros

in the precision matrix Ω⊗Γ = (Σ⊗Ψ)−1 represent the conditional independence among

2

entries of X. More specifically, Xij and Xkl are conditionally independent, given the rest

of entries if and only if (iff) at least one zero in ωik or γjl. As a consequence, let Xi,·, X·,j

denote the i-th row and j-th column of X respectively. For any pair i, k ∈ [p] , 1, . . . , pand pair j, l ∈ [q] , 1, . . . , q, we have

Xi,· ⊥⊥ Xk,·|X[p]\i,k,· iff ωik = 0; X·,j ⊥⊥ X·,l|X·, [q]\j,l iff γjl = 0,

where ⊥⊥ is the symbol for the conditional independence.

When p = 1 or q = 1, MGGMs reduce to vector-variate GGMs, which have been

extensively studied in the high-dimensional setting over the past ten years. Here, we

only briefly survey a few related works. The major research problems in high-dimensional

GGMs include the estimation and support recovery of precision matrices, which have been

investigated by Meinshausen and Buhlmann [2006], Yuan and Lin [2007], Rothman et al.

[2008a], Yuan [2010], d’Aspremont et al. [2008], Friedman et al. [2008], Ravikumar et al.

[2011], Cai et al. [2011], etc. Liu et al. [2012] and Xue and Zou [2012] further extended

GGMs to a more general class of multivariate distributions, i.e., semiparametric Gaussian

copula graphical models. Recently, Liu [2013], Ren et al. [2015], Jankova and van de

Geer [2015] developed various approaches to address statistical inference problem in high-

dimensional GGMs and established asymptotic normality for elements of the precision

matrix. In particular, Liu [2013] investigated the false discovery rate (FDR) control of

GGMs.

In contrast to a large body of literature on vector-variate GGMs, work on matrix-

variate GGMs (MGGMs) is much fewer since estimating precision matrices in MGGMs

is significantly more challenging. Efron [2009] investigated the testing problem on the

row or column independence of X, i.e., testing whether Ω or Γ is a diagonal matrix. For

general sparse precision matrices, one straightforward estimation approach is based on

the penalized likelihood method. However, due to the complicated structure of precision

matrix in the form of Kronecker product, the negative log-likelihood function is no longer

convex, which not only makes the computation more difficult but also the development of

theoretical properties of the resultant estimator significantly more challenging. A few re-

cent works [Allen and Tibshirani, 2010, Leng and Tang, 2012, Yin and Li, 2012, Kalaitzis

et al., 2013, Tsiligkaridis et al., 2013, Ying and Liu, 2013, Huang and Chen, 2015] have

been devoted to developing various penalized likelihood approaches for estimating MG-

GMs or extensions of MGGMs (e.g., multiple MGGMs in Huang and Chen [2015] and

3

semiparametric extension in Ying and Liu [2013]). Some of these works also provide the-

oretical guarantees on the estimated precision matrices, e.g., rate of convergence under

Frobenius norm and sparsistency. However, all these results are stated in the sense that

there exists a local minimizer that enjoys certain good properties. In practice, one would

never know that whether the obtained local minimizer from an optimization solver is such

a desired local minimizer. Most convergence results also require certain conditions on the

sample size n and dimensionality p and q, e.g. p and q cannot differ too much from each

other and n should go to infinity at certain rate (we will show later that n → ∞ is not

necessary at all for estimation in MGGMs). Zhou [2014] developed new penalized meth-

ods for estimating Σ ⊗ Ψ and Ω ⊗ Γ and established the convergence rate under both

the spectral norm and Frobenius norm. However, it fails to provide theoretical results on

the support recovery of Ω⊗ Γ and restricts that that µ = 0, which potentially limits its

applications (note that Zhou [2014] considered the case that n = 1. For only one sample,

µ cannot be estimated without any structural assumption and thus the data cannot be

centralized).

The main goal of this paper is to infer the support of the precision matrix for a MGGM

in the high-dimensional setup, i.e., supp(Ω ⊗ Γ), which characterizes the conditional

independence relationship. Our method is rather different from the common approaches

that formulate the problem into a joint optimization problem as in penalized likelihood

methods. Instead, we construct large-scale multiple testing problems for Ω and Γ:

HΩ0ij : ωij = 0 vs HΩ

0ij : ωij 6= 0, 1 ≤ i < j ≤ p (2)

and

HΓ0ij : γij = 0 vs HΓ

0ij : γij 6= 0, 1 ≤ i < j ≤ q. (3)

Given such a multiple testing formulation, we propose a new approach to asymptotically

control the false discovery rate (FDR). In many applications of GGMs, FDR has been

considered as one of the most important criteria for the performance measure (e.g., in

the application of genetic data analysis [Schafer and Strimmer, 2005, Ma et al., 2007]).

Our approach is the first to investigate FDR control in MGGMs and possesses several

advantages as summarized below.

(1) As compared to existing penalized likelihood methods whose objective functions are

non-convex, our approach is completely based on convex optimization and thus com-

4

putationally more attractive. In particular, the main computational cost of our ap-

proach is the construction of initial estimates of Ω and Γ, which will directly lead

to our test statistics. We propose to adopt neighborhood-based Lasso approach to

separately estimate Ω and Γ. The corresponding p+q Lasso problems are completely

independent that allows an efficient parallel implementation.

(2) Our approach only requires the sample size n ≥ 2, which is a very attractive property

from both theoretical and practical perspectives. More specifically, even for the case

that n is a constant and p → ∞ and q → ∞, our method still guarantees the

asymptotical normality of the test statistics and FDR control. We note that this is

fundamentally different from the case of vector-variate GGMs, which always requires

n→∞ for estimation consistency and support recovery. Roughly speaking, the main

reason that our method works for the constant sample size is that we fully utilize

the correlation structure among rows and columns of X: when constructing initial

estimate of Ω (or Γ), all the columns (or rows) of each matrix-variate observations

are treated as correlated samples. To the best of our knowledge, our approach is the

first on support recovery of MGGMs that allows constant sample size; while previous

works all require that n → ∞ for the support recovery (see, e.g., [Yin and Li, 2012,

Leng and Tang, 2012]).

(3) Our approach allows a wider range of p and q as compared to existing literature. In

particular, our result on FDR control holds for (p, q) such that qr2 ≤ p ≤ qr1 for

some 0 < r2 ≤ r1, while in comparison, Leng and Tang [2012] required p log(p) =

o(nq) and q log(q) = o(np). As one can see, if n is a constant or n → ∞ but

n = o(min(log p, log q)), such a condition will not hold.

Our approach can be directly applied to control FDR for correlated vector-variate

GGM data (see Section 4). From this point of view, this work generalizes the method in

Liu [2013], which only deals with i.i.d. vector-variate GGM data. Such a generalization

from i.i.d. vector-variate GGM to MGGMs (as well as correlated vector-variate GGM)

requires several important technical improvements over Liu [2013] as summarized below:

(1) In Proposition 3.1 of Liu [2013], to establish the asymptotic normality for vector-

variate GGMs, the constructed test statistics is based on a bias correction of the

sample covariance of residuals from Lasso. For our problem, only the bias correction

5

is not sufficient. We propose a variance correction technique (combined with the bias

correction) to estimate the precision matrices for MGGMs or correlated vector-variate

GGMs.

(2) When constructing the initial estimate of Ω (or Γ), we adopt Lasso by treating each

column (or row) of every matrix-variate observation as a single sample and thus the

samples are not i.i.d.; while the samples are i.i.d. in Liu [2013] for estimating precision

matrices in vector-variate GGMs. Therefore, one of the main technical challenges is

to establish the consistency result for Lasso with correlated samples, which itself is

of independent interest for high-dimensional linear regression.

(3) Since our results on MGGMs allows the sample size n to be a small constant as

compared to the requirement on n → ∞ in Liu [2013], all the convergence analysis

needs to be carefully tailored.

The difference on both methodological and technical perspectives between this work and

Liu [2013] will become more clear as we explain our method and analysis later in the

paper.

Finally, we introduce some necessary notations. Let X(k) = (X(k)ij )p×q for k = 1, . . . , n

be the n i.i.d. matrix-variate observations from Np,q(µ,Σ⊗Ψ) and let X = 1n

∑nk=1 X(k).

Put Σ = (σij) and Ψ = (ψij). For any vector x = (x1, . . . , xp)′, let x−i denote p − 1

dimensional vector by removing xi from x. For any p×q matrix A, let Ai,−j denote the i-th

row of A with its j-th entry being removed and A−i,j denote the j-th column of A with its

i-th entry being removed. A−i,−j denote a (p−1)×(q−1) matrix by removing the i-th row

and j-th column of A. Define [n] , 1, . . . , n, [p] , 1, . . . , p and [q] , 1, . . . , q. For a

p-dimensional vector x, let |x|0 =∑p

j=1 I(xj 6= 0), |x|1 =∑p

j=1 |xj| and |x|2 =√∑p

j=1 x2j

be the `0, `1 and Euclidean norm of x, respectively. For a matrix A = (aij) ∈ Rp×q, let

‖A‖F =√∑

i∈[p],j∈[q] a2ij be the Frobenius norm of A, |A|∞ = maxi∈[p],j∈[q] |aij| be the

element-wise `∞-norm of A and ‖A‖2 = sup|x|2≤1 |Ax|2 be the spectral norm of A. For a

square matrix A, let tr(A) denote the trace of A. For a given set H, let Card(H) be the

cardinality of H. Throughout the paper, we use Ip to denote the p × p identity matrix,

and use C, c, etc. to denote generic constants whose values might change from place to

place.

The rest of the paper is organized as follows. In Section 2, we introduce our test

statistics using both bias and variance correction and then describe the FDR control

6

procedure for MGGMs estimation. Theoretical results on asymptotic normality of our

test statistic and FDR control are given in Section 3. The proposed method can also be

applied to estimate vector-variate GGMs with correlated samples as described in Section

4. Simulations and real data analysis are given in Section 5. All of the technical proofs

are relegated to appendix section.

2 Methodology

Our approach is to estimate the support of Ω and Γ separately under the control of false

discovery rate (FDR)/ false discovery proportion (FDP) and combine them to obtain an

estimate of the support of Ω⊗ Γ. In fact, if supp(Ω) and supp(Γ) are the estimators of

supp(Ω) and supp(Γ), respectively, under the control of FDP at level α, it is clear that

the FDP of supp(Ω)⊗ supp(Γ) as an estimator of supp(Ω⊗Γ) is controlled at the level α′,

where α′ given in (30) is a monotonically increasing function in α ∈ [0, 1]. Therefore, our

main task is to design an estimate of supp(Γ) under the FDP level α and the estimate of

supp(Ω) will be completely symmetric. We propose to estimate supp(Γ) by implementing

the following multiple tests

H0ij : γij = 0 vs H1ij : γij 6= 0, 1 ≤ i < j ≤ q. (4)

We note that we omit the symbol Γ in superscript of H in (3) for notational simplicity.

2.1 Construction of Test Statistics

In this section, we propose our test statistic for each H0ij in (4). Our test statistic relies

on an initial estimator of Γ that is descried as follows. By the conditional distribution

of MGGMs (see Theorem 2 in Allen and Tibshirani [2010]), for every row 1 ≤ l ≤ p and

column 1 ≤ j ≤ q, we have

Xl,j|Xl,−j ∼ N(µl,j − (Xl,−j − µl,−j)γ−1jj Γ−j,j, γ−1jj σll),

which can be equivalently written into a linear model,

Xlj = αlj + Xl,−jβj + εlj, (5)

7

where βj = −γ−1jj Γ−j,j does not depend on l and αlj = µl,j + γ−1jj µl,−jΓ−j,j. The error

terms εlj satisfy E(εlj) = 0 for any l ∈ [p] and j ∈ [q]; and Corr(εli, εlj) = γij/√γiiγjj

for any l ∈ [p] and 1 ≤ i < j ≤ q (see Lemma 1 from Peng et al. [2009]).

Let βj = (β1,j, . . . , βq−1,j)′

be an initial estimator for βj, which satisfies

max1≤j≤q

|βj − βj|1 = OP(an1), and max1≤j≤q

|βj − βj|2 = OP(an2), (6)

where an1 → 0 and an2 → 0 at some rate that will be specified later. In this paper, we

adopt the standard Lasso to construct the initial estimator βj for ease of presentation.

We also note that Dantzig selector [Candes and Tao, 2007], square-root/scaled Lasso

[Belloni et al., 2011, Sun and Zhang, 2012] or other sparse regression approaches can

also be adopted provided that (6) is satisfied. More details of the construction of initial

estimator (e.g., tuning parameter selection) will be provided in Section 2.2.

Given the initial estimate βjj∈[q], we define the residual for each regression problem

corresponding to (5) by,

ε(k)lj = X

(k)lj − Xlj − (X

(k)l,−j − Xl,−j)βj,

where Xlj = 1n

∑nk=1X

(k)lj and Xl,−j = 1

n

∑nk=1 X

(k)l,−j. Further let rij be the sample

covariance coefficients between residuals,

rij =1

(n− 1)p

n∑k=1

p∑l=1

ε(k)li ε

(k)lj . (7)

In contrast to case of vector-variate GGMs in [Liu, 2013], where the test statistic is

only based on a bias correction of rij, our test statistic needs both bias and variance

correction. Let first introduce the bias correction of rij. In particular, for 1 ≤ i < j ≤ q,

we introduce the statistic

Tij = rij + riiβi,j + rjjβj−1,i (8)

=1

(n− 1)p

( n∑k=1

p∑l=1

ε(k)li ε

(k)lj +

n∑k=1

p∑l=1

(ε(k)li )2βi,j +

n∑k=1

p∑l=1

(ε(k)lj )2βj−1,i

),

where βj = (β1,j, . . . , βp−1,j)′. Here we note that the estimated coefficient βj−1,i in fact

corresponds to X·,j since the linear model in (5) associated with βi = (β1i, . . . , βp−1,i)′

takes the following form,

Xli = αli +∑k<i

Xlkβk,i +∑k>i

Xlkβk−1,i.

8

For k ∈ [n], l ∈ [p] and j ∈ [q], let us define

ε(k)lj = ε

(k)lj −

1

n

n∑k=1

ε(k)lj =: ε

(k)lj − εlj, σjj,ε =

1

(n− 1)p

n∑k=1

p∑l=1

(ε(k)lj

)2.

We will show that,

Tij = −bijγijγiiγjj

+

∑nk=1

∑pl=1(ε

(k)li ε

(k)lj − Eε(k)li ε

(k)lj )

(n− 1)p

+OP

((a2n1 + an1)

√log max(q, np)

np+ a2n2

), (9)

where

bij = γiiσii,ε + γjjσjj,ε −tr(Σ)

p. (10)

In addition to the bias correction, we also need a variance correction on Tij so that

the constructed test statistic will have an asymptotic standard normal distribution. To

this end, we first establish the following asymptotic normality result for the second term

in (9), ∑nk=1

∑pl=1(ε

(k)li ε


(k)lj )√

(n− 1)‖Σ‖2F((γiiγjj)−1 + γ2ij(γiiγjj)−2)⇒ N(0, 1), as np→∞. (11)

In (11), the quantity γii is unknown. To address this issue, we will show that rii defined

in (7) satisfies

rii −tr(Σ)

pγii= OP

(a2n2 + (a2n1 + an1)

√log max(q, np)

np+

√log max(q, np)

np

). (12)

Using (12), the denominator in (11) under the null H0ij : γij = 0, which is essentially

the asymptotic standard deviation of (n − 1)p Tij under H0ij, can be approximated by√(n− 1)p riirjjAp, where

Ap = p‖Σ‖2F/(tr(Σ))2, (13)

which plays the key role in variance correction.

Combining (11) and (12) with (9), under the null H0ij : γij = 0 and certain rate of

convergence of an1 and an2 (i.e., an1 = o(1/√

log max(q, np)) and an2 = o((np)−1/4)), we

have √(n− 1)pTij√riirjjAp

⇒ N(0, 1), as np→∞. (14)

9

We note that in contrast to the vector-variate GGMs in Liu [2013], due to the more

complicated covariance structure of X, this extra variance correction involving Σ (see

(13)) is necessary in the FDR/FDP control. Given (14), it is natural to construct the

final test statistic as

Tij =

√(n− 1)p

riirjjApTij, (15)

where Ap is an estimator of the real number Ap which satisfies Ap/Ap → 1 in probability

as nq →∞.

2.1.1 Estimator for Ap

We propose an estimator Ap of Ap based on a thresholding estimator of Σ. We first

construct an initial estimator of Σ based on nq p-dimensional samples extracted from X.

In particular, let X = 1n

∑nk=1 X(k) and Ykj = X

(k)·,j − X·,j ∈ Rp×1 for k ∈ [n] and j ∈ [q],

we define

Σ = (σij)p×p :=1

(n− 1)q

n∑k=1

q∑j=1

(Ykj)(Ykj)′.

In a more succinct notation, let Y = [X1 − X, . . . ,Xn − X] ∈ Rp×(nq), Σ = 1(n−1)qYY′.

Then, we threshold the elements of Σ as follows:

σij,λ = σij,λI

|σij,λ| ≥ λ

√log max(p, nq)

nq

for i 6= j; (16)

and σii,λ = σii for i ∈ [p]. Set Σλ = (σij,λ)p×p and define the estimator

Ap =p‖Σλ‖2Ftr(Σλ))2

. (17)

In Proposition 3.2 in Section 3, we show that Ap/Ap → 1 in probability as nq → ∞ for

a properly chosen λ (the choice of λ will be discussed later in Section 2.2). Therefore,

by (14), we have the desired asymptotic normality of Tij under the null: Tij ⇒ N(0, 1)

as np → ∞. We note that, when the columns of X are i.i.d. (i.e., Ψ = Iq), consistency

under the spectral norm ‖Σλ − Σ‖2 has been established if the sparsity condition of Σ

satisfies s0(p) = O(√nq/ log p) (see, e.g., Bickle and Levina [2008], Cai and Liu [2011]

and references therein). We do not need such a strong consistency result of Σλ in spectral

10

norm to establish the consistency of Ap. In fact, since Ap only involves ‖Σ‖F rather than

Σ itself, the sparsity condition on s0(p) is no longer necessary; see Proposition 3.2 and its

proof for more details. We also note that some other estimators of the Frobenius norm

‖Σ‖F have been proposed (e.g,. by Chen and Qin [2010]). However, those approaches

heavily depend on the i.i.d. assumption on the columns of X which is not valid our

setting. In our framework, the development of convergence result is technically more

challenging since the columns of X are no longer i.i.d. and one needs to carefully deal

with the correlation structures among Ykjk∈[n],j∈[q] and establish a large deviation result

for dependent observations.

2.2 Initial estimator βj and data-driven choice of parameters

In the construction of the test statistic Tij, we need the initial estimate βj that satisfies the

condition in (6). Here, we choose to construct βj by Lasso for the presentation simplicity

and other approaches such as Dantzig selector, scaled/square-root Lasso can also be used.

In particular, let Z = [X′

1 − X′, . . . ,X

′

n − X′] ∈ Rq×(np) and Ψ = 1

(n−1)pZZ′

=: (ψij)q×q.

For 1 ≤ j ≤ q, define Dj = diag(Ψ−j,−j). The coefficients βj can be estimated by Lasso

as follow:

βj(δ) = D−1/2j αj(δ), (18)

where

αj(δ) = arg minα∈Rq−1

1

2np

n∑k=1

p∑l=1

(X(k)lj − Xlj − (X

(k)l,−j − Xl,−j)D

−1/2j α)2 + λnj(δ)|α|1

(19)

and

λnj(δ) = δ

√ψjj log max(q, np)

np. (20)

We note that in the Lasso estimate in (19), the np samples(X

(k)l,−j − Xl,−j , X

(k)lj − Xlj

)for k ∈ [n] and j ∈ [p] are not i.i.d. and thus the standard consistency result of Lasso

cannot be applied here. By exploring the correlation structure among rows of X, we

managed to derive the rate of convergence of the Lasso estimator in `1 and `2 norm. This

result and the corresponding proof technique (see Proposition 3.3 and its proof) might be

of independent interest for dealing with high-dimensional correlated data.

11

Although our theoretical results will hold for any large enough constants λ in (16) for

estimating Ap and δ in (20) for βj(δ), it is important to have some data-driven approaches

to tune λ and δ from a practical perspective. We propose to adopt data-driven parameter

tuning strategy from Liu [2013] (see Eq. (19) in Liu [2013]). In particular, λ and δ are

selected by

(λ, δ) = arg minλ,δ

9∑k=3

(∑1≤i 6=j≤q I|Tij(λ, δ)| ≥ Φ−1(1− k

20)

k(q2 − q)/10− 1

)2

, (21)

where Tij(λ, δ) is the test statistic in (15) with initial estimator βj(δ) and Ap. We note

that although the parameter searching is conducted on a two-dimensional grid on λ and

δ, the main computational cost is the construction of βj(δ), which is totally irreverent of

λ. Therefore, the computational cost of (21) is moderate.

2.3 FDR control procedure

Given the constructed test statistic Tij, we can carry out (q2−q)/2 tests in (4) simultane-

ously. Let t be the threshold level such that H0ij is rejected if |Tij| ≥ t. To control FDR

while obtain a large power, an ideal choice of t is to select the smallest possible t under

the condition that FDP (i.e., the proportion of false discoveries among total rejections)

is controlled at the a pre-specified level 0 < α < 1. That is,

torc = inf

t > 0 :

∑(i,j)∈H0

I|Tij| ≥ tmax

∑1≤i<j≤q I|Tij| ≥ t, 1

≤ α

,

where H0 := (i, j) : γij = 0, 1 ≤ i < j ≤ q; and we take infimum over t so that the

test rejects as many hypotheses as possible. We also note that the left hand side of the

inequality in the above equation is known as FDP, whose expectation is FDR. The oracle

torc cannot be computed since H0 is unknown. Nevertheless, by the asymptotic normality

of Tij under the null, we can approximate∑

(i,j)∈H0I|Tij| ≥ t by 2(1−Φ(t))|H0|, where

Φ(t) denotes the CDF for standard normal distribution. Moreover, since Γ is sparse, |H0|is close to (q2− q)/2. Using such an approximation, we obtain the following procedure to

conduct multiple testing in (4) and estimate supp(Γ):

12

Let 0 < α < 1 and

t = inft ≥ 0 :

(1− Φ(t))(q2 − q)max

∑1≤i<j≤q I|Tij| ≥ t, 1

≤ α. (22)

For 1 ≤ i 6= j ≤ q, we reject H0ij if |Tij| ≥ t and the estimated support of Γ is

supp(Γ) = (i, j) : |Tij| ≥ t, 1 ≤ i 6= j ≤ q ∪ (i, i) : 1 ≤ i ≤ q. (23)

Note that we set Tji = Tij for 1 ≤ i < j ≤ q in (23).

Estimation of supp(Ω). The estimation of supp(Ω) can be done with a completely

symmetric procedure. In particular, we only need to consider the transpose of each matrix-

variate observation X(k), i.e., (X(1))′, . . . , (X(n))′ and change some necessary notations

(e.g., p to q and q to p).

Estimation of supp(Ω⊗Γ). Let supp(Ω) and supp(Γ) be the estimators of supp(Ω)

and supp(Γ), respectively, under the control of FDR at level α. The support of Ω ⊗ Γ

can then be estimated by supp(Ω) ⊗ supp(Γ). In Theorem 3.3, we will show that the

FDR/FDP of this estimator is controlled at level

α′ =α((2− α)ab+ aq + bp)

max(ab+ aq + pb, 1)

asymptotically, where a and b are the numbers of total discoveries in supp(Ω) and supp(Γ),

respectively, excluding the diagonal entries.

3 Theoretical results

Let λmin(Σ) = λ(1)1 ≤ . . . ≤ λ

(1)p = λmax(Σ) be eigenvalues of Σ and λmin(Ψ) = λ

(2)1 ≤

. . . ≤ λ(2)q = λmax(Ψ) be eigenvalues of Ψ. We make the following standard assumption

on eigenvalues:

(C1). We assume that c−1 ≤ λ(1)1 ≤ . . . ≤ λ

(1)p ≤ c and c−1 ≤ λ

(2)1 ≤ . . . ≤ λ

(2)q ≤ c for

some constant c > 0.

The first result gives the asymptotic normality for our test statistic.

Proposition 3.1 Assume that, as np → ∞, log max(q, np) = o(np), and the initial

estimator βj for j ∈ [q] satisfies (6) with

an2 = o((np)−1/4) and an1 = o(1/√

log max(q, np)). (24)

13

We have, as np→∞,√(n− 1)p

riirjjAp

(Tij + bij

γijγiiγjj

)⇒ N

(0, 1 +

γ2ijγiiγjj

)in distribution, where rii, bij and Ap are defined in (7), (10), and (13), respectively.

Note that by the assumption (C1), 1/c′ ≤ Ap ≤ c′ for some constant c′ > 0. Also note

that the condition in (24) will be established later in Proposition 3.3. It is interesting

to see that, in Proposition 3.1, we only require np → ∞ which means that the sample

size n can be a fixed constant. This is a significant difference between the estimation

of MGGMs and that of vector-variate GGMs. In the latter problem, to establish the

asymptotic consistency or normality, the sample size is usually required to go to infinity

in the existing literature; see, e.g., Rothman et al. [2008b], Lam and Fan [2009], Liu [2013],

Ren et al. [2015].

We next establish the convergence rate for the estimator of Ap. To this end, we need

an additional condition on Σ.

(C2). For some 0 < τ < 2, assume that∑p

j=1 |σij|τ ≤ C uniformly in 1 ≤ i ≤ p.

Note that (C2) is only slightly stronger than (C1) by noting that (C1) implies that∑pj=1 σ

2ij ≤ C uniformly in 1 ≤ i ≤ p.

Proposition 3.2 Let λ = λ√

logmax(p,nq)nq

with λ being sufficiently large. Suppose that

(C2) holds. We have Ap/Ap = 1 +OP(λ2−τ ) as nq →∞.

By Propositions 3.1 and 3.2, under H0ij : γij = 0, we have the test statistic Tij ⇒N(0, 1) if log max(q, np) = o(np) and log max(p, nq) = o(nq) as (np, nq)→∞.

By Lemma 6.3 in Appendix, we have bij − tr(Σ)/p = oP(1), where bij is defined in

(10). Therefore, Propositions 3.1 and 3.2 also imply that, to detect γij 6= 0 correctly, γij

should satisfy√np|γij| → ∞. Note that, in the estimation of vector-variate GGMs, it

requires√n|γij| → ∞ to ensure γij 6= 0 can be detected with high probability; see Ren

et al. [2015] for the minimax optimal rate for the estimation of single entry in precision

matrix in vector-variate GGMs.

We next show that the FDP and FDR of supp(Ω)⊗ supp(Γ) can be controlled asymp-

totically. To this end, we discuss the FDP and FDR of the estimation of supp(Γ) and

14

supp(Ω) separately. For the estimation of supp(Γ), recall the definition of FDP and FDR,

FDP1 =

∑(i,j)∈H0

I(i, j) ∈ supp(Γ)

max(∑

1≤i<j≤q I(i, j) ∈ supp(Γ), 1)=

∑(i,j)∈H0

I|Tij| ≥ tmax(

∑1≤i<j≤q I|Tij| ≥ t, 1)

,

FDR1 = E(FDP1),

where H0 = (i, j) : γij = 0, 1 ≤ i < j ≤ q. Let H1 = (i, j) : γij 6= 0, 1 ≤ i < j ≤q. Further define $0 = Card(H0) be the total number of true nulls, $1 = card(H1) be

the number of true alternatives, and $ = (q2 − q)/2 be the number of hypotheses. For a

constant γ > 0 and 1 ≤ i ≤ q, define

Ai(γ) = j : 1 ≤ j ≤ q, j 6= i, |γij| ≥ (log q)−2−γ.

Theorem 3.1 shows that our procedure controls FDP and FDR at level α asymptotically.

Theorem 3.1 Let the dimension (p, q) satisfy q ≤ (np)r for some r > 0. Suppose that

Card

(i, j) : 1 ≤ i < j ≤ q,|γij|√γiiγjj

≥ 4

√Ap log q

(n− 1)p

≥√

log log q. (25)

Assume that $0 ≥ c$ for some c > 0 and initial estimators βii∈[q] satisfy (6) with

an1 = o(1/ log max(q, np)) and an2 = o((np log q)−1/4). (26)

Under (C1), (C2), and max1≤i≤q Card(Ai(γ)) = O(qρ) for some ρ < 1/2 and γ > 0, we

have

lim(np,q)→∞

FDR1

α$0/$= 1 and

FDP1

α$0/$→ 1 in probability as (np, q)→∞.

Condition (25) is a nearly necessary condition for the FDP control. In fact, Liu and

Shao [2014](Proposition 2.1) showed that, in large-scale multiple testing problems, if the

number of true alternatives is fixed, then it is impossible for the Benjamini and Hochberg

method [Benjamini and Hochberg, 1995] to control the FDP with probability tending

to one at any desired level. Note that (25) is only slightly stronger than the condition

that the number of true alternatives goes to infinity. The condition on Card(Ai(γ)) is a

sparsity condition for Γ. This condition is also quite weak. For the estimation of vector-

variate GGMs, the existing literature often requires the sparsity of precision matrix to be

15

less than O(√n). Note that when the dimension q is much larger than n, our condition

on Card(Ai(γ)) in Theorem 3.1 is clearly much weaker. The assumption that $0 ≥ c$

for some constant c > 0 is also necessary. If $0 = o($), almost all γij for 1 ≤ i < j ≤ q

are nonzeros and rejecting all the hypotheses will lead to FDR→ 0 and thus control the

FDR. It should be noted that in Theorem 3.1, the sample size n can be a fixed constant

as long as the dimension (p, q)→∞.

As in Theorem 3.1, we have the similar FDP and FDR control result for the estimation

of supp(Ω). Let H′0 = (i, j) : ωij = 0, 1 ≤ i < j ≤ p and H′1 = (i, j) : ωij 6= 0, 1 ≤i < j ≤ p. Let κ0 = Card(H′0), κ1 = Card(H′1) and κ = (p2−p)/2. Recall the definition

of FDP and FDR of the estimation of supp(Ω),

FDP2 =

∑(i,j)∈H′0

I(i, j) ∈ supp(Ω)

max(∑

1≤i<j≤q I(i, j) ∈ supp(Ω), 1), FDR2 = E(FDP2).

For a constant γ > 0 and 1 ≤ i ≤ p, define

Bi(γ) = j : 1 ≤ j ≤ p, j 6= i, |ωij| ≥ (log p)−2−γ.

Let Bq = q‖Ψ‖2F/(tr(Ψ))2. As (C2), we assume the following condition on Ψ = (ψij)p×p,

(C3). For some 0 < τ < 2, assume that∑q

j=1 |ψij|τ ≤ C uniformly in 1 ≤ i ≤ q.

Theorem 3.2 Let the dimension (p, q) satisfy p ≤ (nq)r for some r > 0. Suppose that

Card

(i, j) : 1 ≤ i < j ≤ p,|ωij|√ωiiωjj

≥ 4

√Bq log p

(n− 1)q

≥√

log log p. (27)

Assume that κ0 > cκ for some c > 0 and the initial estimators βii∈[p] satisfies (6) with

an1 = o(1/ log max(p, nq)) and an2 = o((nq log p)−1/4). (28)

Under (C1), (C3), and max1≤i≤pCard(Bi(γ)) = O(pρ) for some ρ < 1/2 and γ > 0, we

have

lim(nq,p)→∞

FDR2

ακ0/κ= 1 and

FDP2

ακ0/κ→ 1 in probability as (nq, p)→∞.

By Theorems 3.1 and 3.2, we can obtain the FDP and FDR result of the estimator

supp(Ω) ⊗ supp(Γ). In particular, let a0 and a be the number of false discoveries and

16

total discoveries in supp(Ω), excluding the diagonal entries. Similarly, let b0 and b be the

number of false discoveries and total discoveries in supp(Γ), excluding the diagonal entries.

Then, it is easy to calculate that the number of false discoveries in supp(Ω)⊗ supp(Γ) is

a0(q + b) + (a− a0)b0 + pb0, and the number of total discoveries in supp(Ω)⊗ supp(Γ) is

pb + a(q + b) (excluding the diagonal entries). We have the following formulas for FDP

and FDR of supp(Ω)⊗ supp(Γ):

FDP =a0(q + b) + (a− a0)b0 + pb0

max(pb+ a(q + b), 1), FDR = E(FDP). (29)

The true FDP in (29) cannot be computed in practice since the number of false discoveries

a0 and b0 are unknown. One straightforward estimator for FDP is to replace the unknown

quantities a0 and b0 in by αa and αb, respectively, which leads to the following FDP

estimator α′:

α′ =α((2− α)ab+ aq + bp)

max(ab+ aq + bp, 1)(30)

In the next theorem, we show that FDP/α′ converges to 1 in probability as (p, q)→∞.

Theorem 3.3 Under the conditions of Theorems 3.1 and 3.2 with the sparsity condition

ω1 = o(ω) and κ1 = o(κ), we have

FDP

α′→ 1 in probability as (p, q)→∞.

Theorem 3.3 shows that the FDP of the proposed estimator supp(Ω) ⊗ supp(Γ) can

be estimated consistently (note that the values of a and b are known). In fact, we can still

obtain the guarantee of FDP of supp(Ω)⊗ supp(Γ) even without the sparsity conditions

ω1 = o(ω) and κ1 = o(κ). In particular, when the sparsity conditions ω1 = o(ω) and

κ1 = o(κ) are not satisfied, by Theorems 3.1 and 3.2, we can obtain from the formula of

FDP that, for any ε > 0,

P( FDP

α(2ab+ aq + bp)/max(ab+ aq + pb, 1)≤ 1 + ε

)→ 1.

Finally, we present the next proposition which shows that the convergence rate con-

dition of the initial estimators of Γ in (24) and (26) can be satisfied under some regular

conditions. The convergence rate condition in (28) for estimation Ω can be established

similarly. This result establishes the consistency of Lasso for correlated samples, which

itself can be a separate interesting result.

17

Proposition 3.3 Let δ in (20) be large enough. Suppose that (C1) holds and max1≤i≤q |βi|0 =

o( √

np

(logmax(q,np))3/2

). We have the initial estimators βi(δ) for 1 ≤ i ≤ q are consistent in

both `1 and `2 norms with the rate in (26).

4 Statistical inference for vector-variate GGMs with

correlated samples

The proposed method for estimating MGGMs can also be adapted to estimate vector-

variate GGMs with identically distributed but correlated samples. In fact, for MGGMs

with only a single matrix-variate observation (n = 1), estimating supp(Γ) is equivalent to

the estimation of vector-variate GGM with p correlated q-dimensional samples (recall that

Γ = (γij)q×q = Ψ−1). To see this, let X ∼ N(µ,Σ⊗Ψ), where the mean matrix µ ∈ Rp×q

is assumed to have p identical rows (each row is denoted by µr) and the diagonal entries

of Σ satisfy σii = 1 for all 1 ≤ i ≤ p. Then, p rows of X are identically distributed (note

that σii = 1) with the multivariate normal distribution N(µr,Ψ) but are correlated. The

correlation across samples may arise in applications, e.g., the microarray studies, in which

Efron [2009] showed that the independence across samples can be a risky assumption.

When Σ = Ip in the above setting, the rows of X become p i.i.d. vector-variate

samples. In such a case, the asymptotic normality for the estimator of γij/(γiiγjj) and

γij has been established in Liu [2013] and Ren et al. [2015], based on which the test on

γij = 0 can be performed. Liu [2013] proposed a procedure to control the FDR in the

estimation of supp(Γ) for i.i.d. vector-variate GGMs. However, the method proposed in

Liu [2013] cannot be applied to deal with correlated samples. The following Proposition

4.1 shows that when there are correlations among samples, the null distribution of the

test statistics developed in Liu [2013] is incorrect and thus cannot be used for the purpose

of FDR control.

Recall the above setup where p rows of X ∈ Rp×q form a set of identically distributed

samples from N(µr,Ψ). Let εlj = Xlj − Xj − (Xl,−j − X−j)βj, where Xj = 1p

∑pl=1Xlj

and X−j = 1p

∑pl=1 Xl,−j. Define rij = 1

p

∑pl=1 εliεlj and Tij = rij + riiβi,j + rjjβj−1,i, for

1 ≤ i < j ≤ q. Liu [2013] proposed the following test statistic for H0ij : γij = 0,

Tij,0 =

√p

riirjjTij. (31)

18

Liu [2013] showed that, if the samples are independent and condition in (24) is satisfied

with n = 1, then under H0ij : γij = 0 , Tij,0 ⇒ N(0, 1). However, when samples are

correlated, this result does not hold anymore and hence FDR/FDP procedure in Liu [2013]

is unable to control the FDR/FDP. An adjustment (the so-called variance correction) of

Tij,0 is needed to establish the asymptotic normality as shown in the following proposition.

Proposition 4.1 Assume (C2) and the conditions of Proposition 3.1 hold with n = 1.

Suppose qr1 ≤ p ≤ qr2 for some 0 < r1 ≤ r2 < ∞. Under H0ij, we have Tij,0/√Ap ⇒

N(0, 1) as (p, q)→∞, where Ap = p‖Σ‖2F/(tr(Σ))2.

From Proposition 4.1, we can see that, when samples are correlated, the asymp-

totic variance of Tij,0 is changed from 1 to Ap. Note that as σii = 1, we have Ap =

p−1∑

1≤i,j≤p σ2ij. It is clearly the quantity Ap reflects how the correlation changes the

asymptotic variance of Tij,0. If the correlation is weak such that Ap → 1 (as p→∞), then

the method by Liu [2013] still works. However, when the correlation becomes stronger

such that Ap ≥ 1 + c for some fixed c > 0, then it is easy to see that the method by Liu

[2013] fails to control the FDR/FDP because the asymptotic variance of Tij,0 is strictly

larger than one.

For estimating vector-variate GGMs with correlated samples, we can use the variance

correction technique introduced in Section 2 to control the FDR/FDP in the estimation

of supp(Γ). Recall that X = (X1, . . . ,Xq) where each Xj ∈ Rp, for each 1 ≤ j ≤ q. Let

Σ = 1q

∑qj=1(Xj − Xj1)(Xj − Xj1)′ ∈ Rp×p, where 1 is a p-dimensional vector with all

entries being 1. Define the thresholding estimators σij,λ = σij,λI|σij,λ| ≥ λ√

log(p∨q)p∧q for

i 6= j and σii,λ = σii. Set Σλ = (σij,λ)p×p and Ap = p‖Σλ‖2F/(tr(Σλ))2. We construct the

following test statistic

Tij =

√p

riirjjApTij. (32)

The following proposition shows that the proposed variance correction method leads

to the asymptotical normality of the test statistic in (32) for correlated samples.

Proposition 4.2 Assume (C2) and the conditions of Proposition 3.1 hold with n = 1.

Suppose qr1 ≤ p ≤ qr2 for some 0 < r1 ≤ r2 <∞. Under H0ij, we have Tij ⇒ N(0, 1) as

(p, q)→∞.

19

With the procedure (22) and Tij in (32), the FDR and FDP control result Theorem

3.1 holds (without the term (n − 1) in (25) and with n = 1 in (26)). We note that

although the proposed method for estimating vector-variate GGM with correlated samples

is similar to the approach for estimating MGGMs in Section 2, there are still some subtle

differences. In particular, for MGGMs with n > 1 samples, the samples are centralized

by subtracting the matrix-variate sample mean X = 1n

∑nk=1 Xk ∈ Rp×q and Xknk=1 are

i.i.d. In contrast, for vector-variate GGMs with correlated samples, the j-th column of

X is centralized by subtracting Xj1, where Xj = 1p

∑pl=1Xlj. Such a difference requires

some additional technical manipulations for establishing Proposition 4.2.

5 Simulated Experiments

In this section, we conduct simulation studies. In particular, we construct Ω and Γ based

on combinations of following graph structures used in [Liu, 2013].

1. Hub graph. There are p/10 rows with sparsity 11. The rest of the rows have

sparsity 2. To this end, we let Ω1 = (ωij), ωij = ωji = 0.5 for i = 10(k − 1) + 1 and

10(k − 1) + 2 ≤ j ≤ 10(k − 1) + 10, 1 ≤ k ≤ p/10. The diagonal ωii = 1 and others

entries are zero. We also let Ω = Ω1 + (|min(λmin)| + 0.05)Ip to make the matrix

be positive definite.

2. Band graph. Ω = (ωij), where ωii = 1, ωi,i+1 = ωi+1,i = 0.6, ωi,i+2 = ωi+2,i = 0.3,

ωij = 0 for |i− j| ≥ 3.

3. Erdos-Reny random graph. There is an edge between each pair of nodes with

probability min(0.05, 5/p) independently. Let Ω1 = (ωij), ωii = 1 and ωij = uij ∗ δijfor i 6= j, where uij ∼ U(0.4, 0.8) is the uniform random variable and δij is the

Bernoulli random variable with success probability min(0.05, 5/p). uij and δij are

independent. We also let Ω = Ω1 + (|min(λmin)| + 0.05)Ip such that the matrix is

positive definite.

The Γ matrix is also constructed from one of the above three graph structures.

For each combination of Ω and Γ, we generate n (n = 20 or n = 100) samples

(X(k))nk=1 with each X(k) ∼ Np,q(0,Σ ⊗Ψ) where Σ = Ω−1 and Ψ = Γ−1. We consider

different settings of p and q, i.e., (p, q) = (100, 100), (p, q) = (200, 50), (p, q) = (200, 200)

20

p q Ω Γ n = 20 n = 100

FDP (α′) Power FDP (α′) Power

100 100 hub hub 0.094 (0.144) 0.998 0.127 (0.145) 1.000

hub band 0.093 (0.151) 0.999 0.125 (0.152) 1.000

hub random 0.085 (0.151) 0.826 0.117 (0.153) 1.000

band band 0.097 (0.161) 1.000 0.139 (0.161) 1.000

band random 0.077 (0.163) 0.995 0.106 (0.163) 1.000

random random 0.080 (0.162) 0.777 0.084 (0.165) 0.999

200 200 hub hub 0.121 (0.145) 1.000 0.130 (0.145) 1.000

hub band 0.108 (0.152) 1.000 0.127 (0.152) 1.000

hub random 0.112 (0.153) 0.955 0.131 (0.153) 1.000

band band 0.109 (0.161) 1.000 0.136 (0.162) 1.000

band random 0.107 (0.163) 1.000 0.120 (0.163) 1.000

random random 0.112 (0.165) 0.974 0.115 (0.165) 1.000

200 50 hub hub 0.084 (0.144) 0.997 0.119 (0.145) 1.000

hub band 0.107 (0.151) 1.000 0.128 (0.152) 1.000

hub random 0.055 (0.152) 0.976 0.103 (0.153) 1.000

band band 0.077 (0.161) 1.000 0.127 (0.161) 1.000

band random 0.086 (0.161) 0.870 0.082 (0.163) 1.000

random random 0.076 (0.158) 0.608 0.102 (0.165) 0.966

400 400 hub hub 0.134 (0.145) 1.000 0.131 (0.145) 1.000

hub band 0.119 (0.152) 1.000 0.131 (0.152) 1.000

hub random 0.135 (0.153) 0.996 0.135 (0.153) 1.000

band band 0.133 (0.162) 1.000 0.140 (0.162) 1.000

band random 0.131 (0.164) 1.000 0.134 (0.164) 1.000

random random 0.136 (0.166) 1.000 0.134 (0.166) 1.000

Table 1: Averaged empirical FDP, α′ in (30) and power for α = 0.1.

and (p, q) = (400, 400). The FDR significance level α for estimating both supp(Ω) and

supp(Γ) is set to 0.1 (the observations for other α’s are similar and thus omitted for space

considerations). The parameters λ and δ are tuned using the data-driven approach in

(21). All the simulation results are based on 100 independent replications.

21

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

hub hub hub band band randomhub band random band random random

FD

P

(a) p = 100, q = 100

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22


FD

P

(b) p = 200, q = 50

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22


FD

P

(c) p = 200, q = 200

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17


FD

P

(d) p = 400, q = 400

Figure 1: Boxplots for FDP when n = 100 and α = 0.1.

In Table 1, we report the averaged true FDP in (29), α′ in (30) and power for estimating

supp(Ω⊗Γ) over 100 replications. The power of the estimator supp(Ω)⊗ supp(Γ) can be

calculated using the following simple formula. In particular, let A and B be the number

of nonzero off-diagonals in Ω and Γ, recall the definition of a0, a, b0, b in (29), we have

Power =(pb+ a(q + b))− (a0(q + b) + (a− a0)b0 + pb0)

pB +A(q +B)=p(b− b0) + (a− a0)(q + b− b0)

pB +A(q +B),

(33)

where the numerator is the number of true discoveries in supp(Ω) ⊗ supp(Γ) and the

denominator is the total number of nonzero off-diagonals in supp(Ω⊗Γ). From Table 1,

the true FDP is well controlled below or close to its estimate α′ for all cases. For n = 100,

the power is very close to 1 for all cases. For n = 20, when Ω and Γ are both generated

from random graphs, there are a few settings that the power is low. We examined these

22

100 200 300 4001

1.02

1.04

1.06

1.08

1.1

p, q

Ap (

Σ) R

atio

(a) n = 20

100 200 300 4001

1.005

1.01

1.015

1.02

1.025

p, q

Ap (

Σ) R

atio

(b) n = 100

Figure 2: The ratio Ap/Ap for different n and p = q when Ω and Γ are hub graphs.

cases and found out that our FDP procedure generates overly sparse estimators, which

leads to lower powers. It shows that when Ω and Γ are both generated from random

graphs, it is difficult to recover their supports with a very small n (e.g., n = 20). We

also note that the true FDP in (29) is usually smaller than its estimator α′ in (30) and

the true FDP gets close to its estimate when p and q become large. In fact, for all the

cases when n = 100, the true FDPs for both supp(Ω) and supp(Γ) are usually smaller

than α so that the overall FDP of supp(Ω⊗Γ) is smaller than its estimate α′ (recall that

the α′ is obtained by estimating the number of false discoveries as α times the number

of total discoveries). Since the power is almost 1, such a small FDP indicates that our

method has a superior support recovery performance. We present the boxplots of FDPs

when n = 100 over 100 replications in Figure 1. As we can see from Figure 1, FDPs are

well concentrated which suggests that the performance of the proposed estimator is quite

stable.

In Figure 2, we plot the ratio Ap/Ap for n = 20 (left figure) and n = 100 (right figure)

as p = q increases from 50 to 400. Due to space constraints, we only show the case when

both Ω and Γ are generated from hub graphs (the plots when Ω and Γ are generated

from other graphs structures are similar). As one can see from Figure 2, when either n is

fixed and p = q increases or p = q is fixed and n increases from 20 to 100, the mean ratio

becomes more close to one and the standard deviation of the ratios gets smaller. This

study empirically verifies Proposition 3.2, which claims that the ratio Ap/Ap converges to

23

p q Ω Γ n = 20 n = 100

FDP Power FDP Power

100 100 hub hub 0.037 0.347 0.038 0.346

hub band 0.455 0.410 0.348 0.381

hub random 0.722 0.926 0.738 0.910

band band 0.341 0.206 0.332 0.167

band random 0.339 0.964 0.246 0.971

random random 0.703 0.962 0.622 0.943

200 200 hub hub 0.027 0.322 0.059 0.370

hub band 0.357 0.314 0.333 0.306

hub random 0.629 0.903 0.654 0.888

band band 0.333 0.187 0.333 0.167

band random 0.246 0.966 0.151 0.982

random random 0.481 0.828 0.326 0.863

200 50 hub hub 0.060 0.381 0.054 0.385

hub band 0.362 0.330 0.371 0.517

hub random 0.754 0.927 0.704 0.909

band band 0.340 0.205 0.334 0.174

band random 0.588 0.833 0.551 0.855

random random 0.784 0.961 0.607 0.947

400 400 hub hub 0.946 0.978 0.952 1.000

hub band 0.982 1.000 0.962 0.992

hub random 0.949 0.999 0.910 1.000

band band 0.338 0.179 0.336 0.175

band random 0.233 0.848 0.237 0.864

random random 0.141 0.610 0.099 0.644

Table 2: Averaged empirical FDP and power using the penalized log-likelihood method

with the SCAD penalty.

1 in probability as nq →∞.

Finally, we compare our procedure with the penalized log-likelihood approach proposed

by Leng and Tang [2012]. We adopt the same (regularization) parameter tuning procedure

as Leng and Tang [2012], i.e., we generate an extra random test dataset with the sample

24

size equal to the training set and choose the parameter that maximizes the log-likelihood

on the test dataset. Due to space constraints, we only report the result using SCAD

penalty rather than L1 penalty since SCAD penalty leads to slightly better performance

(and this is also observed in Leng and Tang [2012]). The averaged empirical FDPs and

powers for different settings of n, p, q,Ω,Γ are shown in Table 2. As one can see from

Table 2, none of the cases has a small FDP while maintaining a large power. In fact, for

those settings with small averaged FDPs (e.g., n = 100, p = 200, q = 50 and Ω and Γ

generated from hub graphs with the averaged FDP 0.054), the corresponding powers are

also small (e.g., 0.385 for the aforementioned case), which indicates that the estimated Ω

or Γ is too sparse. On the other hand, for those settings with large averaged powers (e.g.,

n = 100, p = q = 400 and Ω from hub and Γ from random with the averaged power=1),

the corresponding FDPs are also large (e.g., 0.910 for the aforementioned case), which

indicates that the estimated Ω or Γ is too dense. We also note that when p, q are small as

compared to n, the penalized log-likelihood approach still achieves good support recovery

performance (e.g., the case n = 100, p = q = 20 as reported in Leng and Tang [2012]).

When p, q are comparable to or lager than n, our method achieves much better support

recovery performance than the penalized log-likelihood approach.

6 Real Data Analysis

6.1 US Agricultural Export Data

We first apply our method to the US agricultural export data studied in Leng and Tang

[2012]. The dataset contains annual U.S. agriculture export data from 1970 to 2009, in

total 40 years. Each annual data contains the amount (in thousands U.S. dollars) of export

for 36 products (e.g., pet foods, snack foods, breakfast cereals, soybean meal, different

kinds of meats, eggs, dairy products, etc) at 13 different regions (e.g., North America,

Central America, South America, South Asia, etc). Thus, the dataset can be organized

into 40 matrix-variate observations, where each observation is a (p = 13)×(q = 36) matrix.

We adopt the method proposed in Leng and Tang [2012] to remove the dependence in

this matrix-variate time series data. In particular, we take the logarithm of the original

data plus one and then take the lag-one difference for each matrix observation so that the

number of observations becomes n = 39. Please refer to Leng and Tang [2012] for more

25

Region Product

α = 0.1 α = 0.2 α = 0.3 α = 0.1 α = 0.2 α = 0.3

No. of Edges 2 21 26 11 16 29

Table 3: No. of edges for export data. For 13 regions, there are 78 possible edges in total.

For 36 products, there are 630 possible edges in total.

details of the pre-processing of the data.

We apply the proposed FDR control procedure to estimate the support of the preci-

sion matrices for regions and products under different level α ∈ 0.1, 0.2, 0.3. In Table

3, we report the number of edges/discoveries for different α’s. We observe that the

number of discoveries is relatively small as compared to the number of hypotheses. In

Figure 3, we plot the graphs corresponding to the estimated supports of Ω and Γ for

α = 0.2 and α = 0.3. Figure 3(a) and 3(c) show the plots of graphs for p = 13 re-

gions. As we can see, the regions in the following sets East Asia, Southeast Asia,European Union, Other Europe and Central America, North America, South Americaare always connected. Such an observation is expectable since regions in each afore-

mentioned set are close geographically. This observation is kind of consistent with the

result obtained by penalized log-likelihood approach in Leng and Tang [2012], which

claims that “the magnitude between Europe Union and Other Europe, and that be-

tween East Asia and Southeast Asia are the strongest”. We also note that the regions

South Asia, Sub-Saharan Africa, and North Africa connect to fewer regions with the

degree either one or two. This observation is also kind of consistent with the result

in Leng and Tang [2012], which says that “interestingly, none of the 11 largest edges

corresponds to either North Africa or Sub-Saharan Africa”. The estimated graphs for

products shown in Figure 3(b) and 3(d) are quite sparse, which indicates many pairs

of products are conditionally independent given the information of the rest of the prod-

ucts. We also observe that the products in the following sets Pet foods, Snack foods,Eggs,Dairy Products,Red Meats FR/CH/FR,Red Meats Prep/Pres,Poultry Meat are

always connected. This observation also makes a lot of sense since different kinds of meats,

eggs, and dairy products are closely related products and thus should be highly correlated.

26

North America

Caribbean

Central America

South America

European Union−27

Other Europe

East Asia

Middle East

North Africa

Sub−Saharan Afri

South Asia

Southeast Asia

Oceania

(a) Graph for Regions (α = 0.2)

Wheat

Coarse Grains

Rice

Soybeans

CottonTobacco

Pulses

Peanuts

Other Bulk Commodities

Wheat Flour

Soybean Meal

Soybean Oil

Vegetable Oils (Ex Soybean)

Feeds & FoddersLive Animals

Hides & Skins

Animal Fats

Planting Seeds

Sugar Sweeteners Bev Bases

Other Intermediate Products

Snack Foods

Breakfast Cereals

Red Meats FR/CH/FR

Red Meats Prep/PresPoultry Meat

Dairy Products

Eggs & Products

Fresh Fruit

Fresh Vegetables

Processed Fruit & Vegetables

Fruit & Vegetable Juices

Tree Nuts

Wine and Beer

Nursery Products

Pet Foods

Other Consumer Oriented

(b) Graph for Products (α = 0.2)

North America

Caribbean

Central AmericaSouth America

European Union−27

Other Europe

East Asia

Middle East

North Africa

Sub−Saharan AfricaSouth Asia

Southeast Asia

Oceania

(c) Graph for Regions (α = 0.3)

Wheat

Coarse Grains

Rice

Soybeans

Cotton

Tobacco

PulsesPeanuts

Other Bulk Commodities

Wheat Flour

Soybean Meal

Soybean Oil

Vegetable Oils (Ex Soybean)

Feeds & FoddersLive Animals

Hides & Skins

Animal Fats

Planting Seeds

Sugar Sweeteners Bev Bases

Other Intermediate Products

Snack Foods

Breakfast Cereals

Red Meats FR/CH/FR

Red Meats Prep/Pres

Poultry Meat

Dairy Products

Eggs & Products

Fresh Fruit

Fresh VegetablesProcessed Fruit & Vegetables

Fruit & Vegetable Juices

Tree Nuts

Wine and Beer

Nursery Products

Pet Foods

Other Consumer Oriented

(d) Graph for Products (α = 0.3)

Figure 3: Sparse graphs for export data

6.2 Climate Data Analysis

In this section, we study a climatological data in Lozano et al. [2009], which contains

monthly data of p = 17 different meteorological factors from 1990 to 2002, in total 144

months. The observations span q = 125 locations in the US. The 17 meteorological factors

measured for each month include CO2, CH4, H2, CO, average temperature (TMP), diurnal

temperature range (DTR), minimum temperate (TMN), maximum temperature (TMX),

27

CO2

CH4

CO

H2WETCLD

VAP

PRE

FRS

DTR

TMN

TMP

TMX GLOETR

ETRN

DIR

(a) Graph for meteorological factors (α = 0.2)

CO2

CH4

CO

H2WETCLD

VAP

PRE

FRS

DTR

TMN

TMP

TMX GLOETR

ETRN

DIR

(b) Graph for meteorological factors (α = 0.3)

Figure 4: Sparse graphs for climate data

Meteorological factors Locations

α = 0.1 α = 0.2 α = 0.3 α = 0.1 α = 0.2 α = 0.3

No. of Edges 30 36 39 956 1406 1970

Table 4: No. of edges for climate data. For p = 17 meteorological factors, there are 136

edges in total. For q = 125 locations, there are 7750 possible edges in total.

precipitation (PRE), vapor (VAP), cloud cover (CLD), wet days (WET), frost days (FRS),

global solar radiation (GLO), direct solar radiation (DIR), extraterrestrial radiation (ETR),

extraterrestrial normal radiation (ETRN). We note that we ignore the UV aerosol index

factor in Lozano et al. [2009] since most measurements of this factor are missing. We

adopt the same procedure as described in Section 6.1 to reduce the level of dependence

in this matrix-variate time series data.

We apply the proposed FDR control procedure to estimate the support of the precision

matrices for meteorological factors and locations under different level α ∈ 0.1, 0.2, 0.3.In Table 4, we report the number of edges/discoveries for different α’s. From Table 4, the

number of discoveries for meteorological factors is quite stable as α increases from 0.1 to

0.3. Moreover, the number of discoveries for locations is relatively large, which indicates

28

many strong correlations among pairs of locations. We plot the graphs corresponding to

the estimated supports of the precision matrices for meteorological factors in Figure 4

(the plots for locations are omitted since they are too dense to visualize). An interesting

observation the factors TMX, TMP, TMN and DTR form a clique. This pattern is kind

of reasonable since the factors TMX, TMP, TMN and DTR are all related temperature

and thus should be highly correlated. Other sparsity patterns might also provide some

insights for understanding dependency relationships among meteorological factors.

Acknowledgement

The authors thank Chenlei Leng and Cheng Yong Tang for sharing the code of Leng and

Tang [2012] and US export data.

References

G. I. Allen and R. Tibshirani. Transposable regularized covariance models with an appli-

cation to missing data imputation. Annals of Applied Statistics, 4:764–790, 2010.

A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse

signals via conic programming. Biometrika, 98:791–806, 2011.

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and

powerful approach to multiple testing. Journal of the Royal Statistical Society. Series

B (Methodological), 57(1):389–300, 1995.

P. Bickle and E. Levina. Regularized estimation of large covariance matrices. 36(1):

199–227, 2008.

F. Bijma, J. De Munck, and R. Heethaar. The spatiotemporal meg covariance matrix

modeled as a sum of kronecker products. NeuroImage, 27(2):402–415, 2005.

T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal

of the American Statistical Association, 106(494):672–684, 2011.

T. T. Cai, W. Liu, and X. Luo. A constrained `1 minimization approach to sparse precision

matrix estimation. Journal of the American Statistical Association, 106:594–607, 2011.

29

E. Candes and T. Tao. The dantzig selector: statistical estimation when p is much larger

than n. Annals of Statistics, 35:2313–2351, 2007.

S. X. Chen and Y.-L. Qin. A two-sample test for high-dimensional data with applications

to gene-set testing. Ann. Statist., 38(2):808–835, 04 2010.

A. d’Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance

selection. SIAM Journal on Matrix Analysis and its Applications, 30:56–66, 2008.

A. P. Dawid. Some matrix-variate distribution theory: Notational considerations and a

bayesian application. Biometrika, 68:265–274, 1981.

B. Efron. Are a set of microarrays independent of each other? Annals of Applied Statistics,

3(3):922–942, 2009.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the

graphical lasso. Biostatistics, 9:432–441, 2008.

A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman Hall, 1999.

F. Huang and S. Chen. Joint learning of multiple sparse matrix Gaussian graphical

models. IEEE Transactions on Neural Networks and Learning Systems, 2015.

J. Jankova and S. van de Geer. Confidence intervals for high-dimensional inverse covari-

ance estimation. Electronic Journal of Statistics, 9:1205–1229, 2015.

A. Kalaitzis, J. Lafferty, N. D. Lawrence, and S. Zhou. The bigraphical lasso. In The

30th International Conference on Machine Learning, 2013.

C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix

estimation. Ann. Stat., 37:4254–4278, 2009.

C. Leng and C. Y. Tang. Sparse matrix graphical models. Journal of the American

Statistical Association, 107:1187–1200, 2012.

H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High dimensional semiparametric

Gaussian copula graphical models. Annals of Statistics, 40:2293–2326, 2012.

W. Liu. Gaussian graphical model estimation with false discovery rate control. Annals

of Statistics, 41:2948–2978, 2013.

30

W. Liu and Q.-M. Shao. Phase transition and regularized bootstrap in large-scale t-tests

with false discovery rate control. Ann. Statist., 42(5):2003–2025, 10 2014.

A. C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe.

Spatial-temporal causal modeling for climate change attribution. In Proceedings of

the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining, 2009.

S. Ma, Q. Gong, and H. J. Bohnert. An arabidopsis gene network based on the graphical

Gaussian model. Genome Research, 17:1614–1625, 2007.

N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with

the lasso. Annals of Statistics, 34:1436–1462, 2006.

J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse

regression models. Journal of the American Statistical Association, 104(486):735–746,

2009.

P. Ravikumar, M. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance

estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal

of Statistics, 5:935–980, 2011.

Z. Ren, T. Sun, C. H. Zhang, and H. H. Zhou. Asymptotic normality and optimalities in

estimation of large Gaussian graphical model. Annals of Statistics, 43:991–1026, 2015.

A. Rothman, P. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance

estimation. Electronic Journal of Statistics, 2:494–515, 2008a.

A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covari-

ance estimation. Electron. J. Stat., 2:494–515, 2008b.

J. Schafer and K. Strimmer. An empirical Bayes approach to inferring large-scale gene

association networks. Bioinformatics, 21:754–764, 2005.

T. Sun and C. H. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898,

2012.

T. Tsiligkaridis, A. O. H. III, and S. Zhou. Convergence properties of kronecker graphical

lasso algorithms. IEEE Transactions on Signal Processing, 61(7):1743–1755, 2013.

31

L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional nonparanor-

mal graphical models. Annals of Statistics, 40:2541–2571, 2012.

J. Yin and H. Li. Model selection and estimation in matrix normal graphical model.

Journal of Multivariate Analysis, 107:119–140, 2012.

Y. Ying and H. Liu. High-dimensional semiparametric bigraphical models. Biometrika,

pages 1–16, 2013.

M. Yuan. Sparse inverse covariance matrix estimation via linear programming. Journal

of Machine Learning Research, 11:2261–2286, 2010.

M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.

Biometrika, 94:19–35, 2007.

S. Zhou. Gemini: Graph estimation with matrix variate normal instances. Annals of

Statistics, 42:532–562, 2014.

32

Appendix

Let Ψ = (ψij)q×q. We first prove the following lemma.

Lemma 6.1 We have for any M > 0, there exists a constant C such that

P(

max1≤i≤j≤q

∣∣∣ψij − tr(Σ)

pψij

∣∣∣ ≥ C

√log max(q, np)

np

)= O((q + np)−M).

Proof. Recall that for any pair of i ∈ [q] and j ∈ [q]

ψij =1

(n− 1)p

n∑k=1

p∑l=1

(X(k)li − Xli)(X

(k)lj − Xlj)

=1

(n− 1)p

p∑l=1

n∑k=1

(X(k)li − Xli)(X

(k)lj − Xlj), (34)

where Xli = 1n

∑nk=1X

(k)li and Xlj = 1

n

∑nk=1X

(k)lj . Without loss of generality, we assume

that µ = 0.

Let A ∈ Rn×n be an orthogonal matrix with the last row ( 1√n, . . . , 1√

n). Let Yli =

(Y(1)li , . . . , Y

(n)li )

′= A(X

(1)li , . . . , X

(n)li )

′ ∈ Rn×1. So we have√nXli = Y

(n)li and

n∑k=1

(X(k)li − Xli)(X

(k)lj − Xlj) = Y

′

liYlj − Y (n)li Y

(n)lj =

n−1∑k=1

Y(k)li Y

(k)lj . (35)

Since (X(1)li , . . . , X

(n)li )

′ ∼ N(0, σllψiiIn×n), (Y(1)li , . . . , Y

(n−1)li )

′ ∼ N(0, σllψiiI(n−1)×(n−1)).

Let Yk = (Y(k)li )1≤l≤p,1≤i≤q for 1 ≤ k ≤ n − 1. We have Yk ∼ N(0,Σ ⊗ Ψ) and Yk,

1 ≤ k ≤ n−1, are independent. Let us define Zk = Σ−1/2Yk ∼ N(0, Ip×p⊗Ψ). Let Zki be

the i-th column of Yk. Then (Zki,Zkj) ∼ N(0, Ip×p⊗Ψ[i,j]), where Ψ[i,j] =

(ϕii ϕij

ϕji ϕjj

).

Let the U′DU be the eigenvalue decomposition of Σ, where U is an orthogonal ma-

trix and D = diag(λ(1)1 , . . . , λ

(1)p ). Define (Wki,Wkj) := (UZki,UZkj) ∈ Rp×2 where

Wki = (wki,1, . . . , wki,p)′ ∈ Rp×1 and Wkj = (wkj,1, . . . , wkj,p)

′ ∈ Rp×1. Since U is an

orthogonal matrix, (Wki,Wkj) ∼ N(0, Ip×p ⊗Ψ[i,j]), which also implies that (wki,l, wkj,l)

are independent for 1 ≤ l ≤ p.

Now combining (34) and (35), we have,

ψij =1

(n− 1)p

p∑l=1

n−1∑k=1

Y′kiYki

33

=1

(n− 1)p

n−1∑k=1

(UZki)′DUZkj

=1

(n− 1)p

n−1∑k=1

p∑l=1

λ(1)l wki,lwkj,l. (36)

We further note that

Eψij =1

(n− 1)p

n−1∑k=1

p∑l=1

λ(1)l ψij =

tr(Σ)

pψij.

Put wkij,l = wki,lwkj,l − Ewki,lwkj,l. We have for some η > 0 such that Ee2η|λ(1)l wkij,l| ≤ K

for some K > 0, uniformly in i, j, l, k. It implies that

n−1∑k=1

p∑l=1

E(λ(1)l wkij,l)

2eη|λ(1)l wkij,l| ≤ K

n−1∑k=1

p∑l=1

(E(λ(1)l wkij,l)

4)1/2

=√

2K(n− 1)‖Σ‖2F (ϕiiϕjj + ϕ2ij).

By the exponential inequality in Lemma 1 in Cai and Liu (2011) and ‖Σ‖2F/p ≤ λ(1)p , for

any M > 0, there exists a constant C > 0,

P(|ψij − Eψij| ≥ C

√log max(q, np)

np

)= O((q + np)−M).

This proves Lemma 6.1.

Lemma 6.2 For any M > 0, there exists a constant C such that

P(

max1≤i≤q

max1≤h≤q,h6=i

∣∣∣ 1

np

n∑k=1

p∑l=1

ε(k)li (X

(k)lh − Xlh)

∣∣∣ ≥ C

√log max(q, np)

np

)= O((q + np)−M)

and

P(

max1≤i≤p

∣∣∣ 1

np

n∑k=1

p∑l=1

ε(k)li (X

(k)l,−i − Xl,−i)βi

∣∣∣ ≥ C

√log max(q, np)

np

)= O((q + np)−M).

Proof. Recall that

ε(k)li = X

(k)li − αli −X

(k)l,−iβi.

Set ε(k)i = (ε

(k)1i , . . . , ε

(k)pi )

′and Y

(k)i = (X

(k)1,−iβi, . . . ,X

(k)p,−iβi)

′. It is easy to see that

Cov(ε(k)i ) = γ−1ii Σ. Let X

(k)h be the h-th column of X(k). Since ε

(k)i and X

(k)h are

34

independent for h 6= i, for the p × 2 matrix (ε(k)i ,X

(k)h ), we have Cov((ε

(k)i ,X

(k)h )) =

Σ⊗ diag(γ−1ii , ψii) for h 6= i.

In addition, X(k)·,−i ∼ N(0,Σ⊗Ψ−i,−i) and βi = − 1

γiiΓ−i,i, we have

Cov(Y(k)i ) =

1

γ2iitr(Γ−i,iΓi,−iΨ−i,−i)Σ.

Further, by the fact that,

tr(Γ−i,iΓi,−iΨ−i,−i) = Γi,−iΨ−i,−iΓ−i,i = −γiiΨi,−iΓ−i,i = −γii(1− ψiiγii),

which further implies that,

Cov(Y(k)i ) =

ψiiγii − 1

γiiΣ.

Since ε(k)i and Y

(k)i are independent, Cov((ε

(k)i ,Y

(k)i )) = Σ⊗diag(γ−1ii , (ψiiγii−1)/γii). Fol-

lowing exactly the same proof of Lemma 6.1, where we replace (X(k)li , X

(k)lj ) by (ε

(k)li , X

(k)lh )

or (ε(k)li ,X

(k)l,−iβi) and Ψ[i,j] by diag(γ−1ii , ψii) or diag(γ−1ii , (ψiiγii − 1)/γii), we can obtain

Lemma 6.2 immediately.

Lemma 6.3 (i). We have, as np→∞,∑nk=1

∑pl=1(ε

(k)li ε


(k)lj )√

n1‖Σ‖2F→ N

(0,

1

γiiγjj+

γ2ij(γiiγjj)2

)in distribution.

(ii). For any M > 0, there exists a constant C such that

P(

max1≤i≤j≤q

∣∣∣σij,ε − tr(Σ)

p

γijγiiγjj

∣∣∣ ≥ C

√log max(q, np)

np

)= O((q + np)−M).

Proof. Note that Cov(ε(k)li , ε

(k)li ) =

γijγiiγjj

. It is easy to show that Cov((ε(k)i , ε

(k)j )) =

Σ⊗∆[i,j], where ∆[i,j] =

(1γii

γijγiiγjj

γjiγiiγjj

1γjj

). As in the proof of Lemma 6.1, we can write

n∑k=1

p∑l=1

ε(k)li ε

(k)lj =

n−1∑k=1

p∑l=1

λ(1)l ηki,lηkj,l, (37)

where (ηki,l, ηkj,l), 1 ≤ l ≤ p, 1 ≤ k ≤ n − 1, are i.i.d. N(0,∆[i,j]) random vectors. Note

that Var(ηki,lηkj,l) = 1γiiγjj

+γ2ij

(γiiγjj)2and

∑pl=1(λ

(1)l )2 = ‖Σ‖2F. (i) follows from Lindeberg-

Feller central limit theorem. (ii) follows from the exponential inequality in Lemma 1 in

Cai and Liu (2011).

35

Proof of (9). For notational simplicity, let n1 = n − 1. Note that for all k ∈ [n] and

j ∈ [q],

ε(k)lj = ε

(k)lj − (X

(k)l,−j − Xl,−j)(βj − βj),

which implies that.

ε(k)li ε

(k)lj = ε

(k)li ε

(k)lj − ε

(k)li (X

(k)l,−j − Xl,−j)(βj − βj)

−ε(k)lj (X(k)l,−i − Xl,−i)(βi − βi)

+(βi − βi)′(X

(k)l,−i − Xl,−i)

′(X

(k)l,−j − Xl,−j)(βj − βj). (38)

Let σ = tr(Σ)/p. By the assumption C1, we have c−1 ≤ σ ≤ c. For the last term in (38),

by Cauchy-Schwarz inequality, we have∣∣∣∣∣ 1

n1p

n∑k=1

p∑l=1

(βi − βi)′(X

(k)l,−i − Xl,−i)

′(X


∣∣∣∣∣≤ max

1≤i≤p

∣∣∣(βi − βi)′Ψ−i,−i(βi − βi)∣∣∣ .For any i ∈ [q], we have

|(βi − βi)′Ψ−i,−i(βi − βi)| ≤ |(βi − βi)

′(Ψ−i,−i − σΨ−i,−i)(βi − βi)|

+σ|(βi − βi)′Ψ−i,−i(βi − βi)|.

By Lemma 6.1,

max1≤i≤q

|(βi − βi)′(Ψ−i,−i − σΨ−i,−i)(βi − βi)| = OP

(a2n1

√log max(q, np)

np

).

Moreover,

|(βi − βi)′Ψ−i,−i(βi − βi)| = OP(λmax(Ψ)|βi − βi|22)

uniformly in i ∈ [q]. Combining the above arguments,∣∣∣ 1

n1p

n∑k=1

p∑l=1

(βi − βi)′(X

(k)l,−i − Xl,−i)

′(X


∣∣∣= OP

(a2n2 + a2n1

√log max(q, np)

np

).

We now estimate the second term on the right hand side of (38). For 1 ≤ i ≤ j ≤ q,

write

ε(k)li (X

(k)l,−j − Xl,−j)(βj − βj) = ε

(k)li (X

(k)li − Xli)(βi,j − βi,j)Ii 6= j

36

+∑h6=i,j

ε(k)li (X

(k)lh − Xlh)(βh,j − βh,j),

where βj = (β1,j, . . . , βp−1,j)′

and we set βp,j = 0. By Lemma 6.2 (i),

max1≤i≤j≤q

∣∣∣ ∑h6=i,j

1

n1p

n∑k=1

p∑l=1

ε(k)li (X

(k)lh − Xlh)(βh,j − βh,j)

∣∣∣≤ max

1≤i≤j≤qmaxh6=i,j

∣∣∣∣∣ 1

n1p

n∑k=1

p∑l=1

ε(k)li (X

(k)lh − Xlh)

∣∣∣∣∣ |βj − βj|1= OP

(an1

√log max(q, np)

np

).

A similar inequality holds for the third term on the right hand side of (38). Therefore,

1

n1p

n∑k=1

p∑l=1

ε(k)li ε

(k)lj =

1

n1p

n∑k=1

p∑l=1

ε(k)li ε

(k)lj −

1

n1p

n∑k=1

p∑l=1

ε(k)li (X

(k)li − Xli)(βi,j − βi,j)Ii 6= j

− 1

n1p

n∑k=1

p∑l=1

ε(k)lj (X

(k)lj − Xlj)(βj−1,i − βj−1,i)Ii 6= j

+OP

((a2n1 + an1)

√log max(q, np)

np+ a2n2

)(39)

uniformly in 1 ≤ i ≤ j ≤ q. By (5), we have

1

n1p

n∑k=1

p∑l=1

ε(k)li (X

(k)li − Xli) =

1

n1p

n∑k=1

p∑l=1

(ε(k)li )2 +

1

n1p

n∑k=1

p∑l=1

ε(k)li (X

(k)l,−i − Xl,−i)βi. (40)

By (39), (40) and Lemma 6.2, we have, uniformly in 1 ≤ i ≤ p,

1

n1p

n∑k=1

p∑l=1

ε(k)li (X

(k)li − Xli) =

1

n1p

n∑k=1

p∑l=1

(ε(k)li )2 +OP

(√ log max(q, np)

np

)=

1

n1p

n∑k=1

p∑l=1

(ε(k)li )2 +OP

(√ log max(q, np)

np

)+OP

((a2n1 + an1)

√log max(q, np)

np+ a2n2

), (41)

where the last equation follows from (39) with i = j. So, by (39), (41) and maxi,j |βi,j −βi,j| = OP(an1) = oP(1), for 1 ≤ i < j ≤ q,

1

n1p

n∑k=1

p∑l=1

ε(k)li ε

(k)lj =

1

n1p

n∑k=1

p∑l=1

ε(k)li ε

(k)lj −

1

n1p

n∑k=1

p∑l=1

(ε(k)li )2(βi,j − βi,j)

37

− 1

n1p

n∑k=1

p∑l=1

(ε(k)lj )2(βj−1,i − βj−1,i)

+OP

((a2n1 + an1)

√log max(q, np)

np+ a2n2

).

So

1

n1p

n∑k=1

p∑l=1

ε(k)li ε

(k)lj +

1

n1p

n∑k=1

p∑l=1

(ε(k)li )2βi,j +

1

n1p

n∑k=1

p∑l=1


=1

n1p

n∑k=1

p∑l=1

ε(k)li ε

(k)lj +

1

n1p

n∑k=1

p∑l=1

(ε(k)li )2βi,j +

1

n1p

n∑k=1

p∑l=1


+OP

((a2n1 + an1)

√log max(q, np)

np+ a2n2

)= −bij

ωijωiiωjj

+

∑nk=1

∑pl=1(ε

(k)li ε


(k)lj )

n1p

+OP

((a2n1 + an1)

√log max(q, np)

np+ a2n2

)(42)

uniformly in 1 ≤ i < j ≤ p, where the last equality follows from (39) with i = j,

βi,j = −ωijωjj

, βj−1,i = −ωjiωii

, E(ε(k)li ε

(k)lj ) = n−1

n

γijγiiγjj

σll and maxi,j |βi,j| ≤ C for some

constant C > 0. The proof of (9) is complete.

Proof of Proposition 3.1. The proposition follows from Lemma 6.3, (9) and (39) with

i = j.

Proof of Proposition 3.2. Let φij = tr(Γ)qσij and define λ = λ

√logmax(p,nq)

nq. We have

p∑j=1

σ2ij,λ =

p∑j=1

(σ2ij − φ2

ij)I|σij| ≥ λ+

p∑j=1

φ2ijI|σij| ≥ λ.

Also by Lemma 6.1, with probability tending to one,

p∑j=1

φ2ijI|σij| < λ ≤

p∑j=1

φ2ijI|φij| < 2λ = O(λ2−τ )

uniformly in 1 ≤ i ≤ p by the assumption (C2). So, for the last term,

p∑j=1

φ2ijI|σij| ≥ λ =

p∑j=1

φ2ij −

p∑j=1

φ2ijI|σij| < λ = (1 +OP(λ2−τ ))

p∑j=1

φ2ij

38

uniformly in 1 ≤ i ≤ p. Moreover, with probability tending to one,

p∑j=1

|σ2ij − φ2

ij|I|σij| ≥ λ =

p∑j=1

|σij + φij||σij − φij|I|σij| ≥ λ

≤ C

p∑j=1

|φij||σij − φij|I|φij| ≥ λ/2

≤ Cλ−(τ−1)∨0p∑j=1

|φij|τ |σij − φij|

≤ Cλ(2−τ)∧1,

where the last inequality following from max1≤i≤j≤q |σij − φij| = OP(λ) by Lemma 6.1. It

implies that max1≤i≤q |∑p

j=1 σ2ij,λ −

∑pj=1 φ

2ij| = OP(λ(2−τ)∧1) and hence

‖Σλ‖2F‖Σ‖2F

= OP(λ(2−τ)∧1).

By Lemma 6.1, we have tr(Σλ)/tr(Σ) = OP(λ). This implies this proposition holds.

Proof of Proposition 3.3. Define

aj =1

(n− 1)p

n∑k=1

p∑l=1

(X(k)l,−j − Xl,−j)

′(X

(k)lj − Xlj).

We let λnj and αj denote λnj(δ) = δ√

ψjj log q

npand αj(δ) (defined in (19)), respectively.

By the Karush-Kuhn-Tucker (KKT) condition, we have∣∣∣D−1/2j Ψ−j,−jβj −D−1/2j aj

∣∣∣∞≤ λnj. (43)

By Lemma 6.1, we have c−1 ≤ min1≤j≤q−1 Dj ≤ max1≤j≤q−1 Dj ≤ c for some c > 0 with

probability tending to one. This, together with Lemma 6.2, implies that, for sufficiently

large δ, ∣∣∣ 1

(n− 1)pD−1/2j

n∑k=1

p∑l=1

(X(k)l,−j − Xl,−j)

′ε(k)lj

∣∣∣∞≤ 1

2λnj (44)

uniformly in 1 ≤ j ≤ q, with probability tending to one. Note that ε(k)lj = X

(k)lj − Xlj −

(X(k)l,−j − Xl,−j)βj. Therefore,∣∣∣D−1/2j Ψ−j,−jβj −D

−1/2j aj

∣∣∣∞≤ 1

2λnj (45)

39

uniformly in 1 ≤ j ≤ q. Note that inequalities (43) and (45) imply that∣∣∣D−1/2j Ψ−j,−j(βj − βj)∣∣∣∞≤ 3

2λnj. (46)

Define Λ = diag(Ψ)−1/2Ψdiag(Ψ)−1/2. For any subset T ⊂ 1, 2, · · · , q−1 and ν ∈ Rq−1

with |T | = o(√

nplogmax(q,np)

)and |νT c |1 ≤ c|νT |1 for some c > 0, by Lemma 6.1 and the

conditions in Proposition 3.3, we have

ν′D−1/2j Ψ−j,−jD

−1/2j ν ≥ λmin(Λ−j,−j)|ν|22 −OP

(√ log max(q, np)

np

)|ν|21 ≥ |ν|22/C, (47)

for some constant C > 0, where the first inequality follows from the fact

|ν ′(D−1/2j Ψ−j,−jD−1/2j −Λ−j,−j)ν| ≤ |D−1/2j Ψ−j,−jD

−1/2j −Λ−j,−j|∞|ν|21

and the second inequality follows from the fact |ν|21 ≤ (1 + c)2|νT |21 ≤ (1 + c)2|T ||ν|22.Now let T be the support of βj, αj = D

1/2j βj and ν = D

1/2j (βj −βj) = αj −αj. We

first show that |νT c|1 ≤ 3|νT |1 uniformly in 1 ≤ j ≤ q with probability tending to one.

Define

Q(αj) =1

2(n− 1)p

n∑k=1

p∑l=1

(X(k)lj − Xlj − (X

(k)l,−j − Xl,−j)D

−1/2j αj)

2,

S(αj) = D−1/2j aj −D

−1/2j Ψ−j,−jβj.

Note that S(αj) is the gradient of Q(αj). By the definition of αj, we have

Q(αj)−Q(αj) ≤ λnj(δ)|αj|1 − λnj|αj|1 ≤ λnj(|νT |1 − |νT c|1),

and by (45), with probability tending to one,

Q(αj)−Q(αj) ≥ S′(αj)ν ≥ −

1

2λnj|ν|1 = −1

2λnj(|νT |1 + |νT c |1)

uniformly in 1 ≤ j ≤ q. It follows from the above two inequalities that |νT c|1 ≤ 3|νT |1.So by (46) and (47) we have

|ν|22 ≤ Cν′D−1/2j Ψ−j,−jD

−1/2j ν

≤ C|D−1/2j Ψ−j,−jD−1/2j ν|∞|ν|1

≤ 3

2Cλnj(|νT |1 + |νT c |1)

40

≤ 6Cλnj|νT |1≤ 6Cλnj

√|βj|0|νT |2

uniformly in 1 ≤ j ≤ q with probability tending to one. By noting that c−1 ≤ min1≤j≤q−1 Dj ≤max1≤j≤q−1 Dj ≤ c with probability tending to one, we have |βj − βj|2 ≤ c|ν|2. Hence,

by the conditions in Proposition 3.3, we have an2 = oP

((np log q)−1/4

). Note that

|ν|1 ≤ 4|νT |1 ≤ 4√|βj|0|νT |2 = o((log max(q, np))−1) uniformly in 1 ≤ j ≤ q − 1

with probability tending to one. This proves Proposition 3.3 holds.

6.3 Proof of Theorems 3.1-3.3

Based on the results established above, now the proof of Theorems 3.1 and 3.2 follows

exactly from the proof of Theorem 3.1 in Liu [2013]. We first give some key lemmas which

are the generalization of Lemmas 6,1 and 6.2 in Liu [2013] from i.i.d. case to independent

case (but not necessarily identically distributed).

Let ξ1, . . . , ξn be independent d-dimensional random vectors with mean zero. Let

G(t) = 2 − 2Φ(t) and define | · |(d) by |z|(d) = min|zi|; 1 ≤ i ≤ d for z = (z1, . . . , zd)′.

Let (p, n) be a sequence of positive integers and the constants c, r, b, γ,K,C mentioned

below do not depend on (p, n).

Lemma 6.4 Suppose that p ≤ cnr and max1≤k≤n E|ξk|bdr+2+ε2 ≤ K for some fixed c > 0,

r > 0, b > 0, K > 0 and ε > 0. Assume that ‖ 1nCov(

∑nk=1 ξk) − Id‖ ≤ C(log p)−2−γ for

some γ > 0 and C > 0. Then we have

sup0≤t≤

√b log p

∣∣∣P(|∑n

k=1 ξk|(d) ≥ t√n)

(G(t))d− 1∣∣∣ ≤ C(log p)−1−γ1

for γ1 = minγ, 1/2.

Let ηk = (ηk1, ηk2)′, 1 ≤ k ≤ n, are independent 2-dimensional random vectors with

mean zero.

Lemma 6.5 Suppose that p ≤ cnr and max1≤k≤n E|ηk|2br+2+ε2 < ∞ for some fixed c >

0, r > 0, b > 0 and ε > 0. Assume that∑n

k=1 Var(ηk1) =∑n

k=1 Var(ηk2) = n and

| 1n

∑nk=1 Cov(ηk1, ηk2)| ≤ δ for some 0 ≤ δ < 1. Then we have

P(|

n∑k=1

ηk1| ≥ t√n, |

n∑k=1

ηk2| ≥ t√n)≤ C(t+ 1)−2 exp(−t2/(1 + δ))

41

uniformly for 0 ≤ t ≤√b log p, where C only depends on c, b, r, ε, δ.

The proofs of Lemmas 6.4 and 6.5 are the same as those of Lemma 6.1 and 6.2 in Liu

[2013].

Recall βki,l in (37). For 1 ≤ i < j ≤ q, let

Uij =

∑n−1k=1

∑pl=1 λ

(1)l (ηki,lηkj,l − Eηki,lηkj,l)(γiiγjj)1/2√

(n− 1)pEp, (48)

where Ep = p−1∑p

l=1(λ(1)l )2. Note that λ

(1)l are bounded away from zero and infin-

ity. Also, Var(ηki,lηkj,l) = (γiiγjj)−1(1 + γ2ij(γiiγjj)

−1), Var(Uij) = 1 + γ2ij(γiiγjj)−1 and

Corr(Uij, Ukl) = Corr(η1i,1η1j,1, η1k,1η1l,1). By Lemma 6.4 with d = 1, we have

max1≤i,j≤q

sup0≤t≤4

√log q

∣∣∣P(|Uij| ≥ t√

1 + γ2ij(γiiγjj)−1)

G(t)− 1∣∣∣ ≤ C(log q)−1−ε

for some ε > 0. Therefore, max1≤i,j≤q |Uij| = OP(√

log q). This, together with (39) with

i = j, Lemma 6.3, Proposition 3.2, (9) and (26), implies that

max1≤i<j≤q

∣∣∣√(n− 1)p

riirjjAp(Tij + bij

γijγiiγjj

)− Uij∣∣∣ = oP((log q)−1/2)

as (np, q) → ∞. Now Theorem 3.1 follows from the proof of Theorem 3.1 in Liu [2013]

step by step, by using Lemmas 6.4 and 6.5 and replacing Uij in Liu [2013] by Uij in

(48) and the sample size in Liu [2013] by (n− 1)p. The proof of Theorem 3.2 is similar.

Theorem 3.3 follows from the formula of FDP and Theorems 3.1 and 3.2.

6.4 Proof of Propositions 4.1 and 4.2

Let τ ′ = τ ∨ 1. Note that

Var(

p∑l=1

Xli) = ψii∑

1≤k,l≤p

σlk ≤ ψiip2−1/τ ′ max

1≤l≤p(

p∑k=1

στ′

lk)1/τ′.

It follows from the tail probability of normal distribution that for any M > 0, there exists

a constant C > 0 such that

P(

max1≤i≤q

∣∣∣∑pl=1(Xli − EXli)

p

∣∣∣ ≥ C

√log max(p, q)

p1/(2τ ′)

)= O((p+ q)−M). (49)

42

Similarly, we have

P(

max1≤i≤q

∣∣∣∑pl=1 εlip

∣∣∣ ≥ C

√log max(p, q)

p1/(2τ ′)

)= O((p+ q)−M) (50)

and

P(

max1≤i≤q

∣∣∣∑pl=1(Xl,−i − EXl,−i)βi

p

∣∣∣ ≥ C

√log max(p, q)

p1/(2τ ′)

)= O((p+ q)−M).

Without loss of generality, we assume that µ = 0. Let Ψ = 1pY′Y = (ψij)1≤i,j≤q,

εli = εli − εi with εi = 1p

∑pl=1 εli and X−i = 1

p

∑pl=1 Xl,−i. Note that

1

p

p∑l=1

(Xli − Xi)(Xlj − Xj) =1

p

p∑l=1

XliXlj − XiXj,

1

p

p∑l=1

(εli − εi)(εlj − εj) =1

p

p∑l=1

εliεlj − εiεj. (51)

By a similar equation as (36) with n = 2, we have

P(|1p

p∑l=1

(XliXlj − EXliXlj)| ≥ C

√log max(p, q)

p

)= O((q + p)−M). (52)

This, together with (49) and (51), implies that Lemma 6.1 holds with n = 1.

Similarly, as Lemma 6.2 and Lemma 6.3, we have

P(

max1≤i≤q

max1≤h≤q,h6=i

∣∣∣1p

p∑l=1

εli(Xlh − Xh)∣∣∣ ≥ C

√log max(q, p)

p

)= O((q + p)−M),

P(

max1≤i≤p

∣∣∣1p

p∑l=1

εli(Xl,−i − X−i)βi

∣∣∣ ≥ C

√log max(q, p)

p

)= O((q + p)−M)

and

P(

max1≤i≤j≤q

∣∣∣σij,ε − tr(Σ)

p

γijγiiγjj

∣∣∣ ≥ C

√log max(q, p)

np

)= O((q + p)−M).

By Lindeberg-Feller central limit theorem and a similar equation as (37), we have∑pl=1(εliεlj − Eεliεlj)√

‖Σ‖2F→ N

(0,

1

γiiγjj+

γ2ij(γiiγjj)2

)43

in distribution. By the fact τ ′ < 2, (50) and (51),∑pl=1(εliεlj − Eεliεlj)√

‖Σ‖2F→ N

(0,

1

γiiγjj+

γ2ij(γiiγjj)2

)in distribution. By the proof of (9), we can obtain Proposition 4.1.

We now show that Ap/Ap = 1 +OP(λ(2−τ)∧1). We have

1

q

q∑l=1

(Xil − Xl)(Xjl − Xl) =1

q

q∑l=1

XilXjl −1

q

q∑l=1

(Xil +Xjl)Xl +1

q

q∑l=1

X2l .

Note that 1q

∑ql=1 X

2l = OP

(log(p∨q)p1/τ

′

), and as (52), we have

P(|1q

q∑l=1

(XilXjl − EXilXjl)| ≥ C

√log(p ∨ q)

q

)= O((q + p)−M).

Let Zi = XiΣ−1/2 and Z = 1

p

∑pi=1Zi = (Z1, . . . , Zq). So

1

q

q∑l=1

XilXl =1

qZiΨZ

′=

1

qZiUDU

′Z′,

where U′DU is the eigenvalue decomposition of Ψ, U is an orthogonal matrix and D =

diag(λ(2)1 , . . . , λ

(2)q ). Set ZiU = (zi1, . . . , ziq) and ZU = (z1, . . . , zq). It is easy to see that

(zil, zl), 1 ≤ l ≤ q, are independent. It follows that

1

q

q∑l=1

XilXl =1

q

q∑l=1

λ(2)l zilzl.

So we have

P(1

q

∣∣∣ q∑l=1

λ(2)l (zilzl − Ezilzl)√

Var(z1)

∣∣∣ ≥ C

√log(p ∨ q)

q

)= O((q + p)−M).

Note that Var(z1) = p−2∑

1≤i,j≤p σij ≤ p−1/2√λ(1)p and |Ezilzl| = |p−1

∑pj=1 σij| ≤

p−1/2√λ(1)p . This implies that

P(1

q

∣∣∣ q∑l=1

XilXl

∣∣∣ ≥ C

√log(p ∨ q)p ∧ q

)= O((q + p)−M).

44

Combining the above arguments, we obtain that for any M > 0, there exists some C > 0

such that

P(1

q

∣∣∣ q∑l=1

(Xil − Xl)(Xjl − Xl)−tr(Ψ)

qσij

∣∣∣ ≥ C

√log(p ∨ q)p ∧ q

)= O((q + p)−M).

By the proof of Proposition 3.2, we can show that Ap/Ap = 1 +OP(λ(2−τ)∧1). Proposition

4.2 is proved by Proposition 4.1.

45

Documents

Statistical Inference for Matrix-variate Gaussian ... · Similar to the vector-variate Gaussian graphical models (GGMs) where the conditional independence among variables is encoded