Between Pathway Interactions Matrix with Applications to ...homepage.fudan.edu.cn/xiayin/files/2017/02/... · procedure is applied to a breast cancer gene expression study to identify

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=uasa20

Download by: [University North Carolina - Chapel Hill] Date: 12 February 2017, At: 21:30

Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20

Multiple Testing of Submatrices of a PrecisionMatrix with Applications to Identification ofBetween Pathway Interactions

Yin Xia, Tianxi Cai & T. Tony Cai

To cite this article: Yin Xia, Tianxi Cai & T. Tony Cai (2016): Multiple Testing of Submatrices of aPrecision Matrix with Applications to Identification of Between Pathway Interactions, Journal of theAmerican Statistical Association, DOI: 10.1080/01621459.2016.1251930

To link to this article: http://dx.doi.org/10.1080/01621459.2016.1251930

View supplementary material

Accepted author version posted online: 16Dec 2016.

Submit your article to this journal

Article views: 96

View related articles

View Crossmark data

http://www.tandfonline.com/action/journalInformation?journalCode=uasa20

http://www.tandfonline.com/loi/uasa20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/01621459.2016.1251930

http://dx.doi.org/10.1080/01621459.2016.1251930

http://www.tandfonline.com/doi/suppl/10.1080/01621459.2016.1251930

http://www.tandfonline.com/doi/suppl/10.1080/01621459.2016.1251930

http://www.tandfonline.com/action/authorSubmission?journalCode=uasa20&show=instructions

http://www.tandfonline.com/action/authorSubmission?journalCode=uasa20&show=instructions

http://www.tandfonline.com/doi/mlt/10.1080/01621459.2016.1251930

http://www.tandfonline.com/doi/mlt/10.1080/01621459.2016.1251930

http://crossmark.crossref.org/dialog/?doi=10.1080/01621459.2016.1251930&domain=pdf&date_stamp=2016-12-16

http://crossmark.crossref.org/dialog/?doi=10.1080/01621459.2016.1251930&domain=pdf&date_stamp=2016-12-16

ACCEPTED MANUSCRIPT

Multiple Testing of Submatrices of a Precision Matrixwith Applications to Identification of Between Pathway

Interactions

Yin Xia1, Tianxi Cai2, and T. Tony Cai3

Abstract

Making accurate inference for gene regulatory networks, including inferring about path-

way by pathway interactions, is an important and difficult task. Motivated by such genomic

applications, we consider multiple testing for conditional dependence between subgroups of

variables. Under a Gaussian graphical model framework, the problem is translated into si-

multaneous testing for a collection of submatrices of a high-dimensional precision matrix with

each submatrix summarizing the dependence structure between two subgroups of variables.

A novel multiple testing procedure is proposed and both theoretical and numerical prop-

erties of the procedure are investigated. Asymptotic null distribution of the test statistic for

an individual hypothesis is established and the proposed multiple testing procedure is shown

to asymptotically control the false discovery rate (FDR) and false discovery proportion (FDP)

at the pre-specified level under regularity conditions. Simulations show that the procedure

works well in controlling the FDR and has good power in detecting the true interactions. The

procedure is applied to a breast cancer gene expression study to identify between pathway

interactions.

Keywords: Between pathway interactions, conditional dependence, covariance structure, false discovery

proportion, false discovery rate, Gaussian graphical model, multiple testing, precision matrix, testing sub-

matrices.

1Department of Statistics, Fudan University and Department of Statistics & Operations Research, University of NorthCarolina at Chapel Hill. The research of Yin Xia was supported in part by “The Recruitment Program of GlobalExperts” Youth Project from China, the startup fund from Fudan University and NSF Grant DMS-1612906.

2Department of Biostatistics, Harvard School of Public Health, Harvard University. The research ofTianxi Cai was supported in part by NIH Grants R01 GM079330, P50 MH106933, and U54 HG007963.

3Department of Statistics, The Wharton School, University of Pennsylvania. The research of Tony Caiwas supported in part by NSF Grants DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334.

1ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

1 Introduction

Simultaneous inference for the interactions among a large number of variables is an important

problem in statistics with a wide range of applications. Many statistical methods have been pro-

posed to infer about pairwise interactions (Ritchie et al., 2001; Chatterjee et al., 2006; Kooperberg

and Ruczinski, 2005; Kooperberg and LeBlanc, 2008; Fan and Lv, 2008; Cai and Zhang, 2014;

Cai and Liu, 2015, e.g). Most of the existing methods focus on marginal assessments of pairwise

interactions without conditioning on the other variables. Such marginal methods may result in

false identification of interactions due to the discrepancy between conditional and unconditional

effects. When prior knowledge is available to group the variables of interest, it is often of interest

to make simultaneous inference for the interactions at the group level. For example, functionally

related genes are often grouped into pathways and inferring about between pathway interactions is

important as they represent a majority of the genetic interactions (Kelley and Ideker, 2005).

Motivated by applications in genomics, in this paper we propose methods to efficiently identify

between group interactions while accounting for the joint effects from all other variables of inter-

est. Under a Gaussian graphical model framework, we translate the problem of detecting between

group interactions into the statistical problem of simultaneous testing of a collection of submatri-

ces of a high-dimensional precision matrix. We first discuss the motivating problem of detecting

between pathway interactions before presenting the framework for large-scale multiple testing of

submatrices of a high-dimensional precision matrix.

1.1 Detection of Between Pathway Interactions

It is well known that genes interact functionally in networks to orchestrate cellular processes. Bio-

logical interactions of genes are often inferred based on co-expression networks since co-expressed

genes tend to be functionally related or controlled by the same transcriptional regulatory elements

(Weirauch, 2011). Throughout, we use the term gene-gene interaction to refer totheir biological

interaction, quantified by conditional co-expression (given all other genes), rather than statistical


ACCEPTED MANUSCRIPT

interaction unless specified otherwise. Accurately identifying important gene-gene interactions is

a difficult task due to the high dimensionality of the feature space spanned by gene pairs. Particu-

larly in genome-wide studies where the sample sizes are typically small compared to the number

of interactions of interest, gene level analyses often produce results that are difficult to interpret or

replicate.

One approach to improve the interpretability and reproducibility is to incorporate prior biolog-

ical knowledge such as gene structure or protein-protein interaction network information to group

functionally related genes into pathways and perform analysis at the pathway level. Throughout,

we use the termpathwayto refer generically a gene group under study, whether or not the group

is indeed representing a metabolic or signaling pathway. A large number of knowledge bases

have become available to assemble biologically meaningful gene groups (Xenarios et al., 2002;

Rual et al., 2005; Matthews et al., 2009; Craven and Kumlien, 1999; Khatri et al., 2012). The

knowledge bases provide prior information on biological processes, components, or structures in

which individual genes and proteins are involved in. Analyzing high-throughput molecular mea-

surements at the functional level is very appealing due to its potential in reducing the complexity

of the problem and improving the power (Subramanian et al., 2005; Glazko and Emmert-Streib,

2009).

Detecting pathway level interactions is also biologically relevant because in order to produce

appropriate physiological responses to both internal and external factors, pathways often need

to function in a coordinated fashion due to the complex nature of biological systems. In addi-

tion, there is accumulating evidence that complex traits are often influenced by multiple groups

of functional related genes through their dynamic interaction and co-regulation (Jia et al., 2011).

Therefore, the knowledge of pathway crosstalk network is helpful for inferring the function of

complex biological systems (Li et al., 2008). A wide range of between pathway crosstalk have

been identified as critical for understanding many diseases including breast cancer, lung cancer,

ovarian cancer, major depression disorder, and Alzheimer (Osborne et al., 2005; Shou et al., 2004;

Jia et al., 2011; Liu et al., 2010; Pan, 2012; Puri et al., 2008).


ACCEPTED MANUSCRIPT

In addition to applications to the identification of between genetic pathway interactions, the

proposed procedures are also useful for other settings. Examples include interactions between

biological markers when markers are measured at different time points with multiple measurements

of each marker representing one group; and interactions between different brain regions when

functional MRI measurements are taken over the entire brain with groups indexed by brain regions.

We next describe our proposed framework for detecting between pathway interactions based on

testing for submatrices of a high-dimensional precision matrix.

1.2 Multiple Testing of Submatrices of A Precision Matrix

Under a Gaussian graphical model framework, we formulate the problem of identifying between

group interactions that account for joint effects from all genes of interest as the statistical problem

of simultaneous testing of submatrices of a high-dimensional precision matrix. Let{X1, ∙ ∙ ∙ , Xn}

be a random sample consisting ofn independent copies of ap dimensional Gaussian random vector

X ∼ Np(μ,Σ). The precision matrix, which is the inverse ofΣ, is denoted byΩ = (ωi, j). It is well-

known that the precision matrix is closely connected to the corresponding Gaussian graphG =

(V,E), which represents the conditional independence between components ofX = (X1, . . . ,Xp)T.

HereV is the vertex set consisting of thep componentsX1, . . . ,Xp andE is the edge set consisting

of ordered pairs (i, j), where (i, j) ∈ E if there is an edge betweenXi andXj, indicating thatXi

andXj are conditionally dependent given{Xk, k , i, j}. It is a well-known fact that the conditional

independence betweenXi and Xj given all other variables is equivalent toωi, j = 0. See, e.g.,

Lauritzen(1996).

LetJ1, . . . ,JM ⊂ {1, . . . , p} be a collection of prespecified non-overlapping sets which index

group memberships (e.g. pathway membership), we wish to test simultaneously the hypotheses

of the conditional independence between any two gene groups given all remaining genes in the

collection with proper control of the false discovery rate (FDR) and false discovery proportion

(FDP) asymptotically. It follows from the above discussion that this multiple testing problem can

be equivalently formulated as testing the hypotheses on the submatrices of the precision matrixΩ,


ACCEPTED MANUSCRIPT

H0,m,h : ΩJm×Jh = 0 versusH1,m,h : ΩJm×Jh , 0, 1 ≤ m< h ≤ M, (1)

while controlling the FDR and FDP asymptotically. Hereafter, all results related to the FDR and

FDP are studied in the asymptotic regime and we use FDR and FDP as simplifications for the

expressions of asymptotic FDR and asymptotic FDP.

Simultaneous testing of between group interactions with FDR control is technically challeng-

ing, both in constructing a suitable test statistic and establishing its null distribution for testing the

interactions between any two given groups and in developing a multiple testing procedure that ac-

counts for the multiplicity and dependency with FDR control. To the best of our knowledge, there

are no currently available methods with theoretical guarantees to infer about interactions between

pre-specified gene groups that adjust for effects from a large number of other genes. Furthermore,

no existing methods allow the testing for such group level interactions while properly controlling

a desired FDR.Liu (2013) proposed a multiple testing procedure with the FDR control for the

partial correlations under a Gaussian graphical model.Xia et al.(2015) considered the problem of

identifying gene-by-gene interactions associated with a binary trait under a two-sample framework

and proposed a procedure for testing the differential network by simultaneously testing entry-wise

hypotheses with FDR control. These methods, which can identify the locations of individual gene-

by-gene interactions, are however unable to detect the presence interactions between pairs of gene

groups while controlling the FDR at the group level.

In this paper, we propose a novel multiple testing procedure for between group interactions that

controls the FDR and FDP asymptotically at any pre-specified level 0< α < 1. The simultaneous

testing procedure is developed in two steps. In the first step, we construct a test statistic for testing

the conditional independence of a given pair of variable groupsJm andJh, H0,m,h : ΩJm×Jh = 0,

with m, h. The test statistic is based on the Frobenius norm of a standardized submatrix estimate

with unknown correlation structure. The estimation of this dependency structure is technically

challenging, because correlations among the estimates of the entries ofΩJm×Jh not only depend on


ACCEPTED MANUSCRIPT

the entries within the submatrix, but also largely depend on the entries outside of it. To incorpo-

rate this dependency structure, we estimate the eigenvalues of the correlation matrix of the entry

estimates of a given submatrixΩJm×Jh through a Kronecker product by estimating the eigenvalues

of two partial correlation submatricesRJm×Jm and RJh×Jh of R = D−1/2ΩD−1/2, whereD is the

diagonal matrix ofΩ. It is shown that the test statistic has asymptotically the same limiting null

distribution as a mixture ofχ21 with the estimated correlation structure.

In the second step, we construct a simultaneous testing procedure based on these test statistics.

A major difficulty here is that the correlation structures of the entry estimates vary across different

submatrices. Consequently the limiting null distributions of the test statistics for different subma-

trices are different. We introduce a normal quantile transformation for each test statistic, and the

transformed test statistics are shown to have asymptotically the same distribution as the absolute

value of a standard normal random variable under the null. Based on them, we develop a multiple

testing procedure to account for the multiplicity in testing a large number of hypotheses so that the

overall FDR and FDP are controlled.

Both the theoretical and numerical properties of the proposed procedure are investigated. The

theoretical results show that, under regularity conditions, the proposed procedure asymptotically

controls both the overall FDR and FDP at the pre-specified level. As a comparison, it is discussed

in Section4.3that a direct application of the well-known B-H procedure (Benjamini and Hochberg,

1995) to the individual test statistics is not able to control the FDP when the number of true al-

ternatives is fixed. Simulation studies are carried out to examine the numerical performance of

the multiple testing procedure in various settings. The results show that the procedure performs

well numerically in terms of both the size and power of the test. We also consider a simulation

setting that is similar to the breast cancer gene expression data analyzed in this paper by mimick-

ing the true sizes of the gene groups in the breast cancer study. The result shows that the FDR

is well controlled and this new group level based method significantly outperforms the alternative

procedures.

Finally, we apply the proposed procedure to assess the between pathway interactions in a breast


ACCEPTED MANUSCRIPT

cancer gene expression study. Many of the identified interactions are consistent with those reported

in the literature.

1.3 Structure of the Paper

The rest of the paper is organized as follows. We give a detailed construction of the statistic for

testing a specific submatrix of a precision matrix in Section2. The limiting null distribution of

the test statistic and the theoretical properties of the testing procedure are obtained in Section3.

A multiple testing procedure for simultaneously assessing a collection of submatrices is proposed

and its theoretical properties are established in Section4. Simulation results demonstrating the

performance of the proposed methods in finite sample are given in Section5. In Section6, we apply

the new multiple testing procedure to a breast cancer gene expression study to identify between

pathway interactions. A discussion on possible extensions is given in Section7. All proofs are

contained in the supplementXia et al.(2016).

2 Testing A Given Submatrix

We consider in this section testing a given submatrix of the precision matrixΩ,

H0 : ΩI×J = 0 versus H1 : ΩI×J , 0, (2)

under the framework of Section1.2, whereI andJ index two non-overlapping gene groups.

A rejection of H0 means that at least one pair of variables fromI andJ are not conditionally

independent from each other given all other variables. As the group information is considered

as prior knowledge, performing analysis at the group level is more appealing than the entrywise

procedure as discussed in Section1. We shall construct a test statistic forH0, corresponding to

no interactions between gene groupsI andJ conditional on all other genes. Related works on

testing for independence and conditional independence between random vectors can be found in,

e.g.,Gieser and Randles(1997); Um and Randles(2001); Beran et al.(2007); Su and White(2007,

2008); andHuang et al.(2010).


ACCEPTED MANUSCRIPT

2.1 Notation and Definitions

DenoteA ⊗ B the Kronecker product of matrixA and B. For a vectorβ = (β1, . . . , βp)T ∈ Rp,

define the q norm by|β|q = (∑p

i=1 |βi |q)1/q for 1 ≤ q ≤ ∞. For any vectorμ with dimensionp× 1,

let μ−i denote the (p− 1)× 1 vector by removing theith entry fromμ. For a symmetric matrixA,

let λmax(A) andλmin(A) denote the largest and smallest eigenvalues ofA. For anyp × q matrix

A, Ai,− j denotes theith row of A with its jth entry removed andA−i, j denotes thejth column of

A with its ith entry removed.A−i,− j denotes the (p − 1) × (q − 1) submatrix ofA with its ith row

and jth column removed.Ar×c denotes the submatrix ofA corresponding to the row vectorr and

column vectorc. For an× p data matrixU = (U1, . . . ,Un)T , denote ann× (p− 1) matrixU∙,−i =

(UT1,−i , . . . ,U

Tn,−i)

T. Let U∙,−i = 1/n∑n

k=1 Uk,−i with dimension 1× (p − 1), U(i) = (U1,i , . . . ,Un,i)T

with dimensionn × 1, U(i) = (Ui , . . . , Ui)T with dimensionn × 1, whereUi = 1/n∑n

k=1 Uk,i, and

U(∙,−i) = (UT

∙,−i , . . . , UT

∙,−i)T with dimensionn× (p−1). For a matrixΩ = (ωi, j)p×p, the matrix 1-norm

is defined by‖Ω‖L1 = max1≤ j≤p∑p

i=1 |ωi, j | and the matrix elementwise infinity norm is defined to

be ‖Ω‖∞ = max1≤i, j≤p |ωi, j |. For a setH , denote|H| the cardinality ofH . For two sequences of

real numbers{an} and{bn}, write an = O(bn) if there exists a constantC such that|an| ≤ C|bn| holds

for all n, write an = o(bn) if lim n→∞ an/bn = 0, and writean � bn if lim n→∞ an/bn = 1.

2.2 Testing Procedure

We shall first define a standardized estimateWi, j for each individual entry of the precision matrix,

which is the one-sample version of the estimates proposed inXia et al. (2015), then propose a

novel test statisticSI×J based on the sum of all possibleW2i, j, for (i, j) ∈ I × J .

It is well known that in the Gaussian setting, the precision matrix can be described in terms of

the regression models, see, e.g., Section 2.5 inAnderson(2003). Specifically, we may write

Xk,i = αi + Xk,−iβi + εk,i , 1 ≤ k ≤ n, (3)

whereεk,i ∼ N(0, σi,i − Σi,−iΣ−1−i,−iΣ−i,i) is independent ofXk,−i, andαi = μi − Σi,−iΣ

−1−i,−iμ−i. The


ACCEPTED MANUSCRIPT

regression coefficient vectorβi and the error termsεk,i satisfy

βi = −ω−1i,i Ω−i,i and ri, j ≡ Cov(εk,i , εk, j) = ωi, j/(ωi,iω j, j).

As in Xia et al. (2015), we first develop an estimator ofωi, j and then base the test on its bias

corrected standardization. We begin by constructing estimators ofri, j.

Let βi = (β1,i , ∙ ∙ ∙ , βp−1,i)T be estimators ofβi satisfying

max1≤i≤p|βi − βi |1 = oP{(log p)−1}, (4)

max1≤i≤p|βi − βi |2 = oP

{(n log p)−1/4

}. (5)

Such estimators can be obtained easily via the standard methods such as the Lasso and Dantzig

Selector, see, e.g.,Xia et al. (2015) Section 2.3. Specifically, if we use the Lasso estimator (see

(18) in Section5), then equations (4) and (5) can be satisfied under the condition (C1) in Section3

and the sparsity condition max1≤i≤p |βi |0 = o{n1/2/(log p)3/2}.

Define the fitted residuals by

εk,i = Xk,i − Xi − (Xk,−i − X−i)βi ,

whereXi =1n

∑nk=1 Xk,i, X−i =

1n

∑nk=1 Xk,−i. A natural estimator ofri, j is the sample covariance

between the residuals

r i, j =1n

n∑

k=1

εk,i εk, j . (6)

However, wheni , j, r i, j tends to be biased due to the correlation induced by the estimated

parameters.Xia et al.(2015) proposed a bias corrected estimator ofri, j as

r i, j = −(r i, j + r i,i βi, j + r j, j β j−1,i), for 1 ≤ i < j ≤ p.

For i = j, we let r i,i = r i,i, which is a nearly unbiased estimator ofri,i. For 1≤ i < j ≤ p, a

natural estimator ofωi, j can then be defined by

Ti, j = r i, j/(r i,i ∙ r j, j).


ACCEPTED MANUSCRIPT

Since{Ti, j ,1 ≤ i < j ≤ p} are heteroscedastic and can possibly have a wide range of variability, we

shall first standardizeTi, j. To estimate its variance, note that

θi, j ≡ Var(εk,iεk, j/(ri,i r j, j))/n = (1+ ρ2i, j)/(nri,i r j, j),

whereρ2i, j = β

2i, j ri,i/r j, j. Thenθi, j can be estimated byθi, j = (1+ β2

i, j r i,i/r j, j)/(nri,i r j, j).

Define the standardized statistics

Wi, j = Ti, j/(θi, j)1/2, for 1 ≤ i < j ≤ p. (7)

Finally, we propose the following test statistic for testing a given submatrixΩI×J ,

SI×J =∑

(i, j)∈I×J

W2i, j . (8)

We detail in Section3 statistical properties of the proposed test statistic.

3 Theories on Testing A Given Submatrix

In this section, we investigate the theoretical properties including the limiting null distribution and

the asymptotic power. We first show that the null distribution ofSI×J converges to the distribution

of a mixture ofχ21 variables as (n, p) → ∞ and then demonstrate that the test based onSI×J is

powerful under a large collection of alternatives.

3.1 Asymptotic Null Distribution

Before studying the null distribution ofSI×J , we first introduce the following condition on the

eigenvalues ofΩ, which is a common assumption in the high-dimensional setting (Cai et al., 2013;

Xia et al., 2015; Liu, 2013).

(C1) Assume that logp = o(n1/5), and for some constantC0 > 0, C−10 ≤ λmin(Ω) ≤ λmax(Ω) ≤ C0.

Suppose|Jm| does not depend onn andp for 1 ≤ m≤ M.


ACCEPTED MANUSCRIPT

Let D be the diagonal ofΩ and let (ηi, j) =: R = D−1/2ΩD−1/2. UnderH0, for (i1, j1), (i2, j2) ∈

I × J , the covariance between the standardized statisticsWi1, j1 andWi2, j2, as defined in (7), is

approximately equal toηi1,i2η j1, j2, and thus can be estimated byTi1,i2T j1, j2, whereT := (Ti, j)p×p with

Ti, j = r i, j/√

r i,i r j, j. Thus, we shall estimate the covariance matrix of{Wi, j , (i, j) ∈ I × J} by the

Kronecker product ofTI×I and TJ×J . Let ΛI = (λI1 , . . . , λI|I|)

T andΛJ = {λJ1 , . . . , λJ|J|)

T be the

eigenvalues ofTI×I and TJ×J respectively. We then estimate the eigenvalues of the covariance

matrix of {Wi, j , (i, j) ∈ I × J} by ΛI×J

= (λI×J1 , . . . , λI×JK )T which is the vectorizedΛI ⊗ ΛJ ,

whereK = |I||J|. The following theorem states the asymptotic null distributions forSI×J .

Theorem 1 Suppose that (C1), (4) and (5) hold. Then under H0 : ΩI×J = 0, for any given t∈ R,

we have

P(SI×J ≤ t)

P(∑K

l=1 λI×Jl Z2

l ≤ t)→ 1, (9)

as(n, p)→ ∞, where(Z1, . . . ,ZK) ∼ N(0, I K×K).

Remark 1 The difficulty of Theorem1 comes from the fact that, thoughΩI×J = 0 under the null,

the entries{εk,iεk, j , (i, j) ∈ I×J} can still be highly dependent with each other and their correlations

depend on the entries outside of submatrixΩI×J . Thus, the distribution ofSI×J cannot be simply

estimated by the chi-square distribution. Actually, if we use the chi-square approximation in the

following FDR control procedure in Section4, the choice of threshold level of each statistic will

be too conservative and as the result the FDR cannot be controlled at the pre-specified levelα, i.e.,

the FDR will be much larger thanα.

It has been shown in the above theorem thatSI×J has different asymptotic distribution for

different submatrixΩI×J . Thus, we introduce the normal quantile transformation ofSI×J as

follows

NI×J = Φ−1

1− P(

K∑

l=1

λI×Jl Z2l ≥ SI×J )/2

,


ACCEPTED MANUSCRIPT

whereΦ(t) = P(N(0,1) ≤ t) is standard normal cumulative distribution function (cdf) andSI×J

is the observed value. Thus, we haveP(|N(0,1)| ≥ NI×J ) = P(∑K

l=1 λI×Jl Z2

l ≥ SI×J ). Since

asymptoticallySI×J and∑K

l=1 λI×Jl Z2

l have the same distribution as studied in Theorem1, thus

NI×J asymptotically has the same distribution as the absolute value of a standard normal random

variable. We then define the testΦI×Jα by

ΦI×Jα = I{NI×J ≥ Φ−1(1− α)

}. (10)

The hypothesisH0 : ΩI×J = 0 is rejected wheneverΦI×Jα = 1.

Remark 2 The eigenvalues{λI×Jl , l = 1, . . . ,K} are calculated based onTI×I and TJ×J as de-

scribed earlier. Given the values of{λI×Jl , l = 1, . . . ,K}, the distribution of the mixture ofχ21

variables∑K

l=1 λI×Jl Z2

l can be approximated by a non-central chi-squared distribution with the pa-

rameters determined by the first four cumulants of the quadratic form, see, e.g.,Liu et al. (2009).

We will use this approximation in our numerical studies.

3.2 Asymptotic Power

We now turn to analyze the power of the testΦI×Jα given in (10). For a given pair of index setsI

andJ , we shall first define the following class of precision matrices

WI×J (α, β) ={Ω :

∑

(i, j)∈I×J

ω2i, j

θi, j≥ (2+ δ)(Ψ2

1−α + Ψ21−β)

}, (11)

for anyδ > 0, whereΨ1−α is the 1− α quantile of∑K

l=1 λI×Jl Z2

l as defined in Theorem1.

The next theorem shows that the testΦI×Jα is able to asymptotically distinguish the null param-

eter set in whichΩI×J = 0 fromWI×J (α, β) for arbitrarily small constantδ > 0, with β→ 0.

Theorem 2 Suppose that (C1), (4) and (5) hold. Then we have, for any constantδ > 0,

infΩ∈WI×J (α,β)

P(ΦI×Jα = 1

)≥ 1− β, as n, p→ ∞. (12)


ACCEPTED MANUSCRIPT

Sinceθi, j is of order 1/n, Theorem2 shows that the proposed test rejects the null hypothesisH0 :

ΩI×J = 0 with high probability for a large class of precision matrices satisfying the condition

that there exists one entry of the submatrixΩI×J having a magnitude larger thanC/n1/2 for C =

{2(2+ δ)C20(Ψ2

1−α + Ψ21−β)}

1/2, whereC0 is given in Condition (C1).

4 Multiple Testing of Submatrices with FDR Control

In practice, there are typically many pathways under investigation and it is often of significant

interest to identify which pairs of the pathways interact with each other. A natural approach to

investigate interactions among theM pathways, indexed by{Jm,m = 1, . . . ,M}, is to carry out

simultaneous testing of

H0,m,h : ΩJm×Jh = 0 versusH1,m,h : ΩJm×Jh , 0, for 1 ≤ m< h ≤ M, (13)

whereJ1, . . . ,JM ⊂ {1, . . . , p} is a collection of pre-specified non-overlapping index sets. In

this section, we introduce a multiple testing procedure with FDR and FDP control for testing a

collection ofM = M(M − 1)/2 hypotheses, and we shall assume thatM is large. LetLm denote

the cardinality ofJm assumed to be independent ofn or p for 1 ≤ m ≤ M. LetH = {(m,h) : 1 ≤

m< h ≤ M},H0 = {(m,h) : ΩJm×Jh = 0,1 ≤ m< h ≤ M} be the set of true nulls andH1 = H \H0

be the set of true alternatives. We shall assume that|H1| is relatively small compared to|H|, and

this assumption arises frequently in many contemporary applications.

4.1 Multiple Testing Procedure

Recall that the standardization ofTi, j is defined byWi, j = Ti, j/(θi, j)1/2 as in (7), and the test statistic

SJm×Jh is defined based onWi, j as in (8). It has been shown in Theorem1 thatSJm×Jh has different

asymptotic null distribution for different submatrixΩJm×Jh. Thus, as discussed in Section3.1, the

normal quantile transformation ofSJm×Jh is defined by

NJm×Jh = Φ−1

1− P(

LmLh∑

l=1

λJm×Jhl Z2

l ≥ SJm×Jh)/2

,


ACCEPTED MANUSCRIPT

and NJm×Jh approximately has the same distribution as the absolute value of a standard normal

random variable under the nullH0,m,h. Let t be the threshold level such thatH0,m,h is rejected if

NJm×Jh ≥ t. For any givent, denote the total number of false positives by

R0(t) =∑

(m,h)∈H0

I {NJm×Jh ≥ t}, (14)

and the total number of rejections by

R(t) =∑

(m,h)∈H

I {NJm×Jh ≥ t}. (15)

The false discovery proportion (FDP) and false discovery rate (FDR) are defined as

FDP(t) =R0(t)

R(t) ∨ 1and FDR(t) = E[FDP(t)].

An ideal choice oft is

t0 = inf

{

0 ≤ t ≤√

2 logM :R0(t)

R(t) ∨ 1≤ α

}

,

which would reject as many true positives as possible while controlling the FDR at the pre-specified

levelα. However, the total number of false positives,R0(t), is unknown as the setH0 is unknown.

We propose to estimateR0(t) by 2(1−Φ(t))|H0| and simply estimate|H0| byM because the number

of true alternatives is relatively small. This leads to the following multiple testing procedure with

FDR control.

1. Calculate test statisticsNJm×Jh.

2. For given 0≤ α ≤ 1, calculate

t = inf

{

0 ≤ t ≤√

2 logM− 2 log logM :2M(1− Φ(t))

R(t) ∨ 1≤ α

}

. (16)

If (16) does not exist, then sett =√

2 logM.

3. For (m,h) ∈ H , rejectH0,m,h if NJm×Jh ≥ t.


ACCEPTED MANUSCRIPT

4.2 Theoretical Properties

We now investigate the theoretical properties of the multiple testing procedure given above. For

any 1≤ m≤ M, define

Ξm(γ) ={h : 1 ≤ h ≤ M,h , m,∃i ∈ Jm, j ∈ Jh s.t. |ωi, j | ≥ (logM)−2−γ

}.

The following theorem shows that, under regularity conditions, the above multiple testing proce-

dure controls the FDR and FDP at the pre-specified levelα asymptotically.

Theorem 3 Assume thatM0 =: |H0| � M, and (4) and (5) hold. Suppose there exists someγ > 0

such thatmax1≤m≤M |Ξm(γ)| = o(Mτ) for anyτ > 0. Then under (C1) with p≤ cnr for some c> 0

and r> 0, we have

lim(n,M)→∞FDR(t) ≤ α,

and for anyε > 0,

lim(n,M)→∞

P(FDP(t) ≤ α + ε) = 1.

Remark 3 The technical condition on|Ξm(γ)| is to ensure that most of the submatrices are not

highly correlated with each other. In the special case when max1≤m≤M |Ξm(γ)| = 0, then all sub-

groups are weakly correlated with each other, i.e.,|ωi, j | ≤ (logM)−2−γ for all i ∈ Jm, j ∈ Jh

with m , h. Under this setting, it is shown in the supplementXia et al.(2016) that the proposed

multiple testing procedure performs asymptotically the same as the case when all submatrices are

independent with each other. We do not need this strong condition, and the weaker condition

max1≤m≤M |Ξm(γ)| = o(Mτ) for any τ > 0 assumed in the theorem allows the number of highly

correlated submatrices growing withM.

When t is not attained in the range [0,√

2 logM− 2 log logM] as described in equation (16),

we shall threshold it at√

2 logM. We state in the following corollary a condition to ensure the

existence oft in the range, and as a result, the FDR and FDP will converge to the pre-specified

levelα.


ACCEPTED MANUSCRIPT

Corollary 1 Let

Sρ ={(m,h) ∈ H : ∃(i, j) ∈ Jm× Jh such that |ωi, j |/(θi, j)

1/2 ≥ (logM)12+ρ

}.

Suppose for someρ > 0 and someδ > 0, |Sρ| ≥ ( 1√πα

+ δ)√

logM. Assume thatM0 =: |H0| � M,

and (4) and (5) hold. Suppose there exists someγ > 0 such thatmax1≤m≤M |Ξm(γ)| = o(Mτ) for any

τ > 0. Then, under (C1) with p≤ cnr for some c> 0 and r> 0, we have

lim(n,M)→∞

FDR(t) = α, and FDP(t)/α→ 1

in probability, as(n,M)→ ∞.

Remark 4 The condition|Sρ| ≥ ( 1√πα

+ δ)√

logM in Corollary 1 is mild, since there areM

hypotheses in total and this condition only requires a few submatrices having one entry with mag-

nitude exceeding (logM)1/2+ρ/n1/2 for some constantρ > 0.

4.3 Differences with the B-H Procedure

In this section we first discuss the difference between our method and the Benjamini-Hochberg

(B-H) procedure and then explain why in the multiple testing procedure it is critical to restrictt on

the range 0≤ t ≤√

2 logM− 2 log logM in equation (16) and to thresholdNJm×Jh at√

2 logM

whent is not attained in the range.

Once the test statisticNJm×Jh for a given submatrix is developed, a natural approach to construct

a procedure for simultaneously testing a collection of submatrices is to apply the well-known

B-H procedure to thep-valuespm,h = 2(1 − Φ(NJm×Jh)), 1 ≤ m < h ≤ M, computed from

the transformed statisticsNJm×Jh. Applying the B-H procedure to thesep-values is equivalent to

rejecting the null hypothesesH0,m,h wheneverNJm×Jh ≥ tBH, where

tBH = inf

{

t ≥ 0 :2M(1− Φ(t))

R(t) ∨ 1≤ α

}

. (17)

Note that, the difference between our procedure and the B-H procedure is on the ranges oft in

equations (16) and (17).


ACCEPTED MANUSCRIPT

We first emphasize here that the restriction on the range 0≤ t ≤√

2 logM− 2 log logM in

our proposed procedure as defined in (16) is critical. Whent ≥√

2 logM− log logM, 2M(1 −

Φ(t)) → 0 is not even a consistent estimate ofR0(t) because|R0(t)/{2M(1 − Φ(t))} − 1| 9 0 in

probability as (n,M) → ∞. However, direct application of the B-H procedure to thep-values

amounts to using 2M(1− Φ(t)) as an estimate ofR0(t) for all t ≥ 0, and as a result it may not able

to control the FDP with positive probability. For example, when the number of true alternatives is

fixed, it is shown in Proposition 2.1 inLiu and Shao(2014) that the B-H procedure cannot control

the FDP with positive probability. Thus, in order to control FDP, it is crucial to restrictt on the

range 0≤ t ≤√

2 logM− 2 log logM.

When t is not attained in the range, it is also critical to thresholdNJm×Jh at√

2 logM in-

stead of√

2 logM− 2 log logM. When t does not exist in the range, thresholdingNJm×Jh at√

2 logM− 2 log logM will cause too many false rejections, and consequently the FDR cannot be

controlled asymptotically at levelα. If the threshold level is increased to√

2 logM, the probability

of false rejections can then be perfectly controlled asymptotically as shown in equation (??) of the

supplementXia et al.(2016).

To summarize, in order to control FDR and FDP, it is crucial to restrictt on the range 0≤

t ≤√

2 logM− 2 log logM in equation (16), and when it is not attained in the range, to threshold

NJm×Jh at√

2 logM.

5 Simulation Studies

We now turn to analyze the numerical performance of the proposed multiple testing procedure

through simulation studies. We first investigate the size and power of the proposed method by

considering three matrix models with a random selection of the size of submatrices. We then

mimic the sizes of the pathways of the breast cancer dataset analyzed in Section6 and study the

numerical performance of the proposed multiple testing procedure in a setting that is similar to

the real data application. Our method, which tests for the conditional dependence structure at a


ACCEPTED MANUSCRIPT

group level, is then compared with the entrywise testing method and the B-H procedure. We also

compare the new method with the Bonferroni correction procedure and report the results in the

supplement.

5.1 Simulation for Different Constructions of Submatrices

Our analysis is divided into two parts: the performance of the new test statistics for testing a given

submatrix and the performance of the proposed multiple testing procedure. We first describe the

construction of the submatrices. For a given precision matrixΩ, we randomly divide the upper

triangular matrix ofΩ intoM submatrices, whereM = bp/sc(bp/sc − 1)/2 ands= 2 and 4. Thus

the length of the index sets can range from 1 to (p− bp/sc + 1). This is equivalent to grouping the

genes intobp/sc pathways and considering all possible conditional dependence structure between

different pathways of different sizes.

The data{X1, . . . , Xn} are generated from multivariate normal distribution with zero-mean and

precision matrixΩ. Three choices ofΩ are considered:

• Model 1: Ω∗(1) = (ω∗(1)i, j ) whereω∗(1)

i,i = 1, ω∗(1)i,i+1 = ω∗(1)

i+1,i = 0.5, ω∗(1)i,i+2 = ω∗(1)

i+2,i = 0.5.

For each of the submatrices as we constructed above, if it contains one of those entries,

we make the first row of the submatrices equal to 0.5. Letω∗(1)i, j = 0 otherwise.Ω(1) =

D1/2(Ω∗(1) + δI )/(1+ δ)D1/2 with δ = |λmin(Ω∗(1))| + 0.05.

• Model 2:Ω∗(2) = (ω∗(2)i, j ) whereω∗(2)

i, j = ω∗(2)j,i = 0.3 for i = 10(k − 1)+ 1 and 10(k − 1)+ 2 ≤

j ≤ 10(k − 1) + 10, 1≤ k ≤ p/10. ω∗(2)i, j = 0 otherwise. For each of the submatrices as we

constructed above, if it contains less than three of those entries, we make the submatrices

equal to 0. Let the first row of the submatrices which are closest to the diagonal equal to 0.3.

Ω(2) = D1/2(Ω∗(2) + δI )/(1+ δ)D1/2 with δ = |λmin(Ω∗(2))| + 0.05.

• Model 3: Ω∗(3) = (ω(3)i, j ). For each of the two submatrices closest to the diagonal, as we

constructed above, pick a random row and make the entries equal to 0.3. Letω∗(3)j,i = ω∗(3)

i, j .


ACCEPTED MANUSCRIPT

Ω(3) = D1/2(Ω∗(3) + δI )/(1+ δ)D1/2 with δ = |λmin(Ω∗(3))| + 0.05.

whereD = (Di, j) is a diagonal matrix withDi,i = Unif(1,3) for i = 1, . . . , p.

For each generated dataset, we use the Lasso to estimate the regression coefficientsβi:

βi = D− 1

2i arg min

u

{ 12n

∣∣∣(X−i − X−i)D−1/2i u − (X(i) − X(i))

∣∣∣2

2+ λn,i |u|1

}, (18)

whereDi = diag(Σ−i,−i), andλn,i = κ√σi,i log p/n.

Performance for testing a given submatrix: We start by comparing our test based on the test

statisticSI×J with the entrywise testing of a given submatrix where the null hypothesisH0 :

ΩI×J = 0 is rejected whenever max(i, j)∈I×J |Wi, j | ≥ Φ−1(1−α/K). As our target is the FDR control

of the multiple comparisons, we focus on the power comparisons of these two methods for a range

of significance levels from 0 toα = 0.1/M. For illustration, we compare the performance of

these two tests by testing against a randomly selected nonzero submatrix closest to the diagonal

for Model 1 with s = 4. For each method, the sample size is taken to ben = 200, while the

dimensionp varies over the values 100, 200, 500 and 1000. For simplicity of the comparison, the

tuning parametersλn,i in (18) is selected to beλn,i =√σi,i log p/n for both methods. The power

curves, illustrated in Figure1, are estimated from 100 replications. We can see from the figure that

the power of the new group method is much higher than the entrywise method, and the advantage

becomes much clearer when the dimension ofΩ grows.

Comparison of the multiple testing procedures: We now compare the proposed group level

FDR control procedure (Group) with three other methods: entrywise multiple testing method (En-

trywise), B-H procedure (B-H) and Bonferroni correction procedure (Bonferroni).

For the new method, as described in Section4, we select the tuning parametersλn,i in (18)

adaptively by the data with the principle of making∑

(m,h)∈H0I (NJm×Jh ≥ t) and (2− 2Φ(t))|H0| as

close as possible. The algorithm is similar asXia et al.(2015) and is summarized as follows.

1. Let λn,i = b/20√Σi,i log p/n for b = 1, ∙ ∙ ∙ ,40. For eachb, calculateβ

(b)i , i = 1, ∙ ∙ ∙ , p.

Based on the estimation of regression coefficients, construct the corresponding standardized


ACCEPTED MANUSCRIPT

transformed statisticsN(b)Jm×Jh

for eachb.

2. Chooseb as the minimizer of

10∑

d=1

(∑

(m,h)∈H I (N(b)Jm×Jh

≥ Φ−1(1− d(1− Φ(√

logM))/10))

d(1− Φ(√

logM))/10 ∙ 2M− 1

)2.

The tuning parametersλn,i are then chosen to be

λn,i = b/20√Σi,i log p/n.

We examine the power of the new method based on the average powers for 100 replications,

1100

100∑

r=1

∑(m,h)∈H1

I {NJm×Jh,r ≥ t}

|H1|, (19)

wherer denotes ther-th replication.

For the entrywise multiple testing method, we select the tuning parametersλn,i adaptively using

the principle as described in Section 5 inXia et al.(2015). We applied the multiple testing proce-

dure as developed in Section 4 ofXia et al.(2015) by restrictingt on the range [0,√

4 logp− 2 log logp]

and threshold|Wi, j | at√

4 logp if t is not attained in the range. We then examine the empirical FDR

by

1100

100∑

r=1

∑(m,h)∈H0

I {max(i, j)∈Jm×Jh |Wi, j | ≥ t}∑

(m,h) I {max(i, j)∈Jm×Jh |Wi, j | ≥ t},

and the empirical power by

1100

100∑

r=1

∑(m,h)∈H1

I {max(i, j)∈Jm×Jh |Wi, j | ≥ t}

|H1|.

We apply the Bonferroni correction procedure to the new test statistics and calculate its power

based on (19), with t obtained by settingαB = α/M. The power of the B-H procedure applied to

max(i, j)∈I×J |Wi, j | are calculated by (19) with no restriction on the range oft.

We apply all procedures to these three models withs = 2 and 4. For each method, the sample

size is taken to ben = 200, while the dimensionp varies over the values 100, 200, 500 and 1000.

The FDR level is set atα = 0.1 andα = 0.01 respectively, and the empirical FDRs and powers,


ACCEPTED MANUSCRIPT

summarized in Tables1 and2, are estimated from 100 replications. The standard errors of the

estimated powers are much smaller than the powers themselves and are thus not reported.

The average numbers of conditionally dependent (“true interaction”) and conditionally inde-

pendent (“no interaction”) pairs of subgroups with 100 replications are summarized in Table3. It

can be seen that the number of “true interactions” is relatively small compared to the total number

of pairs of subgroups in all cases, as we assumed in Section4.1.

The results in Table1 show that the empirical FDRs of the new group level method are well

maintained under the target FDR level and are reasonably close toα for almost all settings. The

standard errors of the FDP are small in most cases, especially when the dimension grows. They are

slightly larger in the cases whenα = 0.01, mainly due to the fact that the estimation error of the

standard deviation of FDP is of the order 1/l1/2 with l = 100. As a comparison, the empirical FDRs

of the entrywise method have serious distortion in most of the scenarios, especially whens = 4,

in which case the empirical FDRs can be even larger than 4α. The empirical FDRs of the B-H

procedure are well under control in most cases. However, its standard errors are much larger than

the standard errors of the proposed method in many cases, which coincides with the discussion

in Section4.3. The numerical results also show that the Bonferroni correction procedure is much

more conservative than the other two methods, and the detailed analysis is summarized in the

supplementXia et al.(2016).

Table2 shows that the empirical powers of our proposed method for all these models are very

high under various constructions of submatrices. In particular, it outperforms the entrywise testing

method and the B-H procedure. Especially when the dimension is high, the powers of the new

method are much higher than the other methods under all scenarios. Furthermore, the power gain

of the new group level testing procedure over the entrywise testing method is significant when the

dimension is high. Especially for model 3 whens= 4, the empirical powers of the new procedure

are more than twice the entrywise testing method. This is because the advantage of the group level

testing becomes more significant when the signals spread across various submatrices as in Model

3. We can see from the table that the empirical power of the new method gets smaller when the


ACCEPTED MANUSCRIPT

dimensionp grows. This is because of the fact that we keep the magnitude ofωi, j invariant for

various range of dimensions.

5.2 Simulation by Mimicking the Sizes of Gene Groups

We now consider a simulation setting that is similar to the breast cancer data application given in

Section6. The submatrices of the precision matrixΩ is constructed by mimicking the sizes of the

249 gene groups used in the breast cancer application, with parameter valuesp = 1624,n = 295

andM = 30876. The sizes of the gene groups range from 1 to 110, and the corresponding sizes of

the off-diagonal submatrices range from 1×1 to 97×110. For the diagonal submatricesΩ∗Jm×Jm:=

(ω∗m,i, j) with sizesLm × Lm, m= 1, . . . , 249, which describe the conditional dependency within the

pathways, we letω∗m,i,i = 1, ω∗m,i,i+1 = ω∗m,i+1,i = 0.8 if Lm ≥ 2, ω∗m,i,i+2 = ω∗m,i+2,i = 0.6 if Lm ≥ 3,

andω∗m, j,i = ω∗m,i, j. For each of the non-diagonal submatricesΩ∗Jm×Jm+1

andΩ∗Jm×Jm+2, we randomly

pick one row and let min{10, |Jm+1|} and min{10, |Jm+2|} random entries ofω∗i, j in the rows equal

to 0.5 respectively. We then construct the precision matrix asΩ = D1/2(Ω∗ + δI )/(1+ δ)D1/2, with

δ = λmin(Ω∗) + 0.05. The FDR level is set atα = 0.1 andα = 0.01 respectively.

By mimicking the gene group sizes, we apply the proposed method in Section4.1, the entrywise

testing procedure, the B-H procedure and the Bonferroni correction procedure as described in

Section5.1. The empirical FDR and power results are summarized in Table4, and the performance

of the Bonferroni method is reported in the supplement. The empirical FDR of the new method

is equal to 0.062 whenα = 0.1 and is equal to 0.006 whenα = 0.01, and thus both are close to

the corresponding pre-specified level. Similarly as in Section5.1, the B-H procedure has larger

standard errors than the new procedure, while the entrywise multiple testing procedure has serious

FDR distortion. For the empirical powers, it is shown in Table4 that, the new testing procedure is

more powerful than all the other methods.


ACCEPTED MANUSCRIPT

6 Analysis of Breast Cancer Gene Expression Data

In this section, we apply the multiple testing procedure developed in Section4 to identify between

pathway interactions based on a breast cancer gene expression study as described invan’t Veer

et al.(2002), to further illustrate the merit of the procedure.

This study consists of 295 subjects with primary breast carcinomas whose gene-expression

levels (in log scale) are measured at cancer diagnosis. For illustration, we considerM = 70 breast

cancer related pathways, including several major signaling pathways, assembled based on existing

literature (Osborne et al., 2005; Pan, 2012, e.g.). These pathways consist ofp = 1624 unique genes,

from the molecular signature database. Examples include the MAPK signaling, WNT signaling,

TGF-β signaling, calcium signaling, cell communication, p53 signaling and breast cancer estrogen

signaling pathways. Note that many of the pathways have overlapping genes while our method

requires group indices to be non-overlapping since two groups with shared genes are obviously

dependent of each other. To remove the influence of such trivial dependence, we further partitioned

the 70 pathways into 249 non-overlapping gene subgroups whose sizes range from 1 to 110 with

an average of 6.5. The algorithm used for such partitioning aims to identify the smallest number

of non-overlapping subgroups that can cover all the genes under consideration. The partitioning

algorithm begins with creating anM × p index matrix, I = [I1, ..., I p]. For m = 1, ...,M and

q = 1, ..., p, the (m,q)th element ofI is set to 1 if theqth gene belongs to themth pathway, and 0

otherwise. Then the subgroups are indexed by the unique values of{I1, ..., I p}.

Applying our proposed methods with target false discovery rate of 0.01, we identified 494 be-

tween subgroup interactions out of the 30876 possible subgroup pairs. These between subgroup

interactions can be mapped to 311 unique between pathway interactions and 18 within pathway

interactions. The top pathways with highest numbers of interactions with other pathways include

MAPK signaling, calcium signaling, gycan structures biosynthesis, WNT signaling, cell commu-

nication, TGF-β signaling and breast cancer estrogen pathways. The MAPK signaling pathway has

interactions with 92 gene subgroups which corresponds to 31 pathways including TGF-β, MTOR,


ACCEPTED MANUSCRIPT

P53, WNT, and ERBB signaling pathways. The WNT signaling pathway interacts with 25 other

pathways including TGF-β, MTOR, MPAK and breast cancer estrogen signaling. The TGF-β

signaling pathway interacts with 21 other pathways including MAPK, p53, WNT and calcium

signaling.

Many of these interactions have been previously documented. For example, experimental data

suggest that inhibition of mTORC1 leads to MAPK pathway activation (Carracedo et al., 2008).

The interaction between TGF-β and WNT pathways has been known for a long time and is proba-

bly the most extensively studied. At the organism level, TGF-β interacts with many other pathways

at every stage of life from birth to death. During embryonic development, the complex but delicate

interactions between the TGF-β, WNT, MAPK, and other pathways are important for a range of

processes including body patterning, stem cell maintenance and cell fate determination (Guo and

Wang, 2008). Kouzmenko et al.(2004) showed the first direct evidence of interaction between

WNT and estrogen signaling pathways via functional interaction betweenβ-catenin and ERα.

To examine whether these 70 breast cancer pathways are enriched with interactions, we ran-

domly selected 50 sets of 70 pathways of similar sizes as the breast cancer pathways from the C2

pathway gene sets curated from various online databases (available from the Broad Institute). For

each of the 70 randomly selected pathways, we performed the same analysis as the breast cancer

pathways by first partitioning them into non-overlapping subgroups and then applied our method

to identify significant between subgroup interactions. To determine whether the 70 breast can-

cer pathways are enriched with between subgroup interactions relative to these randomly selected

pathways, we calculate the proportion of between subgroup interactions deemed as significant at

the FDR level of 0.01. Across the 50 randomly selected pathways, the average proportion of sig-

nificant pairs was 0.011 with standard deviation 0.002. The proportion of significant pairs we

identified in the breast cancer data is 0.016, which is 2.5 standard deviations higher than the mean

of proportions from those 50 random sets. The results suggest that the selected 70 pathways are

indeed enriched with “interaction” pairs.


ACCEPTED MANUSCRIPT

7 Discussions

We proposed in this paper a multiple testing procedure under the Gaussian graphical models for

detecting between group interactions. The proposed method can potentially be extended in several

directions. We discuss in this section two of these possible extensions.

7.1 Extension to Gaussian Copula Graphical Models

In the present paper, the problem of identifying the conditional between group interactions is trans-

lated to the problem of multiple testing of submatrices of a high-dimensional precision matrixΩ

under the Gaussian graphical model framework. The main reason for the success of this approach

is that the conditional independence between two non-overlapping groups of variables is equiva-

lent to the corresponding submatrix ofΩ being 0. This approach can be extended to more general

settings of the semiparametric Gaussian copula graphical models where the population distribu-

tion is non-Gaussian, seeLiu et al. (2012) andXue and Zou(2012). The semiparametric Gaussian

copula model assumes that the variables follow a joint normal distribution after a set of unknown

marginal monotonic transformations. It would be interesting to develop a multiple testing proce-

dure and investigate its properties under the semiparametric Gaussian copula graphical models.

Detailed analysis is involved and is an interesting topic for future research.

7.2 The Two-Sample Case

We have focused on the one-sample case in this paper. It is also of interest to study the two-

sample case where the goal is to discover the changes in the conditional dependence between

pathway interactions under two different disease settings. In the one-sample case studied in this

paper,ΩJm×Jh = 0 underH0,m,h. Thus the null is simple but the technical details of deriving the

limiting distribution of a given submatrix is still very involved because the correlation structure of

{Wi, j , (i, j) ∈ Jm×Jh} largely depends on the entries outside of the submatrix of interest. In the two-

sample case, we wish to test the hypothesesH0,m,h : Ω(1)Jm×Jh

= Ω(2)Jm×Jh

. Under the null hypothesis


ACCEPTED MANUSCRIPT

H0,m,h, each submatrix is not necessary a zero matrix. So the null is composite, consequently the

dependence structures of the suitable test statistics depend on the entries both inside and outside of

the submatrices of direct interest. The two-sample case is technically even more challenging and

we leave it as future work.


ACCEPTED MANUSCRIPT

References

Anderson, T. W. (2003).An Introduction To Multivariate Statistical Analysis. Wiley-Intersceince, 3rd ed,

New York.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful

approach to multiple testing.J. R. Statist. Soc. B, 57:289–300.

Beran, R., Bilodeau, M., and de Micheaux, P. L. (2007). Nonparametric tests of independence between

random vectors.Journal of Multivariate Analysis, 98(9):1805–1824.

Cai, T. T. and Liu, W. (2015). Large-scale multiple testing of correlations.Journal of the American Statistical

Association, 110(to appear).

Cai, T. T., Liu, W., and Xia, Y. (2013). Two-sample covariance matrix testing and support recovery in

high-dimensional and sparse settings.J. Amer. Statist. Assoc., 108(501):265–277.

Cai, T. T. and Zhang, A. (2014). Inference on high-dimensional differential correlation matrix.arXiv

preprint arXiv:1408.5907.

Carracedo, A., Ma, L., Teruya-Feldstein, J., Rojo, F., Salmena, L., Alimonti, A., Egia, A., Sasaki, A. T.,

Thomas, G., Kozma, S. C., et al. (2008). Inhibition of mtorc1 leads to mapk pathway activation through

a pi3k-dependent feedback loop in human cancer.The Journal of clinical investigation, 118(9):3065.

Chatterjee, N., Kalaylioglu, Z., Moslehi, R., Peters, U., and Wacholder, S. (2006). Powerful multilocus tests

of genetic association in the presence of gene-gene and gene-environment interactions.The American

Journal of Human Genetics, 79(6):1002–1016.

Craven, M. and Kumlien, J. (1999). Constructing biological knowledge bases by extracting information

from text sources. InISMB, volume 1999, pages 77–86.

Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space (with

discussions).Journal of the Royal Statistical Society, Series B, 70:849–911.

Gieser, P. W. and Randles, R. H. (1997). A nonparametric test of independence between two vectors.Journal

of the American Statistical Association, 92(438):561–567.


ACCEPTED MANUSCRIPT

Glazko, G. V. and Emmert-Streib, F. (2009). Unite and conquer: univariate and multivariate approaches for

finding differentially expressed gene sets.Bioinformatics, 25(18):2348–2354.

Guo, X. and Wang, X.-F. (2008). Signaling cross-talk between tgf-β/bmp and other pathways.Cell research,

19(1):71–88.

Huang, T.-M. et al. (2010). Testing conditional independence using maximal nonlinear conditional correla-

tion. The Annals of Statistics, 38(4):2047–2091.

Jia, P., Kao, C.-F., Kuo, P.-H., and Zhao, Z. (2011). A comprehensive network and pathway analysis of

candidate genes in major depressive disorder.BMC systems biology, 5(Suppl 3):S12.

Kelley, R. and Ideker, T. (2005). Systematic interpretation of genetic interactions using protein networks.

Nature biotechnology, 23(5):561–566.

Khatri, P., Sirota, M., and Butte, A. J. (2012). Ten years of pathway analysis: current approaches and

outstanding challenges.PLoS computational biology, 8(2):e1002375.

Kooperberg, C. and LeBlanc, M. (2008). Increasing the power of identifying gene× gene interactions in

genome-wide association studies.Genetic epidemiology, 32(3):255–263.

Kooperberg, C. and Ruczinski, I. (2005). Identifying interacting snps using monte carlo logic regression.

Genetic epidemiology, 28(2):157–170.

Kouzmenko, A. P., Takeyama, K.-i., Ito, S., Furutani, T., Sawatsubashi, S., Maki, A., Suzuki, E., Kawasaki,

Y., Akiyama, T., Tabata, T., et al. (2004). Wnt/β-catenin and estrogen signaling converge in vivo.Journal

of Biological Chemistry, 279(39):40255–40258.

Lauritzen, S. L. (1996).Graphical models. Oxford University Press.

Li, Y., Agarwal, P., and Rajagopalan, D. (2008). A global pathway crosstalk network.Bioinformatics,

24(12):1442–1447.

Liu, H., Han, F., Yuan, M., Lafferty, J., Wasserman, L., et al. (2012). High-dimensional semiparametric

gaussian copula graphical models.Ann. Statist., 40(4):2293–2326.

Liu, H., Tang, Y., and Zhang, H. H. (2009). A new chi-square approximation to the distribution of non-

negative definite quadratic forms in non-central normal variables.Comput. Stat. Data. Anal., 53(4):853–

856.


ACCEPTED MANUSCRIPT

Liu, W. (2013). Gaussian graphical model estimation with false discovery rate control.Ann. Statist.,

41(6):2948–2978.

Liu, W. and Shao, Q.-M. (2014). Phase transition and regularized bootstrap in large scalet-tests with false

discovery rate control.Ann. Statist., 42(5):2003–2025.

Liu, Z.-P., Wang, Y., Zhang, X.-S., and Chen, L. (2010). Identifying dysfunctional crosstalk of pathways in

various regions of alzheimer’s disease brains.BMC systems biology, 4(Suppl 2):S11.

Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., de Bono, B., Garapati, P., Hemish, J.,

Hermjakob, H., Jassal, B., et al. (2009). Reactome knowledgebase of human biological pathways and

processes.Nucleic acids research, 37(suppl 1):D619–D622.

Osborne, C. K., Shou, J., Massarweh, S., and Schiff, R. (2005). Crosstalk between estrogen receptor and

growth factor receptor pathways as a cause for endocrine therapy resistance in breast cancer.Clinical

cancer research, 11(2):865s–870s.

Pan, X.-H. (2012). Pathway crosstalk analysis based on protein-protein network analysis in ovarian cancer.

Asian Pacific Journal of Cancer Prevention, 13(8):3905–3909.

Puri, N., Salgia, R., et al. (2008). Synergism of egfr and c-met pathways, cross-talk and inhibition, in

non-small cell lung cancer.Journal of carcinogenesis, 7(1):9.

Ritchie, M., Hahn, L., Roodi, N., Bailey, L., Dupont, W., Parl, F., and Moore, J. (2001). Multifactor-

dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic

breast cancer.Am. J. Hum. Genet., 69(1):138–147.

Rual, J.-F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz, G. F., Gibbons,

F. D., Dreze, M., Ayivi-Guedehoussou, N., et al. (2005). Towards a proteome-scale map of the human

protein–protein interaction network.Nature, 437(7062):1173–1178.

Shou, J., Massarweh, S., Osborne, C. K., Wakeling, A. E., Ali, S., Weiss, H., and Schiff, R. (2004). Mecha-

nisms of tamoxifen resistance: increased estrogen receptor-her2/neu cross-talk in er/her2–positive breast

cancer.Journal of the National Cancer Institute, 96(12):926–935.

Su, L. and White, H. (2007). A consistent characteristic function-based test for conditional independence.

Journal of Econometrics, 141(2):807–834.


ACCEPTED MANUSCRIPT

Su, L. and White, H. (2008). A nonparametric hellinger metric test for conditional independence.Econo-

metric Theory, 24(04):829–864.

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A.,

Pomeroy, S. L., Golub, T. R., Lander, E. S., et al. (2005). Gene set enrichment analysis: a knowledge-

based approach for interpreting genome-wide expression profiles.Proceedings of the National Academy

of Sciences of the United States of America, 102(43):15545–15550.

Um, Y. and Randles, R. H. (2001). A multivariate nonparametric test of independence among many vectors.

Journal of Nonparametric Statistics, 13(5):699–708.

van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy,

K., Marton, M. J., Witteveen, A. T., et al. (2002). Gene expression profiling predicts clinical outcome of

breast cancer.Nature, 415(6871):530–536.

Weirauch, M. T. (2011). Gene coexpression networks for the analysis of dna microarray data.Applied

statistics for network biology: methods in systems biology, pages 215–250.

Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S.-M., and Eisenberg, D. (2002). Dip, the database

of interacting proteins: a research tool for studying cellular networks of protein interactions.Nucleic

acids research, 30(1):303–305.

Xia, Y., Cai, T., and Cai, T. T. (2015). Testing differential networks with applications to the detection of

gene-gene interactions.Biometrika, 102:247–266.

Xia, Y., Cai, T., and Cai, T. T. (2016). Supplement to “Multiple Testing of Submatrices of a Precision Matrix

with Applications to Identification of Between Pathway Interactions”. Technical report.

Xue, L. and Zou, H. (2012). Regularized rank-based estimation of high-dimensional nonparanormal graph-

ical models.Ann. Statist., 40(5):2541–2571.


ACCEPTED MANUSCRIPT

pm

etho

dα=

10%

α=

1%s

24

24

Mod

els

12

31

23

12

31

23

Em

piric

alF

DR

(SE

)(in

%)

100

Gro

up8.

9(3

.4)

9.5

(4.3

)9.

0(2

.9)

4.6

(3.3

)8.

6(6

.0)

9.7

(5.6

)0.

8(1

.2)

1.0

(1.0

)0.

6(0

.8)

0.2

(0.5

)0.

8(1

.0)

0.4

(1.1

)E

ntry

wis

e24

.5(4

.6)

15.7

(5.0

)12

.7(3

.7)

36.2

(7.6

)24

.1(8

.3)

16.3

(6.1

)2.

7(2

.3)

1.3

(1.8

)1.

0(1

.2)

4.0

(4.1

)1.

8(3

.3)

1.3

(5.5

)B

-H10

.0(3

.6)

12.7

(5.5

)9.

8(4

.2)

8.3

(5.0

)12

.4(7

.4)

11.7

(6.8

)0.

8(1

.3)

1.2

(1.9

)0.

8(1

.1)

0.4

(1.5

)1.

5(5

.6)

1.1

(3.5

)

200

Gro

up8.

8(2

.5)

9.4

(3.5

)8.

7(2

.5)

5.8

(3.3

)8.

5(4

.5)

8.7

(3.9

)0.

6(0

.5)

0.9

(0.5

)0.

8(0

.8)

0.3

(0.4

)0.

8(0

.2)

0.8

(0.7

)E

ntry

wis

e23

.5(3

.5)

14.9

(3.7

)13

.6(3

.0)

33.1

(7.0

)23

.0(5

.5)

15.1

(5.8

)2.

1(1

.4)

1.0

(1.2

)0.

9(1

.0)

3.1

(2.8

)1.

3(2

.3)

0.7

(3.1

)B

-H9.

8(2

.9)

11.7

(3.8

)9.

2(3

.1)

8.8

(4.2

)11

.9(4

.6)

12.1

(5.5

)0.

8(0

.9)

0.9

(1.9

)0.

8(1

.1)

0.8

(1.5

)1.

4(2

.5)

1.0

(4.2

)

500

Gro

up7.

9(1

.5)

9.9

(2.3

)8.

1(1

.5)

5.4

(1.8

)8.

7(2

.9)

9.3

(2.7

)0.

8(0

.4)

0.9

(0.7

)0.

7(0

.5)

0.5

(0.6

)0.

7(0

.8)

0.9

(0.8

)E

ntry

wis

e19

.4(2

.2)

15.0

(2.3

)12

.5(2

.1)

24.6

(4.9

)20

.6(4

.1)

16.2

(5.3

)1.

6(1

.0)

0.9

(0.9

)0.

8(0

.8)

1.6

(1.4

)1.

1(1

.6)

1.1

(2.1

)B

-H8.

3(2

.0)

11.3

(2.6

)9.

2(2

.1)

8.7

(3.4

)11

.9(4

.2)

11.6

(4.9

)0.

8(0

.6)

1.1

(1.1

)0.

9(1

.1)

0.7

(1.0

)1.

2(2

.3)

1.4

(3.5

)

1000

Gro

up7.

9(1

.2)

9.8

(1.8

)8.

7(1

.4)

6.0

(1.7

)9.

0(2

.0)

10.0

(2.1

)0.

7(0

.3)

0.9

(0.4

)0.

7(0

.4)

0.4

(0.4

)0.

8(0

.5)

1.0

(0.7

)E

ntry

wis

e16

.7(2

.0)

14.5

(1.8

)13

.1(2

.1)

20.9

(3.5

)20

.0(2

.9)

16.5

(5.4

)1.

3(0

.7)

1.3

(0.9

)1.

1(1

.0)

1.7

(1.7

)1.

2(1

.6)

0.8

(2.0

)B

-H7.

8(1

.3)

11.8

(2.3

)10

.1(1

.9)

9.5

(2.7

)12

.0(2

.8)

12.7

(6.2

)0.

6(0

.5)

1.1

(1.0

)0.

8(1

.1)

0.9

(1.2

)1.

3(2

.3)

0.8

(2.1

)

Tabl

e1:

Em

piric

alF

DR

s(s

tand

ard

erro

rs)

(%)

withn=

200,α=

0.1

and

0.01

resp

ectiv

ely,

100

repl

icat

ions

.


ACCEPTED MANUSCRIPT

p method α = 10% α = 1%s 2 4 2 4

Models 1 2 3 1 2 3 1 2 3 1 2 3

100Group 93.8 88.5 84.6 95.2 84.473.5 87.8 74.2 69.4 92.3 68.754.4

Entrywise 92.8 90.3 85.2 92.8 85.167.3 83.4 71.9 66.8 85.9 56.328.5B-H 92.8 88.3 84.6 92.9 81.167.2 82.3 66.3 64.5 84.0 50.931.2

200Group 90.4 83.3 72.5 92.6 77.958.8 82.6 66.8 56.1 87.9 62.742.6

Entrywise 87.6 84.8 71.7 87.3 76.045.5 93.6 60.8 47.6 73.6 40.815.0B-H 87.6 81.4 70.9 86.9 71.343.3 72.2 54.6 46.7 71.1 33.814.1

500Group 84.3 72.3 56.6 85.9 67.441.4 75.1 55.4 40.7 76.0 53.427.2

Entrywise 78.7 70.8 50.8 72.5 60.024.2 60.0 43.3 25.6 48.3 25.9 5.0B-H 77.8 66.0 50.4 70.0 52.520.3 57.5 36.3 23.8 43.3 19.3 3.0

1000Group 80.1 63.3 46.4 80.7 59.931.9 69.9 57.2 31.8 69.2 46.821.1

Entrywise 71.2 59.2 37.1 60.7 48.514.5 49.3 31.5 15.5 33.5 18.5 3.1B-H 69.8 53.1 35.4 56.6 39.9 9.6 45.8 23.9 12.7 27.3 10.6 1.2

Table 2:Empirical powers (%) withn = 200,α = 0.1 and 0.01 respectively, 100 replications.


ACCEPTED MANUSCRIPT

s 2 4 2 4Models 1 2 3 1 2 3 1 2 3 1 2 3

p Trueinteractions Nointeractions100 72 54 96 29 28 46 1153 1171 1129 271 272 254200 146 110 196 60 58 96 4803 4839 4754 1165 1167 1129500 373 279 496 154 147246 30752 30846 30629 7596 760375041000 748 560 996 311 297496 124002 124190 123754 30814 3082830629

Table 3:Average numbers of true interactions and no interactions based on 100 replications.


ACCEPTED MANUSCRIPT

p method α = 10% α = 1%FDR (SE) (in%) Power (in%) FDR (SE) (in%) Power (in%)

1624Group 6.2 (1.4) 47.2 0.5 (0.3) 33.6

Entrywise 26.0 (2.7) 39.9 2.5 (1.5) 26.0B-H 11.1 (2.0) 44.9 1.1 (1.0) 19.8

Table 4: Empirical FDRs (SEs) and powers (%) by mimicking the real data withα = 0.1 andα = 0.01respectively, based on 100 replications.


ACCEPTED MANUSCRIPT

0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030

0.0

0.2

0.4

0.6

0.8

alpha

pow

er

groupentrywise

(a) p=100

0e+00 2e-05 4e-05 6e-05 8e-050.

00.

20.

40.

60.

8

alpha

pow

er

groupentrywise

(b) p=200

0.0e+00 2.0e-06 4.0e-06 6.0e-06 8.0e-06 1.0e-05 1.2e-05

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

alpha

pow

er

groupentrywise

(c) p=500

0.0e+00 5.0e-07 1.0e-06 1.5e-06 2.0e-06 2.5e-06 3.0e-06

0.0

0.1

0.2

0.3

0.4

0.5

0.6

alpha

pow

er

groupentrywise

(d) p=1000Figure 1:Power comparisons of the group method (red, solid) and entrywise method (blue, dash) for testinga given nonzero submatrix. 100 replications.


Documents

Between Pathway Interactions Matrix with Applications to ...homepage.fudan.edu.cn/xiayin/files/2017/02/... · procedure is applied to a breast cancer gene expression study to identify