6
Learning regulatory programs by threshold SVD regression Xin Ma a,1 , Luo Xiao b,1 , and Wing Hung Wong a,c,2 Departments of a Statistics and c Health Research & Policy, Stanford University, Stanford, CA 94305; and b Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205 Contributed by Wing Hung Wong, September 18, 2014 (sent for review August 3, 2014; reviewed by Hongyu Zhao) We formulate a statistical model for the regulation of global gene expression by multiple regulatory programs and propose a thresh- olding singular value decomposition (T-SVD) regression method for learning such a model from data. Extensive simulations demonstrate that this method offers improved computational speed and higher sensitivity and specificity over competing approaches. The method is used to analyze microRNA (miRNA) and long noncoding RNA (lncRNA) data from The Cancer Genome Atlas (TCGA) consortium. The analysis yields previously unidentified insights into the combi- natorial regulation of gene expression by noncoding RNAs, as well as findings that are supported by evidence from the literature. regulatory program | SVD | sparse | multivariate | regression T he development of microarray and next-generation sequencing technologies has enabled rapid quantification of various genome-wide features (DNA sequences, gene expressions, non- coding RNA expressions, methylation, etc.) in a population of samples (1, 2). Large consortia have compiled genetic and mo- lecular profiling data in an enormous number of tumors across hundreds of samples (3, 4). A common challenge arising from these large-scale genomic studies is the inference of regulatory relationships between different genome-wide measurements from the complex biological systems where the number of predictors and responses often far exceeds the sample size. To formulate a statistical model for such regulatory relations, consider the situation depicted in Fig. 1 (see Fig. S1 for more detailed illustration of the model schema), where p regulators x = ðx 1 ; ... ; x p Þ regulate q responses y = ðy 1 ; ... ; y q Þ through r regulatory programs that are represented by hidden nodes, e.g., h 1 ; ... ; h r . The activity h j of the jth program depends on the regulators connected to hidden node j, and h j in turn affects the level of the responses that are connected to node j. To express this model mathematically, we denote by u j and v j the unit vectors corresponding respectively to the input weights fa ij , i = 1; ... ; pg and the output weights fb jk , k = 1; ... ; qg of the jth program. Then the regulatory relations are represented as h j = σðxu j Þ and y = P r j=1 d j h j vj , where x, y are regarded as row vectors, u, v are regarded as column vectors, and σ() is a sig- moidal function. The aforementioned is a standard single-layer neural network model that is widely used in predictive modeling but could be impossible to learn in biological studies where sample size n is much smaller than p or q. Thus, we first simplify the model by taking σ to be the identity function. Then our model becomes h j = xu j , y = P r j=1 d j h j vj . We make the biologically plausible assumption that only a small subset of regulators is contributing to any program and that each program regulates only a small subset of responses. Under this assumption, u j and v j are sparse vectors in R p and R q , re- spectively. The magnitude of the output weight vector (denoted by d j ) represents the importanceof the jth program relative to other programs. Finally, the different programs are assumed to operate independently. Although there are many possible ways to enforce this independence, we choose to assume that u 1 ; ... ; u r are orthogonal to each other and v 1 ; ... ; v r are or- thogonal to each other. This assumption enables us to develop fast algorithms for statistical inference of the model from ob- served data on y and x. It follows from the above assumptions that the u j s and v j s are respectively the left and right singular vectors in the singular value decomposition (SVD) of the coefficients matrix in the regression of y on x. Although there have been considerable recent works on the use of sparse SVD in statistical modeling, most of them are targeted to the situation where only y is ob- served and there is no predictor variable x (5). An exception is reduced-rank stochastic regression with SVD (RRRR) (6), which to our knowledge seems to be the first to introduce a sparse SVD model for the regression relation. However, as will be seen below, our algorithm thresholding SVD (T-SVD) for learning the model is entirely new and provides substantial im- provement in estimation accuracy as well as learning speed. Thus, besides making a conceptual contribution of formulating the regulatory programs as components in a sparse SVD model, our work also represents an advance in the statistical method- ology for estimating such a model. We demonstrate the better performance of our method com- pared with other existing methods using simulation data mim- icking the sparse and combinatorial feature of complex biological systems. We also investigate the microRNA (miRNA)gene regulation (i.e., regulation of gene expression by miRNA) and long noncoding RNA (lncRNA)gene regulations by applying T-SVD to analyze a large ovarian cancer gene expression dataset from The Cancer Genome Atlas (TCGA) consortium (3). This analysis is challenging in that the sample size is substantially smaller than the number of regulators and responses. Our analysis reveals regulatory programs that associate specific miRNA or Significance With the increase in high-throughput data in genomic studies, the study of regulatory relationships between multidimen- sional predictors and responses is becoming a common task. Although high-dimensional data hold promise for revealing rich and complex regulations, it remains challenging to infer the relations between tens of thousands of responses and thousands of predictors, as the desired signal must be searched among an overwhelming number of irrelevant responses. Here we show that by formulating the regulatory programs as hidden-intermediate nodes in a linear network, a sparsity- inducing modeling and inference approach is effective in extracting the regulatory relations among very high-dimensional responses and predictors, even when the sample size is much lower. Author contributions: X.M., L.X., and W.H.W. designed research; X.M. and L.X. performed research; X.M. and L.X. contributed new reagents/analytic tools; X.M. analyzed data; and X.M., L.X., and W.H.W. wrote the paper. Reviewers included: H.Z., Yale University. The authors declare no conflict of interest. Freely available online through the PNAS open access option. 1 X.M. and L.X. contributed equally to this work. 2 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1417808111/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1417808111 PNAS | November 4, 2014 | vol. 111 | no. 44 | 1567515680 STATISTICS

Learning regulatory programs by threshold SVD regressionweb.stanford.edu/~xm24/tsvd_website/tsvd.pdf · 2015. 3. 18. · Learning regulatory programs by threshold SVD regression Xin

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning regulatory programs by threshold SVD regressionweb.stanford.edu/~xm24/tsvd_website/tsvd.pdf · 2015. 3. 18. · Learning regulatory programs by threshold SVD regression Xin

Learning regulatory programs by thresholdSVD regressionXin Maa,1, Luo Xiaob,1, and Wing Hung Wonga,c,2

Departments of aStatistics and cHealth Research & Policy, Stanford University, Stanford, CA 94305; and bDepartment of Biostatistics, Johns Hopkins University,Baltimore, MD 21205

Contributed by Wing Hung Wong, September 18, 2014 (sent for review August 3, 2014; reviewed by Hongyu Zhao)

We formulate a statistical model for the regulation of global geneexpression by multiple regulatory programs and propose a thresh-olding singular value decomposition (T-SVD) regression method forlearning such a model from data. Extensive simulations demonstratethat this method offers improved computational speed and highersensitivity and specificity over competing approaches. The methodis used to analyze microRNA (miRNA) and long noncoding RNA(lncRNA) data from The Cancer Genome Atlas (TCGA) consortium.The analysis yields previously unidentified insights into the combi-natorial regulation of gene expression by noncoding RNAs, as wellas findings that are supported by evidence from the literature.

regulatory program | SVD | sparse | multivariate | regression

The development of microarray and next-generation sequencingtechnologies has enabled rapid quantification of various

genome-wide features (DNA sequences, gene expressions, non-coding RNA expressions, methylation, etc.) in a population ofsamples (1, 2). Large consortia have compiled genetic and mo-lecular profiling data in an enormous number of tumors acrosshundreds of samples (3, 4). A common challenge arising fromthese large-scale genomic studies is the inference of regulatoryrelationships between different genome-wide measurements fromthe complex biological systems where the number of predictorsand responses often far exceeds the sample size.To formulate a statistical model for such regulatory relations,

consider the situation depicted in Fig. 1 (see Fig. S1 for moredetailed illustration of the model schema), where p regulatorsx= ðx1; . . . ; xpÞ regulate q responses y= ðy1; . . . ; yqÞ through rregulatory programs that are represented by hidden nodes, e.g.,h1; . . . ; hr . The activity hj of the jth program depends on theregulators connected to hidden node j, and hj in turn affects thelevel of the responses that are connected to node j. To expressthis model mathematically, we denote by uj and vj the unitvectors corresponding respectively to the input weights faij,i= 1; . . . ; pg and the output weights fbjk, k= 1; . . . ; qg of the jthprogram. Then the regulatory relations are represented ashj = σðxujÞ and y=

Prj=1djhjv′j , where x, y are regarded as row

vectors, u, v are regarded as column vectors, and σ() is a sig-moidal function. The aforementioned is a standard single-layerneural network model that is widely used in predictive modelingbut could be impossible to learn in biological studies wheresample size n is much smaller than p or q. Thus, we first simplifythe model by taking σ to be the identity function. Then ourmodel becomes hj = xuj, y=

Prj=1djhjv′j .

We make the biologically plausible assumption that only asmall subset of regulators is contributing to any program and thateach program regulates only a small subset of responses. Underthis assumption, uj and vj are sparse vectors in Rp and Rq, re-spectively. The magnitude of the output weight vector (denotedby dj) represents the “importance” of the jth program relativeto other programs. Finally, the different programs are assumedto operate independently. Although there are many possibleways to enforce this independence, we choose to assume thatu1; . . . ; ur are orthogonal to each other and v1; . . . ; vr are or-thogonal to each other. This assumption enables us to develop

fast algorithms for statistical inference of the model from ob-served data on y and x.It follows from the above assumptions that the uj s and vj s are

respectively the left and right singular vectors in the singularvalue decomposition (SVD) of the coefficients matrix in theregression of y on x. Although there have been considerablerecent works on the use of sparse SVD in statistical modeling,most of them are targeted to the situation where only y is ob-served and there is no predictor variable x (5). An exceptionis reduced-rank stochastic regression with SVD (RRRR) (6),which to our knowledge seems to be the first to introduce asparse SVD model for the regression relation. However, as willbe seen below, our algorithm thresholding SVD (T-SVD) forlearning the model is entirely new and provides substantial im-provement in estimation accuracy as well as learning speed.Thus, besides making a conceptual contribution of formulatingthe regulatory programs as components in a sparse SVD model,our work also represents an advance in the statistical method-ology for estimating such a model.We demonstrate the better performance of our method com-

pared with other existing methods using simulation data mim-icking the sparse and combinatorial feature of complex biologicalsystems. We also investigate the microRNA (miRNA)–generegulation (i.e., regulation of gene expression by miRNA) andlong noncoding RNA (lncRNA)–gene regulations by applyingT-SVD to analyze a large ovarian cancer gene expression datasetfrom The Cancer Genome Atlas (TCGA) consortium (3). Thisanalysis is challenging in that the sample size is substantiallysmaller than the number of regulators and responses. Our analysisreveals regulatory programs that associate specific miRNA or

Significance

With the increase in high-throughput data in genomic studies,the study of regulatory relationships between multidimen-sional predictors and responses is becoming a common task.Although high-dimensional data hold promise for revealingrich and complex regulations, it remains challenging to inferthe relations between tens of thousands of responses andthousands of predictors, as the desired signal must be searchedamong an overwhelming number of irrelevant responses. Herewe show that by formulating the regulatory programs ashidden-intermediate nodes in a linear network, a sparsity-inducing modeling and inference approach is effective inextracting the regulatory relations among very high-dimensionalresponses and predictors, even when the sample size ismuch lower.

Author contributions: X.M., L.X., and W.H.W. designed research; X.M. and L.X. performedresearch; X.M. and L.X. contributed new reagents/analytic tools; X.M. analyzed data; andX.M., L.X., and W.H.W. wrote the paper.

Reviewers included: H.Z., Yale University.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.1X.M. and L.X. contributed equally to this work.2To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1417808111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1417808111 PNAS | November 4, 2014 | vol. 111 | no. 44 | 15675–15680

STATIST

ICS

Page 2: Learning regulatory programs by threshold SVD regressionweb.stanford.edu/~xm24/tsvd_website/tsvd.pdf · 2015. 3. 18. · Learning regulatory programs by threshold SVD regression Xin

lncRNAs to relevant cancer pathways. Many of the regulator–target relationships are supported by external evidence fromthe literature.

ResultsT-SVD Regression. The above model implies that if y and x arerespectively the response and regulator vectors measured from agiven sample, then we have the regression relationship EðyjxÞ=xðP djujv′j Þ. In other words, the p× q matrix of regressioncoefficients has a SVD decomposition with singular values djand corresponding left- and right-singular vectors uj and vj forj= 1; . . . ; r.When response data are available from n samples, they are

denoted by an n× q matrix Y whose rows correspond to thesample responses. Similarly, the regulator profiles for the n sam-ples are represented by an n× p matrix X . Our model canthen be expressed in matrix form as EðY jXÞ=XC, where C=P

fj=1;...;rgdjujv′j =UDV ′, U = ½u1; . . . ; ur�, V = ½v1; . . . ; vr�, and Dis the r × r diagonal matrix with diagonal d1 ≥ . . . ≥ dr > 0.To fit the model we propose an iterative method. Conditional

on U(or V ), we estimate VD ðor UDÞ by a thresholding-basedregularized multivariate regression step (Materials and Methods,Eqs. 2 and 3), which gives an estimated matrix that is sparse butnot orthogonal (i.e., not having orthogonal columns). To obtainan orthogonal estimate we developed a sparse orthogonal de-composition algorithm (SODA) (Materials and Methods). SODA,unlike the standard Gram–Schmidt process, does not destroysparsity. We iterate between the estimates of U and V untilreaching convergence (Fig. S1).Two additional methodological innovations were incorporated

into our T-SVD regression algorithm. First, the threshold pa-rameter in the thresholding step is automatically determinedby a Bayesian information criterion (BIC)-like criterion thatis specifically derived for our model (Materials and Methods).Second, to achieve speed and scalability the algorithm exploitssparsity in its computation by precomputing and storing termsthat are not changed by the iteration and by indexing the non-zero rows in the sparse large matrices so that only the calculationinvolving nonzero rows would be carried out. For more detailson the algorithm, see Materials and Methods.Compared with sequential extraction algorithm (SEA) and

iterative exclusive extraction algorithm (IEEA) methods (6), ourT-SVD method is an entirely different algorithm approach toestimate a sparse SVD for the regression relation. WhereasSEA and IEEA try to estimate the singular vector pairs (uj; vj)

sequentially, starting from the pair with the largest singularvalue, our approach iterates between the estimation of the Umatrix and estimation of the V matrix.

Simulation Study. We performed comprehensive simulationsto assess the performance of T-SVD relative to four existingmethods. Table 1 lists the four methods and their references.Besides SEA and IEEA, we also include two biclustering-basedmethods (SSVD and BCssvd) (5, 7) in the comparison. Details ofthe simulation settings are given in Materials and Methods. Herewe focus on the case where n= 100, p= 150, q= 150, and thesignal-to-noise ratio ðS2N) = 1. Fig. 2 gives the comparisonresults for the inference of three sparse matrices: coefficientmatrix C, left singular vectors U, and right singular vectors V .The metrics for comparison are in terms of sensitivity, specificity,and sum of squares of errors (SSE). As shown in Fig. 2, T-SVDoutperforms competing algorithms in almost all different sce-narios considered here. The sensitivity of T-SVD is among thehighest when the maximum correlation (ρ) among regulators isbelow 0.5 but starts to decrease afterward. On the other hand,T-SVD almost always gives the highest specificity by a largemargin and it also achieves the lowest total estimation error interms of SSE. The high specificity of T-SVD means that it pro-duces the sparsest estimate among all algorithms. It is interestingto note that the biclustering-based methods have much lowerspecificity. Similar comparison results hold in other simulationsettings (Figs. S2 and S3).

Performance of the BIC. The BIC is a criterion widely used inmodel selection but it does not apply directly in our setting wheren is much smaller than q. We proposed a new BIC for our model(Materials and Methods) and tested it by simulations against thestandard BIC that simply used the number of nonzero entries ofsingular vectors as the number of free parameters. The resultsare given in Table S1. It is seen that the new BIC achieves a 10-to 100-fold decrease of incorrectly identified zeros in the esti-mation of C;U, and V .In terms of computational efficiency, the two biclustering

methods (SSVD and BCssvd) were the fastest (Table S2) but atthe expense of greatly increasing the misclassification and esti-mation errors. T-SVD, although on average 200 times slowerthan biclustering methods, is nonetheless 50-fold and 180-foldfaster than SEA and IEEA methods.

Application to Ovarian Cancer Data.We demonstrate the capabilityof T-SVD by applying it to the miRNA and protein-coding geneexpression data from TCGA ovarian cancer data (3). We alsoobtain lncRNA expression from a recent study that uses a bio-informatics approach to infer lncRNA expression from arraydata (8). Instead of using the whole set of lncRNA, we focus ona subset of 4,297 lncRNAs that have significant associationwith ovarian cancer-related traits. The final dataset consists of487 samples with measurements of all three RNA types (254miRNAs with SD > 0.5, 11,864 protein-coding genes, and4,297 lncRNAs).The numbers of regulatory programs (i.e., rank of C) in the

miRNA–gene regulation (n = 487, P = 254, q = 11,864) andlncRNA–gene regulation (n = 487, P = 4,297, q = 11,864) wereestimated to be 28 and 22, respectively, using T-SVD. For themiRNA–gene regulation analysis, T-SVD selected one to eight

Fig. 1. Schematic representation of the combinatorial regulation usingregulatory programs. Circles represent p predictors X, hexagons represent qresponses Y, and diamonds represent r regulatory programs h. Predictorsand responses were color matched to their corresponding related hiddenprograms. Each aij represents the strength of the input from regulator i toprogram j, where i= 1 . . .p and j= 1 . . . r. Each bjk represents the strength ofthe input from program j to output response k, where j= 1 . . . r andk=1 . . .q. The arrow indicates the direction of the input.

Table 1. List of existing algorithms used in comparison withT-SVD in the simulation study

Name Brief description of the algorithm

SEA Sequential extraction algorithm using SVDIEEA Iterative extraction algorithm using SVDSSVD Biclustering method using sparse SVDBCssvd Biclustering method using sparse SVD

15676 | www.pnas.org/cgi/doi/10.1073/pnas.1417808111 Ma et al.

Page 3: Learning regulatory programs by threshold SVD regressionweb.stanford.edu/~xm24/tsvd_website/tsvd.pdf · 2015. 3. 18. · Learning regulatory programs by threshold SVD regression Xin

miRNAs and 284–431 genes per program (Fig. 3A), resulting inon average around 100-fold reduction in number of predictorsas well as 30-fold reduction in number of responses (Table S3).The estimated lncRNA–gene regulation also showed substantialsparsity in each program (Fig. S4 and Table S4). Additionally,the set of regulators and response genes in each program (i.e.,regulators and genes with nonzero edge weights in the program)shows minimal overlaps across programs (Fig. 3B). These to-gether demonstrate that the T-SVD has succeeded in extractingindependent and sparse regulatory programs from the data.

miRNA–Gene Regulation. A singular value di reflects the relativeimportance of the corresponding program. Thus, we wouldexpect to see the first regulation program capture the globalgene expression pattern within the samples. A previous studyfrom TCGA (3) classified the 487 ovarian cancer samples intofour subtypes (immunoreactive, differentiated, proliferative, andmesenchymal) based on the global gene expression profile. Littlewas known regarding which miRNAs might be more related tothe gene expression cancer subtypes. Program 1 in the miRNA–gene regulation clearly captured the major subtypes, especiallythe immunoreactive and proliferative subtypes (Fig. 3C). Themajority of genes and four of five miRNAs in this programshow high expression in immunoreactive subtype samples andlow expression in proliferative subtype samples. Interestingly,miR-142 was identified as the strongest signature associatedwith lymphocyte-specific gene expression and methylation acrossmultiple cancer types in another study (9). Several recent cancerstudies also supported the inhibitory effect on cell proliferationby miR-142-3p in pancreatic cancer (10) and miR-224 in ovariancancer (11). The only miRNA that showed high expression inproliferative subtype and low expression in immunoreactivesubtype is miR-218. However, recent studies suggested its role as

a tumor suppressor and inhibiting cell proliferation (12, 13). Thissuggested the complexities of miRNA regulations in differentcancer types. High-throughput experimental approaches suchas CLIP-SEq (14) in the ovarian cancer-related cells wouldhelp to elucidate the direct targets and functional roles ofthese miRNAs.The response genes from the top programs showed en-

richment in cancer-related KEGG and Reactome pathways,including pathways in cancer, transcriptional misregulation incancer, cell adhesion, and GPCR signaling pathways (15) (Fig.S5). Based on information in a recent review paper (16), the vastmajority (10 of 14) of the miRNAs from the top three programswere related to cancer. We use program 3 as an example, to il-lustrate the ability of T-SVD to capture important miRNA–generegulation pathways. As shown in Fig. 4A, the response genes inthis program are strongly enriched in cell adhesion and virusinfection-related pathways. The results are also compared withinformation from starBase, a database of miRNA–target in-teraction constructed from combined evidence of sequence-based predictions and 14 large-scale RNA–protein interactionexperimental datasets (17). We used starBase entries supportedby at least one RNA–protein interaction experiment as the trueinteraction reference. In general only 1.9% of miRNA–genepairs have support from starBase, whereas we found that 3.8%of the miRNA–gene pairs on this program have support fromstarBase, which represents a twofold enrichment and is stronglysignificant (hypergeometric P value = 4e-7, Fig. 4B).In particular, 6 of the 11 genes in the enriched ECM–receptor

interaction pathway showed evidence of interaction with multiplemiRNAs identified as regulators in program 3 (Fig. 4C). Both let-7 and miR-29b were demonstrated to play critical roles in cancerproliferation and the ECM pathway in other cancer types (18, 19).Furthermore, the main target gene of let-7 in lung cancer (18),HMGA2, was recently shown to induce ovarian surface epithelialtransformation through regulation of EMT genes (20). Takentogether, these results suggest that let-7, miR-29b, and per-haps some of the other miRNAs in program 3 may play reg-ulatory roles in ovarian cancer through extracellular matrixpathways.

lncRNA–Gene Regulation. Because knowledge of lncRNA targetsand functions is limited, we focused on the enriched Gene On-tology (GO) (21) categories of the response genes in top pro-grams from T-SVD. The enriched GO categories in the top threeprograms include cell cycle, chromatin organization, chromatinmodification, and mitochondrion, etc. (Fig. S6).Many well-studied lncRNAs [e.g., Xist (22), HOTAIR (23),

and HOTTIP (24)] regulate gene expression through inter-actions with chromatin modification complexes and then tar-geting these enzymatic activities to appropriate locations in thegenome (25). Programs 2 and 3 are significantly associated inchromatin regulation-related GO categories (Fig. S6). Eighty-eight genes in the chromatin modification (chromatin regulation,CR) biological process (GO:0016568) are found in the responsegenes in program 2 (hypergeometric test P value = 6.8e-7). Thereare 2 CR genes CENPN and CCNB1 among the top 20 genesin this program if we rank genes by the magnitude of the cor-responding component of the left singular vector of the program.CENPN was recently identified as a core component in a breastcancer prognostic gene signature (26).Because many lncRNAs are involved in cis-regulation (27),

we examined whether the high-ranking genes are near thesame genomic locus of the regulatory lncRNAs in a program.One of the strongest lncRNAs in program 2, ENSG00000233589(RP4-694A7.2), was found to be on the antisense strand of geneDEPDC1. The coefficient estimate from T-SVD (shown belowthe diagonal line in Fig. 5) was very strong as corroborated by thepairwise Pearson correlation coefficient (r = 0.6, shown abovethe diagonal line in Fig. 5). An additional program that showedCR gene enrichment is program 13, which has HMGA2 as atop-ranking response gene (Fig. S7). HMGA2 is the main target

A

B

C

Fig. 2. Evaluation of different algorithms through simulations. We com-pared the performance of T-SVD, SEA, IEEA, SSVD, and BCssvd for the esti-mation of three matrices C (A), U (B), and V (C). The dimensions of thesimulation results are n= 100, p= 150, q= 150. Details of the simulationmethods are in Materials and Methods. We varied the parameter ρ, whichrepresents the strength of the correlation among all predictors, from 0 (nocorrelation) to 0.625 (high correlation). From Left to Right, the three panelsin each row represent the performance based on three different statistics:sensitivity (percentage of true nonzeros identified by each method), speci-ficity (percentage of identified nonzero items being true), and SSE (sum ofsquared errors).

Ma et al. PNAS | November 4, 2014 | vol. 111 | no. 44 | 15677

STATIST

ICS

Page 4: Learning regulatory programs by threshold SVD regressionweb.stanford.edu/~xm24/tsvd_website/tsvd.pdf · 2015. 3. 18. · Learning regulatory programs by threshold SVD regression Xin

gene of let-7 mirRNA in lung cancer (18) and also found tobe strongly regulated by miRNA let-7c in our miRNA–generegulation analysis. Finally we note that although the aboveanalysis shows that some of the top lncRNA–gene regulationshave substantial support from the literature, the exact molec-ular mechanism of these lncRNAs and their direct targets needsfurther experimental evaluation.

DiscussionSVD has previously been used in biclustering of genes andsamples based on a sample of gene expression profiles (5, 28).The use of sparse SVD to model the regression relation (i.e., thematrix C) was first introduced in the RRRR method. However,RRRR is not guaranteed to give orthogonal singular vectors andits approach of sequential extraction of rank one componentsdoes not scale well to a high dimension. This is confirmed in oursimulations. Another joint variable and rank selection method

(29) uses l2 group penalty on the rows of C. However, this al-gorithm along with another recently proposed method (30) canreduce dimension in predictor space but not response space anddoes not provide information on independent regulatory pro-grams. A sparse network-regularized multiple nonnegative ma-trix factorization (SNMNMF), which incorporates the knowninteraction from literature as a prior information in the param-eter estimations, was recently proposed for the inference ofmiRNA–gene regulation (31). However, SNMNMF identifiesonly the coexistence relationship between predictors and responseswithout the estimation of the relative strength or directionof regulation.Here, we formally established connections between the regu-

lation networks and SVD regression both conceptually andmathematically and proposed an SVD regression-based model(T-SVD) for learning regulatory programs in a complex bio-logical system. Our model can capture the association betweena large number of predictors and responses simultaneously andidentify regulatory relationships between subsets of regulatorsand subsets of responses. The learning of independent regulatoryprograms provides deeper insight into the complex regulatoryrelations underlying the biological system.Other than the ncRNA–gene regulation examples shown here,

the T-SVD framework is applicable to other areas of genomicstudies such as the inference of shared trans-expression QTL andglobal regulation of epigenetic markers on gene expression, etc.We have implemented the T-SVD method as a freely available Rpackage named “T_SVD”.

Materials and MethodsT-SVD Model. With the SVD representation of the coefficient matrix, modelY =XC + E becomes

Y =XUDV ′+ E: [1]

The SVD expression of C represents r programs of parameters with decreasingimportance and each program relates the responses to the predictors in aunique way. For the kth program, uk is interpreted as the predictor effect, vk

is interpreted as the response effect, and dk indicates the relative impor-tance of the program. When uk and vk are sparse with many zero entries,only a few predictors can be accounted for the effect of the kth programwith only a few responses being predicted.

We propose to estimate U and V by an iterative thresholding algo-rithm similar to the one used in Yang et al. (7) for sparse SVD de-composition of a matrix. To be more specific, first fix U. Let Xv =XU andVv =VD. Then

Y =XvV ′v + E: [2]

Here Vv is a matrix with sparse and orthogonal columns. Because r <p,we estimate Vv initially by Vv = fðX′vXvÞ−1X′vYg′. To “kill” the small coor-dinates in Vv , we threshold Vv with a thresholding function Θv so thatV

thrv =ΘvðVv ; γvÞ. Although the Gram–Schmidt process can be used to extract

the orthogonal component for Vthrv , it is not guaranteed to produce a sparse

result even if Vthrv itself is sparse. Consequently we develop a novel SODA to

Vthrv and obtain an estimate with sparse and orthogonal columns. SODA is

similar to the QR decomposition but is more appropriate for extractingsparse orthogonal vectors; details are provided later. Next fix V . MultiplyingEq. 1 from the right by V leads to

Yu =XUu + Eu, [3]

where Yu =YV , Uu =UD, and Eu = EV. Here Uu is also a matrix containingsparse and orthogonal columns. If p≤n, we can estimate Uu similarly as be-fore. We focus on the case p>n, where regularization methods are needed.We adopt the thresholding-based iterative procedure (32), which iterates

UðkÞu =Θu

��Ip −

ΣkΣk2

�Uðk−1Þ

u +X′YVk

kΣk2; γu

�: [4]

Here Σ=X′X, kΣk2 is the operator norm of Σ, Θu is a thresholding function,and γu is a thresholding parameter. Because we iterate between the esti-mates of U and V , we can just iterate [4] once. Then we apply SODA again toobtain an estimate of U.

Fig. 3. miRNA–gene regulation program properties. (A) Distribution of thenumber of predictor miRNAs and response genes in each of the programs.(B) Venn diagram showing that there are minimal overlaps among differentprograms in V . (C) The expression heat map of predictor miRNAs and re-sponse genes on the first regulatory program. Columns correspond to sam-ples and rows correspond to genes and miRNAs, and expression levels fromlow to high are coded in a blue to red color scheme. The samples areclustered using the expressions of the nonzero response genes on thefirst program, and there is clear clustering of immunoreactive and pro-liferative subtype samples based on the expression profile for this par-ticular program.

15678 | www.pnas.org/cgi/doi/10.1073/pnas.1417808111 Ma et al.

Page 5: Learning regulatory programs by threshold SVD regressionweb.stanford.edu/~xm24/tsvd_website/tsvd.pdf · 2015. 3. 18. · Learning regulatory programs by threshold SVD regression Xin

The proposed iterative thresholding algorithm is named T-SVD and isshown below.

The superscript ðkÞ indicates the k th iteration. The detailed information onthe thresholding function is given in SI Materials and Methods. The algo-rithm reduces to the algorithm in ref. 7 if X is an identity matrix, except thatwe use the sparse orthogonal decomposition instead of the usual QR de-composition in steps 3 and 5.

Further Details for Estimation. Sparse orthogonal decomposition. The sparseorthogonal decomposition is designed to extract sparse and orthogonaleigenvectors from a sparse matrix. To illustrate the idea, we consider a two-column matrix ½u1, u2�. The QR decomposition gives an orthogonal matrix½v1, v2�, where v1 =u1=ku1k and v2 = ~v2=k~v2k with ~v2 =u2 − ðu′2v1Þv1. Sup-pose u1 and u2 are both sparse; then v2 might contain more nonzero entriesthan u2 because of the orthogonal constraint with u1. The sparse orthogonaldecomposition changes only the nonzero entries of u2 so that the entriesin v2 remain zero whenever the corresponding entries in u2 are zero. Let½v*1 , v*2 � denote the resulting singular vectors from the sparse orthogonaldecomposition. The following is an example:

½u1 u2�=

0BBB@

1 0

1 0

1 1

1 −0:9

1CCCA; ½v1 v2�=

0BBB@

0:5 −0:0190:5 −0:0190:5 0:725

0:5 −0:688

1CCCA;

hv*1 , v*2

i=

0BBB@

0:5 0

0:5 0

0:5 0:707

0:5 −0:707

1CCCA:

Therefore, the sparse orthogonal decomposition may provide moresparse orthogonal vectors than the QR decomposition. The sparse or-thogonal decomposition has some limitation. It will fail algebraically if

the third entry of u2 is zero. In such a case, we revert back to the QRdecomposition.Initialization and implementation. The proposed algorithm requires an initialestimate, Uð0Þ, V ð0Þ, and dð0Þ, which can be obtained from an initial estimateof C through the SVD. One plausible estimator is the reduced-rank leastsquares estimator (33, 34), which is consistent for high-dimensional data(35). Another one is the ridge regression estimator in which a small identitymatrix «Ip is added to Σ to make it invertible. In the simulation study and forthe real data, we use the ridge estimator and let «= ð0:1=pÞPn

i=1λiðΣÞ, whereλiðΣÞ is the i th singular value of Σ.

The accuracy of the algorithm may depend on the initial estimate. Toreduce the effect of the initial estimate, we adopt a two-step procedure. Wefirst run the proposed algorithm with the ridge estimator as starting values,and then we use the resulting estimate as a new initial estimator and run thealgorithm again and obtain the final estimate.Selection of the tuning parameters. Our idea is to combine the BICs for condi-tional models and propose the criterion BIC= logðSSEÞ+ ðlogðrnÞ=rnÞdfv +ðlogðqnÞ=qnÞdfu, where SSE=

��Y−XUDV ′��2F , dfv is the degrees of freedom

of model [2], and dfu is the degrees of freedom of model [3]. Here k · kF is theFrobenius norm. Conditional on U, our estimation procedure with a hard-thresholding function is equivalent to an l0 penalization (31), and hencewithout the orthogonal constraints, the number of nonzero entries in thefinal estimate of Vv can be easily shown to be an unbiased estimator ofthe degrees of freedom of the l0 penalization. Therefore, we estimate dfvby ddfv = #ðV != 0Þ− rðr − 1Þ=2, where #ð · Þ denotes the number of truestatements in a vector of expressions or in a matrix of expressions. Notethat we subtract rðr − 1Þ=2 in the above formula as there are rðr − 1Þ=2+ rconstraints in V and r free parameters in D. Conditional on V , the numberof nonzero elements in the final estimate of Uu underestimates thedegrees of freedom of the hard-thresholded least-squares estimation;however, the bias is negligible under some mild conditions. Therefore, weestimate dfu by ddfu = #ðU!= 0Þ− rðr − 1Þ=2. The derivation of the degrees offreedom of the hard-thresholded least-squares estimation is given in SIMaterials and Methods.

Simulation. The rank of the coefficient matrix was fixed at 3. The designmatrix X, of size n×p, is generated from a multivariate normal distributionwith mean 0 and covariance matrix Σ where the ðk, jÞ th entry of Σ is definedas Σkj = ρjk−jj for k= 1 . . .n; j = 1 . . .p. Here ρ, which represents the strengthof the correlation among all predictors, either is 0 for the independent caseor varies from 0.125 to 0.625 with an increment of 0.125 for the dependentcase. Let C =UDV ′, where U and V , whose unnormalized values are shownin Table S5, be three-column orthogonal matrices containing left and rightsingular vectors, respectively. D is a 3× 3 diagonal matrix with diagonalentries of (20, 10, 5). The response matrix Y is generated by Y =XC + E,where E is the matrix with i.i.d. errors from Nð0,σ2Þ. We varied σ2 to achievedifferent levels of S2Ns; i.e., σ2 = fPdiagðC′×Σ×CÞg=ðn×q× s2rÞ. For eachscenario, 200 replications were simulated. Two algorithms, SEA and IEEAfrom the reduced-rank stochastic regression model (6) implemented in the Rpackage “RRRR,” were included in the comparison. Two SVD biclusteringmodels SSVD and BCssvd (5, 7) implemented in R Packages “ssvd” and “s4vd”were also included in the comparison. Because SSVD and BCssvd are biclus-tering algorithms, the input matrix was calculated by ðX′XÞ−1X′Y for n>pand ðX′X +0:1×minðdÞ=p×diagðpÞÞ−1X′Y for n≤p, and d is the vector of

Cytokine cytokine receptor interaction

Focal adhesion

ECM receptor interaction

Influenza A

HTLV I infection

Herpes simplex infection

Systemic lupus erythematosus

Rheumatoid arthritis

KEGG pathway Enrichment

log10(P value)0 1 2 3 4 5

1.4

2

4

2.1

1.4

1.9

3.3

3.9

Validated interactions (1.9%) from all possible interactions (n=3,013,456)

A

B C

COL2A1 CAV2

miR-9let-7c

CAPN2 ITGB4

miR-29b

ITGB8 COL1A1

miR-301b

Validated interactions (3.8%) from T SVD on layer 3 (n=1,612)

P =

4E

7

Fig. 4. Functional analysis of program 3 of miRNA–gene regulations. (A) Significantly enriched KEGGpathways. (B) Pie charts showing the miRNA–generegulations predicted by T-SVD are significantlyenriched by experimental supports based on star-Base. (C ) Network of experimentally validatedmiRNA–gene interactions for the ECM–receptorinteraction pathway. Only target genes based onthe experimentally validated database starBase areshown. Genes are shown as circles and miRNAs areshown as boxes.

Ma et al. PNAS | November 4, 2014 | vol. 111 | no. 44 | 15679

STATIST

ICS

Page 6: Learning regulatory programs by threshold SVD regressionweb.stanford.edu/~xm24/tsvd_website/tsvd.pdf · 2015. 3. 18. · Learning regulatory programs by threshold SVD regression Xin

singular values of X. All algorithms were assessed for sensitivity (percentageof true nonzeros identified by each method) and specificity (percentage of

identified nonzero items being true), as well as sum of squared errors, which

is defined as SSEð · , · Þ= �� · − ·��2 for C, U, and V , respectively. The average

computation time for each algorithm was also recorded.

Ovarian Cancer TCGA Data. The miRNA and gene expression data for the 489published samples (487 samples have measurements in bothmiRNA and geneexpression) were obtained from the TCGA ovarian cancer study (3) companionwebsite: tcga-data.nci.nih.gov/docs/publications/ov2011/. Unified expression of11,864 genes from three different platforms (Agilent, Affymetrix HuEx, andAffymetrix U133A) along with the 254 miRNAs with large variation (SD > 0.5)from the original data file were used to carry out the analysis. The lncRNA ex-pression data were extracted from a recent study (8), and the predictors areselected to be the 4,297 lncRNAs including literature-curated lncRNAs,ovarian cancer subtype-specific lncRNAs, lncRNAs associated with overallor progression-free survival, and lncRNAs associated with local copynumber changes.

Functional Pathways and GO Categories Enriched in Predicted Response Genes.We used the Cytoscape plug-in ClueGO (36) for the functional analysis of thepredicted response genes in the miRNA–gene and lncRNA–gene regulation.We obtained the significantly enriched (Benjamini–Hochberg corrected Pvalue <0.05) KEGG and Reactome (37) pathways for the predicted responsegenes from T-SVD in each program. For the lncRNA–gene regulation, we alsosearched for the large (>50 overlapping genes) significantly enriched bi-ological processes in GO. As previous lncRNA studies suggest the importantrole of lncRNA in chromatin regulation (25), we collected a CR-related genelist by assembling all genes associated to GO category chromatin modifica-tion (GO:0016568) and its offspring. Then we specifically tested the enrich-ment of CR genes for the predicted response genes in each program byhypergeometric test.

ACKNOWLEDGMENTS.We thank Kun Chen for providing the RRRR package.We are grateful to the TCGA consortium for generating the cancer data.X.M. and W.H.W. were partially supported by National Institutes of HealthGrant R01HG006018; L.X. was partially supported by National Institute ofNeurological Disorders and Stroke Grant R01NS060910.

1. Brown PO, Botstein D (1999) Exploring the new world of the genome with DNAmicroarrays. Nat Genet 21(1, Suppl):33–37.

2. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quan-tifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628.

3. Cancer Genome Atlas Research Network (2011) Integrated genomic analyses ofovarian carcinoma. Nature 474(7353):609–615.

4. Barretina J, et al. (2012) The Cancer Cell Line Encyclopedia enables predictive mod-elling of anticancer drug sensitivity. Nature 483(7391):603–607.

5. Lee M, Shen H, Huang JZ, Marron JS (2010) Biclustering via sparse singular valuedecomposition. Biometrics 66(4):1087–1095.

6. Chen K, Chan K-S, Stenseth NC (2012) Reduced rank stochastic regression with asparse singular value decomposition. J R Stat Soc B 74(2):203–221.

7. Yang D, Ma Z, Buja A (2013) A sparse SVD method for high-dimensional data.J Comput Graph Stat arXiv:1112.2433v1.

8. Du Z, et al. (2013) Integrative genomic analyses reveal clinically relevant long non-coding RNAs in human cancer. Nat Struct Mol Biol 20(7):908–913.

9. Andreopoulos B, Anastassiou D (2012) Integrated analysis reveals hsa-miR-142 as arepresentative of a lymphocyte-specific gene expression and methylation signature.Cancer Inform 11:61–75.

10. MacKenzie TN, et al. (2013) Triptolide induces the expression of miR-142-3p: A neg-ative regulator of heat shock protein 70 and pancreatic cancer cell proliferation. MolCancer Ther 12(7):1266–1275.

11. White NM, et al. (2010) Three dysregulated miRNAs control kallikrein 10 expressionand cell proliferation in ovarian cancer. Br J Cancer 102(8):1244–1253.

12. Uesugi A, et al. (2011) The tumor suppressive microRNA miR-218 targets the mTOR com-ponent Rictor and inhibits AKT phosphorylation in oral cancer. Cancer Res 71(17):5765–5778.

13. Venkataraman S, et al. (2013)MicroRNA 218 acts as a tumor suppressor by targetingmultiplecancer phenotype-associated genes in medulloblastoma. J Biol Chem 288(3):1918–1928.

14. Zhang C, Darnell RB (2011) Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. Nat Biotechnol 29(7):607–614.

15. Dorsam RT, Gutkind JS (2007) G-protein-coupled receptors and cancer. Nat Rev Cancer7(2):79–94.

16. Koturbash I, Zemp FJ, Pogribny I, Kovalchuk O (2011) Small molecules with big effects:The role of the microRNAome in cancer and carcinogenesis. Mutat Res 722(2):94–105.

17. Li JH, Liu S, Zhou H, Qu LH, Yang JH (2014) starBase v2.0: Decoding miRNA-ceRNA,miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data.Nucleic Acids Res 42(Database issue):D92–D97.

18. Mayr C, Hemann MT, Bartel DP (2007) Disrupting the pairing between let-7 andHmga2 enhances oncogenic transformation. Science 315(5818):1576–1579.

19. Park SY, Lee JH, Ha M, Nam JW, Kim VN (2009) miR-29 miRNAs activate p53 by tar-geting p85 alpha and CDC42. Nat Struct Mol Biol 16(1):23–29.

20. Wu J, et al. (2011) HMGA2 overexpression-induced ovarian surface epithelial trans-formation is mediated through regulation of EMT genes. Cancer Res 71(2):349–359.

21. Ashburner M, et al.; The Gene Ontology Consortium (2000) Gene ontology: Tool forthe unification of biology. Nat Genet 25(1):25–29.

22. Maenner S, et al. (2010) 2-D structure of the A region of Xist RNA and its implicationfor PRC2 association. PLoS Biol 8(1):e1000276.

23. Tsai MC, et al. (2010) Long noncoding RNA as modular scaffold of histone modifi-cation complexes. Science 329(5992):689–693.

24. Wang KC, et al. (2011) A long noncoding RNA maintains active chromatin to co-ordinate homeotic gene expression. Nature 472(7341):120–124.

25. Rinn JL, Chang HY (2012) Genome regulation by long noncoding RNAs. Annu RevBiochem 81:145–166.

26. Wu G, Stein L (2012) A network module-based method for identifying cancer prog-nostic signatures. Genome Biol 13(12):R112.

27. Guil S, Esteller M (2012) Cis-acting noncoding RNAs: Friends and foes. Nat Struct MolBiol 19(11):1068–1075.

28. Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing andmodeling. Proc Natl Acad Sci USA 97(18):10101–10106.

29. Bunea F, She Y, Wegkamp MH (2012) Joint variable and rank selection for parsimo-nious estimation of high-dimensional matrices. Ann Stat 40(5):2359–2388.

30. Ma Z, Sun T (2014) Adaptive sparse reduced-rank regression. arXiv:1403.1922.31. Zhang S, Li Q, Liu J, Zhou XJ (2011) A novel computational framework for simulta-

neous integration of multiple types of genomic data to identify microRNA-generegulatory modules. Bioinformatics 27(13):i401–i409.

32. She Y (2009) Thresholding-based iterative selection procedures for model selectionand shrinkage. Electron J Stat 3:384–415.

33. Anderson TW (1951) Estimating linear restrictions on regression coefficients formultivariate normal distributions. Ann Math Stat 22(3):327–351.

34. Reinsel GC, Velu RP (2006) Partially reduced-rank multivariate regression models. StatSin 16:899–917.

35. Bunea F, She Y, WegkampMH (2011) Optimal selection of reduced rank estimators ofhigh-dimensional matrices. Ann Stat 39(2):1282–1309.

36. Bindea G, et al. (2009) ClueGO: A Cytoscape plug-in to decipher functionally groupedgene ontology and pathway annotation networks. Bioinformatics 25(8):1091–1093.

37. Wu G, Feng X, Stein L (2010) A human functional protein interaction network and itsapplication to cancer data analysis. Genome Biol 11(5):R53.

1

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1ENSG0000

0254

551

ENSG0000

0233

589

ENSG0000

0229

044

ENSG0000

0259

153

ENSG0000

0227

540

ENSG0000

0232

065

ENSG0000

0255

857

XLOC_0

1317

4

ENSG0000

0233

895

XLOC_0

0675

3

ENSG0000

0261

098

ENSG0000

0225

032

ENSG0000

0251

143

ENSG0000

0224

660

ENSG0000

0260

054

CKS1B

CDC2

MRPL1

1

POLE2

XTP3TPA

GINS2

RAD1

CENPN

SPC25

STOM

L2

RFC2

DEPDC1*

CDKN3

BXDC2

CCNB1

RPL39L

RANBIR

C5

PRIM1

RAB32

ENSG00000254551

ENSG00000233589

ENSG00000229044

ENSG00000259153

ENSG00000227540

ENSG00000232065

ENSG00000255857

XLOC_013174

ENSG00000233895

XLOC_006753

ENSG00000261098

ENSG00000225032

ENSG00000251143

ENSG00000224660

ENSG00000260054

CKS1B

CDC2

MRPL11

POLE2

XTP3TPA

GINS2

RAD1

CENPN

SPC25

STOML2

RFC2

DEPDC1*

CDKN3

BXDC2

CCNB1

RPL39L

RAN

BIRC5

PRIM1

RAB32

Fig. 5. Coefficient estimate and pairwise Pearson correlation of thelncRNA–gene regulation matrix in program 2. Coefficient estimates from theT-SVD (color and dot size rescaled to [−1, 1]) are shown in the lower lefttriangle below the diagonal line and pairwise Pearson correlation coef-ficients are plotted in the upper right triangle. Only the top 20 (in absolute Cvalue) genes in program 2 are listed. Gene names are colored red for CR(chromatin regulator) genes. Neighboring genes of significant lncRNAs aremarked with an asterisk after the name. Note that the pairwise correlationscan be either negative or positive, although they are all positive in thesubset of the top 20 genes in this particular program.

15680 | www.pnas.org/cgi/doi/10.1073/pnas.1417808111 Ma et al.