Xi Chen Han Liu Jaime G. Carbonell Carnegie Mellon University
Pittsburgh, PA, USA
[email protected]
[email protected]
[email protected]
Abstract
In this paper, we propose to apply sparse canonical correlation
analysis (sparse CCA) to an important genome-wide association study
problem, eQTL mapping. Existing sparse CCA models do not
incorporate struc- tural information among variables such as
pathways of genes. This work extends the sparse CCA so that it
could exploit either the pre-given or unknown group structure via
the structured-sparsity-inducing penalty. Such structured penalty
poses new challenge on optimization techniques. To address this
challenge, by specializing the excessive gap framework, we develop
a scalable primal-dual optimization algorithm with a fast rate of
convergence. Empirical results show that the proposed optimization
algorithm is more effi- cient than existing state-of-the-art
methods. We also demonstrate the effectiveness of the structured
sparse CCA on both simulated and genetic datasets.
1 Introduction
A fundamental problem in genome-wide association study (GWA study
or GWAS) is to understand associ- ations between genetic variations
and phenotypes. An important special case of GWA study is the
expres- sion quantitative trait loci (eQTLs) mapping. More
specifically, the eQTL mapping discovers genetic as- sociations
between genotype data of single nucleotide polymorphisms (SNPs) and
phenotype data of gene expression levels to provide insights into
gene regula- tion, and potentially, controlling factors of a
disease.
Appearing in Proceedings of the 15th International Con- ference on
Artificial Intelligence and Statistics (AISTATS) 2012, La Palma,
Canary Islands. Volume XX of JMLR: W&CP XX. Copyright 2012 by
the authors.
More formally, we have two datasets X (e.g. SNPs data) and Y (e.g.
gene expression data) of dimensions n × d and n × p collected on
the same set of n ob- servations. Both p and d could be much larger
than n in an eQTL study. Our goal is to investigate the
relationship between X and Y.
A popular approach for eQTL mapping is to formulate the problem
into a sparse multivariate regression (Lee et al., 2010; Kim et
al., 2009). These methods treat X as input, Y as output and try to
identify a small subset of input variables that are simultaneously
related to all the responses. Despite the promising aspects of
these models, such multivariate-regression approaches are not
symmetric in that the regression coefficients are only put on the X
side. There is no clear reason why one wants to regress Y on X but
instead of X on Y for an association study. Also, in eQTL mapping,
it is difficult to find a small subset of SNPs which can explain
the expression levels for all the involved genes.
In contrast to sparse regression approach, sparse canonical
correlation analysis (sparse CCA) (Witten et al., 2009; Witten and
Tibshirani, 2009) provides a more “symmetric” solution in which it
finds two sparse canonical vectors u and v to maximize the correla-
tion between Xu and Yv. Although sparse CCA has been successfully
applied to some genomic datasets (e.g. CGH data (Witten and
Tibshirani, 2009)), it has not well been studied for eQTL mapping.
In a study of eQTL mapping, it is of great interest for biologists
to seek for a subset of SNP genotypes and a subset of gene
expression levels that are closely related. To address this
problem, we apply sparse CCA to eQTL mapping; and show that by
incorporating the proper structural information, sparse CCA can be
a useful tool for GWAS.
It is well known that when dealing with high- dimensional data,
prior structural knowledge is crucial for the analysis, which
facilities model’s interpretabil- ity. For example, a biological
pathway is a group of genes that participate in a particular
biological pro- cess to perform certain functionality in a cell. To
find
199
Structured Sparse Canonical Correlation Analysis
the controlling factors related to a disease, it is more meaningful
to study the genes by considering their pathways. However, the
existing sparse CCA models use the 1-regularization and do not
incorporate the rich structural information among variables (e.g.
ge- netic pathways). In this paper, we propose a structured sparse
CCA framework that can naturally incorporate the group structural
information. In particular, we consider two scenarios: (1) when the
group structure is pre-given, we propose to incorporate such prior
knowl- edge using the overlapping-group-lasso penalty (Jenat- ton
et al., 2009, 2010b). Compare to the standard group lasso (Yuan and
Lin, 2006), we allow arbitrary overlaps among groups which reflects
the fact that a gene may belong to multiple pathways. We refer to
this model as the group-structured sparse CCA; (2) if such
structural information is not available as a pri- ori, we propose
the group pursuit sparse CCA using a group pursuit penalty, which
simultaneously conducts variable selection and structure
estimation.
We formulate the structured sparse CCA into a bi- convex problem
and adopt an alternating optimiza- tion strategy. However, unlike
in Witten et al. (2009) where the simple 1-norm penalty is used,
our formu- lation involves the non-separable overlapping-group-
lasso penalty and group pursuit penalty. Such non- separability
poses great challenge on optimization techniques. Although many
methods have been pro- posed (Duchi and Singer, 2009; Jenatton et
al., 2009, 2010a; Mairal et al., 2010; Yuan et al., 2011; Chen et
al., 2012; Argyriou et al., 2011; Qin and Goldfarb, 2011) for
solving the related sparse learning problems, as we surveyed in
Section 3.1, they either (1) cannot be applied to our problem, or
(2) suffer from slow conver- gence rate, or (3) have no convergence
guarantees or convergence rate analysis. In this paper, we propose
an efficient optimization algorithm under the exces- sive gap
framework (Nesterov, 2003), which solves the sparse CCA with
various structured-sparsity-inducing penalties. Since it is a
first-order method, the per- iteration time complexity is very low
(e.g. linear in the sum of group sizes) and the method can scale up
to millions of variables. It is a primal-dual approach which
diminishes the primal-dual gap over iterations. For each subproblem
in the alternating optimization procedure, the algorithm provably
converges to an accurate solution (i.e. the duality gap is less
than ) in O(1/
√ ) iterations.
2 Preliminaries
Given two datasets X and Y of dimensions n× d and n × p on the same
set of n observations, we assume that each column of X and Y is
normalized to have mean zero and standard deviation one. The
sparse
CCA (Witten et al., 2009) takes the form:
maxu,v uT XT Yv (1)
s.t. u2 ≤ 1, v2 ≤ 1, P1(u) ≤ c1, P2(v) ≤ c2,
where the constraints u2 ≤ 1, v2 ≤ 1 are the convex relaxations of
the equality constraints u2 = 1, v2 = 1 which ensure that the
correlation is normalized. P1 and P2 are convex and non-smooth
sparsity-inducing penalties that yield sparse u and v. We note that
since the focus of the paper is on struc- tured sparsity, we only
consider the first pair of canon- ical vectors. Using the technique
from Section 3 in Witten and Tibshirani (2009), we could easily
extend our method to estimate multiple canonical vectors.
Witten et al. (2009) only studied two specific forms of the penalty
P (either P1 or P2) with relatively simple structure: 1-norm
penalty and the chain-structured fused lasso penalty. In this work,
we extend the sparse CCA to more general forms of P to incorporate
the group structural information. In eQTL mapping, the structural
knowledge among genes on Y side is often of more interest. For the
ease of illustration, we assume that P1(u) = u1 and mainly focus on
P2(v), which incorporates the structural information. Since Eq. (1)
is biconvex in u and v, a natural optimization strat- egy is the
alternating approach: fix u and optimize over v; then fix v and
optimize over u; and iterate over these two steps. In our setting,
the optimization with respect to u with P1(u) = u1 is relatively
sim- ple and the closed-form solution has been obtained in Witten
et al. (2009). However, due to the complicated structure of P2(v),
the optimization with respect to v cannot be easily solved and we
will address this chal- lenge in the following section.
3 Group-structured Sparse CCA
In this section, we study the problem in which the group structural
information among variables in Y is pre-given from the domain
knowledge; and our goal is to identify a small subset of relevant
groups un- der the sparse CCA framework. More formally, let us
assume that the set of groups of variables in Y: G = {g1, . . . ,
g|G|} is defined as a subset of the power set of {1, . . . , p},
and is available as prior knowledge. Note that the members (groups)
of G are allowed to overlap. Inspired by the group-lasso penalty
(Yuan and Lin, 2006) and the elastic-net penalty (Zou and Hastie,
2005), we define our penalty P2(v) as follows:
P2(v) = ∑
c
2 vT v, (2)
where vg ∈ R|g| is the subvector of v in group g, wg
is the predefined weight for group g; c is the tuning
200
Xi Chen, Han Liu, Jaime G. Carbonell
parameter. The 1/2 mixed-norm penalty in P2(v) plays the role of
group selection. Since some gene ex- pression levels are highly
correlated, the ridge penalty c 2v
T v addresses the problem of the collinearity, en- forcing strongly
correlated variables to be in or out of the model together for
better interpretability (Zou and Hastie, 2005). In addition,
according to Zou and Hastie (2005), the ridge penalty is crucial to
ensure stable variable selection performance when p n, which is a
typical setting of eQTL mapping.
Rather than solving the constraint form of P2(v), we solve the
regularized problem using the Lagrange form:
min u,v
g∈G wgvg2
s.t.u2 ≤ 1, v2 ≤ 1, u1 ≤ c1, (3)
where there exists a one to one correspondence be- tween (θ, τ) and
(c, c2). We refer to this model as the group-structured sparse
CCA.
3.1 Optimization Algorithm
The main difficulty in solving Eq. (3) arises from op- timizing
with respect to v. Let the domain of v be denoted as Q1 = {v | v2 ≤
1}, β = 1
τ YT Xu and
γ = θ τ , the optimization of Eq. (3) with respect to v
can be written as:
where l(v) = 1 2v − β2
2 is the Euclidean distance loss function and P (v) is the
overlapping-group-lasso penalty: P (v) = γ
∑ g∈G wgvg2.
Related First-order Methods When v is un- constrained and the
groups are non-overlapped, the closed-form optimal solution can be
easily obtained as in Duchi and Singer (2009). In contrast, when
the groups are overlapped, the sub-gradient over each group becomes
very complicated and hence there is no closed-form solution. The
traditional interior point method and iterated reweighted least
squares (Ar- gyriou et al., 2008) suffer from the high
computational cost of solving a large linear system.
A number of first-order methods have recently been developed for
solving variants of overlapping-group- lasso problem. The methods
in Jenatton et al. (2010a) and Mairal et al. (2010) can only be
applied to the tree-structured or 1/∞-regularized groups but not to
1/2-regularized overlapping group struc- ture. The forward-backward
splitting method in Duchi and Singer (2009) and smoothing
proximal-gradient method in Chen et al. (2012) achieve slow
convergence rate of O(1/). An algorithm for 1/2-regularized
overlapping group structure was proposed very re- cently in Yuan et
al. (2011). However, due to the additional constraint v2 ≤ 1, the
key theorem in Yuan et al. (2011) no longer holds. Although it is
still possible to apply a modification of this algorithm, the
convergence rate is unknown. Other possible methods, including
alternating direction augmented Lagrangian method (Qin and
Goldfarb, 2011) and the fixed-point method (Argyriou et al., 2011),
also lack of the con- vergence rate.
In this section, by specializing a general excessive gap framework
of Nesterov (2003), we present a first-order approach with a fast
convergence rate of O(1/
√ ) for
solving Eq. (4).
Reformulation of the Penalty As shown in Chen et al. (2012), the
overlapping-group-lasso penalty P (v) can be reformulated into a
maximization form as follows. Using the dual norm, we have vg2 =
maxαg2≤1 αT
g vg where αg is a vector of length |g|. Let α =
[ αT
be the concatenation of
the vectors {αg}g∈G . We denote the domain of α as Q2 ≡ {α | αg2 ≤
1, ∀g ∈ G}. The penalty P (v) can be reformulated as:
P (v) = γ ∑
αTCv, (5)
where C ∈ R( ∑
g∈G |g|)×p is defined as follows. The rows of C are indexed by all
pairs of (i, g) ∈ {(i, g)|i ∈ g, i ∈ {1, . . . , p}}, the columns
are indexed by j ∈ {1, . . . , p}, and C(i,g),j = γwg if i = j and
0 otherwise.
Here, we provide some insights of this reformulation by showing its
connection with Fenchel Conjugate (Bor- wein and Lewis, 2000). Let
f∗ denote the Fenchel Conjugate of a general function f . P (v) can
be viewed as Fenchel Conjugate of the indicator function
δQ2(x) =
αTCv = δ∗ Q2
(Cv).
Smoothing the Penalty With the penalty P (v) in the form of maxα∈Q2
αTCv, we construct a smooth approximation of P (v) using the
Nesterov’s smoothing technique (Nesterov, 2005) as follows:
Pµ(v) = max α∈Q2
αTCv − µd(α), (6)
where µ is a positive smoothness parameter and d(α) is defined as
1
2α2 2. The relationship between the
smooth approximation Pµ(v) and original penalty P (v) can be
characterized by the following inequality:
P (v) − µD ≤ Pµ(v) ≤ P (v), (7)
201
Structured Sparse Canonical Correlation Analysis
where D = maxα∈Q2 d(α). In our case, D = |G|/2, where |G| is the
number of groups.
According to Theorem 1 (Nesterov, 2005) as below, Pµ(v) is a smooth
function for any µ > 0.
Theorem 1 For any µ > 0, Pµ(v) is convex and
continuously-differentiable in v with the gradient:
∇Pµ(v) = CT αµ(v), (8)
αµ(v) = argmaxα∈Q2 αTCv − µd(α). (9)
Performing some algebra, we obtain the closed-form equation for
αµ(v) as in the next proposition.
Proposition 1 The αµ(v) in Eq. (9) is the concate- nation of
subvectors {[αµ(v)]g} for all g ∈ G. For
any g, [αµ(v)]g = S2
( γwgvg
µ
) , where S2 is the pro-
jection operator (to the 2- ball) defined as follows: S2(x) =
x
x2 if x2 > 1 and S2(x) = x otherwise.
We substitute P (v) in the original objective function f(v) with
Pµ(v) and construct the smooth approxi- mation of f(v): fµ(v) ≡
l(v) + Pµ(v). According to Eq. (7), for any µ > 0:
f(v) − µD ≤ fµ(v) ≤ f(v). (10)
Therefore, fµ(v) is a uniformly smooth approximation of the
objective f(v) with the maximum gap of µD, and µ controls the gap
between fµ(v) and f(v).
Fenchel Dual of f(v) The fundamental idea of the excessive gap
method is to diminish the duality gap between the objective f(v)
and its Fenchel dual over iterations. According to Theorem 3.3.5 in
Borwein and Lewis (2000), the Fenchel dual problem of f(v), (α),
takes the following form:
(α) = −l∗(−CT α) − δQ2(α), (11)
where l∗ is the Fenchel Conjugate of l and −l∗(−CT α) = minv∈Q1
vTCT α + 1
2v − β2 2. By a
direct consequence of Theorem 1 in Nesterov (2005), we obtain the
next theorem.
Theorem 2 The gradient of (α) is as follows:
∇(α) = Cv(α), (12)
2v−β2 2. More-
over, ∇(α) is Lipschitz continuous with the Lipschitz constant L()
= 1
σ C2, where σ = 1 is the strongly convex parameter for function
l(v) and C is the ma- trix spectral norm of C: C ≡ maxx2=1
Cx2.
Utilizing Proposition 1 in Chen et al. (2012), we present the
closed-form equations for v(α) and C.
Proposition 2 v(α) takes the following form: v(α) = S2
( β − CT α
jection operator (to the 2- ball). C =
γmaxj∈{1,...,p}
(wg)2.
According to Proposition 2, the Lipschitz constant for ∇(α)
is:
L() = C2 = γ2 max j∈{1,...,p}
∑ g∈G s.t. j∈g
(wg)2. (13)
Excessive Gap Method According to the Fenchel duality theorem
(Borwein and Lewis, 2000), under cer- tain mild conditions which
hold for our problem, we have minv∈Q1 f(v) = maxα∈Q2 (α), and
(α) ≤ f(v), ∀ v ∈ Q1,α ∈ Q2. (14)
For each iteration t, the excessive gap method (Nes- terov, 2003),
simultaneously updates v and α to guar- antee that: fµt(v
t) ≤ (αt). According to Eq. (10) and (14):
fµt(v t) ≤ (αt) ≤ f(vt) ≤ fµt(v
t) + µtD. (15)
To guarantee the convergence of the algorithm, the ex- cessive gap
method diminishes the value of the smooth- ness parameter µt over
iterations: µt+1 ≤ µt and limt→∞ µt = 0. From Eq. (15), when µt →
0, we have f(vt) ≈ (αt), which are hence the optimal pri- mal and
dual solution.
Moreover, in the excessive gap method, a gradient mapping operator
ψ : Q2 → Q2 is defined as follows,
ψ(z) = argmax α∈Q2
2
(16)
where z is any point in Q2. In our problem, the gradient mapping
operator ψ can also be computed in a closed-form as shown in the
next proposition.
Proposition 3 For any z ∈ Q2, ψ(z) in Eq. (16) is the concatenation
of subvectors [ψ(z)]g for all groups
g ∈ G. For any g, [ψ(z)]g = S2
( zg +
) , where
[∇(z)]g = γwg[v(z)]g is the g-th subvector of ∇(z).
With Propositions 1, 2, 3 in place, all the essential in- gredients
of the excessive gap framework can be com- puted in a closed-form.
We present the excessive gap method for solving Eq. (4) in
Algorithm 1.
Convergence Rate and Time Complexity The Lemma 7.4 and Theorem 7.5
in Nesterov (2003) guar- antee that both the starting points, v0
and α0, and
202
Algorithm 1 Excessive Gap for Solving Eq. (4)
Input: β, γ, G and {wg}g∈G Initialization: (1) Construct C; (2)
Compute L() as in Eq. (13) and set µ0 = 2L(); (3) Set v0 = v(0) =
S2(β); (4) Set α0 = ψ(0) Iterate For t = 0, 1, 2, . . ., until
convergence of vt:
1. Set τt = 2 t+3
.
3. Set zt = (1 − τt)α t + τtαµt(v
t).
6. Update vt+1 = (1 − τt)v t + τtv(zt).
7. Update αt+1 = ψ(zt) as in Proposition 3.
Output: vt+1.
the sequences, {vt} and {αt} in Algorithm 1 satisfy the key
condition fµt(v
t) ≤ (αt). Using Eq. (15), the duality gap can be bounded by:
f(vt) − (αt) ≤ fµt (vt) + µtD − (αt) ≤ µtD. (17)
From Eq. (17), we can see that the duality gap which characterizes
the convergence rate is reduced at the same rate at which µt
approaches to 0. According to Step 4 in Algorithm 1, the
closed-form equation of µt can be written as:
µt = (1 − τt−1)µt−1 = t
t+ 2 · t− 1
(t+ 1)(t+ 2) . (18)
Combining Eq. (17) and (18), we immediately obtain the convergence
rate for duality gap of Algorithm 1.
Theorem 3 (Nesterov, 2003) The duality gap be- tween the primal
solution {vt} and dual solution {αt} generated from Algorithm 1
satisfies: f(vt) − (αt) ≤ µtD = 4C2D
(t+1)(t+2) , where D = maxα∈Q2 d(α). In other
words, if we require that the duality gap is less than ,
Algorithm 1 needs at most ⌈ 2C
√ D − 1
√ ) iterations. The per-iteration time complexity
of Algorithm 1 is linear in p+ ∑
g∈G |g|.
4 Group Pursuit in Sparse CCA
When the group information is not given as a priori, it is of
desire to automatically group the relevant vari- ables into
clusters under the sparse CCA framework. For this purpose, we
propose the group pursuit sparse CCA in this section. Our group
pursuit approach is based on pairwise comparisons between vi and vj
for all 1 ≤ i < j ≤ p : when vi = vj , the i-th and
j-th variables are grouped together. We identify all subgroups
among p variables by conducting pairwise comparisons and applying
transitivity rule, i.e. vi = vj
and vj = vk implies that the i-th, j-th, and k-th vari- ables are
clustered into the same group. The pairwise comparisons can be
naturally encoded in the fusion penalty (Tibshirani and Saunders,
2005; Kim et al., 2009)
∑ i<j |vi − vj |, where the 1-norm will enforce
vi − vj = 0 for closely related (i,j) pairs.
In practice, instead of using the simple penalty∑ i<j |vi − vj |
which treats each pair of variables
equally, we could add the weight wij to incorporate the prior
knowledge that how likely the i-th and j-th variables are in the
same group. Moreover, the 1- norm of v is also incorporated in the
penalty to enforce sparse solution as in the fused lasso model
(Tibshirani and Saunders, 2005). Then, the group pursuit penalty
takes the following form
P2(v) = ∑
2 vT v, (19)
where c′ is the tuning parameter to balance the 1- norm and the
fusion penalty. This penalty function will simultaneously select
the relevant variables and cluster them into groups in an automatic
manner. A natural way for assigning wij is to set wij = |rij |q,
where rij is the correlation between the i-th and j-th variable; q
models the strength of the prior: a larger q results in a stronger
belief of the correlation based group structure. For the purpose of
simplicity, we set wij = |rij | with q = 1 in this paper; while in
principle, any prior knowledge of the possibility of being in the
same group can be incorporated into w. Note that the group pursuit
penalty can be viewed as a variant of graph-guided fusion penalty
(Kim et al., 2009).
In some cases, we have the prior knowledge that the i- th and j-th
variables do not belong to the same group; then the term |vi − vj |
should not appear in the group pursuit penalty Eq. (19). Therefore,
rather than hav- ing |vi − vj | for all (i, j) pairs which forms a
complete graph, we generalize the group pursuit penalty with the
fusion penalty defined on an arbitrary graph with the edge set E.
In summary, the group pursuit sparse CCA is defined as
follows:
min u,v
∑
s.t.u2 ≤ 1, v2 ≤ 1, P1(u) ≤ c1. (20)
It is straightforward to specialize excessive gap method for
solving the group pursuit sparse CCA with a similar approach as in
Section 3.1. Note that an- other non-convex group pursuit penalty
has recently been proposed in (Shen and Huang, 2010). However, it
is computationally very expensive due to the non- convexity of the
penalty and could be easily trapped by local minima.
203
Structured Sparse Canonical Correlation Analysis
|G| = 40, p = 40, 100 γ = 0.4 CPU Primal Obj Rel Gap ExGap
1.9767E-02 8.8682E+03 2.6663E-15 AG 3.5394E-01 8.8682E+03 — ADAL
2.1506E+00 8.8682E+03 — IPM 9.6420E+02 8.8682E+03 7.0490E-09 γ = 4
CPU Primal Obj Rel Gap ExGap 1.7907E-01 8.8851E+03 1.6285E-08 AG
3.0946E+00 8.8851E+03 — ADAL 1.8290E+01 8.8851E+03 — IPM 2.9520E+03
8.8851E+03 3.8840E-08
|G| = 500, p = 500, 100 γ = 5 CPU Primal Obj Rel Gap ExGap
4.1094E+00 1.1211E+05 4.9669E-08 AG 1.0723E+01 1.1211E+05 — ADAL
1.8374E+01 1.1211E+05 — γ = 10 CPU Primal Obj Rel Gap ExGap
6.2159E+00 1.1219E+05 2.8000E-07 AG 9.0650E+00 1.1219E+05 — ADAL
1.5146E+01 1.1219E+05 —
|G| = 5, 000, p = 5, 000, 100 γ = 10 CPU Primal Obj Rel Gap ExGap
9.8349E+01 1.1240E+06 9.0019E-07 AG 8.5362E+02 1.1240E+06 — ADAL
7.7661E+02 1.1240E+06 — γ = 20 CPU Primal Obj Rel Gap ExGap
1.7264E+02 1.1245E+06 8.7364E-07 AG 1.0585E+03 1.1245E+06 — ADAL
7.3510E+02 1.1245E+06 —
Table 1: Computational Efficiency Comparisons
5 Experiment
5.1 Computational Efficiency
We evaluate the scalability and efficiency of our exces- sive gap
method (ExGap) for solving
argmin v:v2≤1
f(v) = 1
g∈G vg2,
where β is given. We compare ExGap with sev- eral state-of-the-art
algorithms: (1) A modification of the accelerated gradient method
(AG) (Yuan et al., 2011) by adding the constraint; (2) Alternating
direc- tion augmented Lagrangian (ADAL) method (Qin and Goldfarb,
2011); (3) Interior point method (IPM) for second order cone
programming by CVX (Grant and Boyd, 2011). All of the experiments
are performed on a PC with Intel Core 2 Quad Q6600 2.4GHz CPU and
4GB RAM. The software is written in MATLAB. We terminate ExGap and
IPM when the relative du- ality gap (Rel Gap) is less than 10−6:
Rel Gap =
|f(vt)−(αt)| 1+|f(vt)|+|(αt)| ≤ 10−6. For ADAL and (modified)
AG, the dual solutions cannot be easily derived. We stop AG and
ADAL when its objective is less than 1.00001 times the objective of
ExGap.
We generate the simulated data with an overlap- ping group
structure imposed on β as follows. As- suming that inputs are
ordered and each group is of size 1000, we define a sequence of
groups of 1000 adjacent inputs with an overlap of 100 vari- ables
between two successive groups, i.e. G = {{1, . . . , 1000}, {901, .
. . , 1900}, . . . , {p − 999, . . . , p}} with p = 1000|G| + 100.
We set the support of β to the first half of the variables and set
the values of β in the support to be 1 and otherwise 0.
We vary the number of the groups |G| and report the CPU time in
seconds (CPU), primal objective (Primal Obj), relative duality gap
(Rel Gap) in Table 1. For each setting of |G|, we use two levels of
regularization parameter γ. Note that when |G| ≥ 50 (p ≥ 50, 100),
we are unable to collect results for IPM, because they lead to
out-of-memory errors due to the large stor- age requirement for
solving the Newton linear system. From Table 1, we can see ExGap is
more efficient than both AG and ADAL in terms of CPU time. In ad-
dition, when p is smaller, AG is more efficient than ADAL; while
for large p, ADAL seems to be more effi- cient. In addition, ExGap
can easily scale up to high- dimensional data with millions of
variables. Another interesting observation is that: for smaller γ
which leads to smaller C and L(), the convergence of Ex- Gap is
much faster. This observation suggests that Ex- Gap is more
efficient when the non-smooth part plays less important role in the
optimization problem.
5.2 Simulation Study
In this and next subsection, we use simulated data and a real eQTL
dataset to investigate the perfor- mance of the overlapping
group-structured and group pursuit sparse CCA. All the
regularization parame- ters are chosen from 0.01 to 10 and set
using the permutation-based method in Witten and Tibshirani (2009).
Instead of tuning all the parameters on a multi-dimensional grid
which is computationally heavy, we first train the 1-regularized
sparse CCA (i.e. P1(u) = u1, P2(v) = v1) and the tuned regular-
ization parameter c1 in Eq. (1) is used for all struc- tured
models. For the overlapping-group-lasso penalty in Eq. (3), all the
group weights {wg} are set to 1. In addition, we observe that the
learned sparsity pattern is insensitive to the parameter τ ; and
therefore we set it to 1 for simplicity. For all algorithms, we use
10 random initializations of u and select the results that lead to
the largest correlation.
Given Overlapping Group Structure In this sec- tion, we conduct the
simulation where the overlapping group structure in v is given as a
priori. We gener- ate the data X and Y with n = 50, d = 100 and p =
82 as follows. Let u be a vector of length d with
204
0 50 100 −2
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
v
(c)
Figure 1: (a) True u and v; (b) Estimated u and v us- ing the
1-regularized sparse CCA; (c) Estimated u and v using the
group-structured sparse CCA.
0 50 100 −2
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
v
(c)
Figure 2: (a) True u and v; (b) Estimated u and v us- ing the
1-regularized sparse CCA; (c) Estimated u and v using the group
pursuit sparse CCA.
20 0s, 20 -1s, and 60 0s. We construct v with p = 82 variables as
follows: assuming that v is covered by 10 groups; each group has 10
variables with 2 variables overlapped between every two successive
groups, i.e. G = {{1, . . . , 10}, {9, . . . , 18}, . . . , {73, .
. . , 82}}. For the indices of the 2nd, 3rd, 8th, 9th and 10th
groups, we set the corresponding entries of v to be zeros and the
other entries are sampled from i.i.d. N(0, 1). In addition, we
randomly generate a latent vector z of length n from N(0, In×n) and
normalize it to unit length.
We generate the data matrix X with each Xij ∼ N(ziuj , 1) and Y
with each Yij ∼ N(zivj , 1). The true and estimated vectors for u
and v are presented in Figure 1. For the group-structured sparse
CCA, we add the regularization
∑ g∈G vg2 on v where G is
taken from the prior knowledge. It can be seen that the
group-structured sparse CCA recovers the true v much better while
the simple 1-regularized sparse CCA leads to an over-sparsified v
vector.
Group Pursuit In this simulation, we assume that the group
structure over v is unknown and the goal is to uncover the group
structure using the group pur- suit sparse CCA. We generate the
data X and Y with n = 50 and p = d = 100 as follows. Let u be a
vector of length d with 20 0s, 20 -1s, and 60 0s as in the previous
simulation study; and v be a vector of length p with 10 3s, 10
-1.5s, 10 1s, 10 2s and 60 0s. In addition, we ran- domly generate
a latent vector z from N(0, In×n) and normalize it to unit length.
We generate X with each sample xi ∼ N(ziu, 0.1Id×d); and Y with
each sample
yi ∼ N(ziv, 0.1Σy) where (Σy)jk = exp−|vj−vk|. We conduct the group
pursuit sparse CCA in Eq. (20), where we add the fusion penalty for
each pair of vari- ables in v, i.e. E is the edge set of the
complete graph. The estimated vector u and v are presented in
Figure 2 (c). It can be easily seen that the group pursuit sparse
CCA correctly captures the group structure in the v vector.
5.3 Real eQTL Data Analysis
Pathway Selection via Group-structured Sparse CCA We analyze a
yeast eQTL data in Zhu et al. (2008)1: X contains d = 1260 SNPs
from the chromo- somes 1–16 for n = 114 yeast strains. Y is the
gene ex- pression data of p = 1, 155 genes for the same 114 yeast
strains. All these genes are in the KEGG database (Kanehisa and
Goto, 2000) and belong to 92 path- ways. We treat each pathway as a
group. To achieve more refined resolution of gene selection,
besides the 92 pathway groups, we also add in p = 1, 155 groups
where each group only has one singleton gene so that the solutions
could be sparse at both the group and individual feature levels as
in Friedman et al. (2010).
Using the group-structured sparse CCA, we select 121 SNPs and 47
genes. These 47 genes spread over 32 pathways. Using the tool
ClueGO (Bindea et al., 2009), we perform KEGG enrichment on the
selected genes and the overview chart is presented in Figure
1The full dataset could be downloaded from http://
blogs.ls.berkeley.edu/bremlab/data
205
YER052CYNL071WYIL005WYER021W
(a) (b) (c)
Figure 3: Overview chart of KEGG functional enrichment using (a)
the group-structured sparse CCA; (b) 1-regularized sparse CCA; (c)
learned group structure using the group pursuit sparse CCA. In (a)
and (b), the proportion of the pie chart represents the number of
the selected genes with this annotation over the total number of
the selected genes. Different colors are used to visually
discriminate different functional groups. See Bindea et al. (2009)
for more details of the presentation.
3(a). From Figure 3(a), we see that most pathways involve in the
functional group Terpenoid backbone biosynthesis, which is a large
class of natural prod- ucts consisting of isoprene (C5) units. As a
compar- ison, the 1-regularized sparse CCA selects 173 SNPs and 71
genes and these 71 genes belong to 50 path- ways. The enrichment of
the selected genes by 1- regularized sparse CCA is presented in
Figure 3(b). We further perform GO enrichment analysis on the
selected genes. Most Go terms obtained from group- structured
sparse CCA has much less p-values than that from 1-regularized
sparse CCA. In addition, the overlapping-group-lasso penalty on
genes will also af- fect the selection of SNPs. An interesting
observation is that, most of the selected SNPs using the group-
structured sparse CCA are concentrated on Chromo- some 12 and
13.
In addition to the group-structured sparse CCA with the group
extracted from KEGG pathways, we also perform the tree-structured
sparse CCA, where we first run the hierarchical agglomerative
clustering of the p × p correlation matrix of Y; define the groups
by each tree node and then run the group-structured sparse CCA. The
tree-structured sparse CCA selects 123 SNPs and 66 genes. The
functional enrichment results show that most of the functions are
identical to those learned by the group-structured sparse CCA with
groups from KEGG, e.g. Terpenoid backbone biosynthesis, Steroid
biosynthesis, O-Mannosyl glycan biosynthesis, Sphingolipid, etc.
This experiment sug- gests that, even without any prior knowledge
of the group structure, the correlation based tree-structured
sparse CCA can also select the relevant genes and pro-
vide the similar enrichment results.
Group Pursuit Sparse CCA Now, we do not as- sume any prior
information of the group structure among the genes. Our goal is to
simultaneously se- lect the relevant genes and group them into
clusters. Using the group pursuit sparse CCA in Eq. (20) and
thresholding the absolute value of the pair-wise corre- lation at
0.8 to construct the edge set E, we selected 61 genes in total. We
present the obtained clusters among these 61 genes in Figure 3 (c)
where two se- lected genes are connected if the absolute value of
dif- ference between the estimated parameters (|vi − vj |) is less
than 1E-3 (singleton nodes are not plotted due to space
limitations) . We observe that there are two obvious clusters. With
the learned clustering struc- ture, we can study the functional
enrichment of each cluster separately, which could lead to more
elaborate enrichment analysis as compared to the analysis of the
selected genes all together.
6 Conclusions
In this paper, we propose a structured sparse CCA framework that
can exploit either the pre-given or un- known group structural
information. We also provide an efficient optimization algorithm
that can scale up to very large problems.
7 Acknowledgement
We would like to thank Dr. Seyoung Kim for pointing out and helping
us prepare the yeast data. We also thank Qihang Lin for helpful
discussions on related first-order methods. Han Liu is a supported
by NSF grant IIS-1116730.
206
References
Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Convex
multi-task feature learning. Machine Learn- ing , 73(3),
243–272.
Argyriou, A., Micchelli, C. A., Pontil, M., Shen, L., and Xu, Y.
(2011). Efficient first order methods for linear composite
regularizers. arXiv:1104.1436v1.
Bindea, G., Mlecnik, B., and et. al., H. H. (2009). Cluego: a
cytoscape plug-in to decipher function- ally grouped gene ontology
and pathway annotation networks. Bioinformatics, 25 (8),
1091–1093.
Borwein, J. and Lewis, A. S. (2000). Convex Analysis and Nonlinear
Optimization: Theory and Examples. Springer.
Chen, X., Lin, Q., Kim, S., Carbonell, J., and Xing, E. (2012).
Smoothing proximal gradient method for general structured sparse
learning. Annals of Ap- plied Statistics (to be appeared).
Duchi, J. and Singer, Y. (2009). Efficient online and batch
learning using forward backward splitting. Journal of Machine
Learning Research, 10, 2899– 2934.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). A note on the
group lasso and a sparse group lasso.
Grant, M. and Boyd, S. (2011). CVX: Matlab software for disciplined
convex programming, version 1.21. http://cvxr.com/cvx.
Jenatton, R., Audibert, J., and Bach, F. (2009). Structured
variable selection with sparsity-inducing norms. Technical report,
INRIA.
Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2010a).
Proximal methods for sparse hierarchical dictionary learning. In
ICML.
Jenatton, R., Obozinski, G., and Bach, F. (2010b). Structured
sparse principal component analysis. In AISTATS .
Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto ency- clopedia of
genes and genomes. Nucleic Acids Re- search, 28, 27–30.
Kim, S., Sohn, K.-A., and Xing, E. P. (2009). A mul- tivariate
regression approach to association analy- sis of a quantitative
trait network. Bioinformatics, 25(12), 204–212.
Lee, S., Zhu, J., and Xing, E. P. (2010). Adatpive multi-task
lasso: with applications to eqtl detection. In NIPS .
Mairal, J., Jenatton, R., Obozinski, G., and Bach, F. (2010).
Network flow algorithms for structured spar- sity. In NIPS .
Nesterov, Y. (2003). Excessive gap technique in non- smooth convex
minimization. Technical report, Uni- versit catholique de Louvain,
Center for Operations Research and Econometrics (CORE).
Nesterov, Y. (2005). Smooth minimization of non- smooth functions.
Math Programming , 103(1), 127– 152.
Qin, Z. and Goldfarb, D. (2011). Structured sparsity via
alternating direction methods. Technical report, Columbia
University.
Shen, X. and Huang, H.-C. (2010). Grouping pursuit through a
regularization solution surface. Journal of the American
Statistical Association, 105(490), 727–739.
Tibshirani, R. and Saunders, M. (2005). Sparsity and smoothness via
the fused lasso. Journal of the Royal Statistical Society: Series B
, 67(1), 91–108.
Witten, D. and Tibshirani, R. (2009). Extensions of sparse
canonical correlation analysis with applica- tions to genomic data.
Statistical Applications in Genetics and Molecular Biology , 8 (1),
1–27.
Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized
matrix decomposition, with applications to sparse principal
components and canonical corre- lation analysis. Biostatistics, 10,
515–534.
Yuan, L., Liu, J., and Ye, J. (2011). Efficient methods for
overlapping group lasso. In NIPS .
Yuan, M. and Lin, Y. (2006). Model selection and esti- mation in
regression with grouped variables. Journal of the Royal Statistical
Society: Series B , 68, 49–67.
Zhu, J., Zhang, B., Smith, E. N., Becky Drees4, R. B. B., Kruglyak,
L., Bumgarner, R. E., and Schadt, E. E. (2008). Integrating
large-scale functional ge- nomic data to dissect the complexity of
yeast regu- latory networks. Nature Genetics, 40, 854–861.
Zou, H. and Hastie, T. (2005). Regularization and variable
selection via the elastic net. Journal of the Royal Statistical
Society: Series B , 67, 301–320.
207