View
3
Download
0
Category
Preview:
Citation preview
Gene Networks
Estimation
References
Gene Networks Estimation
Extensions of the lasso
Jose Sanchez
Mathematical Sciences, Chalmers University of Technology
Sep 12, 2013
Gene Networks
Estimation
References
Cancer systems biology
The transfer of informationfrom a protein to eitherDNA or RNA is not possible.
This fact establishes aframework for the study ofcancer at molecular level.
Gene Networks
Estimation
References
Network Modeling
Why gene networks?
A gene regulatory network describes how genes interact witheach other to form modules and carry out cell functions.
Help in systematically understanding complex molecularmechanisms.
Identification of hub genes, since they are potential diseasedrivers (Kendall et al., 2005; Mani et al., 2008; Nibbe et al.,2010; Slavov and Dawson, 2009).
Gene Networks
Estimation
References
Network Modeling
Why gene networks?
A gene regulatory network describes how genes interact witheach other to form modules and carry out cell functions.
Help in systematically understanding complex molecularmechanisms.
Identification of hub genes, since they are potential diseasedrivers (Kendall et al., 2005; Mani et al., 2008; Nibbe et al.,2010; Slavov and Dawson, 2009).
Goals
Estimation of joint gene regulatory networks for several typesof cancer and data types.
Incorporate biologically meaningful constraints into themodel (commonality, modularity).
Take into account the high-dimensionality (p >> N)of theproblem.
Gene Networks
Estimation
References
Gaussian Graphical Models
A graph consists of a set of vertices V and edges E ,which is a subset of V × V . In a graphical model, thevertices correspond to a set of random variablesX = (X 1,X 2, . . . ,X p) coming from distribution P .
Gene Networks
Estimation
References
Gaussian Graphical Models
A graph consists of a set of vertices V and edges E ,which is a subset of V × V . In a graphical model, thevertices correspond to a set of random variablesX = (X 1,X 2, . . . ,X p) coming from distribution P .
A conditonal independence graph (CIG), is a graphicalmodel where the absence of an edge between variablesX i and X j implies that they are conditionallyindependent (given the rest), that is X i ⊥ X j | XV \{i ,j}.
Gene Networks
Estimation
References
Gaussian Graphical Models
A graph consists of a set of vertices V and edges E ,which is a subset of V × V . In a graphical model, thevertices correspond to a set of random variablesX = (X 1,X 2, . . . ,X p) coming from distribution P .
A conditonal independence graph (CIG), is a graphicalmodel where the absence of an edge between variablesX i and X j implies that they are conditionallyindependent (given the rest), that is X i ⊥ X j | XV \{i ,j}.
If the variables X = (X 1,X 2, . . . ,X p) come from themultivariate normal distribution N(0,Σ), the CIGcorresponds to a Gaussian Graphical Model (Lauritzen,1996). In this case the conditional independenciesbetween the variable is the model (the edges in thegraph) are given by the inverse covariance matrixΘ = Σ−1.
Gene Networks
Estimation
References
Gene Network Modeling
GGM for gene networks
Assume genes to be N(µ,Σ) distributed and modelusing Gaussian graphical models.
The links for the gene network are given by thenon-zeros of the precision matrix Θ = Σ−1.
Since p >> N problem the precision matrix can’t beestimated directly, regularization (sparsity) has to beintroduced.
Gene Networks
Estimation
References
Gene Network Modeling
GGM for gene networks
Assume genes to be N(µ,Σ) distributed and modelusing Gaussian graphical models.
The links for the gene network are given by thenon-zeros of the precision matrix Θ = Σ−1.
Since p >> N problem the precision matrix can’t beestimated directly, regularization (sparsity) has to beintroduced.
Not the only methods
Bayesian networks.
Information theory-based methods.
Correlation based methods.
Gene Networks
Estimation
References
Network Modeling: a
high-dimensional problem
We may not be grapes, but estimation of (human) genenetworks is still a high-dimensional problem.
Figure : Source: M. Pertea and S. Salzberg/Genome Biology2010
Gene Networks
Estimation
References
The Lasso: an approach to the p >> N
problem
Consider the usual multivariate regression setting.
X1,X2, . . . ,Xn p-dimensional covariates and a univariateresponse Y1,Y2, . . . ,Yn.
We model the response variable through a linear model
Yi =
p∑
j=1
βjXji + εi i = 1, 2, . . . , n.
Gene Networks
Estimation
References
The Lasso: an approach to the p >> N
problem
Consider the usual multivariate regression setting.
X1,X2, . . . ,Xn p-dimensional covariates and a univariateresponse Y1,Y2, . . . ,Yn.
We model the response variable through a linear model
Yi =
p∑
j=1
βjXji + εi i = 1, 2, . . . , n.
The Lasso estimates for β are given by the minimizer of(Tibshirani, 1996)
β(λ) =1
n‖Y − Xβ‖22 + λ‖β‖1
Gene Networks
Estimation
References
Penalized GGM for gene networks
Maximize the L1 penalized likelihood function for theprecision matrix Θ
l(Θ) = ln [det (Θ)]− tr (SΘ)− g(λ,Θ)
where Sk is 1nXTX is the empirical covariance matrix.
The graphical lasso (Friedman et al., 2008)
g(λ,Θ) = λ∑
i 6=j
| θij |
Gene Networks
Estimation
References
Penalized GGM for gene networks
Maximize the L1 penalized likelihood function for theprecision matrix Θ
l(Θ) = ln [det (Θ)]− tr (SΘ)− g(λ,Θ)
where Sk is 1nXTX is the empirical covariance matrix.
The graphical lasso (Friedman et al., 2008)
g(λ,Θ) = λ∑
i 6=j
| θij |
The group lasso (Yuan and Lin, 2007)
g(λ, {Θ}) = λ1
K∑
k=1
∑
i 6=j
|θkij |+ λ2
∑
i 6=j
√
√
√
√
K∑
k=1
|θkij |
The fused lasso (Danaher et al., 2011)
g(λ, {Θ}) = λ1
K∑
k=1
∑
i 6=j
|θkij |+ λ2
K∑
k<k′
∑
i ,j
|θkij − θk′
ij |
Gene Networks
Estimation
References
Network Modeling: a
high-dimensional problem
Specifically, we are interested in estimating the networks for8 cancer types and 6 types of variables. The problem resultsin the estimation of about 485 million edges.
mRNA 7954CNA 6562miRNA 285Methylation 3831Mutation 469Clinical 3
Gene Networks
Estimation
References
The Alternating Directions Method
of Multipliers
To jointly model sparse GGM we propose an extendedversion of the fused lasso penalty.
l({Θ}) =
K∑
k=1
nk
[
tr(SkΘk) − ln
(
det(Θk))]
− g(λ, {Z})
g(λ, {Z}) = λ1
K∑
k=1
∑
i 6=j
[
α
∣
∣
∣Zkij
∣
∣
∣ + (1 − α)Z2ij
]
+ λ2
∑
k<k′
∑
i,j
∣
∣
∣
∣
Zkij − Z
k′
ij
∣
∣
∣
∣
.
Gene Networks
Estimation
References
The Alternating Directions Method
of Multipliers
To jointly model sparse GGM we propose an extendedversion of the fused lasso penalty.
l({Θ}) =
K∑
k=1
nk
[
tr(SkΘk) − ln
(
det(Θk))]
− g(λ, {Z})
g(λ, {Z}) = λ1
K∑
k=1
∑
i 6=j
[
α
∣
∣
∣Zkij
∣
∣
∣ + (1 − α)Z2ij
]
+ λ2
∑
k<k′
∑
i,j
∣
∣
∣
∣
Zkij − Z
k′
ij
∣
∣
∣
∣
.
The ADMM (Boyd et al., 2011) can be applied to thegeneral problem
minimize{Θ},{Z}
f ({Θ}) + g(λ, {Z})
subject to Θk = Z k , k = 1, . . . ,K .
Gene Networks
Estimation
References
ADMM steps
ADMM solves this problem by defining the scaledaugmented lagrangian as follows
L({Θ},{Z}, {U}) = f ({Θ}) + g(λ, {Z}) +ρ
2
K∑
k=1
‖Θk− Z
k+ U
k‖2F ,
where Uk are the dual variables.At iteration m, the variables {Θ}, {Z} and {U} are updatedaccording to
1 Θkm ← arg min{Θ} {L({Θ}, {Zm−1}, {Um−1})}
2 Z km ← arg min{Z} {L({Θm}, {Z}, {Um−1})}
3 Ukm ← Uk
m−1 +Θkm − Z k
m
for k = 1, . . . ,K .
Gene Networks
Estimation
References
ADMM, first step
For the first step, function g is a constant, so the problem isto minimize the function
K∑
k=1
nk[
tr(SkΘk)− ln(
det(Θk))]
+ρ
2
K∑
k=1
‖Θk − Z k + Uk‖2F ,
with respect to Θ.
Let VDV T be the singular value decomposition ofρ/nk(Z
k − Uk)− Sk .
The minimizer is given (Witten and Tibshirani, 2009)by V DV T where D is diagonal and
Djj = nk/2ρ(Djj +√
D2jj + 4ρ/nk).
Gene Networks
Estimation
References
ADMM, second step
For the second step, function f is a constant, so the problemis to minimize the function
g(λ, {Z}) +ρ
2
K∑
k=1
‖Θk− Z
k+ U
k‖2F
=ρ
2
K∑
k=1
‖Zk− A
k‖2F + λ1
K∑
k=1
∑
i 6=j
[
α|Zkij | + (1 − α)
(
Zkij
)2]
+ λ2
∑
k<k′
∑
i,j
|Zkij − Z
k′
ij |,
with respect to Z , where Ak = Θk + Uk .This problem is separable for each element (i , j), so we cansolve separately the problems
minimize{Zij}
{
1
2
K∑
k=1
(
Zkij − A
kij
)2
+λ1
ρIi 6=j
K∑
k=1
[
α|Zkij | + (1 − α)
(
Zkij
)2]
+λ2
ρ
∑
k<k′
|Zkij − Z
k′
ij |
Gene Networks
Estimation
References
ADMM, second step
Let
g1(Z) =1
2
K∑
k=1
(
Zk− A
k)2
g2(Z) =K∑
k=1
λk1
[
α|Zk| + (1 − α)
(
Zk)2
]
g3(Z) =∑
k<k′
λkk′
2 |Zk− Z
k′| = ‖Λ2LZ‖1,
where Λ2 = (λkk′
2 ) is a vector of dimension 12K (K +1) and L
is a 12K (K + 1)-by-K matrix with values in {−1, 0, 1}
corresponding to the pairwise differences to be penalized.This problem can be written as
minimizeZ
g1(Z) + g2(V ) + g3(W )
subject to V = Z
W = LZ .
Gene Networks
Estimation
References
ADMM, second step
In each iteration, the solutions to this problem are given by
Z =
[
(ρ1 + 1)I + ρ2LTL
]−1[
A+ ρ1
(
V −1
ρ1P
)
+ ρ2LT
(
W −1
ρ2Q
)]
V = STλ1/ρ1
(
Z +1
ρ1P
)
W = STλ2/ρ2
(
LZ +1
ρ2Q
)
.
Gene Networks
Estimation
References
Selection of parameters via bootstrap
The most important parameters in the model are thesparsity parameter, λ1, and the fusing parameter, λ2.Here we propose to use the bootstrap and select valuesfor the parameters that generate stable networks.
Gene Networks
Estimation
References
Selection of parameters via bootstrap
The most important parameters in the model are thesparsity parameter, λ1, and the fusing parameter, λ2.Here we propose to use the bootstrap and select valuesfor the parameters that generate stable networks.
Consider first the sparsity parameter and assume wehave B bootstrap estimates of our networks. For classk = 1, 2, . . . ,K let
nkij =
∑Bb=1 I(θ
kij ,b 6= 0)
B,
where θkij ,b is the b-th bootstrap estimate for link (i , j)in class k is an estimate of the probability of presence oflink (i , j) in cancer class k .
Gene Networks
Estimation
References
Selection of parameters via bootstrap
The most important parameters in the model are thesparsity parameter, λ1, and the fusing parameter, λ2.Here we propose to use the bootstrap and select valuesfor the parameters that generate stable networks.
Consider first the sparsity parameter and assume wehave B bootstrap estimates of our networks. For classk = 1, 2, . . . ,K let
nkij =
∑Bb=1 I(θ
kij ,b 6= 0)
B,
where θkij ,b is the b-th bootstrap estimate for link (i , j)in class k is an estimate of the probability of presence oflink (i , j) in cancer class k .
For a given threshold T1, a link will be present in thefinal estimate if it is present in 100T1% of thebootstrap estimates.
Gene Networks
Estimation
References
Selection of parameters via bootstrap
To select the fusing parameter we proceed similarly.Consider classes k , k ′ = 1, 2, . . . ,K let
nkk′
ij =
∑B
b=1 I(θkij,b 6= θk
′
ij,b, θkij,b 6= 0, θk
′
ij,b 6= 0)∑B
b=1 I(θkij,b 6= 0, θk
′
ij,b 6= 0).
is an estimate of the probability that link (i , j) isdifferential in classes k and k ′ given it is present in bothclasses.
Gene Networks
Estimation
References
Selection of parameters via bootstrap
To select the fusing parameter we proceed similarly.Consider classes k , k ′ = 1, 2, . . . ,K let
nkk′
ij =
∑B
b=1 I(θkij,b 6= θk
′
ij,b, θkij,b 6= 0, θk
′
ij,b 6= 0)∑B
b=1 I(θkij,b 6= 0, θk
′
ij,b 6= 0).
is an estimate of the probability that link (i , j) isdifferential in classes k and k ′ given it is present in bothclasses.
For a given threshold T2, if nkk′
ij ≥ T2, then link (i , j) isdifferential in classes k and k ′, otherwise it is fused.
Gene Networks
Estimation
References
Pipeline for TCGA data analysis
Gene Networks
Estimation
References
Validation
Gene Networks
Estimation
References
Biological analysis
Gene Networks
Estimation
References
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating directionmethod of multipliers. Foundations and Trends in Machine
Learning., 3(1):1–122, 2011.P. Danaher, P. Wang, and D. Witten. The joint graphical lasso for
inverse covariance estimation across multiple classes.arXiv:1111.0324v1, 2011.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covarianceestimation with the graphical lasso. Biostatistics., 9:432–441, 2008.
SD. Kendall, CM. Linardic, SJ. Adam, and CM. Counter. A network ofgenetic events sufficient to convert normal human cells to atumorigenic state. Cancer Research., 65:9824–9828, 2005.
S. Lauritzen. Graphical Models. Oxford Science Publications., 1996.KM. Mani, C. Lefebvre, K. Wang, WK. Lim, K. Baso, and et al. A
systems biology approach to prediction of oncogenes and molecularperturbation targets in b-cell lymphomas. Molecular Systems
Biology., 4(169), 2008.RK. Nibbe, M. Koyuturk, and MR. Chance. An integrative -omics
approach to identify functional sub-networks in human colorectalcancer. PLoS Computational Biology., 6(1):1–15, 2010.
N. Slavov and KA. Dawson. Correlation signature of the macroscopicstates of the gene regulatory network in cancer. Proceedings of theNational Academy of Sciences of the United States of America., 106(11):4079–4084, 2009.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal
Recommended