View
266
Download
3
Embed Size (px)
DESCRIPTION
We propose a regularized method for multivariate linear regression when the number of predictors may exceed the sample size. This method is designed to strengthen the estimation and the selection of the relevant input features with three ingredients: it takes advantage of the dependency pattern between the responses by estimating the residual covariance; it performs selection on direct links between predictors and responses; and selection is driven by prior structural information. To this end, we build on a recent reformulation of the multivariate linear regression model to a conditional Gaussian graphical model and propose a new regularization scheme accompanied with an efficient optimization procedure. On top of showing very competitive performance on artificial and real data sets, our method demonstrates capabilities for fine interpretation of its parameters, as illustrated in applications to genetics, genomics and spectroscopy.
Citation preview
Structured Regularization for conditional GaussianGraphical Models
Julien Chiquet, Stephane Robin, Tristan Mary-Huard
MAP5 – April the 4th, 2014
arXiv preprint http://arxiv.org/abs/1403.6168
Application to Multi-trait genomic selection (MLCB 2013 NIPS Workshop)
R-package spring https://r-forge.r-project.org/projects/spring-pkg/.
1
Multivariate regression analysis
Consider n samples and let for individual i
I yi be the q-dimensional vector of responses,
I xi be the p-dimensional vector of predictors,
I B be the p × q matrix of regression coefficients
I εi be a noise term with a q-dimensional covariance matrix R.
yi = BTxi + εi , εi ∼ N (0,R), ∀i = 1, . . . ,n,
Matrix notation
Let Y(n × q) and X(n × p) be the data matrices, then
Y = XB + ε, vec(ε) ∼ N (0, Ip ⊗R).
Remark
If X is a design matrix, this is called the “General Linear Model” (GLM).
2
Motivating example: cookie dough data
Osborne, B.G., Fearn, T., Miller, A.R., and Douglas, S.
Application of near infrared reflectance spectroscopy to compositional analys is of
biscuits and biscuit doughs. J. Sci. Food Agr., 1984.
10
20
30
40
50
fat sucrose dry flour water
variable
fat
sucrose
dry flour
water
responses
0.8
1.2
1.6
1500 1750 2000 2250position
refle
ctan
ce
predictors
I q = 4 responses related to the composition of biscuit dough.I p = 256 wavelengths equally sampled between 1380nm and 2400nm.I n = 70 biscuit dough samples.
3
From Low to High dimensional setup
Low dimensional setup
Mardia, Kent and Bibby, Multivariate Analysis, Academic Press, 1979.
I Mathematics is the same for both GLM and MLR.
I Application of maximum likelihood, least squares and generalizedleast squares lead to an estimator which is not defined when n < p
B =(XTX
)−1XTY.
High dimensional setup: regularization is a popular answer
Biais B towards a given feasible set to enhance both predictionperformance and interpretability.
What features are required for the coefficients? (sparsity, and. . . )m
How do we shape the feasible this set?4
From Low to High dimensional setup
Low dimensional setup
Mardia, Kent and Bibby, Multivariate Analysis, Academic Press, 1979.
I Mathematics is the same for both GLM and MLR.
I Application of maximum likelihood, least squares and generalizedleast squares lead to an estimator which is not defined when n < p
B =(XTX
)−1XTY.
High dimensional setup: regularization is a popular answer
Biais B towards a given feasible set to enhance both predictionperformance and interpretability.
What features are required for the coefficients? (sparsity, and. . . )m
How do we shape the feasible this set?
Our proposal: SPRINGa
aStructured selection of Primordial Relationships IN the General linear model
1. account for the dependency structure between the outputs if it exists by estimating R
2. pay attention to the direct links between predictors and responses by means of sparse GGM
3. Integrate some prior information about the predictors by means of graph-regularization
4
Outline
Statistical Model
Regularizing Scheme and optimization
Inference and Optimization
Simulation Studies
Spectroscopy and the cookie dough data
Multi-trait genomic selection for a biparental population (Colza)
5
Outline
Statistical Model
Regularizing Scheme and optimization
Inference and Optimization
Simulation Studies
Spectroscopy and the cookie dough data
Multi-trait genomic selection for a biparental population (Colza)
6
Connection between multivariate regression and GGM (I)
Multivariate Linear Regression (MLR)
The model writesyi |xi ∼ N (BTxi ,R).
with negative log-likelihood
− logL(B,R) =n
2log |R|+ 1
2tr((Y −XB)R−1(Y −XB)T
)+ cst,
which is only bi-convex in (B,R).
7
Connection between multivariate regression and GGM (II)Used in Sohn & Kim (2012) and others
Assume that xi ,yi are centered and jointly Gaussian such as(xi
yi
)∼ N (0,Σ), with Σ =
(Σxx Σxy
Σyx Σyy
), Ω , Σ−1 =
(Ωxx Ωxy
Ωyx Ωyy
).
A Convex loglikelihood
The model writes yi |xi ∼ N(−Ω−1
yyΩyxxi ,Ω−1yy
)and
− 2
nlogL(Ωxy,Ωyy) = − log |Ωyy|+ tr (SyyΩyy)
+ 2tr (SxyΩyx) + tr(ΩyxSxxΩxyΩ−1yy) + cst.
(with Sxx = XTX/n and so on)
8
CGGM: interpretation (I)
Matrix Ω is related to partial correlations (direct links).
corij |\i ,j =−Ωij√ΩiiΩjj
, so Ωij = 0⇔ Xi ⊥⊥ Xj |\i , j .
Linking parameters of MLR to cGGM
The cGGM “splits” the regression coefficients into two parts
B = −ΩxyΩ−1yy , R = Ω−1
yy .
1. Ωxy describes the direct links between predictors and responses
2. Ωyy is the inverse of the residual covariance R
B entails both direct and indirect links, the latter due to correlationbetween responses
9
CGGM: interpretation (II)Illustrative examples
NO structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = τ |i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
Ωxy
10
CGGM: interpretation (II)Illustrative examples
NO structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = 0.1|i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
Rlow
Ωxy Blow
10
CGGM: interpretation (II)Illustrative examples
NO structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = 0.5|i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
Rmed
Ωxy Bmed
10
CGGM: interpretation (II)Illustrative examples
NO structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = 0.9|i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
Rhigh
Ωxy Bhigh
10
CGGM: interpretation (II)Illustrative examples
Strong structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = τ |i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
R
Ωxy B
10
CGGM: interpretation (II)Illustrative examples
Strong structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = 0.1|i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
Rlow
Ωxy Blow
10
CGGM: interpretation (II)Illustrative examples
Strong structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = 0.5|i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
Rmed
Ωxy Bmed
10
CGGM: interpretation (II)Illustrative examples
Strong structure along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = 0.9|i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
Rhigh
Ωxy Bhigh
10
CGGM: interpretation (II)Illustrative examples
along the predictors
I Ωxy: p = 40 predictors, q = 5 outcomes.
I R: toeplitz scheme Rij = |i−j |
I B = −ΩxyR.
Direct relationships are masked in B in case ofstrong correlations between the responses.
R
Ωxy B
Consequence
Our regularization scheme will be applied on the direct links Ωxy.
Remarks
I sparsity on Ωxy does not necessarily induce sparsity on B.
I The prior structure on the predictors is identical on Ωxy and B as itapplies on the ”rows”
10
Outline
Statistical Model
Regularizing Scheme and optimization
Inference and Optimization
Simulation Studies
Spectroscopy and the cookie dough data
Multi-trait genomic selection for a biparental population (Colza)
11
Ball crafting towards structured regularization (1)
Elastic-Net
Grouping effect that catches highlycorrelated predictors simultaneously
minimizeβ∈Rp
||Xβ − y||22
+ λ1 ||β||1+ λ2 ||β||22 .
Zou, H. and Hastie, T.
Regularization and variable selection via the elastic net. JRSS B, 2005.
12
Ball crafting towards structured regularization (2)
Fused-Lasso
Encourages sparsity and identicalconsecutive parameters.
minimizeβ∈Rp
||Xβ − y||22
+ λ1 ||β||1
+ λ2
p−1∑j=1
|βj+1 − βj |.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K.
Sparsity and smoothness via the fused lasso. JRSS B, 2005.
13
Ball crafting towards structured regularization (2)
Fused-Lasso
Encourages sparsity and identicalconsecutive parameters.
minimizeβ∈Rp
||Xβ − y||22
+ λ1 ||β||1+ λ2 ||Dβ||1 ,
with
D =
1 ... ... p
1 −1 1...
. . .. . .
p−1 −1 1
13
Ball crafting towards structured regularization (3)
Structured/Generalized Elastic-Net
A “smooth” version of fused-Lasso(neighbors should be close, notidentical).
minimizeβ∈Rp
||Xβ − y||22
+ λ1 ||β||1
+ λ2
p−1∑j=1
(βj+1 − βj )2.
Slawski, zu Castell and Tutz.Feature selection guided by structural information. Ann. Appl. Stat., 2010.
Hebiri and van De Geer
The smooth-lasso and other l1 + l2 penalized methods EJS, 2011.
14
Ball crafting towards structured regularization (3)
Structured/Generalized Elastic-Net
A “smooth” version of fused-Lasso(neighbors should be close, notidentical).
minimizeβ∈Rp
||Xβ − y||22
+ λ1 ||β||1+ λ2 β
TDTDβ.
L = DTD =
1 ... ... ... p
1 1 −1... −1 2 −1
. . .
.... . .
. . .. . .
−1 2 1p −1 1
.
14
Generalized fused penalty: the univariate case
Graphical interpretation of the fusion penalty
p−1∑j=1
|βj+1 − βj |︸ ︷︷ ︸Fused Lasso
p−1∑j=1
(βj+1 − βj )2
︸ ︷︷ ︸Generalized ridge
A chain graph between the successive (ordered) predictors
Generalization via a graphical argument
Let G = (E ,V,W) be a graph with weighted edges. Then∑(i ,j )∈E
wij |βj − βi | = ‖Dβ‖1∑
(i ,j )∈E
wij (βj − βi)2 = βTLβ,
L = DTD 0 is the graph Laplacian associated o G
15
Adapting this scheme to multivariate settings
Bayesian Interpretation
Suppose the prior structure is encoded in a matrix L.
I Univariate case: the conjugate prior for β is N (0,L−1).
I Multivariate case: combine with the covariance, then
vec(B) ∼ N (0,R⊗ L−1).
I Using vec and ⊗ properties, we have for the direct links
vec(Ωxy) ∼ N (0,R−1 ⊗ L−1).
Corresponding regularization term
logP(Ωxy|L,R) =1
2tr(ΩT
xyLΩxyR)
+ cst.
16
Outline
Statistical Model
Regularizing Scheme and optimization
Inference and Optimization
Simulation Studies
Spectroscopy and the cookie dough data
Multi-trait genomic selection for a biparental population (Colza)
17
Optimization problem
Penalized criterion
Encourage sparsity with structuring prior on the direct links:
J (Ωxy,Ωyy) =
− 1
nlogL(Ωxy,Ωyy) +
λ2
2tr(ΩyxLΩxyΩ−1
yy
)+ λ1‖Ωxy‖1.
Proposition
The objective function is jointly convex in (Ωxy,Ωyy) and admits atleast one global minimum which is unique when n ≥ q and (λ2L + Sxx)is positive definite.
18
Algorithm
Alternate optimization
Ω(k+1)yy = arg min
Ωyy0Jλ1λ2(Ω
(k)xy ,Ωyy), (1a)
Ω(k+1)xy = arg min
Ωxy
Jλ1λ2(Ωxy, Ω(k+1)yy ). (1b)
I (1a) boilds down to the diagonalization of a q × q matrix. O(q3)
I (1b) can be recast as generalized Elastic-Net with size pq . O(npqk) where k is the final number of nonzero entries in Ωxy
Convergence
Despite nonsmoothness of the objective, the `1 penalty is separable in(Ωxy,Ωyy) and results of Tseng (2001, 2009) on convergence ofcoordinate descent apply.
19
First block: covariance estimationAnalytic resolution of R
If Ωxy = 0, then Ωyy = S−1yy . Otherwise we rely on the following:
Proposition
Let n > q . Assume that the following eigen decomposition holds
ΩyxΣλ2xxΩxySyy = Udiag(ζ)U−1
and denote by η = (η1, . . . , ηq) the roots of η2j − ηj − ζj . Then
Ωyy = Udiag(η/ζ)U−1ΩyxΣλ2xxΩxy(= R−1), (2a)
Ω−1yy = SyyUdiag(η−1)U−1(= R). (2b)
Proof.
Differentiation of the objective, commuting matrices property,algebra.
20
Second block: parameters estimationReformulation as a Elastic-Net problem
Proposition
Solution Ωxy for a fix Ωyy is given by vec(Ωxy) = ω where ω solves theElastic-Net problem
arg minω∈Rpq
1
2‖Aω − b‖22 + λ1‖ω‖1
λ2
2ωT(Ω−1yy ⊗ L
)ω,
where A and b are defined thanks to the Cholesky decomposition
CTC = Ω−1yy , so as
A =(C⊗X/
√n), b = −vec
([YC−1/
√n]T)
.
Proof.
Algebra with bad vec/tr/⊗ properties.
21
Monitoring convergenceExample on the cookies dough data
0 50 100 150 200 250
−5
−4
−3
−2
−1
0
iteration
obje
ctiv
e
3e−
01
2e−
01
1e−
01
1e−
01
7e−
02
4e−
02
3e−
02
2e−
02
1e−
02
9e−
03
6e−
03
4e−
03
3e−
03
2e−
03
1e−
03
8e−
04
5e−
04
4e−
04
2e−
04
2e−
04
λ1
Figure: monitoring the objective along the whole path of λ122
Monitoring convergenceExample on the cookies dough data
0 50 100 150 200 250
5010
015
020
0
iteration
logl
ikel
ihoo
d
3e−
01
2e−
01
1e−
01
1e−
01
7e−
02
4e−
02
3e−
02
2e−
02
1e−
02
9e−
03
6e−
03
4e−
03
3e−
03
2e−
03
1e−
03
8e−
04
5e−
04
4e−
04
2e−
04
2e−
04
λ1
Figure: monitoring the likelihood along the whole path of λ122
Tuning the penalty parameters
K−fold cross-Validation
Computationally intensive, but works! For κ : 1, . . . ,n → 1, . . . ,K,
(λcv1 , λcv2 ) = arg min
(λ1,λ2)∈Λ1×Λ2
1
n
n∑i=1
∥∥∥xTi Bλ1,λ2−κ(i) − yi
∥∥∥2
2.
Information criteria adapted to regularized methods
(λpen1 , λpen2 ) = arg minλ1,λ2
−2 logL(Ω
λ1,λ2xy , Ω
λ1,λ2yy ) + pen (dfλ1,λ2)
.
Proposition (Generalized degrees of freedom)
dfλ1,λ2 = card(A)− λ2tr(
(R⊗ L)AA(R⊗ (Sxx + λ2L))−1AA
),
where A =j : vec
(Ωλ1,λ2xy
)6= 0
, the set of active guys.
23
Tuning the penalty parameters
K−fold cross-Validation
Computationally intensive, but works! For κ : 1, . . . ,n → 1, . . . ,K,
(λcv1 , λcv2 )
?= arg min
(λ1,λ2)∈Λ1×Λ2
1
n
n∑i=1
logL(Ωλ1,λ2xy , Ω
λ1,λ2yy ; xi ,yi).
Information criteria adapted to regularized methods
(λpen1 , λpen2 ) = arg minλ1,λ2
−2 logL(Ω
λ1,λ2xy , Ω
λ1,λ2yy ) + pen (dfλ1,λ2)
.
Proposition (Generalized degrees of freedom)
dfλ1,λ2 = card(A)− λ2tr(
(R⊗ L)AA(R⊗ (Sxx + λ2L))−1AA
),
where A =j : vec
(Ωλ1,λ2xy
)6= 0
, the set of active guys.
23
Tuning the penalty parameters
K−fold cross-Validation
Computationally intensive, but works! For κ : 1, . . . ,n → 1, . . . ,K,
(λcv1 , λcv2 )
?= arg min
(λ1,λ2)∈Λ1×Λ2
1
n
n∑i=1
logL(Ωλ1,λ2xy , Ω
λ1,λ2yy ; xi ,yi).
Information criteria adapted to regularized methods
(λpen1 , λpen2 ) = arg minλ1,λ2
−2 logL(Ω
λ1,λ2xy , Ω
λ1,λ2yy ) + pen (dfλ1,λ2)
.
Proposition (Generalized degrees of freedom)
dfλ1,λ2 = card(A)− λ2tr(
(R⊗ L)AA(R⊗ (Sxx + λ2L))−1AA
),
where A =j : vec
(Ωλ1,λ2xy
)6= 0
, the set of active guys.
23
Outline
Statistical Model
Regularizing Scheme and optimization
Inference and Optimization
Simulation Studies
Spectroscopy and the cookie dough data
Multi-trait genomic selection for a biparental population (Colza)
24
Assessing gain brought by covariance estimationSimulation settings
Parameters
p = 40 predictors, q = 5 outcomes.
I Ωxy: 25 non null entries in −1, 1no particular structure along the predictors
I R: toeplitz schemeRij = τ |i−j | with τ ∈ 0.1, 0.5, 0.9.
I B = −ΩxyR.
Data generation
Draw ntrain = 50 + ntest = 1000 samples from
yi = BTxi+εi , with xi ∼ N (0, I) and εi ∼ N (0,R).
Evaluating performance
Compare prediction error on 100 runs betweenLasso, group-Lasso and SPRING.
Ωxy
25
Assessing gain brought by covariance estimationSimulation settings
Parameters
p = 40 predictors, q = 5 outcomes.
I Ωxy: 25 non null entries in −1, 1no particular structure along the predictors
I R: toeplitz schemeRij = τ |i−j | with τ ∈ 0.1, 0.5, 0.9.
I B = −ΩxyR.
Data generation
Draw ntrain = 50 + ntest = 1000 samples from
yi = BTxi+εi , with xi ∼ N (0, I) and εi ∼ N (0,R).
Evaluating performance
Compare prediction error on 100 runs betweenLasso, group-Lasso and SPRING.
Rlow
Ωxy Blow
25
Assessing gain brought by covariance estimationSimulation settings
Parameters
p = 40 predictors, q = 5 outcomes.
I Ωxy: 25 non null entries in −1, 1no particular structure along the predictors
I R: toeplitz schemeRij = τ |i−j | with τ ∈ 0.1, 0.5, 0.9.
I B = −ΩxyR.
Data generation
Draw ntrain = 50 + ntest = 1000 samples from
yi = BTxi+εi , with xi ∼ N (0, I) and εi ∼ N (0,R).
Evaluating performance
Compare prediction error on 100 runs betweenLasso, group-Lasso and SPRING.
Rmed
Ωxy Bmed
25
Assessing gain brought by covariance estimationSimulation settings
Parameters
p = 40 predictors, q = 5 outcomes.
I Ωxy: 25 non null entries in −1, 1no particular structure along the predictors
I R: toeplitz schemeRij = τ |i−j | with τ ∈ 0.1, 0.5, 0.9.
I B = −ΩxyR.
Data generation
Draw ntrain = 50 + ntest = 1000 samples from
yi = BTxi+εi , with xi ∼ N (0, I) and εi ∼ N (0,R).
Evaluating performance
Compare prediction error on 100 runs betweenLasso, group-Lasso and SPRING.
Rhigh
Ωxy Bhigh
25
Assessing gain brought by covariance estimationSimulation settings
Parameters
p = 40 predictors, q = 5 outcomes.
I Ωxy: 25 non null entries in −1, 1no particular structure along the predictors
I R: toeplitz schemeRij = τ |i−j | with τ ∈ 0.1, 0.5, 0.9.
I B = −ΩxyR.
Data generation
Draw ntrain = 50 + ntest = 1000 samples from
yi = BTxi+εi , with xi ∼ N (0, I) and εi ∼ N (0,R).
Evaluating performance
Compare prediction error on 100 runs betweenLasso, group-Lasso and SPRING.
R
Ωxy B
25
Assessing gain brought by covariance estimationSimulation settings
Parameters
p = 40 predictors, q = 5 outcomes.
I Ωxy: 25 non null entries in −1, 1no particular structure along the predictors
I R: toeplitz schemeRij = τ |i−j | with τ ∈ 0.1, 0.5, 0.9.
I B = −ΩxyR.
Data generation
Draw ntrain = 50 + ntest = 1000 samples from
yi = BTxi+εi , with xi ∼ N (0, I) and εi ∼ N (0,R).
Evaluating performance
Compare prediction error on 100 runs betweenLasso, group-Lasso and SPRING.
R
Ωxy B
25
Assessing gain brought by covariance estimationResults
1
2
3
4
low medium high
Pre
dict
ion
Err
or estimatorspring (oracle)springlassogroup−lasso
Figure: Prediction error for 100 runs illustrates the influence of correlationsbetween outcomes. Scenarios low,med, high map to τ ∈ .1, .5, .9
26
Assessing gain brought by structure integrationSimulation settings
Parameters
p = 100, q = 1 to remove covariance effect.
I Ωxy , ωxy: a vector with two successive bumps
ωj =
−((30− j )2 − 100)/200 j = 21, . . . 39,
((70− j )2 − 100)/200 j = 61, . . . 80,
0 otherwise.
−0.50
−0.25
0.00
0.25
0.50
0 25 50 75 100
I R , ρ = 5: a residual (scalar) variance.
I β = −ωxy/ρ.
Data generation
Draw ntrain = 120 + ntest = 1000 samples from
yi = βTxi + εi , with xi ∼ N (0, I) and εi ∼ N (0, ρ).
27
Assessing gain brought by structure integrationSimulation settings
Parameters
p = 100, q = 1 to remove covariance effect.
I Ωxy , ωxy: a vector with two successive bumps
ωj =
−((30− j )2 − 100)/200 j = 21, . . . 39,
((70− j )2 − 100)/200 j = 61, . . . 80,
0 otherwise.
I R , ρ = 5: a residual (scalar) variance.
I β = −ωxy/ρ.
Data generation
Draw ntrain = 120 + ntest = 1000 samples from
yi = βTxi + εi , with xi ∼ N (0, I) and εi ∼ N (0, ρ).
27
Assessing gain brought by structure informationResults (1): predictive performance
40
80
120
160
0.001 0.100
penalty level λ1 (scaled)
Pre
dict
ion
Err
or
estimatorspring (λ2 = .01)spring (λ2 = .00)lasso
Figure: Mean PE + standard error for 100 runs on a grid of λ1 – SPRING withand without structural regularization (L = D>D) and Lasso
28
Assessing gain brought by structure informationResults (2): robustness
What if we introduce a “wrong” structure?
Evaluate performance with the same settings but
I randomly swap all elements in ωxy to remove any structure.
I keep exactly the same xi , εi ,
I draw yi with swapped an unswapped parameters,
I use the same folds for cross-validation,
then replicate 100 times.
Method Scenario MSE PELASSO – .336 (.096) 58.6 (10.2)E-Net (L = I) – .340 (.095) 59 (10.3)SPRING (L = I) – .358 (.094) 60.7 (10)S. E-net unswapped .163 (.036) 41.3 ( 4.08)(L = DTD) swapped .352 (.107) 60.3 (11.42)SPRING unswapped .062 (.022) 31.4 ( 2.99)(L = DTD) swapped .378 (.123) 62.9 (13.15)
29
Assessing gain brought by structure informationResults (2): robustness
What if we introduce a “wrong” structure?
Evaluate performance with the same settings but
I randomly swap all elements in ωxy to remove any structure.
I keep exactly the same xi , εi ,
I draw yi with swapped an unswapped parameters,
I use the same folds for cross-validation,
then replicate 100 times.
Method Scenario MSE PELASSO – .336 (.096) 58.6 (10.2)E-Net (L = I) – .340 (.095) 59 (10.3)SPRING (L = I) – .358 (.094) 60.7 (10)S. E-net unswapped .163 (.036) 41.3 ( 4.08)(L = DTD) swapped .352 (.107) 60.3 (11.42)SPRING unswapped .062 (.022) 31.4 ( 2.99)(L = DTD) swapped .378 (.123) 62.9 (13.15)
29
Outline
Statistical Model
Regularizing Scheme and optimization
Inference and Optimization
Simulation Studies
Spectroscopy and the cookie dough data
Multi-trait genomic selection for a biparental population (Colza)
30
Cookie dough data: performance
Method fat sucrose flour waterStep. MLR .044 1.188 .722 .221Decision th. .076 .566 .265 .176PLS .151 .583 .375 .105PCR .160 .614 .388 .106Bayes. Reg. .058 .819 .457 .080LASSO .045 .860 .376 .104grp LASSO .127 .918 .467 .102str E-net .039 .666 .365 .100MRCE .151 .821 .321 .081SPRING (CV) .065 .397 .237 .083SPRING (BIC) .048 .389 .243 .066
Table: Test error
Brown, P.J., Fearn, T., and Vannucci, M.
Bayesian wavelet regression on curves with applications to a spectroscopic
calibration problem. JASA, 2001.
31
Cookie dough data: performance
Method fat sucrose flour waterStep. MLR .044 1.188 .722 .221Decision th. .076 .566 .265 .176PLS .151 .583 .375 .105PCR .160 .614 .388 .106Bayes. Reg. .058 .819 .457 .080LASSO .045 .860 .376 .104grp LASSO .127 .918 .467 .102str E-net .039 .666 .365 .100MRCE .151 .821 .321 .081SPRING (CV) .065 .397 .237 .083SPRING (BIC) .048 .389 .243 .066
Table: Test error
−100
0
100
1500 1750 2000 2250
The Lasso induces sparsity on B
I No structure along the predictors.
I No structure between responses.
31
Cookie dough data: performance
Method fat sucrose flour waterStep. MLR .044 1.188 .722 .221Decision th. .076 .566 .265 .176PLS .151 .583 .375 .105PCR .160 .614 .388 .106Bayes. Reg. .058 .819 .457 .080LASSO .045 .860 .376 .104grp LASSO .127 .918 .467 .102str E-net .039 .666 .365 .100MRCE .151 .821 .321 .081SPRING (CV) .065 .397 .237 .083SPRING (BIC) .048 .389 .243 .066
Table: Test error
−100
−50
0
50
100
150
1500 1750 2000 2250
The Group-Lasso induces sparsity on B group-wise across the responses
I No structure along the predictors.
I (Too) strong structure between responses.
31
Cookie dough data: performance
Method fat sucrose flour waterStep. MLR .044 1.188 .722 .221Decision th. .076 .566 .265 .176PLS .151 .583 .375 .105PCR .160 .614 .388 .106Bayes. Reg. .058 .819 .457 .080LASSO .045 .860 .376 .104grp LASSO .127 .918 .467 .102str E-net .039 .666 .365 .100MRCE .151 .821 .321 .081SPRING (CV) .065 .397 .237 .083SPRING (BIC) .048 .389 .243 .066
Table: Test error
−20
−10
0
10
20
30
1500 1750 2000 2250
The Structured Elastic-Net induces sparsity on B with a smoothneighborhood prior along the predictors (L = DTD)
I Structure along the predictors.
I No structure between responses.31
Cookie dough data: performance
Method fat sucrose flour waterStep. MLR .044 1.188 .722 .221Decision th. .076 .566 .265 .176PLS .151 .583 .375 .105PCR .160 .614 .388 .106Bayes. Reg. .058 .819 .457 .080LASSO .045 .860 .376 .104grp LASSO .127 .918 .467 .102str E-net .039 .666 .365 .100MRCE .151 .821 .321 .081SPRING (CV) .065 .397 .237 .083SPRING (BIC) .048 .389 .243 .066
Table: Test error
−20
−10
0
10
20
30
1500 1750 2000 2250
MRCE induces sparsity on B and sparsity on R−1
I No structure along the predictors.
I (Supposed to add) Structure between responses.
31
Cookie dough data: performance
Method fat sucrose flour waterStep. MLR .044 1.188 .722 .221Decision th. .076 .566 .265 .176PLS .151 .583 .375 .105PCR .160 .614 .388 .106Bayes. Reg. .058 .819 .457 .080LASSO .045 .860 .376 .104grp LASSO .127 .918 .467 .102str E-net .039 .666 .365 .100MRCE .151 .821 .321 .081SPRING (CV) .065 .397 .237 .083SPRING (BIC) .048 .389 .243 .066
Table: Test error
−20
−10
0
10
20
30
1500 1750 2000 2250
SPRING
I Use GGM to induce structured sparsity on the direct links betweenthe responses and the predictors + smooth neighborhood prior viaL = DTD.
31
Cookie dough data: parameters
B −Ωxy R
−20
−10
0
10
20
30
1500 1750 2000 2250
−20
−10
0
10
20
30
1500 1750 2000 2250
dry flour
fat
sucrose
water
dry
flour
fat
sucr
ose
wat
er
−0.250.000.250.50
value
32
Cookie dough data: model selection with BIC
−700
−600
−500
−400
0.001 0.100log10(λ1)
crite
rion'
s va
lue λ2
0.01
0.1
1
10
BIC
33
Outline
Statistical Model
Regularizing Scheme and optimization
Inference and Optimization
Simulation Studies
Spectroscopy and the cookie dough data
Multi-trait genomic selection for a biparental population (Colza)
34
Quantitative Trait Loci (QTL) study in Colza
Doubled haployd samples
I n = 103 homozygous lines of Brassica napus by crossing ‘Stellar’and ‘Major’ cultivars.
Bi-parental markers
I p = 300 markers with known loci dispatched on the 19 chromosomeswith value in Major, Stellar, Missing → 1,−1, 0.
Traits
Consider q = 8 traits including
I survival traits (% survival in winter)surv92, surv93, surv94, surv97, surv99
I flowering traits (no vernalization, 4 weeks or 8 weeks vernalization)flower0, flower4, flower8
35
Include genetic linkage information
Genetic distance between markers A1 and A2
Let r12 be the recombination rate between A1 and A2, then
d12 = −1
2log 1− 2r12 .
Linkage disequilibrium as covariance between the markers
In a biparental population with independent recombination events, onehas
cor(A1,A2) = ρd13 = ρd12+d13 , with ρ = e−2.
Proposition (Including LD information in the model)
The matrix L is given by inverting the covariance matrix, which can bedone analytically.
36
Analytical form of L as a precision matrixUsually met in AR(1) processes
L is given by the inverse of the correlation between the markers
L = UTΛU
with
U =
1 −ρd12 0 . . . 0 00 1 −ρd23 . . . 0 0
0 0 1. . . 0 0
......
.... . .
. . ....
0 0 0 . . . 1 −ρdm−1m
0 0 0 . . . 0 1
Λ =
(1− ρ2d12 )−1 0 . . . . . . . . . 00 (1− ρ2d23 )−1 0 . . . . . . 0
0 0. . .
......
.... . .
...0 0 0 . . . (1− ρ2dm−1m )−1 00 0 0 . . . 0 1
.
37
Predictive performance
1. Split the data into training/test sets (n1 = 70,n2 = 33),2. Adjust each procedure using 5-fold CV for model selection,3. Compute test (prediction) error.
Method surv92 surv93 surv94 surv97 surv99 Mean PELASSO .79 .98 .90 1.02 1.00 .938group-LASSO .90 1.00 .92 .99 .92 .946Enet (no LD) .87 1.01 .97 1.03 1.03 .983Gen-Enet LD) .75 .98 .89 1.03 1.02 .934our proposal (LD) .77 .96 .84 1.00 1.02 .918
Table: Survival traits
Method flower0 flower4 flower8 Mean PELASSO .58 .53 .74 .616group-LASSO .59 .55 .74 .626Enet (no LD) .55 .54 .69 .593Gen-Enet (LD) .55 .50 .74 .596our proposal (LD) .48 .46 .68 .54
Table: Flowering traits38
Estimated Residual Covariance R
flower0
flower4
flower8
surv92
surv93
surv94
surv97
surv99
flower0 flower4 flower8 surv92 surv93 surv94 surv97 surv99
−1.0
−0.5
0.0
0.5
1.0correlation
39
Estimated Regression Coefficients B
−0.1
0.0
0.1
0 500 1000 1500position of the markers
outcomessurv92surv93surv94surv97surv99flower0flower4flower8
40
Estimated Direct Effects Ωxy
−0.1
0.0
0.1
0 500 1000 1500position of the markers
outcomessurv92surv93surv94surv97surv99flower0flower4flower8
41
Estimated Direct Effects Ωxy
−0.1
0.0
0.1
0 500 1000 1500position of the markers
outcomessurv92surv93surv94surv97surv99flower0flower4flower8
41
QTL Mapping (chr. 2, 8, 10), regression coefficients B
ec2e5a
E33M49.117ec3b12wg2d11bwg1g8a
E32M49.73ec3a8wg7f3aE33M59.59E35M62.117wg6b10wg8g1bwg5a5tg6a12Aca1E38M50.133
E35M59.117
ec2d1a
wg1a10tg2h10btg2f12
wg4b6b
wg6g9E33M59.147eru1ec4h3E33M62.99E38M50.157
wg6d9
E38M62.189tg3c1ec3d3bE33M49.175E33M48.268E35M62.80E35M48.143wg1g4aE33M47.182b
E38M50.119wg7b3
E33M59.64
ec3g3c
ec2h2bE32M48.212
0
40
80
120
2 8 10
loci
outcomes
surv92surv93surv94surv97surv99flower0flower4flower8
42
QTL Mapping (chr. 2, 8, 10), direct links Ωxy
ec2e5a
E33M49.117ec3b12wg2d11bwg1g8a
E32M49.73ec3a8wg7f3aE33M59.59E35M62.117wg6b10wg8g1bwg5a5tg6a12Aca1E38M50.133
E35M59.117
ec2d1a
wg1a10tg2h10btg2f12
wg4b6b
wg6g9E33M59.147eru1ec4h3E33M62.99E38M50.157
wg6d9
E38M62.189tg3c1ec3d3bE33M49.175E33M48.268E35M62.80E35M48.143wg1g4aE33M47.182b
E38M50.119wg7b3
E33M59.64
ec3g3c
ec2h2bE32M48.212
0
40
80
120
2 8 10
loci
outcomes
surv92surv94surv97surv99flower0flower4flower8
42
QTL Mapping (all chromosomes), B
wg1h4c
wg1g5c
tg5e11b
E35M47.262
tg6c3aE32M48.249E33M50.252E35M48.120
ec5d5
wg1g10aE33M48.369E33M50.451tg1f8
wg7b6aisoACO
ec4f1
ec2e5a
E33M49.117
ec3b12wg2d11b
wg1g8a
E32M49.73wg3g11ec3a8wg7f3aE33M59.59
E35M62.117wg6b10
wg8g1bwg5a5tg6a12
Aca1E38M50.133
E35M59.117
ec2d1a
wg1a10tg2h10btg2f12
E33M49.491E38M62.229wg1g10bec4h7
wg4d10
E32M50.409E38M50.159E33M47.338ec2d8a
E33M49.165wg4f4cE35M48.148wg6c6ec4g7bwg7a11
wg5b1aec4e7aE32M61.218
wg4a4bE33M50.183E33M62.140
wg9c7
wg6b2
E33M49.211wg2e12bisoIdh
wg3f7c
ec3b4
E33M62.206wg5b2E32M59.302
wg6a12
wg4d7cec4c5b
E32M61.166
E32M47.136
E32M62.107wg6f10
ec5e12c
E38M62.358E35M62.256
ec5a7bwg3c9
E33M47.154E35M59.581E32M47.460
ec4g7aec6b2
E35M62.111wg1g6E35M62.201ec4c5aec5a7awg1h5
wg6a10E33M50.120ec4e8
E33M48.191
E32M47.168E35M62.225E35M62.340wg1g8cE32M62.75E32M49.473E32M59.330wg7e10wg6h1bwg2c1tg5h12wg3b6
wg7d9awg1g3b
wg7h2
wg9d5
E32M59.359
E33M59.353
E32M61.137ec3h4
wg8g3wg2a11tg2b4E35M47.367ec2e4bE32M47.512ec2h2aLemtg5d9awg7f5awg5a1aec3e12a
wg4b6b
wg6g9E33M59.147eru1ec4h3E33M62.99E38M50.157
wg6d9
E33M60.50
wg4h5a
wg3h8ec3d3aec2c7
wg4d11tg1h12ec2e5bE38M62.461
wg3f7aE35M60.312
E38M62.189tg3c1ec3d3bE33M49.175E33M48.268E35M62.80E35M48.143wg1g4a
E33M47.182b
E38M50.119wg7b3
E33M59.64
ec3g3c
ec2h2bE32M48.212
wg1g5btg6c3bec5a1wg6f3
E32M62.115E33M62.250E32M62.186
wg2b7wg8h5
wg3h4
tg2h10a
tg5e11aE32M50.90ec2d1b
E32M50.77
wg1g4c
wg8g1awg2c3wg7f3bwg4h1
ec4e7bwg5a6
ec2c12
wg2d11a
ec2e12a
wg7a8a
isoLapE33M62.176
E35M48.84E33M49.293E35M62.136
eru2E38M50.186
ec4f11
E32M50.252
E32M59.107
wg1g8b
wg2g9E33M50.282E35M48.123wg1e3wg6d6wg4f4a
ec5c4E35M48.198E35M62.135wg1a4ec2e12bwg3h6
wg4d5awg5b1bE33M61.54
ec3d2E32M48.191E33M59.333
wg6e3bec4d11E32M59.88
ec4g4wg1g4b
ec3g12
ec3g3a
wg9f2
E35M62.222
wg4a4aE33M59.234E33M61.84wg4d7b
ec5e12bec4c11wg6e3a
E32M48.69
ec3b2b
E32M47.186ec4d9
wg4d5c
E33M48.67E35M60.329wg1h4b
E38M62.188
E32M50.261
E33M50.118
wg1g3a
E35M60.230wg6a11wg6h1a
E32M62.241E32M47.288
E33M48.316E33M59.225ec2e4c
ec3e12bwg5a1bwg2a3c
ec5e12awg7f5bE32M47.159tg5d9bslg6E35M59.85
ec2d8bE35M62.132
E35M47.337E35M47.257wg9e9
ec2b3E33M60.229
E32M50.325
wg6c1ec3b2aE35M47.170
wg2d5a
E33M60.120
E33M47.115
wg1g10c
E33M62.130E32M47.344
E32M50.255
wg3g9E32M50.424
pr2
tg5b2
E33M59.165E35M60.125E33M62.196
ec2h2cwg3f7b
ec3f1
E35M60.107
wg1g2
E33M48.346E33M50.371
E33M47.138
tg4d2bE32M62.394E33M47.189
E32M49.409wg7b6b
0
50
100
150
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19chromosomes
loci
outcomes
surv92surv93surv94surv97surv99flower0flower4flower8
43
QTL Mapping (all chromosomes), Ωxy
wg1h4c
wg1g5c
tg5e11b
E35M47.262
tg6c3aE32M48.249E33M50.252E35M48.120
ec5d5
wg1g10aE33M48.369E33M50.451tg1f8
wg7b6aisoACO
ec4f1
ec2e5a
E33M49.117
ec3b12wg2d11b
wg1g8a
E32M49.73wg3g11ec3a8wg7f3aE33M59.59
E35M62.117wg6b10
wg8g1bwg5a5tg6a12
Aca1E38M50.133
E35M59.117
ec2d1a
wg1a10tg2h10btg2f12
E33M49.491E38M62.229wg1g10bec4h7
wg4d10
E32M50.409E38M50.159E33M47.338ec2d8a
E33M49.165wg4f4cE35M48.148wg6c6ec4g7bwg7a11
wg5b1aec4e7aE32M61.218
wg4a4bE33M50.183E33M62.140
wg9c7
wg6b2
E33M49.211wg2e12bisoIdh
wg3f7c
ec3b4
E33M62.206wg5b2E32M59.302
wg6a12
wg4d7cec4c5b
E32M61.166
E32M47.136
E32M62.107wg6f10
ec5e12c
E38M62.358E35M62.256
ec5a7bwg3c9
E33M47.154E35M59.581E32M47.460
ec4g7aec6b2
E35M62.111wg1g6E35M62.201ec4c5aec5a7awg1h5
wg6a10E33M50.120ec4e8
E33M48.191
E32M47.168E35M62.225E35M62.340wg1g8cE32M62.75E32M49.473E32M59.330wg7e10wg6h1bwg2c1tg5h12wg3b6
wg7d9awg1g3b
wg7h2
wg9d5
E32M59.359
E33M59.353
E32M61.137ec3h4
wg8g3wg2a11tg2b4E35M47.367ec2e4bE32M47.512ec2h2aLemtg5d9awg7f5awg5a1aec3e12a
wg4b6b
wg6g9E33M59.147eru1ec4h3E33M62.99E38M50.157
wg6d9
E33M60.50
wg4h5a
wg3h8ec3d3aec2c7
wg4d11tg1h12ec2e5bE38M62.461
wg3f7aE35M60.312
E38M62.189tg3c1ec3d3bE33M49.175E33M48.268E35M62.80E35M48.143wg1g4a
E33M47.182b
E38M50.119wg7b3
E33M59.64
ec3g3c
ec2h2bE32M48.212
wg1g5btg6c3bec5a1wg6f3
E32M62.115E33M62.250E32M62.186
wg2b7wg8h5
wg3h4
tg2h10a
tg5e11aE32M50.90ec2d1b
E32M50.77
wg1g4c
wg8g1awg2c3wg7f3bwg4h1
ec4e7bwg5a6
ec2c12
wg2d11a
ec2e12a
wg7a8a
isoLapE33M62.176
E35M48.84E33M49.293E35M62.136
eru2E38M50.186
ec4f11
E32M50.252
E32M59.107
wg1g8b
wg2g9E33M50.282E35M48.123wg1e3wg6d6wg4f4a
ec5c4E35M48.198E35M62.135wg1a4ec2e12bwg3h6
wg4d5awg5b1bE33M61.54
ec3d2E32M48.191E33M59.333
wg6e3bec4d11E32M59.88
ec4g4wg1g4b
ec3g12
ec3g3a
wg9f2
E35M62.222
wg4a4aE33M59.234E33M61.84wg4d7b
ec5e12bec4c11wg6e3a
E32M48.69
ec3b2b
E32M47.186ec4d9
wg4d5c
E33M48.67E35M60.329wg1h4b
E38M62.188
E32M50.261
E33M50.118
wg1g3a
E35M60.230wg6a11wg6h1a
E32M62.241E32M47.288
E33M48.316E33M59.225ec2e4c
ec3e12bwg5a1bwg2a3c
ec5e12awg7f5bE32M47.159tg5d9bslg6E35M59.85
ec2d8bE35M62.132
E35M47.337E35M47.257wg9e9
ec2b3E33M60.229
E32M50.325
wg6c1ec3b2aE35M47.170
wg2d5a
E33M60.120
E33M47.115
wg1g10c
E33M62.130E32M47.344
E32M50.255
wg3g9E32M50.424
pr2
tg5b2
E33M59.165E35M60.125E33M62.196
ec2h2cwg3f7b
ec3f1
E35M60.107
wg1g2
E33M48.346E33M50.371
E33M47.138
tg4d2bE32M62.394E33M47.189
E32M49.409wg7b6b
0
50
100
150
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19chromosomes
loci
outcomes
surv92surv93surv94surv97surv99flower0flower4flower8
43
Some concluding remarks
Perspectives
1. ModellingI Generalized Fused-Lasso penaltyI Automatic inference of LI Environment? Multiparental aspect, Multi-population ?
2. Technical algorithmic pointsI active set strategy in the alternating algorithmI smart screening of irrelevant predictorsI full C++ implementation
3. Applications to regulatory motifs discovery
I Y is a matrix of q microarrays for n genes (the individuals),I X is the matrice of motif counts in the promotor of each gene.I L is a matrix based upon editing distance between motifs.
A first attempt is made in the paper but we would like to considerlarge scale problems (10s/100s of q , 1000s of n, 10,000s of p).
44
Thanks
Hiring! We are looking for a post-doc with strong background inOptimization and Statistics.
Thanks to you for your patience and to my co-workers
45