Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
High-dimensional statistics:Some progress and challenges ahead
Martin Wainwright
UC BerkeleyDepartments of Statistics, and EECS
University College, London Master Class: Lecture 2
Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,
Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.
High-level overview
Last lecture: least-squares loss and ℓ1-regularization.
The big picture:
Lots of other estimators with same basic form:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29
High-level overview
Last lecture: least-squares loss and ℓ1-regularization.
The big picture:
Lots of other estimators with same basic form:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Past years have witnessed an explosion of results (compressed sensing,covariance estimation, block-sparsity, graphical models, matrix completion...)
Question:
Is there a common set of underlying principles?
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29
Last lecture: Sparse linear regression
= +nS
wy X θ∗
Sc
n× p
Set-up: noisy observations y = Xθ∗ + w with sparse θ∗
Estimator: Lasso program
θ ∈ argminθ
1
n
n∑
i=1
(yi − xTi θ)
2 + λn
p∑
j=1
|θj |
Block-structured extension
= +
r
r
r
n n
p
S
WY X Θ∗
Sc
n× p
Signal Θ∗ is a p× r matrix: partitioned into non-zero rows S and zero
rows Sc
Various applications: multiple-view imaging, gene array prediction, graphicalmodel fitting.
Block-structured extension
= +
r
r
r
n n
p
S
WY X Θ∗
Sc
n× p
Row-wise ℓ1/ℓ2-norm
|||Θ|||1,2 =
p∑
j=1
‖Θj‖2
Block-structured extension
= +
r
r
r
n n
p
S
WY X Θ∗
Sc
n× p
Row-wise ℓ1/ℓ2-norm
|||Θ|||1,2 =
p∑
j=1
‖Θj‖2
More complicated group structure: (Obozinski et al., 2009)
|||Θ∗|||G =∑
g∈G
‖Θg‖2
Example: Low-rank matrix approximation
Θ∗ U D V T
r × r
p1 × p2 p1 × r
r × p2
Set-up: Matrix Θ∗ ∈ Rp1×p2 with rank r ≪ minp1, p2.
Estimator:
Θ ∈ argminΘ
1
n
n∑
i=1
(yi − 〈〈Xi, Θ〉〉)2 + λn
minp1,p2∑
j=1
σj(Θ)
Some past work: Fazel, 2001; Srebro et al., 2004; Recht, Fazel & Parillo, 2007; Bach, 2008;
Candes & Recht, 2008; Keshavan et al., 2009; Rohde & Tsybakov, 2010; Recht, 2009;
Negahban & W., 2010 ...
Application: Collaborative filtering
. . . . . .
4 ∗ 3 . . . . . . ∗
3 5 ∗ . . . . . . 2
5 4 3 . . . . . . 3
2 ∗ ∗ . . . . . . 1
Universe of p1 individuals and p2 films Observe n ≪ p2p2 ratings
(e.g., Srebro, Alon & Jaakkola, 2004)
Security and robustness issues
Spiritual guide
Break-down of Amazon recommendation system, 2002.
Security and robustness issues
Spiritual guide Sex manual
Break-down of Amazon recommendation system, 2002.
Matrix decomposition: Low-rank plus sparse
Matrix Y can be (approximately) decomposed into sum:
Y U V TD
p1 × p2 p1 × r
r × r r × p2
≈ +
Y = Θ∗︸︷︷︸
Low-rank component
+ Γ∗︸︷︷︸
Sparse component
Matrix decomposition: Low-rank plus sparse
Matrix Y can be (approximately) decomposed into sum:
Y U V TD
p1 × p2 p1 × r
r × r r × p2
≈ +
Y = Θ∗︸︷︷︸
Low-rank component
+ Γ∗︸︷︷︸
Sparse component
exact decomposition: initially studied by Chandrasekaran, Sanghavi,Parillo & Willsky, 2009subsequent work: Candes et al., 2010; Xu et al., 2010 Hsu et al., 2010;
Agarwal et al., 2011
Various applications: robust collaborative filtering robust PCA graphical model selection with hidden variables
Gauss-Markov models with hidden variables
X1 X2 X3 X4
Z
Problems with hidden variables: conditioned on hidden Z, vectorX = (X1, X2, X3, X4) is Gauss-Markov.
Gauss-Markov models with hidden variables
X1 X2 X3 X4
Z
Problems with hidden variables: conditioned on hidden Z, vectorX = (X1, X2, X3, X4) is Gauss-Markov.
Inverse covariance of X satisfies sparse, low-rank decomposition:
1− µ µ µ µµ 1− µ µ µµ µ 1− µ µµ µ µ 1− µ
= I4×4 − µ11T .
(Chandrasekaran, Parrilo & Willsky, 2010)
Example: Sparse principal components analysis
+=Σ ZZT D
Set-up: Covariance matrix Σ = ZZT +D, where leading eigenspace Z hassparse columns.
Estimator:
Θ ∈ argminΘ
−〈〈Θ, Σ〉〉+ λn
∑
(j,k)
|Θjk|
Some past work: Johnstone, 2001; Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al.,
2004; d’Aspremont et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008
Motivation and roadmap
many results on different high-dimensional models
all based on estimators of the type:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
Motivation and roadmap
many results on different high-dimensional models
all based on estimators of the type:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Question:
Is there a common set of underlying principles?
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
Motivation and roadmap
many results on different high-dimensional models
all based on estimators of the type:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Question:
Is there a common set of underlying principles?
Answer: Yes, two essential ingredients.
(I) Restricted strong convexity of loss function
(II) Decomposability of the regularizer
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 11 / 29
(I) Role of curvature
1 Curvature controls difficulty of estimation:
δL
θθ
∆
δL
θθ
∆
High curvature: easy to estimate (b) Low curvature: harder
(I) Role of curvature
1 Curvature controls difficulty of estimation:
δL
θθ
∆
δL
θθ
∆
High curvature: easy to estimate (b) Low curvature: harder
2 captured by lower bound on Taylor series error TL(∆; θ∗)
L(θ∗ +∆)− L(θ∗)− 〈∇L(θ∗), ∆〉︸ ︷︷ ︸TL(∆,θ∗)
≥ γ2‖∆‖2
for all ∆ around θ∗.
High dimensions: no strong convexity!
−1−0.5
00.5
1
−1−0.5
00.5
10
0.2
0.4
0.6
0.8
1
When p > n, the Hessian ∇2L(θ;Zn1 ) has nullspace of dimension p− n.
Restricted strong convexity
Definition
Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if
Ln(θ∗ +∆)−
Ln(θ
∗) + 〈∇Ln(θ∗), ∆〉
︸ ︷︷ ︸Taylor error TL(∆; θ∗)
≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸
Lower curvature
− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance
for all ∆ in a suitable neighborhood of θ∗.
Restricted strong convexity
Definition
Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if
Ln(θ∗ +∆)−
Ln(θ
∗) + 〈∇Ln(θ∗), ∆〉
︸ ︷︷ ︸Taylor error TL(∆; θ∗)
≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸
Lower curvature
− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance
for all ∆ in a suitable neighborhood of θ∗.
ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n
Restricted strong convexity
Definition
Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if
Ln(θ∗ +∆)−
Ln(θ
∗) + 〈∇Ln(θ∗), ∆〉
︸ ︷︷ ︸Taylor error TL(∆; θ∗)
≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸
Lower curvature
− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance
for all ∆ in a suitable neighborhood of θ∗.
ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n
RSC enforces a lower bound on curvature, but only when R2(∆) ≪ ‖∆‖2e
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Restricted eigenvalue (RE) condition (van de Geer, 2007; Bickel et al., 2009):
‖X∆‖222n
≥ γ2 ‖∆‖22 for all ‖∆Sc‖1 ≤ ‖∆S‖1.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Restricted eigenvalue (RE) condition (van de Geer, 2007; Bickel et al., 2009):
‖X∆‖222n
≥ γ2 ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2
√s‖∆‖2.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Restricted eigenvalue (RE) condition (van de Geer, 2007; Bickel et al., 2009):
‖X∆‖222n
≥ γ2 ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2
√s‖∆‖2.
holds with high probability for various sub-Gaussian designs whenn % s log p/s.
fairly strong dependency between covariates is possible
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 15 / 29
Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R
p to output y ∈ R:
Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)
.
Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R
p to output y ∈ R:
Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)
.
Taylor series expansion involves random Hessian
H(θ) =1
n
n∑
i=1
Φ′′(〈θ, Xi〉
)XiX
Ti ∈ R
p.
Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R
p to output y ∈ R:
Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)
.
Taylor series expansion involves random Hessian
H(θ) =1
n
n∑
i=1
Φ′′(〈θ, Xi〉
)XiX
Ti ∈ R
p.
Proposition (Negahban, W., Ravikumar & Yu, 2010)
For zero-mean sub-Gaussian covariates Xi with covariance Σ and any GLM,
TL(∆; θ∗) ≥ c3‖Σ12 ∆‖2
‖Σ 1
2∆‖2 − c4κ(Σ)( log p
n
)1/2‖∆‖1
for all ‖∆‖2 ≤ 1
with probability at least 1− c1 exp(−c2n).
Here κ(Σ) = maxj Σjj .
(II) Decomposable regularizers
M
M⊥
Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29
(II) Decomposable regularizers
M
M⊥
Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.
Regularizer R decomposes across (M,M⊥) if
R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29
(II) Decomposable regularizers
M
M⊥
Regularizer R decomposes across (M,M⊥) if
R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.
Includes:• (weighted) ℓ1-norms • nuclear norm• group-sparse norms • sums of decomposable norms
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 17 / 29
Main theoremEstimator
θλn∈ arg min
θ∈Rp
Ln(θ;Z
n1 ) + λnR(θ)
,
where L satisfies RSC(γ, τ) w.r.t regularizer R.
Main theoremEstimator
θλn∈ arg min
θ∈Rp
Ln(θ;Z
n1 ) + λnR(θ)
,
where L satisfies RSC(γ, τ) w.r.t regularizer R.
Theorem (Negahban, Ravikumar, W., & Yu, 2012)
Suppose that θ∗ ∈ M. For any regularization parameterλn ≥ 2R∗
(∇L(θ∗;Zn
1 )), any solution θλn
satisfies
‖θλn− θ∗‖2 -
1
γ2(L)λ2n + τ2(L)
Ψ2(M).
Quantities that control rates:
curvature in RSC: γℓ
tolerance in RSC: τ
dual norm of regularizer: R∗(v) := supR(u)≤1
〈v, u〉.
optimal subspace const.: Ψ(M) = supθ∈M\0
R(θ)/‖θ‖
Main theoremEstimator
θλn∈ arg min
θ∈Rp
Ln(θ;Z
n1 ) + λnR(θ)
,
Theorem (Oracle version)
For any regularization parameter λn ≥ 2R∗(∇L(θ∗;Zn
1 )), any solution θ
satisfies
‖θλn− θ∗‖2 -
(λ′n)
2
γ2(L)Ψ2(M)
︸ ︷︷ ︸Estimation error
+λ′n
γ(L) R(ΠM⊥(θ∗))
︸ ︷︷ ︸Approximation error
where ρ′ = maxρ, τ
Quantities that control rates:
curvature in RSC: γℓ
tolerance in RSC: τ
dual norm of regularizer: R∗(v) := supR(u)≤1
〈v, u〉.
optimal subspace const.: Ψ(M) = supθ∈M\0
R(θ)/‖θ‖
Example: Linear regression (exact sparsity)
Lasso program: minθ∈Rp
‖y −Xθ‖22 + λn‖θ‖1
RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p
for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.
Corollary
Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with
λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4
γ2 s λ2n.
Example: Linear regression (exact sparsity)
Lasso program: minθ∈Rp
‖y −Xθ‖22 + λn‖θ‖1
RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p
for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.
Corollary
Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with
λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4
γ2 s λ2n.
Some stochastic instances: recover known results
Compressed sensing: Xij ∼ N(0, 1) and bounded noise ‖w‖2 ≤ σ√n
Deterministic design: X with bounded columns and wi ∼ N(0, σ2)
‖XTw
n‖∞ ≤
√3σ2 log p
nw.h.p. =⇒ ‖θ − θ∗‖22 ≤ 12σ2
γ2
s log p
n.
(e.g., Candes & Tao, 2007; Huang & Zhang, 2008; Meinshausen & Yu, 2008; Bickel et
al., 2008)
Example: Linear regression (weak sparsity)
for some q ∈ [0, 1], say θ∗ belongsto ℓq-“ball”
Bq(Rq) :=θ ∈ R
p |p∑
j=1
|θj |q ≤ Rq
.
Corollary
For θ∗ ∈ Bq(Rq), any Lasso solution satisfies (w.h.p.)
‖θ − θ∗‖22 - σ2Rq
( log pn
)1−q/2
.
rate known to be minimax optimal (Raskutti, W. & Yu, 2011)
Example: Group-structured regularizersMany applications exhibit sparsity with more structure.....
G1 G2 G3
divide index set 1, 2, . . . , p into groups G = G1, G2, . . . , GT
for parameters νi ∈ [1,∞], define block-norm
‖θ‖ν,G :=
T∑
t=1
‖θGt‖νt
group/block Lasso program
θλn∈ arg min
θ∈Rp
1
2n‖y −Xθ‖22 + λn‖θ‖ν,G
.
different versions studied by various authors(Wright et al., 2005; Tropp et al., 2006; Yuan & Li, 2006; Baraniuk, 2008; Obozinski et
al., 2008; Zhao et al., 2008; Bach et al., 2009; Lounici et al., 2009)
Convergence rates for general group Lasso
Corollary
Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter
λn ≥ 2 maxt=1,2,...,T
∥∥XTw
n
∥∥ν∗t
, where 1ν∗t
= 1− 1νt,
any solution θλnsatisfies
‖θλn− θ∗‖2 ≤ 2
γℓΨν(SG)λn, where Ψν(SG) = sup
θ∈M(SG)\0
‖θ‖ν,G
‖θ‖2.
Convergence rates for general group Lasso
Corollary
Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter
λn ≥ 2 maxt=1,2,...,T
∥∥XTw
n
∥∥ν∗t
, where 1ν∗t
= 1− 1νt,
any solution θλnsatisfies
‖θλn− θ∗‖2 ≤ 2
γℓΨν(SG)λn, where Ψν(SG) = sup
θ∈M(SG)\0
‖θ‖ν,G
‖θ‖2.
Some special cases with m ≡ max. group size
1 ℓ1/ℓ2 regularization: Group norm with ν = 2
‖θλn− θ∗‖22 = O
(sGmn
+sG log T
n
).
Convergence rates for general group Lasso
Corollary
Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter
λn ≥ 2 maxt=1,2,...,T
∥∥XTw
n
∥∥ν∗t
, where 1ν∗t
= 1− 1νt,
any solution θλnsatisfies
‖θλn− θ∗‖2 ≤ 2
γℓΨν(SG)λn, where Ψν(SG) = sup
θ∈M(SG)\0
‖θ‖ν,G
‖θ‖2.
Some special cases with m ≡ max. group size
1 ℓ1/ℓ∞ regularization: Group norm with ν = ∞
‖θλn− θ∗‖22 = O
(sGm2
n+
sG log T
n
).
Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank
noisy/partial observations of the form
yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise
estimate by solving semi-definite program (SDP):
Θ ∈ argminΘ
1
n
n∑
i=1
(yi − 〈〈Xi, Θ〉〉)2 + λn
minp1,p2∑
j=1
σj(Θ)
︸ ︷︷ ︸|||Θ|||1
Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank
noisy/partial observations of the form
yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise
estimate by solving semi-definite program (SDP):
Θ ∈ argminΘ
1
n
n∑
i=1
(yi − 〈〈Xi, Θ〉〉)2 + λn
minp1,p2∑
j=1
σj(Θ)
︸ ︷︷ ︸|||Θ|||1
various applications: matrix completion rank-reduced multivariate regression time-series modeling (vector autoregressions) ...
Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank
noisy/partial observations of the form
yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise
estimate by solving semi-definite program (SDP):
Θ ∈ argminΘ
1
n
n∑
i=1
(yi − 〈〈Xi, Θ〉〉)2 + λn
minp1,p2∑
j=1
σj(Θ)
︸ ︷︷ ︸|||Θ|||1
various applications: matrix completion rank-reduced multivariate regression time-series modeling (vector autoregressions) ...
Some past work: Fazel, 2011; Srebro et al., 2004; Recht et al., 2007, Candes &
Recht, 2008; Recht, 2009; Negahbah & W., 2010; Rohde & Tsybakov, 2010.
Rates for (near) low-rank estimation
For parameter q ∈ [0, 1], set of near low-rank matrices:
Bq(Rq) =Θ∗ ∈ R
p1×p2 |minp1,p2∑
j=1
|σj(Θ∗)|q ≤ Rq
.
Corollary (Negahban & W., 2011)
Under RSC condition, with regularization parameter λn ≥ 16σ(√
p1
n +√
p2
n
),
we have w.h.p.
|||Θ−Θ∗|||2F ≤ c0Rq
γ2ℓ
(σ2 (p1 + p2)
n
)1− q
2
Rates for (near) low-rank estimation
For parameter q ∈ [0, 1], set of near low-rank matrices:
Bq(Rq) =Θ∗ ∈ R
p1×p2 |minp1,p2∑
j=1
|σj(Θ∗)|q ≤ Rq
.
Corollary (Negahban & W., 2011)
Under RSC condition, with regularization parameter λn ≥ 16σ(√
p1
n +√
p2
n
),
we have w.h.p.
|||Θ−Θ∗|||2F ≤ c0Rq
γ2ℓ
(σ2 (p1 + p2)
n
)1− q
2
for a rank r matrix M
|||M |||1 =
r∑
j=1
σj(M) ≤√r
√√√√r∑
j=1
σ2j (M) =
√r |||M |||F
solve nuclear norm regularized program with λn ≥ 2n|||∑n
i=1 wiXi|||2
Noisy matrix completion (unrescaled)
0 1000 2000 3000 4000 5000 60000
0.1
0.2
0.3
0.4
0.5
0.6
Sample size
Fro
b. n
orm
MS
E
MSE versus raw sample size (q = 0)
d2 = 402
d2 = 602
d2 = 802
d2 = 1002
Noisy matrix completion (rescaled)
0.5 1 1.5 2 2.5 3 3.5 40
0.1
0.2
0.3
0.4
0.5
0.6
Rescaled sample size
Fro
b. n
orm
MS
E
MSE versus rescaled sample size (q = 0)
d2 = 402
d2 = 602
d2 = 802
d2 = 1002
Summary
unified framework for high-dimensional M -estimators
decomposability of regularizer R restricted strong convexity of loss functions
actual rates determined by: noise measured in dual function R∗
subspace constant Ψ in moving from R to error norm ‖ · ‖ restricted strong convexity constant
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 27 / 29
Summary
unified framework for high-dimensional M -estimators
decomposability of regularizer R restricted strong convexity of loss functions
actual rates determined by: noise measured in dual function R∗
subspace constant Ψ in moving from R to error norm ‖ · ‖ restricted strong convexity constant
Looking ahead to tomorrow:
From parametric to non-parametric problems: using kernel-based methods inhigh-dimensions.
Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 27 / 29
Some papers (www.eecs.berkeley.edu/˜wainwrig)1 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A
unified framework for high-dimensional analysis of M -estimators withdecomposable regularizers, Statistical Science, December 2012.
2 S. Negahban and M. J. Wainwright (2011). Estimation rates of (near)low-rank matrices with noise and high-dimensional scaling. Annals ofStatistics, 39(1):1069–1097.
3 S. Negahban and M. J. Wainwright (2012). Restricted strong convexityand (weighted) matrix completion: Optimal bounds with noise. Journalof Machine Learning Research, May 2012.
4 G. Raskutti, M. J. Wainwright and B. Yu (2011) Minimax rates for linearregression over ℓq-balls. IEEE Transactions on Information Theory,57(10): 6976–6994.
Significance of decomposability
R(ΠM⊥(∆))
R(ΠM(∆))
R(ΠM(∆))
R(ΠM⊥(∆))
(a) C for exact model (cone) (b) C for approximate model (star-shaped)
Lemma
Suppose that L is convex, and R is decomposable w.r.t. M. Then as long as
λn ≥ 2R∗(∇L(θ∗;Zn
1 )), the error vector ∆ = θλn
− θ∗ belongs to
C(M,M; θ∗) :=∆ ∈ Ω | R(ΠM⊥(∆)) ≤ 3R(Π
M(∆)) + 4R(ΠM⊥(θ∗))
.