High-dimensional statistics: Some progress and challenges aheadevents.csml.ucl.ac.uk/.../wainwright/slides/Wainwright_Lecture2.pdf · High-dimensional statistics: Some progress and

High-dimensional statistics:Some progress and challenges ahead

Martin Wainwright

UC BerkeleyDepartments of Statistics, and EECS

University College, London Master Class: Lecture 2

Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,

Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

High-level overview

Last lecture: least-squares loss and ℓ1-regularization.

The big picture:

Lots of other estimators with same basic form:

θλn︸︷︷︸Estimate

∈ argminθ∈Ω

L(θ;Zn

1 )︸︷︷︸Loss function

+ λn R(θ)︸︷︷︸Regularizer

.

Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 2 / 29

High-level overview

Last lecture: least-squares loss and ℓ1-regularization.

The big picture:

Lots of other estimators with same basic form:


∈ argminθ∈Ω

L(θ;Zn



.

Past years have witnessed an explosion of results (compressed sensing,covariance estimation, block-sparsity, graphical models, matrix completion...)

Question:

Is there a common set of underlying principles?


Last lecture: Sparse linear regression

= +nS

wy X θ∗

Sc

n× p

Set-up: noisy observations y = Xθ∗ + w with sparse θ∗

Estimator: Lasso program

θ ∈ argminθ

1

n

n∑

i=1

(yi − xTi θ)

2 + λn

p∑

j=1

|θj |

Block-structured extension

= +

r

r

r

n n

p

S

WY X Θ∗

Sc

n× p

Signal Θ∗ is a p× r matrix: partitioned into non-zero rows S and zero

rows Sc

Various applications: multiple-view imaging, gene array prediction, graphicalmodel fitting.


= +

r

r

r

n n

p

S

WY X Θ∗

Sc

n× p

Row-wise ℓ1/ℓ2-norm

|||Θ|||1,2 =

p∑

j=1

‖Θj‖2


= +

r

r

r

n n

p

S

WY X Θ∗

Sc

n× p

Row-wise ℓ1/ℓ2-norm

|||Θ|||1,2 =

p∑

j=1

‖Θj‖2

More complicated group structure: (Obozinski et al., 2009)

|||Θ∗|||G =∑

g∈G

‖Θg‖2

Example: Low-rank matrix approximation

Θ∗ U D V T

r × r

p1 × p2 p1 × r

r × p2

Set-up: Matrix Θ∗ ∈ Rp1×p2 with rank r ≪ minp1, p2.

Estimator:

Θ ∈ argminΘ

1

n

n∑

i=1

(yi − 〈〈Xi, Θ〉〉)2 + λn

minp1,p2∑

j=1

σj(Θ)

Some past work: Fazel, 2001; Srebro et al., 2004; Recht, Fazel & Parillo, 2007; Bach, 2008;

Candes & Recht, 2008; Keshavan et al., 2009; Rohde & Tsybakov, 2010; Recht, 2009;

Negahban & W., 2010 ...

Application: Collaborative filtering

. . . . . .

4 ∗ 3 . . . . . . ∗

3 5 ∗ . . . . . . 2

5 4 3 . . . . . . 3

2 ∗ ∗ . . . . . . 1

Universe of p1 individuals and p2 films Observe n ≪ p2p2 ratings

(e.g., Srebro, Alon & Jaakkola, 2004)

Security and robustness issues

Spiritual guide

Break-down of Amazon recommendation system, 2002.

Security and robustness issues

Spiritual guide Sex manual

Break-down of Amazon recommendation system, 2002.

Matrix decomposition: Low-rank plus sparse

Matrix Y can be (approximately) decomposed into sum:

Y U V TD

p1 × p2 p1 × r

r × r r × p2

≈ +

Y = Θ∗︸︷︷︸

Low-rank component

+ Γ∗︸︷︷︸

Sparse component

Matrix decomposition: Low-rank plus sparse

Matrix Y can be (approximately) decomposed into sum:

Y U V TD

p1 × p2 p1 × r

r × r r × p2

≈ +

Y = Θ∗︸︷︷︸

Low-rank component

+ Γ∗︸︷︷︸

Sparse component

exact decomposition: initially studied by Chandrasekaran, Sanghavi,Parillo & Willsky, 2009subsequent work: Candes et al., 2010; Xu et al., 2010 Hsu et al., 2010;

Agarwal et al., 2011

Various applications: robust collaborative filtering robust PCA graphical model selection with hidden variables

Gauss-Markov models with hidden variables

X1 X2 X3 X4

Z

Problems with hidden variables: conditioned on hidden Z, vectorX = (X1, X2, X3, X4) is Gauss-Markov.

Gauss-Markov models with hidden variables

X1 X2 X3 X4

Z

Problems with hidden variables: conditioned on hidden Z, vectorX = (X1, X2, X3, X4) is Gauss-Markov.

Inverse covariance of X satisfies sparse, low-rank decomposition:

1− µ µ µ µµ 1− µ µ µµ µ 1− µ µµ µ µ 1− µ

= I4×4 − µ11T .

(Chandrasekaran, Parrilo & Willsky, 2010)

Example: Sparse principal components analysis

+=Σ ZZT D

Set-up: Covariance matrix Σ = ZZT +D, where leading eigenspace Z hassparse columns.

Estimator:

Θ ∈ argminΘ

−〈〈Θ, Σ〉〉+ λn

∑

(j,k)

|Θjk|

Some past work: Johnstone, 2001; Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al.,

2004; d’Aspremont et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008

Motivation and roadmap

many results on different high-dimensional models

all based on estimators of the type:


∈ argminθ∈Ω

L(θ;Zn



.






∈ argminθ∈Ω

L(θ;Zn



.

Question:







∈ argminθ∈Ω

L(θ;Zn



.

Question:


Answer: Yes, two essential ingredients.

(I) Restricted strong convexity of loss function

(II) Decomposability of the regularizer


(I) Role of curvature

1 Curvature controls difficulty of estimation:

δL

θθ

∆

δL

θθ

∆

High curvature: easy to estimate (b) Low curvature: harder

(I) Role of curvature

1 Curvature controls difficulty of estimation:

δL

θθ

∆

δL

θθ

∆

High curvature: easy to estimate (b) Low curvature: harder

2 captured by lower bound on Taylor series error TL(∆; θ∗)

L(θ∗ +∆)− L(θ∗)− 〈∇L(θ∗), ∆〉︸︷︷︸TL(∆,θ∗)

≥ γ2‖∆‖2

for all ∆ around θ∗.

High dimensions: no strong convexity!

−1−0.5

00.5

1

−1−0.5

00.5

10

0.2

0.4

0.6

0.8

1

When p > n, the Hessian ∇2L(θ;Zn1 ) has nullspace of dimension p− n.

Restricted strong convexity

Definition

Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if

Ln(θ∗ +∆)−

Ln(θ

∗) + 〈∇Ln(θ∗), ∆〉

︸︷︷︸Taylor error TL(∆; θ∗)

≥ γ2ℓ ‖∆‖2e︸︷︷︸

Lower curvature

− τ2ℓ R2(∆)︸︷︷︸Tolerance

for all ∆ in a suitable neighborhood of θ∗.


Definition


Ln(θ∗ +∆)−

Ln(θ

∗) + 〈∇Ln(θ∗), ∆〉


≥ γ2ℓ ‖∆‖2e︸︷︷︸

Lower curvature



ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n


Definition


Ln(θ∗ +∆)−

Ln(θ

∗) + 〈∇Ln(θ∗), ∆〉


≥ γ2ℓ ‖∆‖2e︸︷︷︸

Lower curvature



ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n

RSC enforces a lower bound on curvature, but only when R2(∆) ≪ ‖∆‖2e

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) = 12n‖y −Xθ‖22:

TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.




TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.

Restricted eigenvalue (RE) condition (van de Geer, 2007; Bickel et al., 2009):

‖X∆‖222n

≥ γ2 ‖∆‖22 for all ‖∆Sc‖1 ≤ ‖∆S‖1.




TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.


‖X∆‖222n

≥ γ2 ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2

√s‖∆‖2.




TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.


‖X∆‖222n

≥ γ2 ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2

√s‖∆‖2.

holds with high probability for various sub-Gaussian designs whenn % s log p/s.

fairly strong dependency between covariates is possible


Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R

p to output y ∈ R:

Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)

.



Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)

.

Taylor series expansion involves random Hessian

H(θ) =1

n

n∑

i=1

Φ′′(〈θ, Xi〉

)XiX

Ti ∈ R

p.



Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)

.

Taylor series expansion involves random Hessian

H(θ) =1

n

n∑

i=1

Φ′′(〈θ, Xi〉

)XiX

Ti ∈ R

p.

Proposition (Negahban, W., Ravikumar & Yu, 2010)

For zero-mean sub-Gaussian covariates Xi with covariance Σ and any GLM,

TL(∆; θ∗) ≥ c3‖Σ12 ∆‖2

‖Σ 1

2∆‖2 − c4κ(Σ)( log p

n

)1/2‖∆‖1

for all ‖∆‖2 ≤ 1

with probability at least 1− c1 exp(−c2n).

Here κ(Σ) = maxj Σjj .

(II) Decomposable regularizers

M

M⊥

Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.



M

M⊥

Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.

Regularizer R decomposes across (M,M⊥) if

R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.



M

M⊥

Regularizer R decomposes across (M,M⊥) if

R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.

Includes:• (weighted) ℓ1-norms • nuclear norm• group-sparse norms • sums of decomposable norms


Main theoremEstimator

θλn∈ arg min

θ∈Rp

Ln(θ;Z

n1 ) + λnR(θ)

,

where L satisfies RSC(γ, τ) w.r.t regularizer R.


θλn∈ arg min

θ∈Rp

Ln(θ;Z

n1 ) + λnR(θ)

,

where L satisfies RSC(γ, τ) w.r.t regularizer R.

Theorem (Negahban, Ravikumar, W., & Yu, 2012)

Suppose that θ∗ ∈ M. For any regularization parameterλn ≥ 2R∗

(∇L(θ∗;Zn

1 )), any solution θλn

satisfies

‖θλn− θ∗‖2 -

1

γ2(L)λ2n + τ2(L)

Ψ2(M).

Quantities that control rates:

curvature in RSC: γℓ

tolerance in RSC: τ

dual norm of regularizer: R∗(v) := supR(u)≤1

〈v, u〉.

optimal subspace const.: Ψ(M) = supθ∈M\0

R(θ)/‖θ‖


θλn∈ arg min

θ∈Rp

Ln(θ;Z

n1 ) + λnR(θ)

,

Theorem (Oracle version)

For any regularization parameter λn ≥ 2R∗(∇L(θ∗;Zn

1 )), any solution θ

satisfies

‖θλn− θ∗‖2 -

(λ′n)

2

γ2(L)Ψ2(M)

︸︷︷︸Estimation error

+λ′n

γ(L) R(ΠM⊥(θ∗))

︸︷︷︸Approximation error

where ρ′ = maxρ, τ

Quantities that control rates:

curvature in RSC: γℓ

tolerance in RSC: τ

dual norm of regularizer: R∗(v) := supR(u)≤1

〈v, u〉.

optimal subspace const.: Ψ(M) = supθ∈M\0

R(θ)/‖θ‖

Example: Linear regression (exact sparsity)

Lasso program: minθ∈Rp

‖y −Xθ‖22 + λn‖θ‖1

RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p

for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.

Corollary

Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with

λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4

γ2 s λ2n.

Example: Linear regression (exact sparsity)

Lasso program: minθ∈Rp

‖y −Xθ‖22 + λn‖θ‖1

RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p

for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.

Corollary

Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with

λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4

γ2 s λ2n.

Some stochastic instances: recover known results

Compressed sensing: Xij ∼ N(0, 1) and bounded noise ‖w‖2 ≤ σ√n

Deterministic design: X with bounded columns and wi ∼ N(0, σ2)

‖XTw

n‖∞ ≤

√3σ2 log p

nw.h.p. =⇒ ‖θ − θ∗‖22 ≤ 12σ2

γ2

s log p

n.

(e.g., Candes & Tao, 2007; Huang & Zhang, 2008; Meinshausen & Yu, 2008; Bickel et

al., 2008)

Example: Linear regression (weak sparsity)

for some q ∈ [0, 1], say θ∗ belongsto ℓq-“ball”

Bq(Rq) :=θ ∈ R

p |p∑

j=1

|θj |q ≤ Rq

.

Corollary

For θ∗ ∈ Bq(Rq), any Lasso solution satisfies (w.h.p.)

‖θ − θ∗‖22 - σ2Rq

( log pn

)1−q/2

.

rate known to be minimax optimal (Raskutti, W. & Yu, 2011)

Example: Group-structured regularizersMany applications exhibit sparsity with more structure.....

G1 G2 G3

divide index set 1, 2, . . . , p into groups G = G1, G2, . . . , GT

for parameters νi ∈ [1,∞], define block-norm

‖θ‖ν,G :=

T∑

t=1

‖θGt‖νt

group/block Lasso program

θλn∈ arg min

θ∈Rp

1

2n‖y −Xθ‖22 + λn‖θ‖ν,G

.

different versions studied by various authors(Wright et al., 2005; Tropp et al., 2006; Yuan & Li, 2006; Baraniuk, 2008; Obozinski et

al., 2008; Zhao et al., 2008; Bach et al., 2009; Lounici et al., 2009)

Convergence rates for general group Lasso

Corollary

Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter

λn ≥ 2 maxt=1,2,...,T

∥∥XTw

n

∥∥ν∗t

, where 1ν∗t

= 1− 1νt,

any solution θλnsatisfies

‖θλn− θ∗‖2 ≤ 2

γℓΨν(SG)λn, where Ψν(SG) = sup

θ∈M(SG)\0

‖θ‖ν,G

‖θ‖2.


Corollary


λn ≥ 2 maxt=1,2,...,T

∥∥XTw

n

∥∥ν∗t

, where 1ν∗t

= 1− 1νt,


‖θλn− θ∗‖2 ≤ 2


θ∈M(SG)\0

‖θ‖ν,G

‖θ‖2.

Some special cases with m ≡ max. group size

1 ℓ1/ℓ2 regularization: Group norm with ν = 2

‖θλn− θ∗‖22 = O

(sGmn

+sG log T

n

).


Corollary


λn ≥ 2 maxt=1,2,...,T

∥∥XTw

n

∥∥ν∗t

, where 1ν∗t

= 1− 1νt,


‖θλn− θ∗‖2 ≤ 2


θ∈M(SG)\0

‖θ‖ν,G

‖θ‖2.

Some special cases with m ≡ max. group size

1 ℓ1/ℓ∞ regularization: Group norm with ν = ∞

‖θλn− θ∗‖22 = O

(sGm2

n+

sG log T

n

).

Example: Low-rank matrices and nuclear norm

low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank

noisy/partial observations of the form

yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise

estimate by solving semi-definite program (SDP):

Θ ∈ argminΘ

1

n

n∑

i=1

(yi − 〈〈Xi, Θ〉〉)2 + λn

minp1,p2∑

j=1

σj(Θ)

︸︷︷︸|||Θ|||1






Θ ∈ argminΘ

1

n

n∑

i=1

(yi − 〈〈Xi, Θ〉〉)2 + λn

minp1,p2∑

j=1

σj(Θ)

︸︷︷︸|||Θ|||1

various applications: matrix completion rank-reduced multivariate regression time-series modeling (vector autoregressions) ...






Θ ∈ argminΘ

1

n

n∑

i=1

(yi − 〈〈Xi, Θ〉〉)2 + λn

minp1,p2∑

j=1

σj(Θ)

︸︷︷︸|||Θ|||1

various applications: matrix completion rank-reduced multivariate regression time-series modeling (vector autoregressions) ...

Some past work: Fazel, 2011; Srebro et al., 2004; Recht et al., 2007, Candes &

Recht, 2008; Recht, 2009; Negahbah & W., 2010; Rohde & Tsybakov, 2010.

Rates for (near) low-rank estimation

For parameter q ∈ [0, 1], set of near low-rank matrices:

Bq(Rq) =Θ∗ ∈ R

p1×p2 |minp1,p2∑

j=1

|σj(Θ∗)|q ≤ Rq

.

Corollary (Negahban & W., 2011)

Under RSC condition, with regularization parameter λn ≥ 16σ(√

p1

n +√

p2

n

),

we have w.h.p.

|||Θ−Θ∗|||2F ≤ c0Rq

γ2ℓ

(σ2 (p1 + p2)

n

)1− q

2

Rates for (near) low-rank estimation

For parameter q ∈ [0, 1], set of near low-rank matrices:

Bq(Rq) =Θ∗ ∈ R

p1×p2 |minp1,p2∑

j=1

|σj(Θ∗)|q ≤ Rq

.

Corollary (Negahban & W., 2011)

Under RSC condition, with regularization parameter λn ≥ 16σ(√

p1

n +√

p2

n

),

we have w.h.p.

|||Θ−Θ∗|||2F ≤ c0Rq

γ2ℓ

(σ2 (p1 + p2)

n

)1− q

2

for a rank r matrix M

|||M |||1 =

r∑

j=1

σj(M) ≤√r

√√√√r∑

j=1

σ2j (M) =

√r |||M |||F

solve nuclear norm regularized program with λn ≥ 2n|||∑n

i=1 wiXi|||2

Noisy matrix completion (unrescaled)

0 1000 2000 3000 4000 5000 60000

0.1

0.2

0.3

0.4

0.5

0.6

Sample size

Fro

b. n

orm

MS

E

MSE versus raw sample size (q = 0)

d2 = 402

d2 = 602

d2 = 802

d2 = 1002

Noisy matrix completion (rescaled)

0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

0.6

Rescaled sample size

Fro

b. n

orm

MS

E

MSE versus rescaled sample size (q = 0)

d2 = 402

d2 = 602

d2 = 802

d2 = 1002

Summary

unified framework for high-dimensional M -estimators

decomposability of regularizer R restricted strong convexity of loss functions

actual rates determined by: noise measured in dual function R∗

subspace constant Ψ in moving from R to error norm ‖ · ‖ restricted strong convexity constant


Summary

unified framework for high-dimensional M -estimators

decomposability of regularizer R restricted strong convexity of loss functions

actual rates determined by: noise measured in dual function R∗

subspace constant Ψ in moving from R to error norm ‖ · ‖ restricted strong convexity constant

Looking ahead to tomorrow:

From parametric to non-parametric problems: using kernel-based methods inhigh-dimensions.


Some papers (www.eecs.berkeley.edu/˜wainwrig)1 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A

unified framework for high-dimensional analysis of M -estimators withdecomposable regularizers, Statistical Science, December 2012.

2 S. Negahban and M. J. Wainwright (2011). Estimation rates of (near)low-rank matrices with noise and high-dimensional scaling. Annals ofStatistics, 39(1):1069–1097.

3 S. Negahban and M. J. Wainwright (2012). Restricted strong convexityand (weighted) matrix completion: Optimal bounds with noise. Journalof Machine Learning Research, May 2012.

4 G. Raskutti, M. J. Wainwright and B. Yu (2011) Minimax rates for linearregression over ℓq-balls. IEEE Transactions on Information Theory,57(10): 6976–6994.

Significance of decomposability

R(ΠM⊥(∆))

R(ΠM(∆))

R(ΠM(∆))

R(ΠM⊥(∆))

(a) C for exact model (cone) (b) C for approximate model (star-shaped)

Lemma

Suppose that L is convex, and R is decomposable w.r.t. M. Then as long as

λn ≥ 2R∗(∇L(θ∗;Zn

1 )), the error vector ∆ = θλn

− θ∗ belongs to

C(M,M; θ∗) :=∆ ∈ Ω | R(ΠM⊥(∆)) ≤ 3R(Π

M(∆)) + 4R(ΠM⊥(θ∗))

.

Documents

High-dimensional statistics: Some progress and challenges aheadevents.csml.ucl.ac.uk/.../wainwright/slides/Wainwright_Lecture2.pdf · High-dimensional statistics: Some progress and