Another Look at the Lasso - math.unm.edufletcher/LassoTalk2015.pdfSeymourOther possible...

Seymour Other possible talks Introduction Changing the sign of the estimator A lasso trace (profile) Standardization and sign changes? Towards nonunique estimation Geometry of r(X ) < p + 1 LARS References

Another Look at the Lasso

Ronald Christensen

Department of Math and StatisticsUniversity of New Mexico

October 15, 2015

Seymour

Figure: Seymour with Anne

Seymour

Figure: Seymour: In disguise as Fisher??

Seymour

Figure: Seymour at ease.

Seymour

Figure: Seymour with pipes.

Seymour

Figure: Seymour with hair.

Seymour

Figure: Seymour as I knew him.

Minnesota

First half of my life in Minnesota.

Second quarter at U of M.

Carefree and callow?

Seymour.

Grad school

Figure: Our home.

Grad school

Figure: Our other home.

Grad school

Figure: Friday night without lights.

Grad school

My son Fletcher is in the background of some of these.

Figure: Chris.

Grad school

Figure: Norton with Erik and Chris in background.

Grad school

Figure: Gary and me.

Another look at the lasso

Other talks I could have given.

Linear model lack-of-fit tests

Partial sums of residuals.

Nonnormal asymptotics.

Linear model version of law of the iterated logarithm.

Slow convergence to asymptotic null distribution.

Remarkably successful simulations of most null distributions.

Generalized Split Plot Models

Testing the whole plot variance.

Equivalence of F and GLRT.

Constraints on parameter space complicate things.

When different, F is better.

Quite lovely application of linear model theory.

Favorite results in prediction

Obviously, in honor of Seymour.

Bayes better than frequentist, by frequentist standards.

R2 = [corr(yi , yi )]2.

BP residuals uncorrelated with everything.

BLP residuals uncorrelated with everything linear.

Leave-1-out CV overestimates prediction error worse thannaive underestimates.

Instead, I’ve chosen to discuss somethingI don’t know much about!

Hot topic; HTW (2015).

20 years of work; Tibshirani (1996).

The talk consists of:

Lots of pictures;

Lots of speculation.

Outline of Presentation

1 Defining the problem.

2 Geometry for unique least squares estimates.

3 Estimates changing signs. (Lasso trace/profile.)

4 Geometry for nonunique least squares estimates.

5 Uniqueness of lasso estimates.

6 LARS

Penalized estimation

Linear model

Y = Xβ + e, E(e) = 0, Cov(e) = σ2I . (1)

Minimize‖Y − Xβ‖2 + λp(β) (2)

Tuning parameter λ ≥ 0.Nonnegative penalty function p(β).

Equivalent: Restricted estimation

Minimize‖Y − Xβ‖2

subject top(β) ≤ δ.

Unless p(β) < δ, the minimizing value has p(β) = δ.

We focus on restricted estimation. HTW focus on penalizedestimation.

Lasso and Ridge

yi = β0 + β1xi1 + · · ·+ βpxi ,p + εi , i = 1, . . . , n, (3)

The lasso penalty:

pL(β) =

p∑j=1

|βj |.

Ridge regression penalty:

pR(β) =

p∑j=1

β2j .

Geometry. Not Statistics.

Least squares is a geometric procedure, not a statistical one.Under certain model assumptions, least squares has very nicestatistical properties.

Similarly, lasso is a geometric procedure, not a statistical one.Inference is largely based on the Bootstrap, hence is asymptotic.

Bayesian lasso largely defeats the stated purpose.

Two standard pictures

−1 0 1 2 3

(b1,b2)

Figure: Shrinkage without selection.

−1 0 1 2 3

(b1,b2)

Figure: Shrinkage and selection.

If you don’t believe me

−0.5 0.0 0.5 1.0

b2(b1,b2)

Figure: Shrinkage and selection: Closeup.

Grow the ellipse

−1 0 1 2 3

(b1,b2)

Figure: Repeat: Lasso with δ = 1. Estimation is trivial.

Grow the diamond

−1 0 1 2 3

(b1,b2)

(1.7,0)

Figure: Lasso with δ = 1.7. Estimation is easy.

Grow the diamond

−1 0 1 2 3

(b1,b2)

(0,2.7)

Figure: Lasso with δ = 2.7. Estimation is easy.

Grow the diamond

−1 0 1 2 3

(b1,b2)

Figure: Lasso with δ = 3.1. Estimation is trivial.

Two dimensions. Too easy.

If you know you are hitting a vertex, estimation is trivial.

If you know you are hitting a line, estimation is easy.

The problem is knowing what you are hitting.

Much harder in higher dimensions.

Three dimensions

Figure: Octohedron. (Copied from Wikipedia.)

Hit it with a football. Because a football is round, you can hit asurface, and edge, or a vertex.

Three dimensions

Figure: Octohedron with ellipsoid. (Copied from HTW.)

Hit it with a football. Because a football is round, you can hit asurface, and edge, or a vertex.

Three and higher dimensions.

If you know you are hitting a vertex, estimation is trivial.

If you know you are hitting an edge, estimation is still prettyeasy.

If you know you are hitting a surface, estimation is easy.

Even in very high dimensions, estimation is pretty easy if youknow what you are hitting.

The problem is knowing what you are hitting.

A vertex is the intersection of 3 one-dimensional linearconstraints. Dimension 0.

An edge is the intersection of 2 one-dimensional linearconstraints. Dimension 1.

A surface is 1 one-dimensional linear constraint. Dimension 2.

Ideas scale up to higher dimensions.

It is easy to fit a linear model subject to linear constraints.

(Just figure out the new linear model.)

The problem is knowing what constraints are appropriate.

With δ large enough, knowing the (one) constraint is easy.

Wanted to reconstruct the lasso based on this fact.

First idea, fit the model subject to the constraint. Zero outany estimates that changed sign, refit. (Not even close.)

Second idea, backward elimination as δ decreases. Zero outestimates one at a time. (Not the lasso, but not necessarilybad.)

Note, LARS with lasso modification is a version of forwardselection with stepwise.

Latest conjecture, backward elimination with stepwise mightbe lasso.

With predictor standardization, why bother to shrink at all?(Carroll and colleagues, relaxed.)

Backward elimination, forward selection, or stepwise based on|βj |?Out with the small, in with the big.

Changing signs: Lasso

My first error was thinking that Lasso estimates could not changesign.

Based on the two standard pictures plus a lack of imagination onmy part.

I thought that the signs of β determined the surface. Just had toworry about its edges and vertices.

Only true when δ is large.

Changing signs: Ridge regression

−1 0 1 2 3

b2(b1,b2)

Figure: Ridge regression with sign change.

I knew ridge regression estimates could change sign from Hoerl andKennard (and the ph.d. applied class).

Changing signs: Lasso

−1 0 1 2 3

(b1,b2)

Figure: Lasso with sign change.

A lasso trace (profile)

−1 0 1 2 3

(b1,b2)

Figure: δ = 1.8.

A lasso trace

−1 0 1 2 3

(b1,b2)

Figure: δ = 1.6.

A lasso trace

−0.1 0.0 0.1 0.2 0.3

Figure: Closeup: δ = 1.6.

A lasso trace

−0.1 0.0 0.1 0.2 0.3

Figure: δ = 1.52.

A lasso trace

−0.1 0.0 0.1 0.2 0.3

Figure: δ = 1.477.

A lasso trace

−0.2 −0.1 0.0 0.1 0.2

Figure: δ = 1.46.

A lasso trace

−0.2 −0.1 0.0 0.1 0.2

Figure: δ = 1.43.

A lasso trace

−0.2 −0.1 0.0 0.1 0.2

Figure: δ = 1.39.

A lasso trace

−0.2 −0.1 0.0 0.1 0.2

Figure: δ = 1.3.

No changing signs with standardization?

−1 0 1 2 3

(b1,b2)

Figure: Standardized lasso. No sign change? ρ = 0.9, δ = 1

−1 0 1 2 3

(b1,b2)

Figure: Standardized lasso. No sign change? ρ = 0.999, δ = 1.935.

−1 0 1 2 3

(b1,b2)

Figure: Standardized lasso. No sign change? ρ = −0.999, δ = 1.935

Increasing correlation (Towards nonunique estimation)

0.0 0.5 1.0 1.5 2.0

(b1,b2)

Figure: Repeat: Lasso with ρ = .4/√.8

.= .45.

Increasing correlation

0.0 0.5 1.0 1.5 2.0

(b1,b2)

Figure: Lasso with ρ = 0.999.

0.0 0.5 1.0 1.5 2.0

(b1,b2)

Figure: Lasso with ρ = 0.9999.

0.0 0.5 1.0 1.5 2.0

(b1,b2)

Figure: Lasso with ρ = −0.9999.

Nonunique estimates

−1 0 1 2 3

Figure: Nonunique estimates. δ = 1.

Nonunique estimates

−1 0 1 2 3

Figure: Least squares Lasso estimate. δ = 7/3. Grow the diamond to hitthe line. δ > 7/3?

Nonunique estimates

−1 0 1 2 3

Figure: Nonunique estimates with shortest estimate. δ = 1.

Nonunique estimates

−1 0 1 2 3

Figure: Nonunique estimates with shortest estimate and lasso estimate.δ = 1.

Nonunique estimates

Figure: Octohedron. (Copied from Wikipedia.)

If you throw it against a wall, likely to hit a vertex. If you throw itinto a wire, likely to hit an edge.

Overparameterized one-way ANOVA

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

α1 − α2 = 2

β0 = (1, − 1)

Figure: Shrinkage and selection.

Computations

Y = Xβ + e

Mean adjusted.

X ′X is a correlation matrix.

Geometry versus Lagrange (penalized likelihood)

δ →∞⇒ β(δ) = β

δ → 0⇒ β(δ)→ 0

λ→∞⇒ β(λ) = 0

λ→ 0⇒ β(λ)→ β

Soft threshold for one variable:

βj(λ) = sign(βj)[|βj | − λ]+ ≡ Sλ(βj)

This is 0 unless λ < maxj |X ′jY | because with X ′X a

correlation matrix, X ′jY = βj .

Cyclic coordinate descent: βj ← Sλ(βj + X ′j [Y − X β]); keep

running through the js.

Focus on update steps, worry about starting later.There exist knots λ0 ≥ λ1 ≥ · · · ≥ λp∧n ≥ 0.For knot λk there is an active set of predictor variables Ak .βAk

are least squares estimates for the active set, with 0s forinactive variables.β(λk) is the current estimate.Define, for λ ≤ λk ,

β(λ) =λ

λkβ(λk) +

λk − λλk

LARS—will see later

For active Xj , ∣∣X ′j [Y − X β(λ)]

∣∣ = λ.

For inactive Xi , ∣∣X ′i [Y − X β(λ)]

∣∣ ≤ λ.Decreasing λ.

First inactive Xi with,∣∣X ′

i [Y − X β(λ)]∣∣ = λ added to active set

which defines λk+1.

Repeat.

LARS Notes

1 The two definitions of β(λk+1) agree.

β(λk+1) =λk+1

λkβ(λk) +

λk − λk+1

λkβAk

=λk+1

λk+1β(λk+1) +

λk+1 − λk+1

λk+1βAk+1

When you add a new active variable, you initially use verylittle of it.

2 λ0 = maxj X′jY . A0 is the maximizing Xj . For λ ≥ λ0,

β(λ) ≡ 0.

LARS into Lasso

If any coefficient in

β(λ) =λ

λkβ(λk) +

λk − λλk

changes sign, throw out that variable and refit.

Constant residual covariance for LARS

InductionSuppose that at knot λk we have

0 ≤ X ′j [Y − X β(λk)] = λk

for any Xj in the active set. Then, for λ < λk ,

X ′j [Y − X β(λ)] =

λkX ′j [Y − X β(λk)] +

λk − λλk

X ′j [Y − X βAk

λkλk +

λk − λλk

Recall: X βAk= MAk

Y , where MAkis the ppo onto space of active

variables.Similar result if inner product is negative.

LARS—Fletch

Additions to active variables driven by knot behavior. For aninactive variable Xi , compute pi defined by

X ′i [Y − X β(λk)] ≡ piλk .

X ′i [Y − X β(λ)] =

λkX ′i [Y − X β(λk)] +

λk − λλk

X ′i [Y − X βAk

Add Xi if (1− λ

)X ′i [Y − X βAk

= (1− pi )λ

LARS—Fletch

For λk ≥ λ ≥ λk+1,

β(λ) =k∑

λjβAj

Degrees of freedom

Dimensions of vector spaces.

Leery of other definitions. (DIC?)

HTW: yi = m(xi ) + εi , mi (Y ) = m(xi ),

df [m] =Cov[mi (Y ), yi ]

Lasso: df estimated by number of nonzero coefficients.

Ridge: df = p/(1 + λ) for orthogonal design

yi = yi/nλ, df = 1/λ.

Sparsity

FittingY = Xβ + e

Y = X0γ + e, C (X0) = span{Xj |j ∈ S}

S ⊂ {1, 2, . . . , p} has r elements.

Ideally,Xk ⊥ C (X0), k 6∈ S,

or some asymptotic version thereof.

Clearly, if the extraneous variables are highly correlated withthe important ones, you have a mess.

References

Hastie, T., Tibshirani, R., and Wainwright, M. (2015). StatisticalLearning with Sparcity: The Lasso and Generalizations.Chapman and Hall, Boca Raton, FL.

Murdoch, D. and Chow, E. D. (2015). Package ‘ellipse.’https://cran.r-project.org/web/packages/ellipse/ellipse.pdf

Qi, X., Luo, R., Carroll, R. J. and Zhao, H. (2015). SparseRegression by Projection and Sparse Discriminant Analysis,Journal of Computational and Graphical Statistics, 24,416-438.

Tibshirani, R. (1996). Regression shrinkage and selection via thelasso. Journal of the Royal Statistical Society: Series B, 58,267-288.

Another Look at the Lasso - math.unm.edufletcher/LassoTalk2015.pdfSeymourOther possible...

Documents

Adaptive Lasso

Regularization with Ridge penalties, the Lasso, and the ... · 66 CHAPTER 4. RIDGE PENALTIES, LASSO, AND ELASTIC NET recently developed methods “Group Lasso” and “Blockwise

Introduction to LASSO - Booth School of Businessfaculty.chicagobooth.edu/ruey.tsay/teaching/ama/sp2016/lasso-16.pdf · Introduction to LASSO LASSO stands for least absolute shrinkage

Wavelet-based Weighted LASSO and Screening approaches in ... · Wavelet-based Weighted LASSO and Screening approaches in functional linear regression ... wavelet-based LASSO su ers

The Lasso Problem and Uniqueness - CMU Statisticsryantibs/papers/lassounique.pdf · The Lasso Problem and Uniqueness Ryan J. Tibshirani Carnegie Mellon University Abstract The lasso

Process Lasso - Bitsum · 2016-05-16 · Process Lasso Overview Process Lasso is an automated Windows process (program) management and optimization utility. By managing the programs

LASSO Regression - courses.cs.washington.edu · 1 1 Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February

Orlando di Lasso - Catholic Liturgy, Sacred Musicmusicasacra.com/.../Orlando-di-Lasso-Missa-super-Pilons-complete.pdf · Orlando di Lasso Missa super Pilons, pilons l’orge edition

TGS GPS Lasso Plus

Trace Lasso: a trace norm regularization for correlated ...fbach/grave_nips2011.pdf · itly using a regularization term that depends on the data or design matrix X. In fact, there

Differential Network Analysis via Lasso Penalized D-Trace ... · Biometrika (), , , pp. 1–28 C 2012 Biometrika Trust Printed in Great Britain Differential Network Analysis

Lasso,Polygon,Magnectic Tools

Process Lasso - Bitsum

Lasso Drop Earrings GoodyBeads.com Components Listblog.goodybeads.com/wp-content/uploads/2018/06/Lasso-Earrings.pdf · Lasso Drop Earrings 1 4 7 3 6 2 5 GoodyBeads.com | 2018 | Lasso

Lasso Them In

EERRAANNSSTTAALLTTUUNNGGEENN Lasso-Tochter mit … · .

Lasso Bicinia

Lasso Deck (LAUNCH conf)

Allan Wilson Lasso

the LASSO - Stanford Universitytibs/lasso/lasso.pdf · the LASSO - Stanford University