Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Solving `1-Regularized Regression Problems

Stephen Wright

University of Wisconsin-Madison

Waterloo, June 2007

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 1 / 46

1 Introduction

2 Signal Reconstruction / Compressed SensingFormulation and TheoryAlgorithmsResults

3 Logistic RegressionApplicationAlgorithmsResults

4 Conclusions

+Mario Figueiredo, Rob Nowak, Weiliang Shi, Grace Wahba


Formulation

Consider problems of the form

minx

φ(x)def= f (x) + τ‖x‖1,

where f (x) is a smooth convex function. Arises in many applications ofrecent interest.

The term ‖x‖1 tends to encourage sparsity in the optimal x . Motivation:variable selection. Seek a small number of “explanatory variables.”

Often need to solve for multiple values of τ e.g. to adjust sparsity to somedesired level or perform cross-validation.


First Problem: “Sparse” Least Squares

minx

φ(x)def=

1

2‖Ax − y‖22 + τ‖x‖1,

where A ∈ IRm×n typically has more columns than rows. See LASSO(Tibshirani, 1996).

A is not necessarily sparse;

m and n may be extremely large;

not practical to store or factor substantial submatrices of A;

may wish to solve for a range of τ values, not just one.


Second Problem: Logistic Regression

Have attribute vectors x(1), x(2), . . . , x(n) (real vectors) and labelsy(1), y(2), . . . , y(n) (binary 0/1).

Probability of outcome Y = 1 given attribute vector X isp(X ) = E (Y = 1|X ). Model log odds or logit function as linearcombination of basis functions of x :

ln

(p(x)

1− p(x)

)=

N∑l=0

alBl(x).

Define a log-likelihood function based on observations:

1

n

n∑i=1

[y(i) log p(x(i)) + (1− y(i)) log(1− p(x(i)))] .

Choose coefficients al , l = 0, 1, . . . ,N to maximize this function.

Regularize by adding the `1 term: τ∑N

i=1 |al |.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 5 / 46

Approaches

For both problems, we describe two related approaches:

1 Split x into positive and negative parts x = u − v , and solve abound-constrained problem:

minu,v

f (u − v) + τ1T (u + v) s.t. (u, v) ≥ 0.

Apply gradient projection and variants (Barzilai-Borwein, two-metric).

2 Apply methods for composite nonsmooth minimization, obtainingsteps d from solving

mind

m(d) + τ‖x + d‖1,

where m is a simplified model of f , with ∇m(0) = ∇f (x). Use aquadratic term αdTd to damp the step length, use second-orderscaling.


PART ISparse Least Squares, applied to SignalReconstruction / Compressed Sensing


Wavelet-Based Signal Reconstruction

Problem has the form Ax ≈ y , where A = RW :

x is vector of coefficients for the unknown image or signal;

W is a wavelet basis (multiplication by W performs a wavelettransform). Possibly not square.

R is the observation operator (e.g. convolution with a blur operator,or tomographic projection).

y is vector of observations, possibly containing errors/noise.

W is generally large and dense — impractical to store or factor it.However, matrix-vector multiplications by R, RT , W , W T can beperformed economically. Typically O(n) or O(n log n) operations.

Motivation: Want to reconstruct a signal x from transmitted encoding y ,given prior knowledge that x is sparse. (There’s recent supporting theory.)

(Ref: NY Times, “Scientist at Work: Terence Tao,” 13 March 2007)


Compressed Sensing

Recent theory shows that, if x is known to be sparse, then it can bereconstructed from y ≈ Ax , where A is m × n “random” with m < n,under certain conditions on A.

A Representative Result. (Candes, Romberg, Tao, 2005; see alsoDonoho, 2004). Given A, define δS to be the smallest quantity for which

(1− δS)‖c‖22 ≤ ‖AT c‖22 ≤ (1 + δS)‖c‖22, for all c ,

where AT is a column submatrix of A defined by T ⊂ {1, 2, . . . , n} with|T | ≤ S . (Ensures that AT is close to orthonormal.)

If δ3S + 3δ4S < 2, then for any signal x with at most S nonzeros and anyvector y such that ‖y − Ax‖2 ≤ ε, the solution of

minx‖x‖1 subject to ‖Ax − y‖2 ≤ ε

satisfies ‖x − x‖2 ≤ CSε, where CS is a constant depending only on δ4S .


Algorithms

nonsmooth: minx

φ(x)def=

1

2q(x) + τ‖x‖1

def=

1

2‖Ax − y‖22 + τ‖x‖1.

Can formulate as bound-constrained least squares by splitting

x = u − v , (u, v) ≥ 0,

and writing

least sq: minu≥0,v≥0

φ(u, v)def=

1

2‖A(u − v)− y‖22 + τ1Tu + τ1T v .

For signal processing applications, we’ve had success with separableapproximation for the nonsmooth form and gradient projectionalgorithms for the QP form.

Many other approaches proposed recently. (I’ll mention some.)


Separable (Successive) Approximation: Nonsmooth Form

From current iterate xk , form a simple approximation to φ(xk + d) andchoose step d to minimize it:

mind∇q(xk)Td +

1

2αkdTd + τ‖xk + d‖1. (SA)

The αk term can be thought of as

an approximation to the Hessian: αk I ≈ ∇2q = ATA;

Lagrange multiplier for a trust-region constraint ‖d‖2 ≤ ∆.

(SA) is trivial to solve in O(n) operations, since it is separable in thecomponents of d .

(Approach generalizes easily when ‖x‖1 is replaced by∑n

i=1 |xi |p.)

Cost: 2-3 multiplications by A and/or AT at each iteration.


Choosing αk : Barzilai-Borwein

Choose αk so that αk I mimics the behavior of the true Hessian ATA overthe last step taken. Define

s = xk − xk−1, y = ∇q(xk)−∇q(xk−1) = ATA(xk − xk−1).

Now choose αk such that αks ≈ y in a least-squares sense. Leads to

αk =sT y

sT s.

(Barzilai and Borwein (1988) describe the approach in the context ofunconstrained minimization of a smooth function.)

A distinctive property of BB scaling is that it leads to nonmonotonemethods—function value may increase at some iterations. Can usesafeguarding to ensure convergence.

Some analysis and computational results indicate its superiority overmonotone steepest descent, in certain contexts.


Monotone Variant: Nonsmooth Form

Obtain a monotone variant by

making an initial choice of αk using Barzilai-Borwein

If φ(xk + d) > φ(xk), set αk ← 2αk and repeat.

Gives “smoother” algorithmic behavior, sometimes faster convergence.

Each point tried is the solution of a trust-region subproblem:

mind∇q(xk)Td + τ‖xk + d‖1 subject to ‖d‖2 ≤ ∆,

for some ∆ > 0. (Increasing αk corresponds to decreasing ∆.)

In a sense, this scheme is

trust-region with a linearized approximation of φ, and

interesting initial choice of trust-region radius ∆ at each iteration.


QP Formulation: Gradient Projection

Problem is minu≥0,v≥0 φ(u, v). We have

∇u,vφ(u, v) =

[ATA(u − v)− AT y + τ1−ATA(u − v) + AT y + τ1

].

Choose new iterate as

(uk+1, vk+1) =[(uk , vk)− α(∇uφ

k ,∇vφk)]+

where [·]+ denotes projection onto nonnegative orthant. Set α by

Barzilai-Borwein, or

minimizer of φ along (δu, δv ), which is the “immediate” projection of−∇φ onto the constraint set.

possibly modify α to achieve decrease in φ.

Get a monotone variant of BB by doing a line search on the feasible linebetween (uk , vk) and (uk+1, vk+1).

Computation per iteration similar to separable approximation.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 14 / 46

Termination

Important issue in many applications, generally overlooked by optimizers.

May or may not need a highly accurate solution. Often important to find anear-optimal manifold (“select variables” near-optimally).

Stabilization of the active manifold can be used as an indicator ofnear-optimality. (Measure fraction of status changes at each step.)

Use other estimates of dist (x ,S) or [φ(x)− φ(x∗)] based on

duality (upper bound on [φ(x)− φ(x∗)])LCP perturbation theory (bound on dist (x ,S))bound on achievable decrease in φ over a fixed trust region

All can be implemented cheaply. The trick is to find a tolerance that yieldsgood results on most instances.


Debiasing

After finding an approximate solution, debias by

fixing the zero elements of x at zero

doing an unconstrained minimization of ‖Ax − y‖22 over the nonzeroelements.

In other words, having selected the variables, we solve a reduced linearleast squares problem in these variables alone.

We use standard conjugate gradient for this step; works well because the(projected) Hessian is well conditioned.


Multiple Values of τ

Often need to solve for a range of τ values.

Separable approximation / gradient projection approaches can make use ofa good starting point. Simply use the approximate solution for one valueof τ as the starting point for a nearby value.

Solve for preassigned sequence of τs, in increasing order. (The set ofnonzero x components tends to shrink as τ increases.)


Alternative: Active Set/Pivoting/Homotopy

Start with τ =∞ (for which solution is x = 0) and decrease τ towards thedesired value (possibly zero).

Determine “breakpoints”—values of τ at which a component of x changesfrom zero to nonzero or vice versa. Use pivoting operations to update x atthese values for the new active set.

See Osborne et al (2000), Efron et al (2003) (LARS).

Donoho and Tsaig (2006): If there is a sparse solution xLS with only Snonzeros, need only about S pivots to find it, and a total of aboutO(S3 + Smn) operations (vs O(m3) to do just one step of a standardpivoting matrix, when A is dense).

Can implement using only matrix-vector multiplications by A and AT .Competitive with GP for very sparse solutions.


Alternative: Interior-Point

Could apply a primal-dual method to the QP formulation, solving thelinear system at each IP iteration by CG or LSQR. (Inner iterations requireonly multiplications by A and AT .)

Chen, Donoho, Saunders (1998): basis pursuit.

Saunders (2002): PDCO / SolveBP.

l2 l1: Kim et al. (2007): use a different formulation:

minx ,u

1

2‖Ax − y‖22 + τeTu, subject to − u ≤ x ≤ u.

Preconditions CG on the inner iterations.

Issues:

Need to use iterative methods on ill-conditioned systems to findinterior-point steps; preconditioners not always easy to find.

Not very good at solving for multiple τ values (usual difficulties ofwarm-starting interior-point methods).


Alternative: “Bound Optimization”

Replace Hessian ATA by a diagonal approximation D, with ATA � D:

mind∇q(xk)Td +

1

2dTDd + τ‖xk + d‖1,

Again, subproblem is separable so solution can be calculated cheaply.

Can prove convergence because of the dominance property ATA � D, butcan be overly conservative. (See Figueiredo, Nowak, others.)

Similar to scaled steepest descent / gradient projection with fixed steplength (that’s a bit too long).


Alternative Approach: Second-Order Cone

Applies to formulation

minx‖x‖1 subject to ‖Ax − y‖2 ≤ t,

for parameter t ≥ 0.

In l1magic code (Candes, Romberg, 2005), recast as a second-order coneprogram and solved by a primal log-barrier / Newton / CG approach usingthe usual barrier term

µ log(t2 − ‖Ax − y‖22)

for the constraint.

Again, ill-conditioned linear systems cause numerical difficulties. Does nothandle multiple t values well.


Results: Compressed Sensing

Small, explicit problem to evaluate different algorithms.

min1

2‖y − Rx‖22 + τ‖x‖1,

R is 512× 4096, dense, elements chosen independently from N(0, 1),then rows are normalized.

Choose x∗i = 0 with prob .99, x∗i ∈ Uniform[−1, 1] with prob .01.

Choose y = Rx∗ + e where ei ∈ N(0, .005).

Solve for a single value of τ , chosen according to some statisticalcriterion.

Compare several algorithms:

GPSR: QP form, gradient projection, monotone Barzilai-Borwein.

l1-magic (SOCP formulation)

SparseLab: basis pursuit + interior-point

Bound optimization


GP reconstructs the signal well (compared to least-squares solution).(Note some attenuation due to the ‖x‖1 term.)

0 500 1000 1500 2000 2500 3000 3500 4000!1

0

1Original

0 500 1000 1500 2000 2500 3000 3500 4000!1

0

1Reconstruction (details: n = 4096, k = 512, sigma = 0.005, tau = 0.0289173

MSE = 7.56393e!005

0 500 1000 1500 2000 2500 3000 3500 4000!2

0

2Pseudo!solution

MSE = 0.0468702


As size increases, GP beats the competition. (Other GP variants similar.)

128 256 512 1024 2048 4096 819210!1

100

101

102

103

n

CPU

tim

e (s

econ

ds)

GPSR!BBSparseLabL1!magicBOA


Debiasing removes the attenuation due to the ‖x‖1 term.

! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!!&

!

&

'()*)+,-

! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!!&

!

&./01+23(403)1+567895:5!;!!#<=>

! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!!&

!

&?/@),2/A5(/01+23(403)1+567895:5#;=$/!!!$>


Comparisons with l1 ls

(Kim, Koh, Lustig, Boyd, Gorinevsky, 2007) Similar problem to the above(random matrix with spikes), and noise of 10−4 (Sec. 5.1 of Kim et al.).

Compare l1 ls with three variants of gradient projection on the QPformulation:

“immediate gradient” with Armijo,

Non-monotone Barzilai-Borwein,

Monotone Barzilai-Borwein.

l1 ls beats l1-magic, PDCO, homotopy, and the interior-point codeMOSEK handily on this example.


104 105 10610!1

100

101

102

103

Problem size (n)

Ave

rage

CPU

tim

eEmpirical asymptotic exponents O(n!)

GPSR!BB monotone (! = 0.861)GPSR!BB non!monotone (! = 0.882)GPSR!Basic (! = 0.874)l1!ls (! = 1.21)


PART IIRegularized Convex Minimization,

applied to Logistic Regression


Logistic Regression

Reminder: Have attribute vectors x(1), x(2), . . . , x(n) (real vectors) andbinary labels y(1), y(2), . . . , y(n) (all 0 or 1).

Probability of outcome Y = 1 given attribute X is p(X ) = E (Y = 1|X ),parametrized as

ln

(p(x)

1− p(x)

)=

N∑l=0

alBl(x).

By rearranging, we can write this as

p(x) =

[1 + exp

(−

N∑l=0

alBl(x)

)]−1

.

the a posteriori log-likelihood function is then

L(a) =1

n

n∑i=1

[y(i) log p(x(i)) + (1− y(i)) log(1− p(x(i)))]

which can be written as ...Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 29 / 46

L(a) =1

n

n∑i=1

[−y(i)

N∑l=0

alBl(x(i)) + log

(1 + exp

N∑l=0

alBl(x(i))

)].

Regularize by including a multiple of the `1 term:

J(a) =N∑

i=1

|al |.

(Note that |a0| is omitted, because conventionally B0(x) ≡ 1, and wedon’t penalize the constant shift.)

The optimization problem is then

mina

Tτ (a)def= −L(a) + τJ(a).

Typical dimensions:

number of observations n: possibly 103 − 105;length of data vectors x(i) (features): 2− 1000s;N potentially exponential in the number of features (to captureinteractions).


Evaluation of L and its Derivatives

We have

L(a) =1

n

n∑i=1

[−y(i)

N∑l=0

alBl(x(i)) + log (1 + F (x(i); a))

],

where

F (x ; a) = expN∑

l=0

alBl(x(i)).

It could be relatively expensive to evaluate the F (x(i); a) fori = 1, 2, . . . , n, whcih is needed for L. However once these quantities areknown, the gradient ∇L(a) is cheap and the Hessian ∇2L(a) has simplestructure (though dense).


PatternSearch: Beaver Dam Data

n = 876 persons in study. y(i) = 1 if patient has myopia. Each x(i) is azero-one vector of 7 features/risk factors:

Risk factor 0 1

sex female maleincome > $30, 000 ≤ $30, 000juvenile myopia myopic after age 21 myopic before age 21cataract severity 1,2,3 4,5smoking packs × years ≤ 30 > 30aspirin yes novitamins no yes

Find combinations of factors, as well as individual factors, that predictprogression of myopia. Define N = 27 and basis functions

Bi1,i2,...,i7(x) =∏

j :ij=1

xj .

This function is 1 if xj = 1 for all j with ij = 1, and zero otherwise.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 32 / 46

PatternSearch: Rheumatoid Arthritis and SNPs

Want to predict likelihood that an individual is susceptible to rheumatoidarthritis based on genetic variations, plus some environmental factors.

SNP: variation in a single nucleotide in the genome sequence. e.g.AAGGC changes to ATGGC. This version has two alleles, A and T. Theless frequent nucleotides are called minor alleles.

It’s observed that rheumatoid arthritis is associated with SNPs onchromosome 6. Include 9187 nucleotides, mostly on chromosome 6, in thefeature vector x , where the relevant component of x is coded as 0,1,2according to whether it contains the most common nucleotide or one ofthe minor alleles.

x also contains coding of a variation of “DR type at the HLA locus ofchromosome 6.” x also contains variables for gender (female=1), smoking(yes=1), and age (older than 55 = 1). Total of 9192 x components.


Naively, might like to examine all possible interactions but this wouldinvolve solving a problem with ≈ 310000 unknowns.

Instead, do multiple rounds of prescreening.

795 of the 9192 individual variables survived the first round.(Actually, 880 variables survive, as for some variables more than one“level” was of interest.)

After screening for interactions between pairs of these 795 variables,obtained 1679 interactions of possible interest.

Then solve a max-likelihood problem with 2559 = 880 + 1679variables.


Algorithms

Again, base algorithms on two formulations: bound-constrained viavariable splitting, and the original nonsmooth formulation.

Restoring earlier notation, we have the nonsmooth form

nonsmooth: min f (x) + τ‖x‖1,

and the bound-constrained form:

bound: min(u,v)≥0

f (u − v) + τ1Tu + τ1T v .

In our case, f = −L.

Write asminz≥0

F (z)

where z = (u, v) and F defined obviously.


Bound-Constrained: Two-Metric Gradient Projection

Use gradient projection again, with two-metric scaling in the “free”components (Bertsekas, 1982). At each iterate zk for minz≥0 F (z):

Calculate ∇F (zk) and use it to form an estimate of the free variableset Ik . Exclude i from Ik if xk

i is close to zero and ∂f /∂xi > 0.Calculate the partial Hessian corresponding to Ik :

HIk=

[∂2F (zk)

∂zi∂zj

](i ,j)∈Ik×Ik

.

Form search direction pk by

pki = −τk

∂F

∂zi, i /∈ Ik

for some scale factor τk , and

pkIk

= − (HIk+ εk I )−1∇Ik

F (zk).

Armijo backtracking line search along(xk + αpk)+, α = 1, 1

2 , 14 , 1

8 , . . . , until descent in F .


Why 2MGP?

For “interesting” τ values, very few nonzero components of a at thesolution, so have to compute and factor only a small submatrix of theHessian (cheap) at least on later iterations.

Use of the reduced Hessian (with damping term) accelerates themethod greatly over first-order gradient projection.

Strategies that explicitly require more than the Ik part of the Hessianare not practical because of its size and density.

Strategies that use CG on the reduced Hessian unnecessary becauseit’s small (near the solution), easy to calculate, ill conditioned.

Though we can cut many corners, the doubling of problem size is ugly.Can we adapt the appealing aspects of 2MGP to the nonsmooth setting,avoiding splitting of the variables?


Successive Approximation Approaches: Nonsmooth Form

Similarly to “separable approximation” for the least-squares application,can get step d by solving

mind∇f (xk)Td +

1

2αkdTd + τ‖xk + d‖1.

Again, this is separable and can be solved trivially.

Again, it’s related to a linearized trust-region subproblem:

mind∇f (xk)Td + τ‖xk + d‖1 s.t. ‖d‖2 ≤ ∆.

Adding a full quadratic term (1/2)dTWkd (giving a Newton-type method)yields a model that’s no longer separable; too expensive to solve.


Relationship to Composite Nonsmooth Algorithms

There’s considerable relevant research on trust-region algorithms forcomposite nonsmooth optimization in the period 1984-1991. Manyauthors (e.g. Conn, Coleman, Fletcher, Osborne, Yuan, SW, others).

min f (x) + h(c(x)),

where f : IRn → IR and c ′ : IRn → IRm are smooth and h is polyhedralconvex.

Main motivation was `1 exact penalty function for nonlinear programming.

What’s different/special about this problem?

Simple nonsmooth term ‖x‖1 makes “linear” model trivial to solve.

Easier to work directly with the damping term αdTd than with thetrust region.

Full quadratic approximation not practical, but a reduced Hessianwould be OK.

Second-order corrections not needed (no constraint curvature).


Particularly relevant are techniques of Fletcher and Sainz de la Maza(1989) and SW (1990) that first solve a linear trust-region subproblem, totry to identify the “active manifold:”

mind∇f (xk)Td + h(c(xk) + A(xk)d) s.t. ‖d‖ ≤ ∆,

for some trust-region radius ∆. (Yields the analog of a “Cauchy point.”)

They solve a quadratic subproblem, or try to do a higher-order correctionalong the active manifold.


The history book on the shelfIs always repeating itselfWaterloo (ABBA, 1974).


A 2MGP-like Approach

The damped linear model (in place of the linear trust-region model) isused to identify an active manifold: Ik = {i | (xk + d)i 6= 0}.Evaluate reduced Hessian HIk

of f at xk and try to take a line-searchNewton step in the Ik components.

Don’t let any of the Ik components cross over 0Move the non-Ik components to 0.

Accept if there’s a decrease in Tτ .

If Newton step fails, and if the step from the linear model gives adecrease in Tτ , take it. Otherwise, increase αk .


Relationships

This approach is also related to other approaches proposed forminimization of nonsmooth functions that are smooth on naturally arisingmanifolds: partly smooth functions.

Many researchers, esp. Lewis. “Predictor-corrector” methods analyzed byvarious authors (Mifflin, Sagastizabal, Daniilidis, Hare, Malick, Miller, ...):

Predictor step is the Newton-like step along the active manifold;

Corrector step(s) try to return to the manifold (e.g. projection orproximal-point mapping).

Our linear model is similar (not identical) to the proximal-point correctorstep.

Also connection to SLIQUE (Byrd, Gould, Nocedal, Waltz) for nonlinearprogramming, in which

use linearized trust-region problem to identify the active manifold

take an equality-constrained SQP step on this manifold.


Other Algorithms for Logistic Regression

Many other algorithms have been tried recently.

“Grafting” (Perkins et al, 2003). Starting from x = 0 and σ = ∅,select component i for which |∂f /∂xi | is largest; Set σ = σ ∪ {i} andminimize f over the components in σ; repeat.“Boosting:” move one component at a time, sometimes by a fixedstep length (in + or − direction).Bound optimization (Krishnapurnam, Carin, Figueiredo, 2005): SQPwith diagonal Hessian approximation.Descent along a subgradient scaled with an L-BFGS approximateHessian (Andrew and Gao, 2007).SQP on the equivalent formulation

minx

f (x) s.t. ‖x‖1 ≤ C ,

with the weighted least-squares subproblem solved by a homotopysolver and Armijo line search (Lee, Lee, Abbeel, Ng, 2006).Smoothing of the `1 term.

However, 2MGP is reported to be better in comparisons by two teams.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 44 / 46

Results: Myopia Progression

Choose τ by generalized cross-validation.

For the final selected τ , 13 of the 128 possible combined risk factorsare selected.

These are subjected to further analysis (“Step 2”); five factors survive.

Coefficients for f :

pattern coefficient

constant -3.29cataract 2.42

smoking, no vitamins 1.18male, low income, juv. myopia, no aspirin 1.84

male, low income, cataract, no aspirin 1.08


Conclusions

An interesting class of problems that’s arisen recently in diverseapplications.

Can draw on much fundamental optimization work, algorithmic andtheoretical, BUT

It takes a lot of work to accommodate the characteristics of eachproblem to devise a practical algorithm:

size and structurenonlinearity and conditioningexpected “density” of solutionscontext (e.g. warm starts)

The algorithm motivates new algorithmic work for optimizers.

Work in progress!


Documents

Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation