46
Solving 1 -Regularized Regression Problems Stephen Wright University of Wisconsin-Madison Waterloo, June 2007 Stephen Wright (UW-Madison) Solving 1 -Regularized Regression Problems Waterloo, June 2007 1 / 46

Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Solving `1-Regularized Regression Problems

Stephen Wright

University of Wisconsin-Madison

Waterloo, June 2007

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 1 / 46

Page 2: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

1 Introduction

2 Signal Reconstruction / Compressed SensingFormulation and TheoryAlgorithmsResults

3 Logistic RegressionApplicationAlgorithmsResults

4 Conclusions

+Mario Figueiredo, Rob Nowak, Weiliang Shi, Grace Wahba

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 2 / 46

Page 3: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Formulation

Consider problems of the form

minx

φ(x)def= f (x) + τ‖x‖1,

where f (x) is a smooth convex function. Arises in many applications ofrecent interest.

The term ‖x‖1 tends to encourage sparsity in the optimal x . Motivation:variable selection. Seek a small number of “explanatory variables.”

Often need to solve for multiple values of τ e.g. to adjust sparsity to somedesired level or perform cross-validation.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 3 / 46

Page 4: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

First Problem: “Sparse” Least Squares

minx

φ(x)def=

1

2‖Ax − y‖22 + τ‖x‖1,

where A ∈ IRm×n typically has more columns than rows. See LASSO(Tibshirani, 1996).

A is not necessarily sparse;

m and n may be extremely large;

not practical to store or factor substantial submatrices of A;

may wish to solve for a range of τ values, not just one.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 4 / 46

Page 5: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Second Problem: Logistic Regression

Have attribute vectors x(1), x(2), . . . , x(n) (real vectors) and labelsy(1), y(2), . . . , y(n) (binary 0/1).

Probability of outcome Y = 1 given attribute vector X isp(X ) = E (Y = 1|X ). Model log odds or logit function as linearcombination of basis functions of x :

ln

(p(x)

1− p(x)

)=

N∑l=0

alBl(x).

Define a log-likelihood function based on observations:

1

n

n∑i=1

[y(i) log p(x(i)) + (1− y(i)) log(1− p(x(i)))] .

Choose coefficients al , l = 0, 1, . . . ,N to maximize this function.

Regularize by adding the `1 term: τ∑N

i=1 |al |.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 5 / 46

Page 6: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Approaches

For both problems, we describe two related approaches:

1 Split x into positive and negative parts x = u − v , and solve abound-constrained problem:

minu,v

f (u − v) + τ1T (u + v) s.t. (u, v) ≥ 0.

Apply gradient projection and variants (Barzilai-Borwein, two-metric).

2 Apply methods for composite nonsmooth minimization, obtainingsteps d from solving

mind

m(d) + τ‖x + d‖1,

where m is a simplified model of f , with ∇m(0) = ∇f (x). Use aquadratic term αdTd to damp the step length, use second-orderscaling.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 6 / 46

Page 7: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

PART ISparse Least Squares, applied to SignalReconstruction / Compressed Sensing

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 7 / 46

Page 8: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Wavelet-Based Signal Reconstruction

Problem has the form Ax ≈ y , where A = RW :

x is vector of coefficients for the unknown image or signal;

W is a wavelet basis (multiplication by W performs a wavelettransform). Possibly not square.

R is the observation operator (e.g. convolution with a blur operator,or tomographic projection).

y is vector of observations, possibly containing errors/noise.

W is generally large and dense — impractical to store or factor it.However, matrix-vector multiplications by R, RT , W , W T can beperformed economically. Typically O(n) or O(n log n) operations.

Motivation: Want to reconstruct a signal x from transmitted encoding y ,given prior knowledge that x is sparse. (There’s recent supporting theory.)

(Ref: NY Times, “Scientist at Work: Terence Tao,” 13 March 2007)

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 8 / 46

Page 9: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Compressed Sensing

Recent theory shows that, if x is known to be sparse, then it can bereconstructed from y ≈ Ax , where A is m × n “random” with m < n,under certain conditions on A.

A Representative Result. (Candes, Romberg, Tao, 2005; see alsoDonoho, 2004). Given A, define δS to be the smallest quantity for which

(1− δS)‖c‖22 ≤ ‖AT c‖22 ≤ (1 + δS)‖c‖22, for all c ,

where AT is a column submatrix of A defined by T ⊂ {1, 2, . . . , n} with|T | ≤ S . (Ensures that AT is close to orthonormal.)

If δ3S + 3δ4S < 2, then for any signal x with at most S nonzeros and anyvector y such that ‖y − Ax‖2 ≤ ε, the solution of

minx‖x‖1 subject to ‖Ax − y‖2 ≤ ε

satisfies ‖x − x‖2 ≤ CSε, where CS is a constant depending only on δ4S .

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 9 / 46

Page 10: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Algorithms

nonsmooth: minx

φ(x)def=

1

2q(x) + τ‖x‖1

def=

1

2‖Ax − y‖22 + τ‖x‖1.

Can formulate as bound-constrained least squares by splitting

x = u − v , (u, v) ≥ 0,

and writing

least sq: minu≥0,v≥0

φ(u, v)def=

1

2‖A(u − v)− y‖22 + τ1Tu + τ1T v .

For signal processing applications, we’ve had success with separableapproximation for the nonsmooth form and gradient projectionalgorithms for the QP form.

Many other approaches proposed recently. (I’ll mention some.)

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 10 / 46

Page 11: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Separable (Successive) Approximation: Nonsmooth Form

From current iterate xk , form a simple approximation to φ(xk + d) andchoose step d to minimize it:

mind∇q(xk)Td +

1

2αkdTd + τ‖xk + d‖1. (SA)

The αk term can be thought of as

an approximation to the Hessian: αk I ≈ ∇2q = ATA;

Lagrange multiplier for a trust-region constraint ‖d‖2 ≤ ∆.

(SA) is trivial to solve in O(n) operations, since it is separable in thecomponents of d .

(Approach generalizes easily when ‖x‖1 is replaced by∑n

i=1 |xi |p.)

Cost: 2-3 multiplications by A and/or AT at each iteration.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 11 / 46

Page 12: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Choosing αk : Barzilai-Borwein

Choose αk so that αk I mimics the behavior of the true Hessian ATA overthe last step taken. Define

s = xk − xk−1, y = ∇q(xk)−∇q(xk−1) = ATA(xk − xk−1).

Now choose αk such that αks ≈ y in a least-squares sense. Leads to

αk =sT y

sT s.

(Barzilai and Borwein (1988) describe the approach in the context ofunconstrained minimization of a smooth function.)

A distinctive property of BB scaling is that it leads to nonmonotonemethods—function value may increase at some iterations. Can usesafeguarding to ensure convergence.

Some analysis and computational results indicate its superiority overmonotone steepest descent, in certain contexts.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 12 / 46

Page 13: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Monotone Variant: Nonsmooth Form

Obtain a monotone variant by

making an initial choice of αk using Barzilai-Borwein

If φ(xk + d) > φ(xk), set αk ← 2αk and repeat.

Gives “smoother” algorithmic behavior, sometimes faster convergence.

Each point tried is the solution of a trust-region subproblem:

mind∇q(xk)Td + τ‖xk + d‖1 subject to ‖d‖2 ≤ ∆,

for some ∆ > 0. (Increasing αk corresponds to decreasing ∆.)

In a sense, this scheme is

trust-region with a linearized approximation of φ, and

interesting initial choice of trust-region radius ∆ at each iteration.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 13 / 46

Page 14: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

QP Formulation: Gradient Projection

Problem is minu≥0,v≥0 φ(u, v). We have

∇u,vφ(u, v) =

[ATA(u − v)− AT y + τ1−ATA(u − v) + AT y + τ1

].

Choose new iterate as

(uk+1, vk+1) =[(uk , vk)− α(∇uφ

k ,∇vφk)]+

where [·]+ denotes projection onto nonnegative orthant. Set α by

Barzilai-Borwein, or

minimizer of φ along (δu, δv ), which is the “immediate” projection of−∇φ onto the constraint set.

possibly modify α to achieve decrease in φ.

Get a monotone variant of BB by doing a line search on the feasible linebetween (uk , vk) and (uk+1, vk+1).

Computation per iteration similar to separable approximation.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 14 / 46

Page 15: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Termination

Important issue in many applications, generally overlooked by optimizers.

May or may not need a highly accurate solution. Often important to find anear-optimal manifold (“select variables” near-optimally).

Stabilization of the active manifold can be used as an indicator ofnear-optimality. (Measure fraction of status changes at each step.)

Use other estimates of dist (x ,S) or [φ(x)− φ(x∗)] based on

duality (upper bound on [φ(x)− φ(x∗)])LCP perturbation theory (bound on dist (x ,S))bound on achievable decrease in φ over a fixed trust region

All can be implemented cheaply. The trick is to find a tolerance that yieldsgood results on most instances.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 15 / 46

Page 16: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Debiasing

After finding an approximate solution, debias by

fixing the zero elements of x at zero

doing an unconstrained minimization of ‖Ax − y‖22 over the nonzeroelements.

In other words, having selected the variables, we solve a reduced linearleast squares problem in these variables alone.

We use standard conjugate gradient for this step; works well because the(projected) Hessian is well conditioned.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 16 / 46

Page 17: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Multiple Values of τ

Often need to solve for a range of τ values.

Separable approximation / gradient projection approaches can make use ofa good starting point. Simply use the approximate solution for one valueof τ as the starting point for a nearby value.

Solve for preassigned sequence of τs, in increasing order. (The set ofnonzero x components tends to shrink as τ increases.)

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 17 / 46

Page 18: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Alternative: Active Set/Pivoting/Homotopy

Start with τ =∞ (for which solution is x = 0) and decrease τ towards thedesired value (possibly zero).

Determine “breakpoints”—values of τ at which a component of x changesfrom zero to nonzero or vice versa. Use pivoting operations to update x atthese values for the new active set.

See Osborne et al (2000), Efron et al (2003) (LARS).

Donoho and Tsaig (2006): If there is a sparse solution xLS with only Snonzeros, need only about S pivots to find it, and a total of aboutO(S3 + Smn) operations (vs O(m3) to do just one step of a standardpivoting matrix, when A is dense).

Can implement using only matrix-vector multiplications by A and AT .Competitive with GP for very sparse solutions.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 18 / 46

Page 19: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Alternative: Interior-Point

Could apply a primal-dual method to the QP formulation, solving thelinear system at each IP iteration by CG or LSQR. (Inner iterations requireonly multiplications by A and AT .)

Chen, Donoho, Saunders (1998): basis pursuit.

Saunders (2002): PDCO / SolveBP.

l2 l1: Kim et al. (2007): use a different formulation:

minx ,u

1

2‖Ax − y‖22 + τeTu, subject to − u ≤ x ≤ u.

Preconditions CG on the inner iterations.

Issues:

Need to use iterative methods on ill-conditioned systems to findinterior-point steps; preconditioners not always easy to find.

Not very good at solving for multiple τ values (usual difficulties ofwarm-starting interior-point methods).

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 19 / 46

Page 20: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Alternative: “Bound Optimization”

Replace Hessian ATA by a diagonal approximation D, with ATA � D:

mind∇q(xk)Td +

1

2dTDd + τ‖xk + d‖1,

Again, subproblem is separable so solution can be calculated cheaply.

Can prove convergence because of the dominance property ATA � D, butcan be overly conservative. (See Figueiredo, Nowak, others.)

Similar to scaled steepest descent / gradient projection with fixed steplength (that’s a bit too long).

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 20 / 46

Page 21: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Alternative Approach: Second-Order Cone

Applies to formulation

minx‖x‖1 subject to ‖Ax − y‖2 ≤ t,

for parameter t ≥ 0.

In l1magic code (Candes, Romberg, 2005), recast as a second-order coneprogram and solved by a primal log-barrier / Newton / CG approach usingthe usual barrier term

µ log(t2 − ‖Ax − y‖22)

for the constraint.

Again, ill-conditioned linear systems cause numerical difficulties. Does nothandle multiple t values well.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 21 / 46

Page 22: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Results: Compressed Sensing

Small, explicit problem to evaluate different algorithms.

min1

2‖y − Rx‖22 + τ‖x‖1,

R is 512× 4096, dense, elements chosen independently from N(0, 1),then rows are normalized.

Choose x∗i = 0 with prob .99, x∗i ∈ Uniform[−1, 1] with prob .01.

Choose y = Rx∗ + e where ei ∈ N(0, .005).

Solve for a single value of τ , chosen according to some statisticalcriterion.

Compare several algorithms:

GPSR: QP form, gradient projection, monotone Barzilai-Borwein.

l1-magic (SOCP formulation)

SparseLab: basis pursuit + interior-point

Bound optimization

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 22 / 46

Page 23: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

GP reconstructs the signal well (compared to least-squares solution).(Note some attenuation due to the ‖x‖1 term.)

0 500 1000 1500 2000 2500 3000 3500 4000!1

0

1Original

0 500 1000 1500 2000 2500 3000 3500 4000!1

0

1Reconstruction (details: n = 4096, k = 512, sigma = 0.005, tau = 0.0289173

MSE = 7.56393e!005

0 500 1000 1500 2000 2500 3000 3500 4000!2

0

2Pseudo!solution

MSE = 0.0468702

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 23 / 46

Page 24: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

As size increases, GP beats the competition. (Other GP variants similar.)

128 256 512 1024 2048 4096 819210!1

100

101

102

103

n

CPU

tim

e (s

econ

ds)

GPSR!BBSparseLabL1!magicBOA

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 24 / 46

Page 25: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Debiasing removes the attenuation due to the ‖x‖1 term.

! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!!&

!

&

'()*)+,-

! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!!&

!

&./01+23(403)1+567895:5!;!!#<=>

! "!! #!! $!! %!! &!!! &"!! &#!! &$!! &%!! "!!!!&

!

&?/@),2/A5(/01+23(403)1+567895:5#;=$/!!!$>

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 25 / 46

Page 26: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Comparisons with l1 ls

(Kim, Koh, Lustig, Boyd, Gorinevsky, 2007) Similar problem to the above(random matrix with spikes), and noise of 10−4 (Sec. 5.1 of Kim et al.).

Compare l1 ls with three variants of gradient projection on the QPformulation:

“immediate gradient” with Armijo,

Non-monotone Barzilai-Borwein,

Monotone Barzilai-Borwein.

l1 ls beats l1-magic, PDCO, homotopy, and the interior-point codeMOSEK handily on this example.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 26 / 46

Page 27: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

104 105 10610!1

100

101

102

103

Problem size (n)

Ave

rage

CPU

tim

eEmpirical asymptotic exponents O(n!)

GPSR!BB monotone (! = 0.861)GPSR!BB non!monotone (! = 0.882)GPSR!Basic (! = 0.874)l1!ls (! = 1.21)

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 27 / 46

Page 28: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

PART IIRegularized Convex Minimization,

applied to Logistic Regression

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 28 / 46

Page 29: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Logistic Regression

Reminder: Have attribute vectors x(1), x(2), . . . , x(n) (real vectors) andbinary labels y(1), y(2), . . . , y(n) (all 0 or 1).

Probability of outcome Y = 1 given attribute X is p(X ) = E (Y = 1|X ),parametrized as

ln

(p(x)

1− p(x)

)=

N∑l=0

alBl(x).

By rearranging, we can write this as

p(x) =

[1 + exp

(−

N∑l=0

alBl(x)

)]−1

.

the a posteriori log-likelihood function is then

L(a) =1

n

n∑i=1

[y(i) log p(x(i)) + (1− y(i)) log(1− p(x(i)))]

which can be written as ...Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 29 / 46

Page 30: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

L(a) =1

n

n∑i=1

[−y(i)

N∑l=0

alBl(x(i)) + log

(1 + exp

N∑l=0

alBl(x(i))

)].

Regularize by including a multiple of the `1 term:

J(a) =N∑

i=1

|al |.

(Note that |a0| is omitted, because conventionally B0(x) ≡ 1, and wedon’t penalize the constant shift.)

The optimization problem is then

mina

Tτ (a)def= −L(a) + τJ(a).

Typical dimensions:

number of observations n: possibly 103 − 105;length of data vectors x(i) (features): 2− 1000s;N potentially exponential in the number of features (to captureinteractions).

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 30 / 46

Page 31: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Evaluation of L and its Derivatives

We have

L(a) =1

n

n∑i=1

[−y(i)

N∑l=0

alBl(x(i)) + log (1 + F (x(i); a))

],

where

F (x ; a) = expN∑

l=0

alBl(x(i)).

It could be relatively expensive to evaluate the F (x(i); a) fori = 1, 2, . . . , n, whcih is needed for L. However once these quantities areknown, the gradient ∇L(a) is cheap and the Hessian ∇2L(a) has simplestructure (though dense).

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 31 / 46

Page 32: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

PatternSearch: Beaver Dam Data

n = 876 persons in study. y(i) = 1 if patient has myopia. Each x(i) is azero-one vector of 7 features/risk factors:

Risk factor 0 1

sex female maleincome > $30, 000 ≤ $30, 000juvenile myopia myopic after age 21 myopic before age 21cataract severity 1,2,3 4,5smoking packs × years ≤ 30 > 30aspirin yes novitamins no yes

Find combinations of factors, as well as individual factors, that predictprogression of myopia. Define N = 27 and basis functions

Bi1,i2,...,i7(x) =∏

j :ij=1

xj .

This function is 1 if xj = 1 for all j with ij = 1, and zero otherwise.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 32 / 46

Page 33: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

PatternSearch: Rheumatoid Arthritis and SNPs

Want to predict likelihood that an individual is susceptible to rheumatoidarthritis based on genetic variations, plus some environmental factors.

SNP: variation in a single nucleotide in the genome sequence. e.g.AAGGC changes to ATGGC. This version has two alleles, A and T. Theless frequent nucleotides are called minor alleles.

It’s observed that rheumatoid arthritis is associated with SNPs onchromosome 6. Include 9187 nucleotides, mostly on chromosome 6, in thefeature vector x , where the relevant component of x is coded as 0,1,2according to whether it contains the most common nucleotide or one ofthe minor alleles.

x also contains coding of a variation of “DR type at the HLA locus ofchromosome 6.” x also contains variables for gender (female=1), smoking(yes=1), and age (older than 55 = 1). Total of 9192 x components.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 33 / 46

Page 34: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Naively, might like to examine all possible interactions but this wouldinvolve solving a problem with ≈ 310000 unknowns.

Instead, do multiple rounds of prescreening.

795 of the 9192 individual variables survived the first round.(Actually, 880 variables survive, as for some variables more than one“level” was of interest.)

After screening for interactions between pairs of these 795 variables,obtained 1679 interactions of possible interest.

Then solve a max-likelihood problem with 2559 = 880 + 1679variables.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 34 / 46

Page 35: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Algorithms

Again, base algorithms on two formulations: bound-constrained viavariable splitting, and the original nonsmooth formulation.

Restoring earlier notation, we have the nonsmooth form

nonsmooth: min f (x) + τ‖x‖1,

and the bound-constrained form:

bound: min(u,v)≥0

f (u − v) + τ1Tu + τ1T v .

In our case, f = −L.

Write asminz≥0

F (z)

where z = (u, v) and F defined obviously.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 35 / 46

Page 36: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Bound-Constrained: Two-Metric Gradient Projection

Use gradient projection again, with two-metric scaling in the “free”components (Bertsekas, 1982). At each iterate zk for minz≥0 F (z):

Calculate ∇F (zk) and use it to form an estimate of the free variableset Ik . Exclude i from Ik if xk

i is close to zero and ∂f /∂xi > 0.Calculate the partial Hessian corresponding to Ik :

HIk=

[∂2F (zk)

∂zi∂zj

](i ,j)∈Ik×Ik

.

Form search direction pk by

pki = −τk

∂F

∂zi, i /∈ Ik

for some scale factor τk , and

pkIk

= − (HIk+ εk I )−1∇Ik

F (zk).

Armijo backtracking line search along(xk + αpk)+, α = 1, 1

2 , 14 , 1

8 , . . . , until descent in F .

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 36 / 46

Page 37: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Why 2MGP?

For “interesting” τ values, very few nonzero components of a at thesolution, so have to compute and factor only a small submatrix of theHessian (cheap) at least on later iterations.

Use of the reduced Hessian (with damping term) accelerates themethod greatly over first-order gradient projection.

Strategies that explicitly require more than the Ik part of the Hessianare not practical because of its size and density.

Strategies that use CG on the reduced Hessian unnecessary becauseit’s small (near the solution), easy to calculate, ill conditioned.

Though we can cut many corners, the doubling of problem size is ugly.Can we adapt the appealing aspects of 2MGP to the nonsmooth setting,avoiding splitting of the variables?

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 37 / 46

Page 38: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Successive Approximation Approaches: Nonsmooth Form

Similarly to “separable approximation” for the least-squares application,can get step d by solving

mind∇f (xk)Td +

1

2αkdTd + τ‖xk + d‖1.

Again, this is separable and can be solved trivially.

Again, it’s related to a linearized trust-region subproblem:

mind∇f (xk)Td + τ‖xk + d‖1 s.t. ‖d‖2 ≤ ∆.

Adding a full quadratic term (1/2)dTWkd (giving a Newton-type method)yields a model that’s no longer separable; too expensive to solve.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 38 / 46

Page 39: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Relationship to Composite Nonsmooth Algorithms

There’s considerable relevant research on trust-region algorithms forcomposite nonsmooth optimization in the period 1984-1991. Manyauthors (e.g. Conn, Coleman, Fletcher, Osborne, Yuan, SW, others).

min f (x) + h(c(x)),

where f : IRn → IR and c ′ : IRn → IRm are smooth and h is polyhedralconvex.

Main motivation was `1 exact penalty function for nonlinear programming.

What’s different/special about this problem?

Simple nonsmooth term ‖x‖1 makes “linear” model trivial to solve.

Easier to work directly with the damping term αdTd than with thetrust region.

Full quadratic approximation not practical, but a reduced Hessianwould be OK.

Second-order corrections not needed (no constraint curvature).

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 39 / 46

Page 40: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Particularly relevant are techniques of Fletcher and Sainz de la Maza(1989) and SW (1990) that first solve a linear trust-region subproblem, totry to identify the “active manifold:”

mind∇f (xk)Td + h(c(xk) + A(xk)d) s.t. ‖d‖ ≤ ∆,

for some trust-region radius ∆. (Yields the analog of a “Cauchy point.”)

They solve a quadratic subproblem, or try to do a higher-order correctionalong the active manifold.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 40 / 46

Page 41: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

The history book on the shelfIs always repeating itselfWaterloo (ABBA, 1974).

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 41 / 46

Page 42: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

A 2MGP-like Approach

The damped linear model (in place of the linear trust-region model) isused to identify an active manifold: Ik = {i | (xk + d)i 6= 0}.Evaluate reduced Hessian HIk

of f at xk and try to take a line-searchNewton step in the Ik components.

Don’t let any of the Ik components cross over 0Move the non-Ik components to 0.

Accept if there’s a decrease in Tτ .

If Newton step fails, and if the step from the linear model gives adecrease in Tτ , take it. Otherwise, increase αk .

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 42 / 46

Page 43: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Relationships

This approach is also related to other approaches proposed forminimization of nonsmooth functions that are smooth on naturally arisingmanifolds: partly smooth functions.

Many researchers, esp. Lewis. “Predictor-corrector” methods analyzed byvarious authors (Mifflin, Sagastizabal, Daniilidis, Hare, Malick, Miller, ...):

Predictor step is the Newton-like step along the active manifold;

Corrector step(s) try to return to the manifold (e.g. projection orproximal-point mapping).

Our linear model is similar (not identical) to the proximal-point correctorstep.

Also connection to SLIQUE (Byrd, Gould, Nocedal, Waltz) for nonlinearprogramming, in which

use linearized trust-region problem to identify the active manifold

take an equality-constrained SQP step on this manifold.

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 43 / 46

Page 44: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Other Algorithms for Logistic Regression

Many other algorithms have been tried recently.

“Grafting” (Perkins et al, 2003). Starting from x = 0 and σ = ∅,select component i for which |∂f /∂xi | is largest; Set σ = σ ∪ {i} andminimize f over the components in σ; repeat.“Boosting:” move one component at a time, sometimes by a fixedstep length (in + or − direction).Bound optimization (Krishnapurnam, Carin, Figueiredo, 2005): SQPwith diagonal Hessian approximation.Descent along a subgradient scaled with an L-BFGS approximateHessian (Andrew and Gao, 2007).SQP on the equivalent formulation

minx

f (x) s.t. ‖x‖1 ≤ C ,

with the weighted least-squares subproblem solved by a homotopysolver and Armijo line search (Lee, Lee, Abbeel, Ng, 2006).Smoothing of the `1 term.

However, 2MGP is reported to be better in comparisons by two teams.Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 44 / 46

Page 45: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Results: Myopia Progression

Choose τ by generalized cross-validation.

For the final selected τ , 13 of the 128 possible combined risk factorsare selected.

These are subjected to further analysis (“Step 2”); five factors survive.

Coefficients for f :

pattern coefficient

constant -3.29cataract 2.42

smoking, no vitamins 1.18male, low income, juv. myopia, no aspirin 1.84

male, low income, cataract, no aspirin 1.08

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 45 / 46

Page 46: Solving 1-Regularized Regression Problemspages.cs.wisc.edu/~swright/talks/sjw-waterloo.pdf · 2007-07-25 · 1 Introduction 2 Signal Reconstruction / Compressed Sensing Formulation

Conclusions

An interesting class of problems that’s arisen recently in diverseapplications.

Can draw on much fundamental optimization work, algorithmic andtheoretical, BUT

It takes a lot of work to accommodate the characteristics of eachproblem to devise a practical algorithm:

size and structurenonlinearity and conditioningexpected “density” of solutionscontext (e.g. warm starts)

The algorithm motivates new algorithmic work for optimizers.

Work in progress!

Stephen Wright (UW-Madison) Solving `1-Regularized Regression Problems Waterloo, June 2007 46 / 46