XUDONG LI optimization models with an emphasis …...xDepartment of Mathematics and Institute of Operations Research and Analytics, National University of Singapore, 10 Lower Kent

Copyright copy by SIAM Unauthorized reproduction of this article is prohibited

SIAM J OPTIM ccopy 2018 Society for Industrial and Applied MathematicsVol 28 No 1 pp 433ndash458

A HIGHLY EFFICIENT SEMISMOOTH NEWTON AUGMENTEDLAGRANGIAN METHOD FOR SOLVING LASSO PROBLEMSlowast

XUDONG LIdagger DEFENG SUNDagger AND KIM-CHUAN TOHsect

Abstract We develop a fast and robust algorithm for solving large-scale convex compositeoptimization models with an emphasis on the `1-regularized least squares regression (lasso) problemsDespite the fact that there exist a large number of solvers in the literature for the lasso problemswe found that no solver can efficiently handle difficult large-scale regression problems with real dataBy leveraging on available error bound results to realize the asymptotic superlinear convergenceproperty of the augmented Lagrangian algorithm and by exploiting the second order sparsity ofthe problem through the semismooth Newton method we are able to propose an algorithm calledSsnal to efficiently solve the aforementioned difficult problems Under very mild conditions whichhold automatically for lasso problems both the primal and the dual iteration sequences generated bySsnal possess a fast linear convergence rate which can even be superlinear asymptotically Numericalcomparisons between our approach and a number of state-of-the-art solvers on real data sets arepresented to demonstrate the high efficiency and robustness of our proposed algorithm in solvingdifficult large-scale lasso problems

Key words lasso sparse optimization augmented Lagrangian metric subregularity semis-moothness Newtonrsquos method

AMS subject classifications 65F10 90C06 90C25 90C31

DOI 10113716M1097572

1 Introduction In this paper we aim to design a highly efficient and robustalgorithm for solving convex composite optimization problems including the following`1-regularized least squares (LS) problem

2Axminus b2 + λx1

where A X rarr Y is a linear map whose adjoint is denoted as Alowast b isin Y and λ gt 0 aregiven data and X Y are two real finite-dimensional Euclidean spaces each equippedwith an inner product 〈middot middot〉 and its induced norm middot

With the advent of convenient automated data collection technologies the BigData era brings new challenges in analyzing massive data due to the inherent sizesmdashlarge samples and high dimensionality [15] In order to respond to these challengesresearchers have developed many new statistical tools to analyze such data Amongthese the `1-regularized models are arguably the most intensively studied They areused in many applications such as in compressive sensing high-dimensional variableselection and image reconstruction Most notably the model (1) called lasso was

lowastReceived by the editors October 6 2016 accepted for publication (in revised form) August 232017 published electronically February 13 2018 An earlier version of this paper was made availablein arXiv as arXiv160705428

httpwwwsiamorgjournalssiopt28-1M109757htmldaggerDepartment of Operations Research and Financial Engineering Princeton University Princeton

NJ 08544 (xudonglprincetonedu)DaggerDepartment of Applied Mathematics The Hong Kong Polytechnic University Hung Hom Hong

Kong On leave from Department of Mathematics National University of Singapore (defengsunpolyueduhk)sectDepartment of Mathematics and Institute of Operations Research and Analytics

National University of Singapore 10 Lower Kent Ridge Road Singapore 119076 Singapore(mattohkcnusedusg)

434 XUDONG LI DEFENG SUN AND KIM-CHUAN TOH

proposed in [48] and has been used extensively in high-dimensional statistics andmachine learning The model (1) has also been studied in the signal processing contextunder the name basis pursuit denoising [9] In addition to its own importance instatistics and machine learning lasso problem (1) also appears as an inner subproblemof many important algorithms For example in recent papers [4 1] a level set methodwas proposed to solve a computationally more challenging reformulation of the lassoproblem ie

2Axminus b2 le σ

The level set method relies critically on the assumption that the optimization problem

2Axminus b2 + δBλ(x)

Bλ = x | middot 1 le λ

of the same type as (1) can be efficiently solved to high accuracy The lasso-type op-timization problems also appear as subproblems in various proximal Newton methodsfor solving convex composite optimization problems [7 29 53] Notably in a broadersense all these proximal Newton methods belong to the class of algorithms studiedin [17]

The above-mentioned importance together with a wide range of applications of(1) has inspired many researchers to develop various algorithms for solving this prob-lem and its equivalent reformulations These algorithms can roughly be divided intotwo categories The first category consists of algorithms that use only the gradientinformation for example the accelerated proximal gradient (APG) method [37 2]GPSR [16] SPGL1 [4] SpaRSA [50] FPC AS [49] and NESTA [3] to name onlya few Meanwhile algorithms in the second category including but not limited tomfIPM [18] SNF [36] BAS [6] SQA [7] OBA [26] and FBS-Newton [51] utilizethe second order information of the underlying problem in the algorithmic design toaccelerate the convergence Nearly all of these second order information based solversrely on certain nondegeneracy assumptions to guarantee the nonsingularity of thecorresponding inner linear systems The only exception is the inexact interior-point-algorithm based solver mfIPM which does not rely on the nondegeneracy assumptionbut requires the availability of appropriate preconditioners to ameliorate the extremeill-conditioning in the linear systems of the subproblems For nondegenerate problemsthe solvers in the second category generally work quite well and usually outperformthe algorithms in the first category when high accuracy solutions are sought In thispaper we also aim to solve the lasso problems by making use of the second orderinformation The novelty of our approach is that we do not need any nondegeneracyassumption in our theory or computations The core idea is to analyze the fundamen-tal nonsmooth structures in the problems to formulate and solve specific semismoothequations with well conditioned symmetric and positive definite generalized Jacobianmatrices which consequently play a critical role in our algorithmic design Whenapplied to solving difficult large-scale sparse optimization problems even for degen-erate ones our approach can outperform the first order algorithms by a huge marginregardless of whether low- or high-accuracy solutions are sought This is in sharpcontrast to most of the other second order based solvers mentioned above wheretheir competitive advantages over first order methods only become apparent whenhigh-accuracy solutions are sought

Our proposed algorithm is a semismooth Newton augmented Lagrangian (in shortSsnal) method for solving the dual of problem (1) where the sparsity property of the

SEMISMOOTH NEWTON ALM 435

second order generalized Hessian is wisely exploited This algorithmic framework isadapted from those that appeared in [54 25 52 31] for solving semidefinite program-ming problems where impressive numerical results have been reported Specializedto the vector case our Ssnal possesses unique features that are not available in thesemidefinite programming case It is these combined unique features that allow ouralgorithm to converge at a very fast speed with very low computational costs at eachstep Indeed for large-scale sparse lasso problems our numerical experiments showthat the proposed algorithm needs at most a few dozens of outer iterations to reach so-lutions with the desired accuracy while all the inner semismooth Newton subproblemscan be solved very cheaply One reason for this impressive performance is that thepiecewise linear-quadratic structure of the lasso problem (1) guarantees the asymp-totic superlinear convergence of the augmented Lagrangian algorithm Beyond thepiecewise linear-quadratic case we also study more general functions to guarantee thisfast convergence rate More importantly since there are several desirable propertiesincluding the strong convexity of the objective function in the inner subproblems ineach outer iteration we only need to execute a few (usually one to four) semismoothNewton steps to solve the underlying subproblem As will be shown later for lassoproblems with sparse optimal solutions the computational costs of performing thesesemismooth Newton steps can be made to be extremely cheap compared to othercosts This seems to be counterintuitive as normally one would expect a second ordermethod to be computationally much more expensive than the first order methodsat each step Here we make this counterintuitive achievement possible by carefullyexploiting the second order sparsity in the augmented Lagrangian functions Notablyour algorithmic framework not only works for models such as lasso adaptive lasso[56] and elastic net [57] but can also be applied to more general convex compositeoptimization problems The high performance of our algorithm also serves to showthat the second order information more specifically the nonsmooth second order in-formation can be and should be incorporated intelligently into the algorithmic designfor large-scale optimization problems

The remaining parts of this paper are organized as follows In the next section weintroduce some definitions and present preliminary results on the metric subregularityof multivalued mappings We should emphasize here that these stability results playa pivotal role in the analysis of the convergence rate of our algorithm In section 3we propose an augmented Lagrangian algorithm to solve the general convex com-posite optimization model and analyze its asymptotic superlinear convergence Thesemismooth Newton algorithm for solving the inner subproblems and the efficientimplementation of the algorithm are also presented in this section In section 4 weconduct extensive numerical experiments to evaluate the performance of Ssnal insolving various lasso problems We conclude our paper in sections

2 Preliminaries We discuss in this section some stability properties of convexcomposite optimization problems It will become apparent later that these stabilityproperties are the key ingredients for establishing the fast convergence of our aug-mented Lagrangian method

Recall that X and Y are two real finite-dimensional Euclidean spaces For a givenclosed proper convex function p X rarr (minusinfin+infin] the proximal mapping Proxp(middot)associated with p is defined by

Proxp(x) = arg minuisinX

p(x) +

2uminus x2

forallx isin X D

We will often make use of the Moreau identity Proxtp(x) + tProxplowastt(xt)= x where t gt 0 is a given parameter Denote dist(xC) = infxprimeisinC x minus xprime forany x isin X and any set C sub X

Let F X rArr Y be a multivalued mapping We define the graph of F to be theset

gphF = (x y) isin X times Y | y isin F (x)Fminus1 the inverse of F is the multivalued mapping from Y to X whose graph is(y x) | (x y) isin gphF

Definition 21 (error bound) Let F X rArr Y be a multivalued mapping andy isin Y satisfy Fminus1(y) 6= empty F is said to satisfy the error bound condition for the pointy with modulus κ ge 0 if there exists ε gt 0 such that if x isin X with dist(y F (x)) le εthen

dist(x Fminus1(y)

)le κdist(y F (x))(2)

The above error bound condition was called the growth condition in [34] and wasused to analyze the local linear convergence properties of the proximal point algorithmRecall that F X rArr Y is called a polyhedral multifunction if its graph is the unionof finitely many polyhedral convex sets In [39] Robinson established the followingcelebrated proposition on the error bound result for polyhedral multifunctions

Proposition 22 Let F be a polyhedral multifunction from X to Y Then Fsatisfies the error bound condition (2) for any point y isin Y satisfying Fminus1(y) 6= empty witha common modulus κ ge 0

For later use we present the following definition of metric subregularity fromChapter 3 in [13]

Definition 23 (metric subregularity) Let F X rArr Y be a multivalued map-ping and (x y) isin gphF F is said to be metrically subregular at x for y with modulusκ ge 0 if there exist neighborhoods U of x and V of y such that

dist(x Fminus1(y)) le κdist(y F (x) cap V ) forallx isin U

From the above definition we see that if F X rArr Y satisfies the error boundcondition (2) for y with the modulus κ then it is metrically subregular at x for y withthe same modulus κ for any x isin Fminus1(y)

The following definition on essential smoothness is taken from [40 section 26]

Definition 24 (essential smoothness) A proper convex function f on X is es-sentially smooth if f is differentiable on int (dom f) 6= empty and limkrarrinfin nablaf(xk) = +infinwhenever xk is a sequence in int (dom f) converging to a boundary point x ofint (dom f)

In particular a smooth convex function on X is essentially smooth Moreover ifa closed proper convex function f is strictly convex on dom f then its conjugate flowast

is essentially smooth [40 Theorem 263]Consider the following composite convex optimization model

(3) max minus (f(x) = h(Ax)minus 〈c x〉+ p(x))

where A X rarr Y is a linear map h Y rarr lt and p X rarr (minusinfin+infin] are twoclosed proper convex functions and c isin X is a given vector The dual of (3) can bewritten as

(4) min hlowast(y) + plowast(z) | Alowasty + z = c

where glowast and plowast are the Fenchel conjugate functions of g and h respectively Through-out this section we make the following blanket assumption on hlowast(middot) and plowast(middot)

Assumption 25 hlowast(middot) is essentially smooth and plowast(middot) is either an indicator func-tion δP (middot) or a support function δlowastP (middot) for some nonempty polyhedral convex setP sube X Moreover nablahlowast is locally Lipschitz continuous and directionally differen-tiable on int (domhlowast)

Under Assumption 25 by [40 Theorem 261] we know that parthlowast(y) = empty whenevery 6isin int (domhlowast) Denote by l the Lagrangian function for (4)

(5) l(y z x) = hlowast(y) + plowast(z)minus 〈x Alowasty + z minus c〉 forall (y z x) isin Y times X times X

Corresponding to the closed proper convex function f in the objective of (3) and theconvex-concave function l in (5) define the maximal monotone operators Tf and Tl[42] by

Tf (x) = partf(x) Tl(y z x) = (yprime zprime xprime) | (yprime zprimeminusxprime) isin partl(y z x)

whose inverses are given respectively by

T minus1f (xprime) = partflowast(xprime) T minus1l (yprime zprime xprime) = (y z x) | (yprime zprimeminusxprime) isin partl(y z x)

Unlike the case for Tf [33 47 58] stability results of Tl which correspond to theperturbations of both primal and dual solutions are very limited Next as a tool forstudying the convergence rate of Ssnal we shall establish a theorem which revealsthe metric subregularity of Tl under some mild assumptions

The KKT system associated with problem (4) is given as follows

0 isin parthlowast(y)minusAx 0 isin minusx+ partplowast(z) 0 = Alowasty + z minus c (x y z) isin X times Y times X

Assume that the above KKT system has at least one solution This existence assump-tion together with the essential smoothness assumption on hlowast implies that the aboveKKT system can be equivalently rewritten as

0 = nablahlowast(y)minusAx0 isin minusx+ partplowast(z)

0 = Alowasty + z minus c(x y z) isin X times int(domhlowast)timesX

Therefore under the above assumptions we only need to focus on int (domhlowast) times Xwhen solving problem (4) Let (y z) be an optimal solution to problem (4) Thenwe know that the set of the Lagrangian multipliers associated with (y z) denoted asM(y z) is nonempty Define the critical cone associated with (4) at (y z) as follows

(7) C(y z) =

(d1 d2) isin Y times X | Alowastd1 + d2 = 0

〈nablahlowast(y) d1〉+ (plowast)prime(z d2) = 0 d2 isin Tdom(plowast)(z)

where (plowast)prime(z d2) is the directional derivative of plowast at z with respect to d2 andTdom(plowast)(z) is the tangent cone of dom(plowast) at z When the conjugate function plowast

is taken to be the indicator function of a nonempty polyhedral set P the above defi-nition reduces directly to the standard definition of the critical cone in the nonlinearprogramming setting

Definition 26 (second order sufficient condition) Let (y z) isin Y times X be anoptimal solution to problem (4) with M(y z) 6= empty We say that the second ordersufficient condition for problem (4) holds at (y z) if

〈d1 (nablahlowast)prime(y d1)〉 gt 0 forall 0 6= (d1 d2) isin C(y z)

By building on the proof ideas from the literature on nonlinear programmingproblems [11 27 24] we are able to prove the following result on the metric subreg-ularity of Tl This allows us to prove the linear and even the asymptotic superlinearconvergence of the sequences generated by the Ssnal algorithm to be presented in thenext section even when the objective in problem (3) does not possess the piecewiselinear-quadratic structure as in the lasso problem (1)

Theorem 27 Assume that the KKT system (6) has at least one solution anddenote it as (x y z) Suppose that Assumption 25 holds and that the second ordersufficient condition for problem (4) holds at (y z) Then Tl is metrically subregularat (y z x) for the origin

Proof First we claim that there exists a neighborhood U of (x y z) such thatfor any w = (u1 u2 v) isin W = Y times X times X with w sufficiently small any solution(x(w) y(w) z(w)) isin U of the perturbed KKT system

(8) 0 = nablahlowast(y)minusAxminus u1 0 isin minusxminus u2 + partplowast(z) 0 = Alowasty + z minus cminus v

satisfies the following estimate

(9) (y(w) z(w))minus (y z) = O(w)

For the sake of contradiction suppose that our claim is not true Then there ex-ist some sequences wk= (uk1 u

k) and (xk yk zk) such that wk rarr 0 and(xk yk zk)rarr (x y z) for each k the point (xk yk zk) is a solution of (8) for w = wkand

(yk zk)minus (y z) gt γkwk

with some γk gt 0 such that γk rarrinfin Passing to a subsequence if necessary we assumethat

((yk zk) minus (y z)

)(yk zk) minus (y z) converges to some ξ = (ξ1 ξ2) isin Y times X

ξ = 1 Then setting tk = (yk zk) minus (y z) and passing to a subsequence furtherif necessary by the local Lipschitz continuity and the directional differentiability ofnablahlowast(middot) at y we know that for all k sufficiently large

nablahlowast(yk)minusnablahlowast(y) =nablahlowast(y + tkξ1)minusnablahlowast(y) +nablahlowast(yk)minusnablahlowast(y + tkξ1)

= tk(nablahlowast)prime(y ξ1) + o(tk) +O(yk minus y minus tkξ1)

= tk(nablahlowast)prime(y ξ1) + o(tk)

Denote for each k xk = xk + uk2 Simple calculations show that for all k sufficientlylarge

(10) 0 = nablahlowast(yk)minusnablahlowast(y)minusA(xkminusx)+Auk2minusuk1 = tk(nablahlowast)prime(y ξ1)+o(tk)minusA(xkminusx)

(11) 0 = A(yk minus y) + (zk minus z)minus vk = tk(Alowastξ1 + ξ2) + o(tk)

Dividing both sides of (11) by tk and then taking limits we obtain

(12) Alowastξ1 + ξ2 = 0

which further implies that

(13) 〈nablahlowast(y) ξ1〉 = 〈Ax ξ1〉 = minus〈x ξ2〉

Since xk isin partplowast(zk) we know that for all k sufficiently large

zk = z + tkξ2 + o(tk) isin dom plowast

That is ξ2 isin Tdom plowast(z)According to the structure of plowast we separate our discussions into two cases

Case I There exists a nonempty polyhedral convex set P such that plowast(z) =δlowastP (z) for all z isin X Then for each k it holds that

xk = ΠP (zk + xk)

By [14 Theorem 411] we have

(14) xk = ΠP (zk+ xk) = ΠP (z+ x)+ΠC(zkminus z+ xkminus x) = x+ΠC(z

kminus z+ xkminus x)

where C is the critical cone of P at z + x ie

C equiv CP (z + x) = TP (x) cap zperp

Since C is a polyhedral cone we know from [14 Proposition 414] that ΠC(middot) is apiecewise linear function ie there exist a positive integer l and orthogonal projectorsB1 Bl such that for any x isin X

ΠC(x) isin B1x Blx

By restricting to a subsequence if necessary we may further assume that there exists1 le jprime le l such that for all k ge 1

(15) ΠC(zk minus z + xk minus x) = Bjprime(z

k minus z + xk minus x) = ΠRange(Bjprime )(zk minus z + xk minus x)

Denote L = Range(Bjprime) Combining (14) and (15) we get

C cap L 3 (xk minus x) perp (zk minus z) isin C cap Lperp

where C is the polar cone of C Since xk minus x isin C we have 〈xk minus x z〉 = 0 whichtogether with 〈xkminus x zkminus z〉 = 0 implies 〈xkminus x zk〉 = 0 Thus for all k sufficientlylarge

〈zk xk〉 = 〈zk x〉 = 〈z + tkξ2 + o(tk) x〉

and it follows that

(plowast)prime(z ξ2) = limkrarrinfin

δlowastP (z + tkξ2)minus δlowastP (z)

tk= limkrarrinfin

δlowastP (zk)minus δlowastP (z)

= limkrarrinfin

〈zk xk〉 minus 〈z x〉tk

= 〈x ξ2〉

By (12) (13) and (16) we know that (ξ1 ξ2) isin C(y z) Since C cap L is a polyhedralconvex cone we know from [40 Theorem 193] that A(C cap L) is also a polyhedralconvex cone which together with (10) implies

(nablahlowast)prime(y ξ1) isin A(C cap L)

Therefore there exists a vector η isin C cap L such that

〈ξ1 (nablahlowast)prime(y ξ1)〉 = 〈ξ1 Aη〉 = minus〈ξ2 η〉 = 0

where the last equality follows from the fact that ξ2 isin C cap Lperp Note that the lastinclusion holds since the polyhedral convex cone C cap Lperp is closed and

ξ2 = limkrarrinfin

zk minus ztk

isin C cap Lperp

As 0 6= ξ = (ξ1 ξ2) isin C(y z) but 〈ξ1 (nablahlowast)prime(y ξ1)〉 = 0 this contradicts the assump-tion that the second order sufficient condition holds for (4) at (y z) Thus we haveproved our claim for Case I

Case II There exists a nonempty polyhedral convex set P such that plowast(z) =δP (z) for all z isin X Then we know that for each k

zk = ΠP (zk + xk)

Since (δP )prime(z d2) = 0 for all d2 isin Tdom(plowast)(z) the critical cone in (7) now takes thefollowing form

C(y z) =

(d1 d2) isin Y times X | Alowastd1 + d2 = 0 〈nablahlowast(y) d1〉 = 0 d2 isin Tdom(plowast)(z)

Similar to Case I without loss of generality we can assume that there exists a subspaceL such that for all k ge 1

C cap L 3 (zk minus z) perp (xk minus x) isin C cap Lperp

whereC equiv CP (z + x) = TP (z) cap xperp

Since C cap L is a polyhedral convex cone we know

ξ2 = limkrarrinfin

zk minus ztk

isin C cap L

and consequently 〈ξ2 x〉 = 0 which together with (12) and (13) implies ξ =(ξ1 ξ2) isin C(y z) By (10) and the fact that A(C cap Lperp) is a polyhedral convexcone we know that

(nablahlowast)prime(y ξ1) isin A(C cap Lperp)

Therefore there exists a vector η isin C cap Lperp such that

Since ξ = (ξ1 ξ2) 6= 0 we arrive at a contradiction to the assumed second ordersufficient condition So our claim is also true for this case

In summary we have proven that there exists a neighborhood U of (x y z) suchthat for any w close enough to the origin (9) holds for any solution (x(w) y(w)

z(w)) isin U to the perturbed KKT system (8) Next we show that Tl is metricallysubregular at (y z x) for the origin

Define the mapping ΘKKT X times Y times X timesW rarr Y timesX times X by

ΘKKT (x y z w) =

nablahlowast(y)minusAxminus u1z minus Proxplowast(z + x+ u2)Alowasty + z minus cminus v

forall(x y z w) isin X times Y times X timesW

and define the mapping θ X rarr Y times X times X as follows

θ(x) = ΘKKT(x y z 0) forallx isin X

Then we have x isin M(y z) if and only if θ(x) = 0 Since Proxplowast(middot) is a piecewiseaffine function θ(middot) is a piecewise affine function and thus a polyhedral multifunctionBy using Proposition 22 and shrinking the neighborhood U if necessary for any wclose enough to the origin and any solution (x(w) y(w) z(w)) isin U of the perturbedKKT system (8) we have

dist(x(w)M(y z)) = O(θ(x(w)))

= O(ΘKKT(x(w) y z 0)minusΘKKT(x(w) y(w) z(w) w))

= O(w+ (y(w) z(w))minus (y z))

which together with (9) implies the existence of a constant κ ge 0 such that

(17) (y(w) z(w))minus (y z)+ dist(x(w)M(y z)) le κw

Thus by Definition 23 we have proven that Tl is metrically subregular at (y z x)for the origin The proof of the theorem is complete

Remark 28 For convex piecewise linear-quadratic programming problems suchas the `1 and elastic net regularized LS problem we know from J Sunrsquos thesis [46]on the characterization of the subdifferentials of convex piecewise linear-quadraticfunctions (see also [43 Proposition 1230]) that the corresponding operators Tl and Tfare polyhedral multifunctions and thus by Proposition 22 the error bound conditionholds Here we emphasize again that the error bound condition holds for the lassoproblem (1) Moreover Tf associated with the `1 or elastic net regularized logisticregression model ie for a given vector b isin ltm the loss function h ltm rarr lt inproblem (3) takes the form

h(y) =

msumi=1

log(1 + eminusbiyi) forally isin ltm

also satisfies the error bound condition [33 47] Meanwhile from Theorem 27 weknow that Tl corresponding to the `1 or the elastic net regularized logistic regressionmodel is metrically subregular at any solutions to the KKT system (6) for the origin

3 An augmented Lagrangian method with asymptotic superlinear con-vergence Recall the general convex composite model (3)

(P) max minus (f(x) = h(Ax)minus 〈c x〉+ p(x))

and its dual(D) minhlowast(y) + plowast(z) | Alowasty + z = c

In this section we shall propose an asymptotically superlinearly convergent aug-mented Lagrangian method for solving (P) and (D) In this section we make thefollowing standing assumptions regarding the function h

Assumption 31(a) h Y rarr lt is a convex differentiable function whose gradient is 1αh-Lipschitz

continuous ie

nablah(yprime)minusnablah(y) le (1αh)yprime minus y forallyprime y isin Y

(b) h is essentially locally strongly convex [21] ie for any compact and convexset K sub dom parth there exists βK gt 0 such that

(1minusλ)h(yprime) +λh(y) ge h((1minusλ)yprime+λy) +1

2βKλ(1minusλ)yprimeminusy2 forallyprime y isin K

for all λ isin [0 1]

Many commonly used loss functions in the machine learning literature satisfy theabove mild assumptions For example h can be the loss function in the linear regres-sion logistic regression and Poisson regression models While the strict convexity of aconvex function is closely related to the differentiability of its conjugate the essentiallocal strong convexity in Assumption 31(b) for a convex function was first proposedin [21] to obtain a characterization of the local Lipschitz continuity of the gradient ofits conjugate function

The aforementioned assumptions on h imply the following useful properties of hlowastFirst by [43 Proposition 1260] we know that hlowast is strongly convex with modulusαh Second by [21 Corollary 44] we know that hlowast is essentially smooth and nablahlowast islocally Lipschitz continuous on int (domhlowast) If the solution set to the KKT systemassociated with (P) and (D) is further assumed to be nonempty similar to what wehave discussed in the last section one only needs to focus on int (domhlowast)times X whensolving (D) Given σ gt 0 the augmented Lagrangian function associated with (D) isgiven by

Lσ(y zx) = l(y z x) +σ

2Alowasty + z minus c2 forall (y z x) isin Y times X times X

where the Lagrangian function l(middot) is defined in (5)

31 SSNAL A semismooth Newton augmented Lagrangian algorithmfor (D) The detailed steps of our algorithm Ssnal for solving (D) are given as fol-lows Since a semismooth Newton method will be employed to solve the subproblemsinvolved in this prototype of the inexact augmented Lagrangian method [42] it isnatural for us to call our algorithm Ssnal

Algorithm SSNAL An inexact augmented Lagrangian method for (D)

Let σ0 gt 0 be a given parameter Choose (y0 z0 x0) isin int(domhlowast) times dom plowast times X For k = 0 1 perform the following steps in each iterationStep 1 Compute

(18) (yk+1 zk+1) asymp arg minΨk(y z) = Lσk(y zxk)

Step 2 Compute xk+1 = xk minus σk(Alowastyk+1 + zk+1 minus c) and update σk+1 uarr σinfin le infin

Next we shall adapt the results developed in [41 42 34] to establish the globaland local superlinear convergence of our algorithm

Since the inner problems cannot be solved exactly we use the following standardstopping criterion studied in [41 42] for approximately solving (18)

(A) Ψk(yk+1 zk+1)minus inf Ψk le ε2k2σkinfinsumk=0

εk ltinfin

Now we can state the global convergence of Algorithm Ssnal from [41 42] withoutmuch difficulty

Theorem 32 Suppose that Assumption 31 holds and that the solution set to(P) is nonempty Let (yk zk xk) be the infinite sequence generated by AlgorithmSsnal with stopping criterion (A) Then the sequence xk is bounded and convergesto an optimal solution of (P) In addition (yk zk) is also bounded and convergesto the unique optimal solution (ylowast zlowast) isin int(domhlowast)times dom plowast of (D)

Proof Since the solution set to (P) is assumed to be nonempty the optimal valueof (P) is finite From Assumption 31(a) we have that domh = Y and hlowast is stronglyconvex [43 Proposition 1260] Then by Fenchelrsquos duality theorem [40 Corollary3121] we know that the solution set to (D) is nonempty and the optimal valueof (D) is finite and equal to the optimal value of (P) That is the solution set tothe KKT system associated with (P) and (D) is nonempty The uniqueness of theoptimal solution (ylowast zlowast) isin int(domhlowast) times X of (D) then follows directly from thestrong convexity of hlowast By combining this uniqueness with [42 Theorem 4] one caneasily obtain the boundedness of (yk zk) and other desired results readily

We need the the following stopping criteria for the local convergence analysis

(B1) Ψk(yk+1 zk+1)minus inf Ψk le (δ2k2σk)xk+1 minus xk2suminfink=0 δk lt +infin

(B2) dist(0 partΨk(yk+1 zk+1)) le (δprimekσk)xk+1 minus xk 0 le δprimek rarr 0

where xk+1 = xk + σk(Alowastyk+1 + zk+1 minus c)Theorem 33 Assume that Assumption 31 holds and that the solution set Ω to

(P) is nonempty Suppose that Tf satisfies the error bound condition (2) for the originwith modulus af Let (yk zk xk) be any infinite sequence generated by AlgorithmSsnal with stopping criteria (A) and (B1) Then the sequence xk converges toxlowast isin Ω and for all k sufficiently large

(19) dist(xk+1Ω) le θkdist(xkΩ)

where θk =(af (a2f + σ2

k)minus12 + 2δk)(1 minus δk)minus1 rarr θinfin = af (a2f + σ2

infin)minus12 lt 1 as

k rarr +infin Moreover the sequence (yk zk) converges to the unique optimal solution(ylowast zlowast) isin int(domhlowast)times dom plowast to (D)

Moreover if Tl is metrically subregular at (ylowast zlowast xlowast) for the origin with modulusal and the stopping criterion (B2) is also used then for all k sufficiently large

(yk+1 zk+1)minus (ylowast zlowast) le θprimekxk+1 minus xk

where θprimek = al(1 + δprimek)σk with limkrarrinfin θprimek = alσinfin

Proof The first part of the theorem follows from [34 Theorem 21] [42 Propo-sition 7 and Theorem 5] and Theorem 32 To prove the second part we recall that

if Tl is metrically subregular at (ylowast zlowast xlowast) for the origin with the modulus al and(yk zk xk)rarr (ylowast zlowast xlowast) then for all k sufficiently large

(yk+1 zk+1)minus (ylowast zlowast)+ dist(xk+1Ω) le al dist(0 Tl(yk+1 zk+1 xk+1))

Therefore by the estimate (421) in [42] and the stopping criterion (B2) we obtainthat for all k sufficiently large

(yk+1 zk+1)minus (ylowast zlowast) le al(1 + δprimek)σkxk+1 minus xk

This completes the proof of Theorem 33

Remark 34 In the above theorem when σinfin = +infin the inequality (19) directlyimplies that xk converges Q-superlinearly Moreover recent advances in [12] alsoestablished the asymptotic R-superlinear convergence of the dual iteration sequence(yk zk) Indeed from [12 Proposition 41] under the same conditions of Theorem33 we have that for k sufficiently large

(20) (yk+1 zk+1)minus (ylowast zlowast) le θprimekxk+1 minus xk le θprimek(1minus δk)minus1dist(xk Ω)

where θprimek(1 minus δk)minus1 rarr alσinfin Thus if σinfin = infin the Q-superlinear convergence ofxk and (20) further imply that (yk zk) converges R-superlinearly

We should emphasize here that by combining Remarks 28 and 34 and Theo-rem 33 our Algorithm Ssnal is guaranteed to produce an asymptotically super-linearly convergent sequence when used to solve (D) for many commonly used reg-ularizers and loss functions In particular the Ssnal algorithm is asymptoticallysuperlinearly convergent when applied to the dual of (1)

32 Solving the augmented Lagrangian subproblems Here we shall pro-pose an efficient semismooth Newton algorithm to solve the inner subproblems in theaugmented Lagrangian method (18) That is for some fixed σ gt 0 and x isin X weaim to solve

(21) minyz

Ψ(y z) = Lσ(y z x)

Since Ψ(middot middot) is a strongly convex function we have that for any α isin lt the level setLα = (y z) isin domhlowast times dom plowast | Ψ(y z) le α is a closed and bounded convexset Moreover problem (21) admits a unique optimal solution denoted as (y z) isinint(domhlowast)times dom plowast

Denote for any y isin Y

ψ(y) = infz

Ψ(y z) = hlowast(y) + plowast(Proxplowastσ(xσ minusAlowasty + c))

2σProxσp(xminus σ(Alowasty minus c))2 minus 1

Therefore if (y z) = arg min Ψ(y z) then (y z) isin int(domhlowast)times dom plowast can be com-puted simultaneously by

y = arg minψ(y) z = Proxplowastσ(xσ minusAlowasty + c)

Note that ψ(middot) is strongly convex and continuously differentiable on int(domhlowast) with

nablaψ(y) = nablahlowast(y)minusAProxσp(xminus σ(Alowasty minus c)) forally isin int(domhlowast)

Thus y can be obtained via solving the following nonsmooth equation

(22) nablaψ(y) = 0 y isin int(domhlowast)

Let y isin int(domhlowast) be any given point Since hlowast is a convex function with a locallyLipschitz continuous gradient on int(domhlowast) the following operator is well defined

part2ψ(y) = part(nablahlowast)(y) + σApartProxσp(xminus σ(Alowasty minus c))Alowast

where part(nablahlowast)(y) is the Clarke subdifferential of nablahlowast at y [10] and partProxσp(x minusσ(Alowastyminusc)) is the Clarke subdifferential of the Lipschitz continuous mapping Proxσp(middot)at xminusσ(Alowastyminusc) Note that from [10 Proposition 233 and Theorem 266] we knowthat

part2ψ(y) (d) sube part2ψ(y) (d) forall d isin Y

where part2ψ(y) denotes the generalized Hessian of ψ at y Define

(23) V = H + σAUAlowast

with H isin part2hlowast(y) and U isin partProxσp(x minus σ(Alowasty minus c)) Then we have V isin part2ψ(y)Since hlowast is a strongly convex function we know that H is symmetric positive definiteon Y and thus V is also symmetric positive definite on Y

Under the mild assumption that nablahlowast and Proxσp are strongly semismooth (thedefinition of which is given next) we can design a superlinearly convergent semismoothNewton method to solve the nonsmooth equation (22)

Definition 35 (semismoothness [35 38 45]) Let F O sube X rarr Y be a locallyLipschitz continuous function on the open set O F is said to be semismooth at x isin Oif F is directionally differentiable at x and for any V isin partF (x+ ∆x) with ∆xrarr 0

F (x+ ∆x)minus F (x)minus V∆x = o(∆x)

F is said to be strongly semismooth at x if F is semismooth at x and

F (x+ ∆x)minus F (x)minus V∆x = O(∆x2)

F is said to be a semismooth (respectively strongly semismooth) function on O if itis semismooth (respectively strongly semismooth) everywhere in O

Note that it is widely known in the nonsmooth optimizationequation commu-nity that continuous piecewise affine functions and twice continuously differentiablefunctions are all strongly semismooth everywhere In particular Proxmiddot1 as a Lips-chitz continuous piecewise affine function is strongly semismooth See [14] for moresemismooth and strongly semismooth functions

Now we can design a semismooth Newton (Ssn) method to solve (22) as followsand could expect to get a fast superlinear or even quadratic convergenceD

Algorithm SSN A semismooth Newton algorithm for solving (22)(SSN(y0 x σ))

Given micro isin (0 12) η isin (0 1) τ isin (0 1] and δ isin (0 1) Choose y0 isin int(domhlowast)Iterate the following steps for j = 0 1 Step 1 Choose Hj isin part(nablahlowast)(yj) and Uj isin partProxσp(x minus σ(Alowastyj minus c)) Let Vj =

Hj + σAUjAlowast Solve the linear system

(24) Vjd+nablaψ(yj) = 0

exactly or by the conjugate gradient (CG) algorithm to find dj such that

Vjdj +nablaψ(yj) le min(η nablaψ(yj)1+τ )

Step 2 (Line search) Set αj = δmj where mj is the first nonnegative integer m forwhich

yj+δmdj isin int(domhlowast) and ψ(yj+δmdj)leψ(yj)+microδm〈nablaψ(yj) dj〉

Step 3 Set yj+1 = yj + αj dj

The convergence results for the above Ssn algorithm are stated in the nexttheorem

Theorem 36 Assume that nablahlowast(middot) and Proxσp(middot) are strongly semismooth onint(domhlowast) and X respectively Let the sequence yj be generated by AlgorithmSsn Then yj converges to the unique optimal solution y isin int(domhlowast) of theproblem in (22) and

yj+1 minus y = O(yj minus y1+τ )

Proof Since by [54 Proposition 33] dj is a descent direction Algorithm Ssn is

well defined By (23) we know that for any j ge 0 Vj isin part2ψ(yj) Then we can provethe conclusion of this theorem by following the proofs of [54 Theorems 34 and 35]We omit the details here

We shall now discuss the implementations of stopping criteria (A) (B1) and (B2)for Algorithm Ssn to solve the subproblem (18) in Algorithm Ssnal Note that whenAlgorithm Ssn is applied to minimize Ψk(middot) to find

yk+1 = Ssn(yk xk σk) and zk+1 = Proxplowastσk(xkσk minusAlowastyk+1 + c)

we have by simple calculations and the strong convexity of hlowast that

Ψk(yk+1 zk+1)minus inf Ψk = ψk(yk+1)minus inf ψk le (12αh)nablaψk(yk+1)2

and (nablaψk(yk+1) 0) isin partΨk(yk+1 zk+1) where ψk(y) = infz Ψk(y z) for all y isin YTherefore the stopping criteria (A) (B1) and (B2) can be achieved by the followingimplementable criteria

(Aprime) nablaψk(yk+1) leradicαhσk εk

infinsumk=0

εk ltinfin

(B1prime) nablaψk(yk+1) leradicαhσk δkAlowastyk+1 + zk+1 minus c

infinsumk=0

δk lt +infin

(B2prime) nablaψk(yk+1) le δprimekAlowastyk+1 + zk+1 minus c 0 le δprimek rarr 0

That is the stopping criteria (A) (B1) and (B2) will be satisfied as long asnablaψk(yk+1) is sufficiently small

33 An efficient implementation of SSN for solving subproblems (18)When Algorithm Ssnal is applied to solve the general convex composite optimizationmodel (P) the key part is to use Algorithm Ssn to solve the subproblems (18) Inthis subsection we shall discuss an efficient implementation of Ssn for solving theaforementioned subproblems when the nonsmooth regularizer p is chosen to be λ middot 1for some λ gt 0 Clearly in Algorithm Ssn the most important step is the computationof the search direction dj from the linear system (24) So we shall first discuss thesolving of this linear system

Let (x y) isin ltn times ltm and σ gt 0 be given We consider the following Newtonlinear system

(25) (H + σAUAT )d = minusnablaψ(y)

where H isin part(nablahlowast)(y) A denotes the matrix representation of A with respect to thestandard bases of ltn and ltm and U isin partProxσλmiddot1(x) with x = xminusσ(AT yminusc) SinceH is a symmetric and positive definite matrix (25) can be equivalently rewritten as(

Im + σ(Lminus1A)U(Lminus1A)T)(LT d) = minusLminus1nablaψ(y)

where L is a nonsingular matrix obtained from the (sparse) Cholesky decompositionof H such that H = LLT In many applications H is usually a sparse matrix Indeedwhen the function h in the primal objective is taken to be the squared loss or thelogistic loss functions the resulting matrices H are in fact diagonal matrices That isthe costs of computing L and its inverse are negligible in most situations Thereforewithout loss of generality we can consider a simplified version of (25) as follows

(26) (Im + σAUAT )d = minusnablaψ(y)

which is precisely the Newton system associated with the standard lasso problem(1) Since U isin ltntimesn is a diagonal matrix at first glance the costs of computingAUAT and the matrix-vector multiplication AUAT d for a given vector d isin ltm areO(m2n) and O(mn) respectively These computational costs are too expensive whenthe dimensions of A are large and can make the commonly employed approaches suchas the Cholesky factorization and the conjugate gradient method inappropriate forsolving (26) Fortunately under the sparse optimization setting if the sparsity of Uis wisely taken into the consideration one can substantially reduce these unfavorablecomputational costs to a level such that they are negligible or at least insignificantcompared to other costs Next we shall show how this can be done by taking fulladvantage of the sparsity of U This sparsity will be referred to as the second ordersparsity of the underlying problem

For x = xminusσ(AT yminus c) in our computations we can always choose U = Diag(u)the diagonal matrix whose ith diagonal element is given by ui with

0 if |xi| le σλ

1 otherwisei = 1 n

Since Proxσλmiddot1(x) = sign(x) max|x| minus σλ 0 it is not difficult to see that U isinpartProxσλmiddot1(x) Let J = j | |xj | gt σλ j = 1 n and r = |J | the cardinalityof J By taking the special 0-1 structure of U into consideration we have that

AUAT =O(m2n)

AJATJ =

= O(m2r)

Fig 1 Reducing the computational costs from O(m2n) to O(m2r)

(27) AUAT = (AU)(AU)T = AJATJ

where AJ isin ltmtimesr is the submatrix of A with those columns not in J being removedfrom A Then by using (27) we know that now the costs of computing AUAT andAUAT d for a given vector d are reduced to O(m2r) and O(mr) respectively Due tothe sparsity-promoting property of the regularizer p the number r is usually muchsmaller than n Thus by exploring the aforementioned second order sparsity we cangreatly reduce the computational costs in solving the linear system (26) when usingthe Cholesky factorization More specifically the total computational costs of usingthe Cholesky factorization to solve the linear system are reduced from O(m2(m+n))to O(m2(m+ r)) See Figure 1 for an illustration on the reduction This means thateven if n happens to be extremely large (say larger than 107) one can still solve theNewton linear system (26) efficiently via the Cholesky factorization as long as bothm and r are moderate (say less than 104) If in addition r m which is often thecase when m is large and the optimal solutions to the underlying problem are sparseinstead of factorizing an mtimesm matrix we can make use of the ShermanndashMorrisonndashWoodbury formula [22] to get the inverse of Im+σAUAT by inverting a much smallerr times r matrix as follows

(Im + σAUAT )minus1 = (Im + σAJATJ )minus1 = Im minusAJ (σminus1Ir +ATJAJ )minus1ATJ

See Figure 2 for an illustration on the computation of ATJAJ In this case the totalcomputational costs for solving the Newton linear system (26) are reduced significantlyfurther from O(m2(m + r)) to O(r2(m + r)) We should emphasize here that thisdramatic reduction in computational costs results from the wise combination of thecareful examination of the existing second order sparsity in the lasso-type problemsand some ldquosmartrdquo numerical linear algebra

From the above arguments we can see that as long as the number of the nonzerocomponents of Proxσλmiddot1(x) is small say less than

radicn and Hj isin part(nablahlowast)(yj) is a

sparse matrix eg a diagonal matrix we can always solve the linear system (24)at very low costs In particular this is true for the lasso problems admitting sparse

=ATJAJ = O(r2m)

Fig 2 Further reducing the computational costs to O(r2m)

solutions Similar discussions on the reduction of the computational costs can alsobe conducted for the case when the conjugate gradient method is applied to solvethe linear systems (24) Note that one may argue that even if the original problemhas only sparse solutions at certain stages one may still encounter the situation thatthe number of the nonzero components of Proxσλmiddot1(x) is large Our answer to thisquestion is simple First this phenomenon rarely occurs in practice since we alwaysstart with a sparse feasible point eg the zero vector Second even if at certainsteps this phenomenon does occur we just need to apply a small number of conjugategradient iterations to the linear system (24) as in this case the parameter σ is normallysmall and the current point is far away from any sparse optimal solution In summarywe have demonstrated how Algorithm Ssn can be implemented efficiently for solvingsparse optimization problems of the form (18) with p(middot) being chosen to be λ middot 1

4 Numerical experiments for lasso problems In this section we shallevaluate the performance of our algorithm Ssnal for solving large-scale lasso prob-lems (1) We note that the relative performance of most of the existing algorithmsmentioned in the introduction has recently been well documented in the two recentpapers [18 36] which appears to suggest that for some large-scale sparse reconstruc-tion problems mfIPM1 and FPC AS2 have mostly outperformed the other solversHence in this section we will compare our algorithm with these two popular solversNote that mfIPM is a specialized interior-point based second order method designedfor the lasso problem (1) whereas FPC AS is a first order method based on forward-backward operator splitting Moreover we also report the numerical performance oftwo commonly used algorithms for solving lasso problems the accelerated proximalgradient (APG) algorithm [37 2] as implemented by Liu Ji and Ye in SLEP3 [32]and the alternating direction method of multipliers (ADMM) [19 20] For the pur-pose of comparisons we also test the linearized ADMM (LADMM) [55] We haveimplemented both ADMM and LADMM in MATLAB with the step length set tobe 1618 Although the existing solvers can perform impressively well on some easy-to-solve sparse reconstruction problems as one will see later they lack the ability toefficiently solve difficult problems such as the large-scale regression problems whenthe data A is badly conditioned

For testing purposes the regularization parameter λ in the lasso problem (1) ischosen as

λ = λcAlowastbinfin

where 0 lt λc lt 1 In our numerical experiments we measure the accuracy of anapproximate optimal solution x for (1) by using the following relative KKT residual

η =xminus proxλmiddot1(xminusAlowast(Axminus b))

1 + x+ Axminus b

1httpwwwmathsedacukERGOmfipmcs2httpwwwcaamriceedusimoptimizationL1FPC AS3httpyelabnetsoftwareSLEP

For a given tolerance ε gt 0 we will stop the tested algorithms when η lt ε Forall the tests in this section we set ε = 10minus6 The algorithms will also be stoppedwhen they reach the maximum number of iterations (1000 iterations for our algorithmand mfIPM and 20000 iterations for FPC AS APG ADMM and LADMM) or themaximum computation time of 7 hours All the parameters for mfIPM FPC ASand APG are set to the default values All our computational results are obtainedby running MATLAB (version 84) on a Windows workstation (16-core Intel XeonE5-2650 260GHz 64 G RAM)

41 Numerical results for large-scale regression problems In this sub-section we test all the algorithms with the test instances (A b) obtained from large-scale regression problems in the LIBSVM data sets [8] These data sets are collectedfrom 10-K Corpus [28] and the UCI data repository [30] For computational efficiencyzero columns in A (the matrix representation of A) are removed As suggested in[23] for the data sets pyrim triazines abalone bodyfat housing mpg andspace ga we expand their original features by using polynomial basis functions overthose features For example the last digit in pyrim5 indicates that an order 5 poly-nomial is used to generate the basis functions This naming convention is also usedin the rest of the expanded data sets These test instances shown in Table 1 can bequite difficult in terms of the problem dimensions and the largest eigenvalue of AAlowastwhich is denoted as λmax(AAlowast)

Table 2 reports the detailed numerical results for Ssnal mfIPM FPC AS APGLADMM and ADMM in solving large-scale regression problems In the table mdenotes the number of samples n denotes the number of features and ldquonnzrdquo denotesthe number of nonzeros in the solution x obtained by Ssnal using the followingestimation

nnz = min

ksumi=1

|xi| ge 0999x1

where x is obtained by sorting x such that |x1| ge middot middot middot ge |xn| An entry marked asldquoErrorrdquo indicates that the algorithm breaks down due to some internal errors Thecomputation time is in the format of ldquohoursminutessecondsrdquo and the entry ldquo00rdquo inthe column means that the computation time is less than 05 second

One can observe from Table 2 that all the tested first order algorithms exceptADMM ie FPC AS APG and LADMM fail to solve most of the test instances

Table 1Statistics of the UCI test instances

Probname mn λmax(AAlowast)E2006train 16087150348 191e+05

log1pE2006train 160874265669 586e+07

E2006test 330872812 479e+04

log1pE2006test 33081771946 146e+07

pyrim5 74169911 122e+06

triazines4 186557845 207e+07

abalone7 41776435 521e+05

bodyfat7 252116280 529e+04

housing7 50677520 328e+05

mpg7 3923432 128e+04

space ga9 31075005 401e+03

SEMISMOOTH NEWTON ALM 451T

10minus6)

E2006t

10minus3

160871

10minus4

E2006t

10minus3

160874

265669

10minus4

E2006t

10minus3

10minus4

E2006t

10minus3

771946

10minus4

10minus3

10minus4

10minus3

10minus4

10minus3

|Error|5

rror|1

10minus4

|Error|3

rror|1

t710minus3

10minus4

10minus3

10minus4

10minus3

|Error|1

rror|3

10minus4

10minus3

10minus4

to the required accuracy after 20000 iterations or 7 hours In particular FPC ASfails to produce a reasonably accurate solution for all the test instances In fact for3 test instances it breaks down due to some internal errors This poor performanceindicates that these first order methods cannot obtain reasonably accurate solutionswhen dealing with difficult large-scale problems While ADMM can solve most of thetest instances it needs much more time than Ssnal For example for the instancehousing7 with λc = 10minus3 we can see that Ssnal is at least 330 times faster thanADMM In addition Ssnal can solve the instance triazines4 in 30 seconds whileADMM reaches the maximum of 20000 iterations and consumes about 2 hours butonly produces a rather inaccurate solution

On the other hand one can observe that the two second order information basedmethods Ssnal and mfIPM perform quite robustly despite the huge dimensions andthe possibly badly conditioned data sets More specifically Ssnal is able to solve theinstance log1pE2006train with approximately 43 million features in 20 seconds(λc = 10minus3) Among these two algorithms clearly Ssnal is far more efficient thanthe specialized interior-point method mfIPM for all the test instances especially forlarge-scale problems where the factor can be up to 300 times faster While Ssnalcan solve all the instances to the desired accuracy mfIPM fails on 6 instances (withhigh-dimensional features) out of 22 We also note that mfIPM can only reach asolution with the accuracy of 10minus1 when it fails to compute the corresponding Newtondirections These facts indicate that the nonsmooth approach employed by Ssnal isfar superior compared to the interior point method in exploiting the sparsity in thegeneralized Hessian The superior numerical performance of Ssnal indicates that itis a robust high-performance solver for high-dimensional lasso problems

As pointed out by one referee the polynomial expansion in our data-processingstep may affect the scaling of the problems Since first order solvers are not affine in-variant this scaling may affect their performance Hence we normalize each column ofthe matrix A (the matrix representation of A) to have unit norm and correspondinglychange the variables This scaling step also changes the regularization parameter λaccordingly to a nonuniform weight vector Since it is not easy to call FPC AS whenλ is not a scalar based on the recommendation of the referee we use another popularactive-set based solver PSSas4 [44] to replace FPC AS for the testing All the param-eters for PSSas are set to the default values In our tests PSSas will be stopped whenit reaches the maximum number of 20000 iterations or the maximum computationtime of 7 hours We note that by default PSSas will terminate when its progress issmaller than the threshold 10minus9

The detailed numerical results for Ssnal mfIPM PSSas APG LADMM andADMM with the normalization step in solving the large-scale regression problems arelisted in Table 3 From Table 3 one can easily observe that the simple normalizationtechnique does not change the conclusions based on Table 2 The performance ofSsnal is generally invariant with respect to the scaling and Ssnal is still much fasterand more robust than other solvers Meanwhile after the normalization mfIPMcan solve 2 more instances to the required accuracy On the other hand APG andthe ADMM type of solvers (ie LADMM and ADMM) perform worse than the un-scaled case Furthermore PSSas can only solve 6 out of 22 instances to the requiredaccuracy In fact PSSas fails on all the test instances when λc = 10minus4 For the in-stance triazines4 it consumes about 6 hours but only generates a poor solution withη = 32times 10minus3 (Actually we also run PSSas on these test instances without scaling

4httpswwwcsubccaschmidtmSoftwarethesishtml

=10minus6)

E2006t

10minus3

160871

10minus4

E2006t

10minus3

160874

265669

10minus4

E2006t

10minus3

10minus4

E2006t

10minus3

771946

10minus4

10minus3

10minus4

10minus3

10minus4

10minus3

10minus4

t710minus3

10minus4

10minus3

10minus4

10minus3

10minus4

10minus3

10minus4

and obtain similar performances Detailed results are omitted to conserve space)Therefore we can safely conclude that the simple normalization technique employedhere may not be suitable for general first order methods

42 Numerical results for Sparco collection In this subsection the testinstances (A b) are taken from 8 real-valued sparse reconstruction problems in theSparco collection [5] For testing purposes we introduce a 60dB noise to b (as in[18]) by using the MATLAB command b = awgn(b60rsquomeasuredrsquo) For thesetest instances the matrix representations of the linear maps A are not availableHence ADMM will not be tested in this subsection as it will be extremely expensiveif not impossible to compute and factorize I + σAAlowast

In Table 4 we report the detailed computational results obtained by SsnalmfIPM FPC AS APG and LADMM in solving two large-scale instances srcsep1and srcsep2 in the Sparco collection Here we test five choices of λc ie λc =10minus08 10minus1 10minus12 10minus15 10minus2 As one can observe as λc decreases the numberof nonzeros (nnz) in the computed solution increases In the table we list somestatistics of the test instances including the problem dimensions (mn) and the largesteigenvalue ofAAlowast (λmax(AAlowast)) For all the tested algorithms we present the iterationcounts the relative KKT residuals as well as the computation times (in the formatof hoursminutesseconds) One can observe from Table 4 that all the algorithmsperform very well for these test instances with small Lipschitz constants λmax(AAlowast)As such Ssnal does not have a clear advantage as shown in Table 2 Moreoversince the matrix representations of the linear maps A and Alowast involved are not storedexplicitly (ie these linear maps can only be regarded as black-box functions) thesecond order sparsity can hardly be fully exploited Nevertheless our AlgorithmSsnal is generally faster than mfIPM APG and LADMM while it is comparablewith the fastest algorithm FPC AS

In Table 5 we report the numerical results obtained by Ssnal mfIPM FPC ASAPG and LADMM in solving various instances of the lasso problem (1) For simplic-ity we only test two cases with λc = 10minus3 and 10minus4 We can observe that FPC AS

Table 4The performance of Ssnal mfIPM FPC AS APG and LADMM on srcsep1 and srcsep2

(accuracy ε = 10minus6 noise 60dB) In the table a = Ssnal b = mfIPM c = FPC AS d = APGand e = LADMM

Iteration η Time

λc nnz a | b | c | d | e a | b | c | d | e a | b | c | d | e

srcsep1 m = 29166 n = 57344 λmax(AAlowast) = 356

016 380 14 | 28 | 42 | 401 | 89 64-7| 90-7 | 27-8 | 53-7 | 95-7 07 | 09 | 05 | 15 | 07

010 726 16 | 38 | 42 | 574 | 161 47-7| 59-7 | 21-8 | 28-7 | 96-7 11 | 15 | 05 | 22 | 13

006 1402 19 | 41 | 63 | 801 | 393 15-7| 15-7 | 54-8 | 63-7 | 99-7 18 | 18 | 10 | 30 | 32

003 2899 19 | 56 | 110 | 901 | 337 27-7| 94-7 | 73-8 | 93-7 | 99-7 28 | 53 | 16 | 33 | 28

001 6538 17 | 88 | 223 | 1401 | 542 71-7| 57-7 | 11-7 | 99-7 | 99-7 121 | 215 | 34 | 53 | 45

016 423 15 | 29 | 42 | 501 | 127 32-7| 21-7 | 19-8 | 49-7 | 94-7 14 | 13 | 07 | 25 | 15

010 805 16 | 37 | 84 | 601 | 212 88-7| 92-7 | 26-8 | 96-7 | 97-7 21 | 19 | 16 | 30 | 26

006 1549 19 | 40 | 84 | 901 | 419 14-7| 31-7 | 54-8 | 47-7 | 99-7 32 | 28 | 18 | 44 | 50

003 3254 20 | 69 | 128 | 901 | 488 13-7| 66-7 | 89-7 | 94-7 | 99-7 106 | 133 | 26 | 44 | 59

001 7400 21 | 94 | 259 | 2201 | 837 88-7| 40-7 | 99-7 | 88-7 | 93-7 142 | 533 | 59 | 205 | 143

Table 5The performance of Ssnal mfIPM FPC AS APG and LADMM on 8 selected sparco problems

λc nnz η Time

probname a | b | c | d | e a | b | c | d | e

blknheavi 10minus3 12 57-7 | 92-7 | 13-1 | 20-6 | 99-7 01 | 01 | 55 | 08 | 06

10241024 10minus4 12 92-8 | 87-7 | 46-3 | 85-5 | 99-7 01 | 01 | 49 | 07 | 07

srcsep1 10minus3 14066 16-7 | 73-7 | 97-7 | 87-7 | 97-7 541 | 4234 | 1325 | 156 | 416

2916657344 10minus4 19306 98-7 | 95-7 | 99-7 | 99-7 | 95-7 928 | 33108 | 3228 | 250 | 1306

srcsep2 10minus3 16502 39-7 | 68-7 | 99-7 | 97-7 | 98-7 951 | 10110 | 1627 | 257 | 849

2916686016 10minus4 22315 79-7 | 95-7 | 10-3 | 96-7 | 95-7 1914 | 64021 | 20106 | 456 | 1601

srcsep3 10minus3 27314 61-7 | 96-7 | 99-7 | 96-7 | 99-7 33 | 624 | 851 | 47 | 49

196608196608 10minus4 83785 97-7 | 99-7 | 99-7 | 99-7 | 99-7 203 | 14259 | 340 | 115 | 306

soccer1 10minus3 4 18-7 | 63-7 | 52-1 | 84-7 | 99-7 01 | 03 | 1351 | 235 | 02

32004096 10minus4 8 87-7 | 43-7 | 52-1 | 33-6 | 96-7 01 | 02 | 1323 | 307 | 02

soccer2 10minus3 4 34-7 | 63-7 | 50-1 | 82-7 | 99-7 00 | 03 | 1346 | 140 | 02

32004096 10minus4 8 21-7 | 14-7 | 68-1 | 18-6 | 91-7 01 | 03 | 1327 | 307 | 02

blurrycam 10minus3 1694 19-7 | 65-7 | 36-8 | 41-7 | 94-7 03 | 09 | 03 | 02 | 07

6553665536 10minus4 5630 10-7 | 97-7 | 13-7 | 97-7 | 99-7 05 | 135 | 08 | 03 | 29

blurspike 10minus3 1954 31-7 | 95-7 | 74-4 | 99-7 | 99-7 03 | 05 | 638 | 03 | 27

1638416384 10minus4 11698 35-7 | 74-7 | 83-5 | 98-7 | 99-7 10 | 08 | 643 | 05 | 35

performs very well when it succeeds in obtaining a solution with the desired accu-racy However it is not robust in that it fails to solve 4 out of 8 and 5 out of 8problems when λc = 10minus3 and 10minus4 respectively For a few cases FPC AS canonly achieve a poor accuracy (10minus1) The same nonrobustness also appears in theperformance of APG This nonrobustness is in fact closely related to the value ofλmax(AAlowast) For example both FPC AS and APG fail to solve a rather small prob-lem blknheavi (m = n = 1024) whose corresponding λmax(AAlowast) = 709 On theother hand LADMM Ssnal and mfIPM can solve all the test instances success-fully Nevertheless in some cases LADMM requires much more time than SsnalOne can also observe that for large-scale problems Ssnal outperforms mfIPM by alarge margin (sometimes up to a factor of 100) This also demonstrates the powerof Ssn based augmented Lagrangian methods over interior-point methods in solv-ing large-scale problems Due to the high sensitivity of the first order algorithms toλmax(AAlowast) one can safely conclude that the first order algorithms can only be used tosolve problems with relatively small λmax(AAlowast) Moreover in order to obtain efficientand robust algorithms for lasso problems or more general convex composite optimiza-tion problems it is necessary to carefully exploit the second order information in thealgorithmic design

5 Conclusion In this paper we have proposed an inexact augmented La-grangian method of an asymptotic superlinear convergence rate for solving the large-scale convex composite optimization problems of the form (P) It is particularly wellsuited for solving `1-regularized least squares (LS) problems With the intelligent

incorporation of the semismooth Newton method our Algorithm Ssnal is able to fullyexploit the second order sparsity of the problems Numerical results have convinc-ingly demonstrated the superior efficiency and robustness of our algorithm in solvinglarge-scale `1-regularized LS problems Based on extensive numerical evidence wefirmly believe that our algorithmic framework can be adapted to design robust andefficient solvers for various large-scale convex composite optimization problems

Acknowledgments The authors would like to thank the Associate Editor andanonymous referees for their helpful suggestions The authors would also like tothank Dr Ying Cui and Dr Chao Ding for numerous discussions on the error boundconditions and metric subregularity

REFERENCES

[1] A Y Aravkin J V Burke D Drusvyatskiy M P Friedlander and S Roy Level-setmethods for convex optimization preprint arXiv160201506 2016

[2] A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linear inverseproblems SIAM J Imaging Sci 2 (2009) pp 183ndash202

[3] S Becker J Bobin and E J Candes NESTA A fast and accurate first-order method forsparse recovery SIAM J Imaging Sci 4 (2011) pp 1ndash39

[4] E van den Berg and M P Friedlander Probing the Pareto frontier for basis pursuitsolutions SIAM J Sci Comput 31 (2008) pp 890ndash912

[5] E van den Berg MP Friedlander G Hennenfent FJ Herrman R Saab andO Yılmaz Sparco A testing framework for sparse reconstruction ACM Trans MathSoftw 35 (2009) pp 1ndash16

[6] R H Byrd G M Chin J Nocedal and F Oztoprak A family of second-order methodsfor convex `1-regularized optimization Math Program 159 (2016) pp 435ndash467

[7] R H Byrd J Nocedal and F Oztoprak An inexact successive quadratic approximationmethod for L-1 regularized optimization Math Program 157 (2016) pp 375ndash396

[8] C-C Chang and C-J Lin LIBSVM A library for support vector machines ACM TransIntel Syst Tech 2 (2011) 27

[9] S S Chen D L Donoho and M A Saunders Atomic decomposition by basis pursuitSIAM J Sci Comput 20 (1998) pp 33ndash61

[10] F Clarke Optimization and Nonsmooth Analysis John Wiley and Sons New York 1983[11] A L Dontchev and R T Rockafellar Characterizations of Lipschitzian stability in

nonlinear programming in Mathematical programming with Data Perturbations LectureNotes in Pure and Appl Math 195 Dekker New York 1998 pp 65ndash82

[12] Y Cui D F Sun and K-C Toh On the asymptotic superlinear convergence of the aug-mented Lagrangian method for semidefinite programming with multiple solutions preprintarXiv161000875 2016

[13] A L Dontchev and R T Rockafellar Implicit Functions and Solution Mappings SpringerMonographs in Mathematics Springer New York 2009

[14] F Facchinei and J-S Pang Finite-Dimensional Variational Inequalities and Complemen-tarity Problems Springer New York 2003

[15] J Fan H Fang and L Han Challenges of big data analysis Nat Sci Rev 1 (2014)pp 293ndash314

[16] M A T Figueiredo R D Nowak and S J Wright Gradient projection for sparsereconstruction Application to compressed sensing and other inverse problems IEEE JSele Topics Signal Process 1 (2007) pp 586ndash597

[17] A Fischer Local behavior of an iterative framework for generalized equations with nonisolatedsolutions Math Program 94 (2002) pp 91ndash124

[18] K Fountoulakis J Gondzio and P Zhlobich Matrix-free interior point method for com-pressed sensing problems Math Program Comput 6 (2014) pp 1ndash31

[19] D Gabay and B Mercier A dual algorithm for the solution of nonlinear variational problemsvia finite element approximations Comput Math Appl 2 (1976) pp 17ndash40

[20] R Glowinski and A Marroco Sur lrsquoapproximation par elements finis drsquoordre un et laresolution par penalisation-dualite drsquoune classe de problemes de Dirichlet non lineairesRev Francaise drsquoAutomatique Inform Rech Oprsquoerationelle 9 (1975) pp 41ndash76

[21] R Goebel and R T Rockafellar Local strong convexity and local Lipschitz continuity ofthe gradient of convex functions J Convex Anal 15 (2008) pp 263ndash270

[22] G Golub and C F Van Loan Matrix Computations 3rd ed Johns Hopkins UniversityPress Baltimore MD 1996

[23] L Huang J Jia B Yu B G Chun P Maniatis and M Naik Predicting execution time ofcomputer programs using sparse polynomial regression in Advances in Neural InformationProcessing Systems 2010 pp 883ndash891

[24] A F Izmailov A S Kurennoy and M V Solodov A note on upper Lipschitz stabil-ity error bounds and critical multipliers for Lipschitz-continuous KKT systems MathProgram 142 (2013) pp 591ndash604

[25] K Jiang D F Sun and K-C Toh A partial proximal point algorithm for nuclear normregularized matrix least squares problems Math Program Comput 6 (2014) pp 281ndash325

[26] N Keskar J Nocedal F Oztoprak and A Wachter A second-order method for convex`1-regularized optimization with active-set prediction Optim Methods Software 31 (2016)pp 605ndash621

[27] D Klatte Upper Lipschitz behavior of solutions to perturbed C11 optimization problemsMath Program 88 (2000) pp 169ndash180

[28] S Kogan D Levin B R Routledge J S Sagi and N A Smith Predicting Risk fromFinancial Reports with Regression NAACL-HLT 2009 Boulder CO 2009

[29] J D Lee Y Sun and M A Saunders Proximal Newton-type methods for minimizingcomposite functions SIAM J Optim 24 (2014) pp 1420ndash1443

[30] M Lichman UCI Machine Learning Repository httparchiveicsuciedumldatasetshtml[31] X D Li D F Sun and K-C Toh QSDPNAL A Two-Phase Proximal Aug-

mented Lagrangian Method for Convex Quadratic Semidefinite Programming preprintarXiv151208872 2015

[32] J Liu S Ji and J Ye SLEP Sparse Learning with Efficient Projections Arizona StateUniversity 2009

[33] Z-Q Luo and P Tseng On the linear convergence of descent methods for convex essentiallysmooth minimization SIAM J Control Optim 30 (1992) pp 408ndash425

[34] F J Luque Asymptotic convergence analysis of the proximal point algorithm SIAM J ControlOptim 22 (1984) pp 277ndash293

[35] R Mifflin Semismooth and semiconvex functions in constrained optimization SIAM J Con-trol Optim 15 (1977) pp 959ndash972

[36] A Milzarek and M Ulbrich A semismooth Newton method with multidimensional filterglobalization for `1-optimization SIAM J Optim 24 (2014) pp 298ndash333

[37] Y Nesterov A method of solving a convex programming problem with convergence rateO(1k2) Soviet Math Dokl 27 (1983) pp 372ndash376

[38] L Qi and J Sun A nonsmooth version of Newtonrsquos method Math Program 58 (1993)pp 353ndash367

[39] S M Robinson Some continuity properties of polyhedral multifunctions in MathematicalProgramming at Oberwolfach Math Program Stud Springer Berlin Heidelberg 1981pp 206ndash214

[40] R T Rockafellar Convex Analysis Princeton University Press Princeton NJ 1970[41] R T Rockafellar Monotone operators and the proximal point algorithm SIAM J Control

Optim 14 (1976) pp 877ndash898[42] R T Rockafellar Augmented Lagrangians and applications of the proximal point algorithm

in convex programming Math Oper Res 1 (1976) pp 97ndash116[43] R T Rockafellar and R J-B Wets Variational Analysis Grundlehren Math Wiss

Springer-Verlag Berlin 1998[44] M Schmidt Graphical Model Structure Learning with `1-Regularization PhD thesis De-

partment of Computer Science The University of British Columbia 2010[45] D F Sun and J Sun Semismooth matrix-valued functions Math Oper Res 27 (2002)

pp 150ndash169[46] J Sun On Monotropic Piecewise Quadratic Programming PhD thesis Department of Math-

ematics University of Washington 1986[47] P Tseng and S Yun A coordinate gradient descent method for nonsmooth separable mini-

mization Math Program 125 (2010) pp 387ndash423[48] R Tibshirani Regression shrinkage and selection via the lasso J Roy Statist Soc Ser B

58 (1996) pp 267ndash288[49] Z Wen W Yin D Goldfarb and Y Zhang A fast algorithm for sparse reconstruction

based on shrinkage subspace optimization and continuation SIAM J Sci Comput 32(2010) pp 1832ndash1857

[50] S J Wright R D Nowak and M A T Figueiredo Sparse reconstruction by separableapproximation IEEE Trans Signal Process 57 (2009) pp 2479ndash2493

[51] X Xiao Y Li Z Wen and L Zhang Semi-smooth second-order type methods for compositeconvex programs preprint arXiv160307870 2016

[52] L Yang D F Sun and K-C Toh SDPNAL+ A majorized semismooth Newton-CGaugmented Lagrangian method for semidefinite programming with nonnegative constraintsMath Program Comput 7 (2015) pp 331ndash366

[53] M-C Yue Z Zhou and A M-C So Inexact Regularized Proximal Newton Method Prov-able Convergence Guarantees for Non-Smooth Convex Minimization without Strong Con-vexity preprint arXiv160507522 2016

[54] X Y Zhao D F Sun and K-C Toh A Newton-CG augmented Lagrangian method forsemidefinite programming SIAM J Optim 20 (2010) pp 1737ndash1765

[55] X Zhang M Burger and S Osher A unified primal-dual algorithm framework based onBregman iteration J Sci Comput 49 (2011) pp 20ndash46

[56] H Zhou The adaptive lasso and its oracle properties J Amer Statist Assoc 101 (2006)pp 1418ndash1429

[57] H Zhou and T Hastie Regularization and variable selection via the elastic net J RoyStatist Soc Ser B 67 (2005) pp 301ndash320

[58] Z Zhou and A M-C So A unified approach to error bounds for structured convex optimiza-tion problems Math Program 165 (2017) pp 689ndash728

Introduction

Preliminaries

An augmented Lagrangian method with asymptotic superlinear convergence

SSNAL A semismooth Newton augmented Lagrangian algorithm for (D)

Solving the augmented Lagrangian subproblems

An efficient implementation of SSN for solving subproblems (18)

Numerical experiments for lasso problems

Numerical results for large-scale regression problems

Numerical results for Sparco collection

Conclusion

References