18
A LEVEL-SET METHOD FOR CONVEX OPTIMIZATION WITH A 1 FEASIBLE SOLUTION PATH 2 QIHANG LIN * , SELVAPRABU NADARAJAH , AND NEGAR SOHEILI 3 Abstract. Large-scale constrained convex optimization problems arise in several application 4 domains. First-order methods are good candidates to tackle such problems due to their low iteration 5 complexity and memory requirement. The level-set framework extends the applicability of first-order 6 methods to tackle problems with complicated convex objectives and constraint sets. Current methods 7 based on this framework either rely on the solution of challenging subproblems or do not guarantee a 8 feasible solution, especially if the procedure is terminated before convergence. We develop a level-set 9 method that finds an -relative optimal and feasible solution to a constrained convex optimization 10 problem with a fairly general objective function and set of constraints, maintains a feasible solution 11 at each iteration, and only relies on calls to first-order oracles. We establish the iteration complexity 12 of our approach, also accounting for the smoothness and strong convexity of the objective function 13 and constraints when these properties hold. The dependence of our complexity on is similar to 14 the analogous dependence in the unconstrained setting, which is not known to be true for level-set 15 methods in the literature. Nevertheless, ensuring feasibility is not free. The iteration complexity of 16 our method depends on a condition number, while existing level-set methods that do not guarantee 17 feasibility can avoid such dependence. 18 Key words. constrained convex optimization, level-set technique, first-order methods, com- 19 plexity analysis 20 AMS subject classifications. 90C25, 90C30, 90C52 21 1. Introduction. Large scale constrained convex optimization problems arise 22 in several business, science, and engineering applications. A commonly encountered 23 form for such problems is 24 f * := min x∈X f (x) (1) 25 s.t. g i (x) 0, i =1,...,m, (2) 26 27 where X⊆ R n is a closed and convex set, and f and g i , i =1,...,m, are convex real 28 functions defined on X . Given > 0, a point x is -optimal if it satisfies f (x ) -f * 29 and -feasible if it satisfies max i=1,...,m g i (x ) . In addition, given 0 < 1, and a 30 feasible and suboptimal solution e ∈X , a point x is -relative optimal with respect 31 to e if it satisfies (f (x ) - f * )/(f (e) - f * ) . 32 Level-set techniques [2, 3, 11, 17, 27, 28] provide a flexible framework for solving 33 the optimization problem (1)-(2). Specifically, this problem can be reformulated as a 34 convex root finding problem L * (t) = 0, where L * (t) = min x∈X max{f (x) -t; g i (x),i = 35 1,...,m} (see Lemar´ echal et al. [17]). Then a bundle method [16, 29, 14] or an accel- 36 erated gradient method [20, Section 2.3.5] can be applied to handle the minimization 37 subproblem to estimate L * (t) and facilitate the root finding process. In general, these 38 methods solve the subproblem using a nontrivial quadratic program that needs to be 39 solved at each iteration. 40 An important variant of (1)-(2) arising in statistics, image processing, and signal 41 processing occurs when f (x) takes a simple form, such as a gauge function, and there 42 is a single constraint with the structure ρ(Ax - b) λ, where ρ is a convex function, 43 A is a matrix, b is a vector, and λ is a scalar. In this setting, Van Den Berg and 44 * Tippie College of Business, University of Iowa, USA, [email protected] College of Business Administration, University of Illinois at Chicago, USA, [email protected] College of Business Administration, University of Illinois at Chicago, USA, [email protected] 1 This manuscript is for review purposes only.

A LEVEL-SET METHOD FOR CONVEX … A LEVEL-SET METHOD FOR CONVEX OPTIMIZATION WITH A 2 FEASIBLE SOLUTION PATH 3 QIHANG LIN , SELVAPRABU NADARAJAHy, AND NEGAR SOHEILIz 4 Abstract. Large-scale

Embed Size (px)

Citation preview

A LEVEL-SET METHOD FOR CONVEX OPTIMIZATION WITH A1

FEASIBLE SOLUTION PATH2

QIHANG LIN∗, SELVAPRABU NADARAJAH† , AND NEGAR SOHEILI‡3

Abstract. Large-scale constrained convex optimization problems arise in several application4domains. First-order methods are good candidates to tackle such problems due to their low iteration5complexity and memory requirement. The level-set framework extends the applicability of first-order6methods to tackle problems with complicated convex objectives and constraint sets. Current methods7based on this framework either rely on the solution of challenging subproblems or do not guarantee a8feasible solution, especially if the procedure is terminated before convergence. We develop a level-set9method that finds an ε-relative optimal and feasible solution to a constrained convex optimization10problem with a fairly general objective function and set of constraints, maintains a feasible solution11at each iteration, and only relies on calls to first-order oracles. We establish the iteration complexity12of our approach, also accounting for the smoothness and strong convexity of the objective function13and constraints when these properties hold. The dependence of our complexity on ε is similar to14the analogous dependence in the unconstrained setting, which is not known to be true for level-set15methods in the literature. Nevertheless, ensuring feasibility is not free. The iteration complexity of16our method depends on a condition number, while existing level-set methods that do not guarantee17feasibility can avoid such dependence.18

Key words. constrained convex optimization, level-set technique, first-order methods, com-19plexity analysis20

AMS subject classifications. 90C25, 90C30, 90C5221

1. Introduction. Large scale constrained convex optimization problems arise22

in several business, science, and engineering applications. A commonly encountered23

form for such problems is24

f∗ := minx∈X

f(x)(1)25

s.t. gi(x) ≤ 0, i = 1, . . . ,m,(2)2627

where X ⊆ Rn is a closed and convex set, and f and gi, i = 1, . . . ,m, are convex real28

functions defined on X . Given ε > 0, a point xε is ε-optimal if it satisfies f(xε)−f∗ ≤ ε29

and ε-feasible if it satisfies maxi=1,...,m gi(xε) ≤ ε. In addition, given 0 < ε ≤ 1, and a30

feasible and suboptimal solution e ∈ X , a point xε is ε-relative optimal with respect31

to e if it satisfies (f(xε)− f∗)/(f(e)− f∗) ≤ ε.32

Level-set techniques [2, 3, 11, 17, 27, 28] provide a flexible framework for solving33

the optimization problem (1)-(2). Specifically, this problem can be reformulated as a34

convex root finding problem L∗(t) = 0, where L∗(t) = minx∈X max{f(x)−t; gi(x), i =35

1, . . . ,m} (see Lemarechal et al. [17]). Then a bundle method [16, 29, 14] or an accel-36

erated gradient method [20, Section 2.3.5] can be applied to handle the minimization37

subproblem to estimate L∗(t) and facilitate the root finding process. In general, these38

methods solve the subproblem using a nontrivial quadratic program that needs to be39

solved at each iteration.40

An important variant of (1)-(2) arising in statistics, image processing, and signal41

processing occurs when f(x) takes a simple form, such as a gauge function, and there42

is a single constraint with the structure ρ(Ax− b) ≤ λ, where ρ is a convex function,43

A is a matrix, b is a vector, and λ is a scalar. In this setting, Van Den Berg and44

∗Tippie College of Business, University of Iowa, USA, [email protected]†College of Business Administration, University of Illinois at Chicago, USA, [email protected]‡College of Business Administration, University of Illinois at Chicago, USA, [email protected]

1

This manuscript is for review purposes only.

2 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

Friedlander [27, 28] and Aravkin et al. [2] use a level-set approach to swap the roles45

of the objective function and constraints in (1)-(2) to obtain the convex root finding46

problem v(t) = 0 with v(t) := minx∈X ,f(x)≤t ρ(Ax − b) − λ. Harchaoui et al. [11]47

proposed a similar approach. The root finding problem arising in this approach is48

solved using inexact newton and secant methods in [27, 28, 2] and [2], respectively.49

Since f(x) is assumed to be simple, projection onto the feasible set {x ∈ Rn|x ∈50

X , f(x) ≤ t} is possible and the subproblem in the definition of v(t) can be solved51

by various first-order methods. Although the assumption on the objective function52

is important, the constraint structure is not limiting because (2) can be obtained by53

choosing A to be an identity matrix, b = 0, λ = 0, and ρ(x) = maxi=1,...,m gi(x).54

The level-set techniques discussed above find an ε-feasible and ε-optimal solution55

to (1)-(2), that is, the terminal solution may not be feasible. This issue can be56

overcome when the infeasible terminal solution is also super-optimal and a strictly57

feasible solution to the problem exists. In this case, a radial-projection scheme [2, 25]58

can be applied to obtain a feasible and ε-relative optimal solution. Super-optimality of59

the terminal solution holds if the constraint f(x) ≤ t is satisfied at termination. This60

condition can be enforced when projections on to the feasible set with this constraint61

are computationally cheap, which typically requires f(x) to be simple (e.g. see [2]).62

Otherwise, such a guarantee does not exist.63

Building on this rich literature, we develop a feasible level-set method that (i) em-64

ploys a first-order oracle to solve subproblems and compute a feasible and ε-relative65

optimal solution to (1)-(2) with a general objective function and set of constraints66

while also maintaining feasible intermediate solutions at each iteration (i.e., a feasible67

solution path), and (ii) avoids potentially costly projections. Our focus on maintain-68

ing feasibility is especially desirable in situations where level-set methods need to be69

terminated before convergence, for instance due to a time constraint. In this case, our70

level-set approach would return a feasible, albeit possibly suboptimal, solution that71

can be implemented whereas the solution from existing level-set approaches may not72

be implementable due to potentially large constraint violations. Convex optimiza-73

tion problems where constraints model operating conditions that need to be satisfied74

by solutions for implementation arise in several applications, including network and75

nonlinear resource allocation, signal processing, and circuit design [7, 8, 23].76

The radial sub-gradient method proposed by Renegar [25] and extended by Grim-77

mer [10] also emphasizes feasibility. This method avoids projection and finds a feasible78

and ε-relative optimal solution by performing a line search at each iteration. This line79

search, while easy to execute for some linear or conic programs, can be challenging for80

general convex programs (1)-(2). Moreover, it is unknown whether the iteration com-81

plexity of this approach can be improved by utilizing the potential smoothness and82

strong convexity of the objective function and constraints. In contrast, the feasible83

level-set approach in this paper avoids the need for line search, in addition to projec-84

tion, and its iteration complexity leverages the strong convexity and smoothness of85

the objective and constraints when these properties are true.86

Our work is indeed related to first-order methods, which are popular due to their87

low per-iteration cost and memory requirement, for example compared to interior88

point methods, for solving large scale convex optimization problems [12]. Most ex-89

isting first-order methods find a feasible and ε-optimal solution to a version of the90

optimization problem (1)-(2) with a simple feasible region (e.g. a box, simplex or91

ball) which makes projection on to the feasible set at each iteration inexpensive. En-92

suring feasibility using projection becomes difficult for general constraints (2). Some93

sub-gradient methods, including the method by Nesterov [20, Section 3.2.4] and its94

This manuscript is for review purposes only.

3

recent variants by Lan and Zhou [15] and Bayandina [4], can be applied to (1)-(2)95

without the need for projection mappings. However these methods do not ensure96

feasibility of the feasible solution as we do, that is, they instead find an ε-feasible and97

ε-optimal solution to (1)-(2) at convergence.98

Given the focus of our research, one naturally wonders whether there is an asso-99

ciated computational cost of ensuring feasibility in the context of level-set methods.100

Our analysis suggests that the answer depends on the perspective one takes. First,101

consider the dependence of the iteration complexity on a condition measure as a cri-102

terion. A nice property of the level-set method in [2] is that its iteration complexity103

is independent of a condition measure for finding an ε-feasible and ε-optimal solu-104

tion. Instead, finding an ε-relative optimal and feasible solution leads to iteration105

complexities with dependence on condition measures when (i) combining the level-set106

approach [2] and the transformation in [25], or (ii) using our feasible level-set ap-107

proach. This suggests that a cost of ensuring feasibility is the presence of a condition108

measure in the iteration complexity. Second, suppose we consider the dependence of109

the number of iterations on ε as our assessment criterion. The complexity result of the110

level-set approach provided by [2] can be interpreted as the product of the iteration111

complexity of the oracle used to solve the subproblem and a O(log(1/ε)) factor. The112

dependence of our algorithm would be similar under a non-adaptive analysis akin to113

[2]. However, using a novel adaptive analysis we find that the iteration complexity of114

our feasible level-set approach can in fact eliminate the O(log(1/ε)) factor. In other115

words, our complexity is analogous to the known complexity of first order methods for116

unconstrained convex optimization and it is unclear to us whether a similar adaptive117

analysis can be applied to the approach in [2]. Overall, in terms of the dependence118

on ε there appears to be no cost to ensuring feasibility.119

The rest of this paper is organized as follows. Section 2 introduces the level-set120

formulation and our feasible level-set approach, also establishing its outer iteration121

complexity to compute an ε-relative optimal and feasible solution to the optimization122

problem (1)-(2). Section 3 describes two first-order oracles that solve the subproblem123

of our level-set formulation when the functions f and gi, i = 1, . . . ,m, are smooth124

and non-smooth, and in each case accounts for the potential strong convexity of these125

functions. The inner iteration complexity of each oracle is also analyzed in this section.126

Section 4 presents an adaptive complexity analysis to establish the overall iteration127

complexity of using our feasible level-set approach.128

2. Feasible Level-set Method. Our methodological developments rely on the129

assumption below.130

Assumption 1. There exists a solution e ∈ X such that maxi=1,...,m gi(e) < 0131

and f(e) > f∗. We label such a solution g-strictly feasible.132

Given t ∈ R, we define133

L∗(t) := minx∈X

L(t, x),(3)134135

where L(t, x) := max{f(x) − t; gi(x), i = 1, . . . ,m} for all x ∈ X . Let ∂L∗(t) denote136

the sub-differential of L∗ at t. Lemma 1 summarizes known properties of L∗(t).137

Lemma 1 (Lemmas 2.3.4, 2.3.5, and 2.3.6 in [20]). It holds that138

(a) L∗(t) is non-increasing and convex in t;139

(b) L∗(t) ≤ 0, if t ≥ f∗ and L∗(t) > 0, if t < f∗. Moreover, under Assumption 1,140

it holds that L∗(t) < 0, for any t > f∗.141

This manuscript is for review purposes only.

4 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

(c) Given ∆ ≥ 0, we have L∗(t) − ∆ ≤ L∗(t + ∆) ≤ L∗(t). Therefore, −1 ≤142

ζL∗ ≤ 0 for all ζL∗ ∈ ∂L∗(t) and t ∈ R.143

Proof. We only prove the second part of Item (b) as the proofs of other items are144

provided in [20]. In particular, we show L∗(t) < 0 for any t > f∗ under Assumption 1.145

Let x∗ denote an optimal solution to (1)-(2). If t > f∗ we have f(x∗)− t < 0. Since f146

is a continuous function, there exists δ > 0 such that for any x ∈ Bδ(x∗)∩X we have147

f(x)− t < 0, where Bδ(x∗) is a ball of radius δ centered at x∗. Take x ∈ Bδ(x∗) ∩ X148

such that x = λx∗ + (1 − λ)e for some λ ∈ (0, 1), where e is a solution that satisfies149

Assumption 1. By the convexity of the functions gi, i = 1, . . . ,m, the feasibility150

of x∗, and the g-strict feasibility of e, we have gi(x) ≤ λgi(x∗) + (1 − λ)gi(e) < 0151

for i = 1, . . . ,m. In addition, since x ∈ Bδ(x∗), we have f(x) − t < 0. Therefore,152

L∗(t) ≤ max{f(x)− t, gi(x), i = 1, . . . ,m} < 0.153

Lemma 1(b), specifically L∗(t) < 0 for any t > f∗, ensures that the solution154

t∗ = f∗ is a unique root to L∗(t) = 0 and the solution x∗(t∗) determining L∗(t∗) is155

an optimal solution to (1)-(2). In this case, it is possible to find an ε-relative optimal156

and feasible solution by closely approximating t∗ from above with a t and then finding157

a point x ∈ X with L(t, x) < 0. The latter condition ensures x is feasible. In order158

to obtain such a t, we start with an initial value t0 > t∗ and generate a decreasing159

sequence t0 ≥ t1 ≥ t2 ≥ · · · ≥ t∗, where tk is updated to tk+1 based on a term160

obtained by solving (3) with t = tk up to a certain optimality gap. We present this161

approach in Algorithm 1 which utilizes an oracle A defined as follows.162

Definition 1. An algorithm is called an oracle A if, given t ≥ f∗, x ∈ X , α > 1163

and γ ≥ L(t, x)− L∗(t), it finds a point x+ ∈ X , such that L∗(t) ≥ αL(t, x+).164

Algorithm 1 Feasible Level-Set Method

1 Input: g-Strictly feasible solution e, and parameters α > 1 and 0 < ε ≤ 1.2 Initialization: Set x0 = e and t0 = f(x0).3 for k = 0, 1, 2, ... do4 Choose γk such that γk ≥ L(tk, xk)− L∗(tk).5 Generate xk+1 = A(tk, xk, α, γk) such that L∗(tk) ≥ αL(tk, xk+1).6 if αL(tk, xk+1) ≥ L(t0, x1)ε then7 Terminate and return xk+1.8 end if9 Set tk+1 = tk + 1

2L(tk, xk+1).10 end for

Given a g-strictly feasible solution e and α > 1, Algorithm 1 finds an ε-relative165

optimal and feasible solution by calling an oracle A with input tuple (tk, xk, α, γk)166

at each iteration k. This oracle provides a lower bound, αL(tk, xk+1) ≤ 0, and an167

upper bound, L(tk, xk+1) ≤ 0, for L∗(tk). We use the upper bound to update tk168

and the lower bound in the stopping criterion. In particular, Algorithm 1 terminates169

when αL(tk, xk+1) is greater than or equal to L(t0, x1)ε. We discuss possible choices170

for oracle A in §3 and its input γ in Table 1. Theorem 2 shows that Algorithm 1 is171

a feasible level-set method, that is xk generated at each iteration k is feasible, and172

terminates with an ε-relative optimal solution. It also establishes its outer iteration173

complexity. This complexity depends on a condition measure174

β := −L∗(t0)

t0 − t∗,175

This manuscript is for review purposes only.

5

0 1

t∗ t0

β

L∗(t0)

t

L∗(t)

(a) Well conditioned case.

0 1

t∗ t0

βL∗(t0)

t

L∗(t)

(b) Poorly conditioned case.

Fig. 1: Illustration of condition measure β for t∗ and t0 equals 0 and 1, respectively.

for problem (1)-(2). The suboptimality of x0 = e implies t0 = f(x0) > t∗. By176

Lemma 1 we can show that β ∈ (0, 1]. Larger values of β signify a better conditioned177

problem, as illustrated in Figure 1, because we iteratively use a close approximation178

of L∗(t) to approach t∗ from t0. Therefore, larger L∗(t) in absolute value intuitively179

suggests faster movement towards t∗. In addition, the following inequalities can be180

easily verified from the convexity of L∗(t) and Lemma 1(c):181

(4) − β =L∗(t0)

t0 − t∗≥ L∗(t)

t− t∗≥ −1, ∀t ∈ (t∗, t0].182

Theorem 2. Given α > 1, 0 < ε ≤ 1, and a g-strictly feasible solution e, Algo-183

rithm 1 maintains a feasible solution at each iteration and ensures184

(5) tk+1 − t∗ ≤(

1− β

)(tk − t∗).185

In addition, it returns an ε-relative optimal solution to (1)-(2) in at most186

K :=

⌈log(1− β

2α )−1

(α2

βε

)⌉+ 1187

outer iterations.188

Proof. Notice that t0 > t∗ since t0 = f(e). Suppose t∗ ≤ tk ≤ t0 at iteration k.189

At iteration k + 1 we have190

tk+1 − t∗ = tk − t∗ +1

2L(tk, xk+1)191

≥ tk − t∗ +1

2L∗(tk)192

≥ tk − t∗ −1

2(tk − t∗)193

≥ 0,194195

This manuscript is for review purposes only.

6 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

where the first equality is obtained by substituting for tk+1 using the update equation196

for t in Algorithm 1, the first inequality using the definition of L∗(tk), the second197

inequality from (4) with t = tk, and the final inequality from tk− t∗ ≥ 0. This implies198

that tk ≥ t∗ for all k. In addition, the condition L∗(tk) ≥ αL(tk, xk+1) satisfied by199

the oracle A, α > 1, and Lemma 1(b) imply that L(tk, xk+1) ≤ L∗(tk)/α ≤ 0. It thus200

follows that tk+1 ≤ tk and the xk+1 maintained by Algorithm 1 is feasible for all k.201

Next we establish (5) and the outer iteration complexity of Algorithm 1. Using202

the definiton of tk+1, α > 1, and the condition L∗(tk) ≥ αL(tk, xk+1) satisfied by203

oracle A, we write204

tk+1 − t∗ = tk − t∗ +1

2L(tk, xk+1) ≤ tk − t∗ +

L∗(tk)

2α.(6)205

206

Using (4) with t = tk to substitute for L∗(tk) in (6) gives the required inequality (5).207

Recursively using (5) starting from k = 0 gives208

(7) tk − t∗ ≤(

1− β

)k(t0 − t∗).209

To obtain the terminal condition, we note that, for k ≥ K − 1,210

−αL(tk, xk+1) ≤ −αL∗(tk)211

≤ α(tk − t∗)212

≤ α(

1− β

)K−1

(t0 − t∗)213

≤ − εαL∗(t0)214

≤ −εL(t0, x1),215216

where the first inequality holds by the definition of L∗(t); the second using (4) for217

t = tk; the third using (7); the forth by applying the definitions of K and β. The218

last inequality follows from the condition satisfied by the oracle at t0 and x1, that is,219

αL(t0, x1) ≤ L∗(t0). Hence, Algorithm 1 terminates after K outer iterations.220

Finally we proceed to show that the terminal solution is ε-relative optimal. Ob-221

serve that222

(8) f(xk+1)− f∗ = f(xk+1)− t∗ ≤ tk − t∗,223

where the equality is a consequence of the definition of t∗ and the inequality is obtained224

using f(xk+1) ≤ tk, which holds because f(xk+1)− tk ≤ L(tk, xk+1) ≤ 0. Combining225

tk − t∗ ≤ (t0 − t∗)L∗(tk)/L∗(t0) from (4) and (8), we obtain f(xk+1) − f∗ ≤ (t0 −226

t∗)L∗(tk)/L∗(t0). Rearranging terms results in the inequality227

f(xk+1)− f∗

f(e)− f∗≤ L∗(tk)

L∗(t0),228

where we use t∗ = f∗ and t0 = f(e). Indeed, L∗(tk)/L∗(t0) ≤ ε when the algorithm229

terminates. To verify this, the terminal condition and the property satisfied by oracle230

A indicate that L∗(tk) ≥ αL(tk, xk+1) ≥ L(t0, x1)ε ≥ L∗(t0)ε, which further implies231

L∗(tk)

L∗(t0)≤ ε.

232

This manuscript is for review purposes only.

7

3. First-order oracles. In this section, we adapt first-order methods to be used233

as the oracle A in Algorithm 1 to solve (3) under different properties of the convex234

functions f and gi, i = 1, . . . ,m. We adapt these methods in a manner that facilitates235

our adaptive complexity analysis in §4.236

We assume functions f and gi, i = 1, . . . ,m, are µ-convex for some µ ≥ 0. Namely,237

for h = f, g1, . . . , gm, we have238

(9) h(x) ≥ h(y) + 〈ζh, x− y〉+µ

2‖x− y‖22239

for any x and y in X and ζh ∈ ∂h(y), where ∂h(y) denotes the set of all sub-gradients240

of h at y ∈ X and ‖ · ‖2 denotes the 2-norm. We call functions f and gi, i = 1, . . . ,m,241

convex if µ = 0 and strongly convex if µ > 0. Let242

D := maxx∈X‖x‖2 and M := max

x∈Xmax{‖ζf‖2; ‖ζgi‖2, i = 1, . . . ,m},243

where ζf ∈ ∂f(x) and ζgi ∈ ∂gi(x), i = 1, . . . ,m. When f and gi, i = 1, . . . ,m are244

smooth, M := maxx∈X max{‖∇f(x)‖2; ‖∇gi(x)‖2, i = 1, . . . ,m}.245

We make Assumption 2 in the rest of the paper.246

Assumption 2. M < +∞ and either (i) µ > 0 or (ii) µ = 0 and D < +∞.247

Assuming M to be finite is a common assumption in the literature when developing248

first-order methods for solving smooth min-max problems (see e.g. [6, 24]) and non-249

smooth problems (see e.g. [9, 18, 22]). The assumption of finite D is present to obtain250

computable stopping criteria for the first-order oracles that we use.251

3.1. Smooth convex functions. In this subsection, in addition to assump-252

tions 1 and 2, we also assume that the functions in the objective and constrains of253

(1)-(2) are smooth, formally stated below.254

Assumption 3. Functions f and gi, i = 1, . . . ,m, are S-smooth for some S ≥ 0,255

namely, for h = f, g1, . . . , gm, h(x) ≤ h(y) + 〈∇h(y), x− y〉 + S2 ‖x − y‖

22 for any x256

and y in X .257

A first-order method achieves a fast linear convergence rate when the objective258

function is both smooth and strongly convex [20]. Despite this smoothness assump-259

tion, since the function L(t, x) is a maximum of functions modeling the objective and260

constraints, it may be non-smooth in x. Nevertheless, we can approximate L(t, x)261

by a smooth and strongly convex function of x using a smoothing method [6, 21]262

and a standard regularization technique. Applying first-order methods to minimize263

this smooth approximation of L(t, x) leads to a lower iteration complexity than di-264

rectly working with the original non-smooth function. The technique we utilize is265

a modified version of the smoothing and regularization reduction method by Allen-266

Zhu and Hazan [1]. However, since the formulation and the non-smooth structure267

of (3) are quite different from the problems considered in [1], we present below the268

reduction technique and conduct the complexity analysis under our framework. The269

exponentially smoothed and regularized version of L(t, x) is270

Lσ,λ(t, x) :=1

σlog

(exp(σ(f(x)− t)) +

m∑i=1

exp(σgi(x))

)+λ

2‖x‖22,(10)271

272

where σ > 0 and λ ≥ 0 are smoothing and regularization parameters, respectively.273

Lemma 2 is based on related arguments in [6] and summarizes properties of Lσ,λ(t, x)274

and its “closeness” to L(t, x).275

This manuscript is for review purposes only.

8 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

Lemma 2. Suppose assumptions 2 and 3 hold. Given σ > 0 and λ ≥ 0, the276

function Lσ,λ(t, x) is (i) (σM2 + S)-smooth and (ii) (λ + µ)-convex. Moreover, we277

have278

(11) 0 ≤ Lσ,λ(t, x)− L(t, x) ≤ log(m+ 1)

σ+λD2

2.279

Here, we follow the convention that 0·+∞ = 0 so that λD2 = 0 if λ = 0 and D = +∞.280

Proof. Item (i) and the relationship (11) are respectively minor modifications of281

Proposition 4.1 and Example 4.9 in [6]. Item (ii) directly follows from the µ-convexity282

of f , gi, i = 1, . . . ,m and λ-convexity of the function λ2 ‖x‖

22.283

We solve the following version of (3) with L(t, x) replaced by Lσ,λ(t, x):

minx∈X

Lσ,λ(t, x).

Specifically, we apply a variant of Nesterov’s accelerated gradient method [5, 26] for284

a given t > t∗ and a sequence of (σ, λ) with σ increasing to infinity and λ decreasing285

to zero. We refer to this gradient method as restarted smoothing gradient method286

(RSGM) and summarize its steps in Oracle 1. The choice of (σ, λ) and the number of287

iterations performed by RSGM depend on whether f and gi, i = 1, . . . ,m are strongly288

convex (µ > 0) or convex (µ = 0). Moreover, RSGM is re-started after Ni accelerated289

gradient method iterations with a smaller precision γi until this precision is below the290

required threshold of (1 − α)L(t, xi). Theorem 3 establishes that Oracle 1 returns a291

solution x+ ∈ X such that L∗(t) ≥ αL(t, x+). Hence, RSGM can be used as an oracle292

A (see Definition 1).293

Theorem 3. Suppose assumptions 1, 2, and 3 hold. Given level t ∈ (t∗, t0],294

solution x ∈ X , parameters α > 1 and γ ≥ L(t, x)−L∗(t), Oracle 1 returns a solution295

x+ ∈ X that satisfies L∗(t) ≥ αL(t, x+) in at most296 (√40S

µ+ 1

)log2

(αγ

(1− α)L∗(t)+ 1

)+ 3M

√160α log(m+ 1)

µ(1− α)L∗(t)297

298

iterations of the accelerated gradient method when µ > 0 and299

12D

√10Sα

(1− α)L∗(t)+

4MDα√

80 log(m+ 1)

(1− α)L∗(t)+ log2

(αγ

(1− α)L∗(t)+ 1

)300

301

such iterations when µ = 0 and D < +∞.302

Proof. Let x∗σ,λ(t) := arg minx∈XLσ,λ(t, x) and L∗σ,λ(t) := Lσ,λ(t, x∗σ,λ(t)). We303

will first show by induction that at each iteration i = 0, 1, 2, . . . of Oracle 1 the304

following inequality holds.305

(12) L(t, xi)− L∗(t) ≤ γi.306

The validity of (12) for i = 0 holds because x0 = x, γ0 = γ, and we know307

L(t, x) − L∗(t) ≤ γ. Suppose (12) holds for iteration i. We claim that it also holds308

for iteration i+ 1 in both the strongly convex and convex cases. In fact, at iteration309

i+ 1, we have310

L(t, xi+1)− L∗(t) ≤ Lσi,λi(t, xi+1)− L∗σi,λi(t) +log(m+ 1)

σi+λiD

2

2311

This manuscript is for review purposes only.

9

Oracle 1 Restarted Smoothing Gradient Method: RSGM(t, x, α, γ)

1 Input: Level t, initial solution x ∈ X , constant α > 1, and initial precisionγ ≥ L(t, x)− L∗(t).

2 Set x0 = x and γ0 = γ.3 for i = 0, 1, 2, . . . do4 if γi ≤ (1− α)L(t, xi) then5 Terminate and return x+ = xi.6 end if7 Set

σi =

4 log(m+ 1)

γiif µ > 0

8 log(m+ 1)

γiif µ = 0

and λi =

0 if µ > 0

γi4D2

if µ = 0.

8 Starting at xi, apply the accelerated gradient method [26, Algorithm 1] on theoptimization problem minx∈X Lσi,λi(t;x) for

Ni =

⌈max

{√40S

µ,M√

160 log(m+ 1)√µγi

}⌉if µ > 0

⌈max

{√160SD2

γi,

4MD√

80 log(m+ 1)

γi

}⌉if µ = 0,

iterations and let xi+1 be the solution returned by this method.9 Set γi+1 = γi/2.

10 end for

≤2(σiM

2 + S)‖xi − x∗σi,λi(t)‖22

N2i

+log(m+ 1)

σi+λiD

2

2312

≤4(σiM

2 + S)(Lσi,λi(t, xi)− L∗σi,λi(t))(µ+ λi)N2

i

+log(m+ 1)

σi+λiD

2

2313

≤ 4(σiM2 + S)

(µ+ λi)N2i

(γi +

log(m+ 1)

σi+λiD

2

2

)+

log(m+ 1)

σi+λiD

2

2314

=5γi(σiM

2 + S)

(µ+ λi)N2i

+γi4

315

≤ γi8

+γi8

+γi4

316

=γi2

= γi+1.317318

The first inequality follows from Lemma 2; the second inequality from the (σiM2 +319

S)-smoothness of Lσi,λi(t, x) and the convergence property, i.e., Lσi,λi(t, xi+1) −320

L∗σi,λi(t) ≤2N2i

(σiM2 + S)‖xi − x∗σi,λi(t)‖

22, of an accelerated gradient method [26,321

Corollary 1]; the third using the (λi +µ)-convexity of Lσi,λi(t, x) in x; and the fourth322

by combining Lemma 2 and the induction hypothesis that L(t, xi)− L∗(t) ≤ γi. The323

This manuscript is for review purposes only.

10 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

rest of the relationships follow from the inequalities 5γiσiM2/((µ + λi)N

2i ) ≤ γi/8324

and 5γiS/((µ+λi)N2i ) ≤ γi/8, which hold for both µ > 0 and µ = 0 according to the325

definitions of σi, λi, and Ni.326

It can be easily verified that Oracle 1 terminates in at most I iterations, where327

I := log2

(αγ

(1−α)L∗(t) + 1)≥ 0. In fact, according to (12), for any i ≥ I, we have328

L(t, xi)− L∗(t) ≤ γi =γ0

2i≤ γ

2I≤ (1−α)

α L∗(t), which implies that αL(t, xi) ≤ L∗(t).329

Hence γi ≤ 1−αα (αL(t, xi)) = (1− α)L(t, xi) and Oracle 1 terminates.330

We next analyze the total number of iterations across all calls of the accelerated331

gradient method [26, Algorithm 1] during the execution of Oracle 1.332

Suppose µ > 0. The number of iterations taken by Oracle 1 can be bounded as333

follows:334

I−1∑i=0

Ni ≤I−1∑i=0

√40S

µ+

I−1∑i=0

M

√160 log(m+ 1)

µγi+

I−1∑i=0

1335

=

(√40S

µ+ 1

)I +M

√160 log(m+ 1)

µ

I−1∑i=0

(1

γi

)1/2

336

=

(√40S

µ+ 1

)I +M

√160 log(m+ 1)

µγ0

I−1∑i=0

2i/2337

=

(√40S

µ+ 1

)I +M

√160 log(m+ 1)

µγ

(1− 2I/2

1−√

2

)338

(√40S

µ+ 1

)log2

(αγ

(1− α)L∗(t)+ 1

)+ 3M

√160α log(m+ 1)

µ(1− α)L∗(t).339

340

The first inequality above is obtained using the inequality dae ≤ a+1 for a ∈ R+, and341

by bounding the maximum determining Ni by a sum of its terms; the first equality342

from simplifying terms; the second equality from using the definition of γi; the third343

equality from γ0 = γ and the geometric sum formula∑I−1i=0 2i/2 = (1−2I/2)/(1−

√2);344

and the last inequality by substituting for I, using the relationship√a+ 1 ≤

√a+ 1345

for a ∈ R+ to obtain the inequality346

2I/2 − 1 =

(αγ

(1− α)L∗(t)+ 1

)1/2

− 1 ≤(

αγ

(1− α)L∗(t)

)1/2

,(13)347348

and bounding 1/(√

2− 1) above by 3.349

Suppose µ = 0. The iteration complexity can be bounded using the following350

steps:351

I−1∑i=0

Ni ≤I−1∑i=0

4D

√10S

γi+

I−1∑i=0

4MD

√80 log(m+ 1)

γi+

I−1∑i=0

1352

= 4D

√10S

γ0

I−1∑i=0

2i/2 + 4MD

√80 log(m+ 1)

γ0

I−1∑i=0

2i + I353

= 4D

√10S

γ0

(1− 2I/2

1−√

2

)+ 4MD

√80 log(m+ 1)

γ0

(1− 2I

1− 2

)+ I354

This manuscript is for review purposes only.

11

≤ 12D

√10Sα

(1− α)L∗(t)+

4αMD√

80 log(m+ 1)

(1− α)L∗(t)+ log2

(αγ

(1− α)L∗(t)+ 1

).355

356

The first inequality follows from replacing the maximum in the definition of Ni by357

the sum of its terms and using dae ≤ a+ 1 for a ∈ R+, the first and second equalities358

by rearranging terms and using the geometric sum formula, respectively, and the last359

inequality by substituting for I and using the inequality (13).360

3.2. Non-smooth functions. In this subsection, we only enforce assumptions 1361

and 2. In other words, the functions f and gi i = 1, . . . ,m can be non-smooth. Under362

Assumption 2, we have maxx∈X ‖ζL,t(x)‖2 ≤M for any ζL,t(x) ∈ ∂xL(t, x) and t ∈ R,363

where ∂xL(t, x) denotes the set of all sub-gradients of L(t, ·) at x with respect to x.364

As oracle A, we choose the standard sub-gradient method with varying stepsizes365

from [19] and its variant in [13]1, respectively, when µ = 0 and µ > 0 to directly solve366

(3). We present this sub-gradient method for both these choices of µ in Oracle 2. The367

iteration complexity of Oracle 2 is well-known and given in Proposition 1 without368

proof. The proof for µ = 0 can be found in [19, §2.2] and and the one for µ > 0369

is given in [13, Section 3.2]. Here we only provide arguments to show that Oracle 2370

will return a solution with the desired optimality gap required by Algorithm 1, which371

then qualifies this algorithm to be used as an oracle A (see Definition 1).372

Oracle 2 Sub-gradient Descent: SGD(t, x, α)

1 Input: Level t, initial solution x ∈ X and constant α > 12 Set x0 = x.3 for i = 0, 1, 2, . . . do4 Set

ηi =

2

(i+ 2)µif µ > 0,

√2D√

i+ 1Mif µ = 0,

and Ei :=

M2

(i+ 2)µif µ > 0,

2√

2DM√i+ 1

if µ = 0.

5 Set

xi =

2

(i+ 1)(i+ 2)

∑ij=0(j + 1)xj if µ > 0,

1

i+ 1

∑ij=0 xj if µ = 0.

6 Let xi+1 = arg minx∈X12‖x− xi + ηiζL,t(xi)‖22.

7 if Ei ≤ (1− α)L(t, xi) then8 Terminate and return x+ = xi.9 end if

10 end for

Proposition 1 ([13, 19]). Given level t ∈ (t∗, t0], solution x ∈ X , and a param-373

eter α > 1, Oracle 2 returns a solution x+ ∈ X that satisfies L∗(t) ≥ αL(t, x+) in at374

1The original algorithms proposed in [13] and [19] are stochastic sub-gradient methods. Here,we apply these methods with deterministic sub-gradient.

This manuscript is for review purposes only.

12 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

most375

(14)αM2

µ(1− α)L∗(t)376

iterations of the sub-gradient method when µ > 0 and at most377

(15)8α2M2D2

((1− α)L∗(t))2378

such iterations when µ = 0 and D < +∞.379

Proof. We first consider the case when µ = 0 and D < +∞. Oracle 2 is the sub-380

gradient method in [19] with the distance-generating function 12‖ · ‖

22. From equation381

(2.24) in [19], we have382

L(t, xi)− L∗(t) ≤2√

2DM√i+ 1

= Ei.383384

Hence, when i ≥ 8α2M2D2

((α−1)L∗(t))2 − 1, we have L(t, xi) − L∗(t) ≤ Ei ≤ 1−αα L∗(t), which385

implies L∗(t) ≥ αL(t, xi). Hence, Ei ≤ 1−αα L∗(t) ≤ (1 − α)L(t, xi) which indicates386

that Oracle 2 terminates in no more than 8α2M2D2

((α−1)L∗(t))2 iterations.387

Now, suppose µ > 0. Oracle 2, in this case, is the sub-gradient method in [13].388

By Section 3.2 in [13], we have389

L(t, xi)− L∗(t) ≤M2

(i+ 2)µ.390

391

Therefore, when i ≥ αM2

µ(α−1)L∗(t) −2, we have L(t, xi)−L∗(t) ≤ Ei ≤ 1−αα L∗(t), which392

implies L∗(t) ≥ αL(t, xi). In addition, the algorithm terminates after at most this393

many iterations because Ei ≤ 1−αα (αL(t, xi)) ≤ (1− α)L(t, xi).394

This proposition indicates that the sub-gradient method in [19] and its variant395

in [13] qualify as oracles A in Algorithm 1 when µ = 0 and µ > 0, respectively (see396

Definition 1). Since the iteration complexities (14) and (15) do not depend on the397

initial solution x, the worst-case complexity does not restrict this choice of x ∈ X . In398

addition, γ is not needed for executing Oracle 2. Therefore, we omit it as an input to399

this algorithm even though the generic call to the oracle A in Algorithm 1 includes γ.400

4. Overall iteration complexity. We present an adaptive complexity analysis401

to establish the overall complexity of Algorithm 1 when using the oracles A discussed402

in §3. We present two lemmas that are needed to show our main complexity result in403

Theorem 4.404

Lemma 3. Suppose Algorithm 1 terminates at iteration K. Then for any 1 ≤405

k ≤ K, we have406

(16) L∗(tk) <L∗(t0)ε

2α2.407

Proof. We first use the relationship tk = tk−1 + 12L(tk−1, xk) to show L∗(tk) ≤408

12L∗(tk−1) for any k ≥ 1. Defining x∗k := arg minx∈X L(tk, x), we proceed as follows:409

L∗(tk) ≤max{f(x∗k−1)− tk−1 −1

2L(tk−1, xk); gi(x

∗k−1), i = 1, . . . ,m}410

This manuscript is for review purposes only.

13

= max{f(x∗k−1)− tk−1; gi(x∗k−1) +

1

2L(tk−1, xk), i = 1, . . . ,m}411

− 1

2L(tk−1, xk)412

≤ L∗(tk−1)− 1

2L(tk−1, xk)413

≤ 1

2L∗(tk−1),(17)414

415

where the second inequality holds since L(tk−1, xk) ≤ 1αL∗(tk−1) ≤ 0 and the third416

inequality follows from the inequality L∗(tk−1) ≤ L(tk−1, xk). Since Algorithm 1 has417

not terminated at iteration k − 1 for 1 ≤ k ≤ K, we have418

(18) L∗(t0)ε ≥ αL(t0, x1)ε > α2L(tk−1, xk) ≥ α2L∗(tk−1),419

where the first inequality holds by the condition satisfied by oracle A, the second420

by the stopping criterion of Algorithm 1 not being satisfied, and the third by the421

definition of L∗(tk−1). The inequality (16) then follows by combining (17) and (18).422

Lemma 4. Suppose Algorithm 1 terminates at iteration K. The following hold:423

1.∑Kk=0

1

−L∗(tk)≤ 4α3

−β2L∗(t0)ε;424

2.∑Kk=0

1√−L∗(tk)

≤ 4√

2α2

β1.5√−L∗(t0)ε

;425

3.∑Kk=0

1

(L∗(tk))2≤ 16α5

3β3(L∗(t0)ε)2.426

Proof. We only prove Item 1. Items 2 and 3 can be derived following analogous427

steps. We have428

K∑k=0

1

−L∗(tk)≤

K∑k=0

1

β(tk − t∗)429

≤K∑k=0

(1− β2α )K−k

β(tK − t∗)430

≤ 1

−βL∗(tK)

K∑k=0

(1− β

)K−k431

≤ 2α2

−βL∗(t0)ε

K∑k=0

(1− β

)K−k432

=2α2

−βL∗(t0)ε

[1− (1− β

2α )K+1

1− (1− β2α )

]433

≤ 4α3

−β2L∗(t0)ε,434

435

where the first inequality follows by (4); the second by (5); the third by (4); the436

fourth using (16) with k = K; the first equality by the geometric sum; and the fifth437

inequality by dropping negative terms.438

This manuscript is for review purposes only.

14 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

Table 1: Choices of the oracle A and γk in Algorithm 1

Case Oracle Choice of γk

Smooth(Asms 1, 2, & 3)

Strongly convex(µ > 0)

1any γ ≥ −L∗(t0) if k = 0−αL(tk−1, xk) if k ≥ 1

Convex(µ = 0 & D < +∞)

1 2MD

Non-smooth(Asms 1 & 2)

Strongly convex(µ > 0)

2 N/A

Convex(µ = 0 & D < +∞)

2 N/A

We summarize the choice of oracle A and γk in each iteration of Algorithm 1 in439

Table 1 (asms abbreviates assumptions) and characterize the overall iteration com-440

plexity (measured by the total number of gradient iterations) of this algorithm in441

Theorem 4. The dependence on ε in our big-O complexity for solving the constrained442

convex optimization problem (1)-(2) and the known analogue of this complexity for443

the unconstrained version of this problem are interestingly the same. However, unlike444

the unconstrained case, the complexity terms in Theorem 4 depend on the condition445

measure β.446

Theorem 4. Suppose the oracle A and γk in Algorithm 1 are chosen according447

to Table 1. Given 0 < ε ≤ 1, α > 1, and a g-strictly feasible solution e, Algorithm 1448

returns an ε-relative optimal and feasible solution to (1)-(2) in at most449

450

O

(1

β1.5√−L∗(t0)ε

+ log(1− β2α )−1

(1

βε

)+ log2

(1

−L∗(t0)ε

)),

O(

1

−β2L∗(t0)ε

), O

(1

−β2L∗(t0)ε

), and O

(1

β3(L∗(t0)ε)2

)451

gradient iterations, respectively, when the functions f and gi, i = 1, . . . ,m are smooth452

strongly convex, smooth convex, non-smooth strongly convex, and non-smooth convex.453

Notice that we have only presented the terms ε, β and L∗(t0) in our big-O com-454

plexity. The exact numbers of iterations are given explicitly in the proof.455

Remark 1. Based on Table 1, when the functions f , gi, i = 1, . . . ,m, are smooth456

and µ > 0, the input γ used in the first call of Oracle 1 should be bounded below by457

−L∗(t0). We do not specify the choice of such γ0 in our complexity bound (see (23)).458

However, one simple choice of γ0, in this case, could be 12µ‖∇f(e)‖2 since459

−L∗(t0) ≤ −minx∈X

max{〈∇f(e), x− e〉+

µ

2‖x− e‖2, gi(x), i = 1, . . . ,m

}460

≤ −minx∈X

{〈∇f(e), x− e〉+

µ

2‖x− e‖2

}461

≤ 1

2µ‖∇f(e)‖2,462

463

where the first inequality follows from (9) and the third holds since x = e − 1µ∇f(e)464

solves minx∈Rn{〈∇f(e), x− e〉+ µ

2 ‖x− e‖2}

.465

This manuscript is for review purposes only.

15

Proof. [Smooth strongly convex case] We first show that we can guarantee γk ≥466

L(tk, xk)−L∗(tk) by choosing γ0 ≥ −L∗(t0) and γk = −αL(tk−1, xk) for k ≥ 1. Hence467

γk can be used as an input to Oracle 1.468

When k = 0, we have L(t0, x0) = 0 and hence L(t0, x0)−L∗(t0) = −L∗(t0) ≤ γ0.469

Take k ≥ 1. We show that470

L(tk, xk) = max{f(xk)− tk−1 −1

2L(tk−1, xk); gi(xk), i = 1, . . . ,m}471

= max{f(xk)− tk−1; gi(xk) +1

2L(tk−1, xk), i = 1, . . . ,m} − 1

2L(tk−1, xk)472

≤ L(tk−1, xk)− 1

2L(tk−1, xk)473

≤ 0,(19)474475

where the first and the second inequalities hold because L(tk−1, xk) ≤ 1αL∗(tk−1) ≤ 0.476

Furthermore, since L∗(t) is a decreasing function in t we have L∗(tk−1) ≤ L∗(tk)477

since tk ≤ tk−1. Combining this with (19) yields L(tk, xk) − L∗(tk) ≤ −L∗(tk−1) ≤478

−αL(tk−1, xk) = γk.479

We next characterize the iteration complexity of oracle A in each iteration of Al-480

gorithm 1 using Theorem 3. By virtue of this theorem, the total number of gradient481

iterations performed by Algorithm 1 is at most482

K∑k=0

[(√40S

µ+ 1

)log2

(αγk

(1− α)L∗(tk)+ 1

)+ 3M

√160α log(m+ 1)

µ(1− α)L∗(tk)

]483

=

(√40S

µ+ 1

)[log2

(αγ0

(1− α)L∗(t0)+ 1

)+

K∑k=1

log2

((α− 2α2)L(tk−1, xk)

(1− α)L∗(tk)

)]484

+ 3M

√160α log(m+ 1)

µ(α− 1)

K∑k=0

1√−L∗(tk)

,

(20)

485

486

where we have replaced γk by −αL(tk−1, xk) for k > 0, and used the inequality487

(1− α)L∗(tk) ≤ (1− α)L∗(tk−1) ≤ α(1− α)L(tk−1, xk). The first sum in (20) can be488

bounded as follows.489

K∑k=1

log2

((α− 2α2)L(tk−1, xk)

(1− α)L∗(tk)

)490

= log2

(K∏k=1

(α− 2α2)L(tk−1, xk)

(1− α)L∗(tk)

)491

= log2

((2α2 − αα− 1

)K· L(t0, x1)

L∗(tK)

K−1∏k=1

L(tk, xk+1)

L∗(tk)

)492

≤ log2

((2α2 − αα− 1

)K· L(t0, x1)

L∗(tK)

)493

=

[K log2

(2α2 − αα− 1

)+ log2

(L(t0, x1)

L∗(tK)

)]494

This manuscript is for review purposes only.

16 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

≤(⌈

log(1− β2α )−1

(α2

βε

)⌉+ 1

)log2

(2α2 − αα− 1

)+ log2

(2α2

ε

),

(21)

495496

where the first inequality holds because L(tk, xk+1)/L∗(tk) ≤ 1 for all k = 0, 1, . . . ,K,497

and the second inequality follows from setting K equal to⌈log(1− β

2α )−1

(α2

βε

)⌉+ 1,498

using (16) for k = K, and the inequality L∗(t0) ≤ L(t0, x1) ≤ 0.499

The second sum in (20) can be upper bounded using Item 2 of Lemma 4. Specif-500

ically,501

K∑k=0

3M

√160α log(m+ 1)

(1− α)µL∗(tk)≤ 12Mα2

β1.5

√320α log(m+ 1)

µ(1− α)L∗(t0)ε.(22)502

503

Adding (21) and (22), we obtain the following upper bound on the total number504

of iterations (20) required by Algorithm 1:505 (√40S

µ+ 1

)[(⌈log(1− β

2α )−1

(α2

βε

)⌉+ 1

)log2

(2α2 − αα− 1

)+ log2

(2α2

ε

)506

+ log2

(αγ0

(1− α)L∗(t0)+ 1

)]+

12Mα2

β1.5

(320α log(m+ 1)

µ(1− α)L∗(t0)ε

)1/2

.(23)507508

[Smooth convex case] Notice that γk = 2MD can be used as an input to Oracle 1509

at every iteration k ≥ 1 since510

L(tk, xk)− L∗(tk) ≤ −〈ζL,tk(xk), x∗k − xk〉511

≤ ‖ζL,tk(xk)‖2‖x∗k − xk‖2512

≤ 2MD,513514

where the first inequality follows from the convexity of L(t, x) in x, the second using515

the Cauchy-Schwartz inequality, and the last using the definition of M and D. By516

theorems 2 and 3, the total number of iterations required by Algorithm 1 is equal to517

K∑k=0

[12D

√10αS

(1− α)L∗(tk)+

4MDα√

80 log(m+ 1)

(1− α)L∗(tk)+ log2

(αγk

(1− α)L∗(tk)+ 1

)].518

519

An upper bound on this quantity is520

K∑k=0

[12D

√10αS

(1− α)L∗(tk)+

4MDα√

80 log(m+ 1)

(1− α)L∗(tk)+

√2MDα

(1− α)L∗(tk)

],521

522

which is obtained by replacing γk = 2MD for all k and using the inequality log(a) ≤523a−1√a

for a > 1. Next using Lemma 4, we obtain the required upper bound on the total524

number of iterations taken by Algorithm 1 as525

48Dα2

β1.5

(20Sα

(1− α)L∗(t0)ε

)1/2

+16MDα4

β2(1− α)L∗(t0)ε(80 log(m+ 1))1/2

526

+8α2

β1.5

(MDα

(1− α)L∗(t0)ε

)1/2

.527

This manuscript is for review purposes only.

17

528

[Non-smooth strongly convex case] By Lemma 4 and Proposition 1, the total num-529

ber of iterations required by Algorithm 1 can be bounded above in a straightforward530

manner:531K∑k=0

αM2

µ(1− α)L∗(tk)≤ 4M2α4

µβ2(1− α)L∗(t0)ε.532

[Non-smooth convex case] In a similar fashion, by Lemma 4 and Proposition 1,533

the total number of iterations required by Algorithm 1 is upper bounded by534

K∑k=0

8α2M2D2

((1− α)L∗(tk))2 ≤

43M2D2α7

β3 ((1− α)L∗(t0)ε)2 .

535

REFERENCES536

[1] Z. Allen-Zhu and E. Hazan, Optimal black-box reductions between optimization objectives,537arXiv preprint arxiv:1603.05642, (2016).538

[2] A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, and S. Roy, Level-set539methods for convex optimization, arXiv preprint arXiv:1602.01506, (2016).540

[3] A. Y. Aravkin, J. V. Burke, and M. P. Friedlander, Variational properties of value func-541tions, SIAM Journal on Optimization, 23 (2013), pp. 1689–1717.542

[4] A. Bayandina, Adaptive mirror descent for constrained optimization, arXiv preprint543arXiv:1705.02029, (2017).544

[5] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse545problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.546

[6] A. Beck and M. Teboulle, Smoothing and first order methods: A unified framework, SIAM547Journal on Optimization, 22 (2012), pp. 557–580.548

[7] K. M. Bretthauer and B. Shetty, The nonlinear knapsack problem–algorithms and appli-549cations, European Journal of Operational Research, 138 (2002), pp. 459–472.550

[8] M. Chiang, Geometric programming for communication systems, Foundations and Trends in551Communications and Information Theory, 2 (2005), pp. 1–154.552

[9] R. M. Freund and H. Lu, New computational guarantees for solving convex optimization553problems with first order methods, via a function growth condition measure, arXiv preprint554arXiv:1511.02974, (2015).555

[10] B. Grimmer, Radial subgradient method, arXiv preprint arXiv:1703.09280v2, (2017).556[11] Z. Harchaoui, A. Juditsky, and A. Nemirovski, Conditional gradient algorithms for norm-557

regularized smooth convex optimization, Mathematical Programming, 152 (2015), pp. 75–558112.559

[12] A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scale op-560timization, i: General purpose methods, in Optimization for Machine Learning, S. Sra,561S. Nowozin, and S. Wright, eds., MIT Press, Cambridge, MA, USA, 2011, pp. 121–148.562

[13] S. Lacoste-Julien, M. Schmidt, and F. Bach, A simpler approach to obtaining an563O(1/t) convergence rate for the projected stochastic subgradient method, arXiv preprint564arXiv:1212.2002, (2012).565

[14] G. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex opti-566mization, Mathematical Programming, 149 (2015), pp. 1–45.567

[15] G. Lan and Z. Zhou, Algorithms for stochastic optimization with expectation constraints,568arXiv preprint arXiv:1604.03887, (2016).569

[16] C. Lemarechal, An extension of davidon methods to non differentiable problems, Nondiffer-570entiable Optimization, (1975), pp. 95–109.571

[17] C. Lemarechal, A. Nemirovskii, and Y. Nesterov, New variants of bundle methods, Math-572ematical programming, 69 (1995), pp. 111–147.573

[18] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation574approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–5751609.576

[19] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation577approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–5781609.579

This manuscript is for review purposes only.

18 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI

[20] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Kluwer, Nether-580lands, (2004).581

[21] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming,582103 (2005), pp. 127–152.583

[22] Y. Nesterov, Primal-dual subgradient methods for convex problems, Mathematical program-584ming, 120 (2009), pp. 221–259.585

[23] M. Patriksson and C. Stromberg, Algorithms for the continuous nonlinear resource allo-586cation problem–new implementations and numerical studies, European Journal of Opera-587tional Research, 243 (2015), pp. 703–722.588

[24] E. Pee and J. O. Royset, On solving large-scale finite minimax problems using exponential589smoothing, Journal of optimization theory and applications, 148 (2011), pp. 390–421.590

[25] J. Renegar, “Efficient” subgradient methods for general convex optimization, SIAM Journal591on Optimization, 26 (2016), pp. 2649–2676.592

[26] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, submit-593ted to SIAM Journal on Optimization, (2008).594

[27] E. Van Den Berg and M. P. Friedlander, Probing the pareto frontier for basis pursuit595solutions, SIAM Journal on Scientific Computing, 31 (2008), pp. 890–912.596

[28] E. Van den Berg and M. P. Friedlander, Sparse optimization with least-squares constraints,597SIAM Journal on Optimization, 21 (2011), pp. 1201–1229.598

[29] P. Wolfe, A method of conjugate subgradients for minimizing nondifferentiable functions,599(1975), pp. 145–173.600

This manuscript is for review purposes only.