Upload
vanhuong
View
228
Download
1
Embed Size (px)
Citation preview
A LEVEL-SET METHOD FOR CONVEX OPTIMIZATION WITH A1
FEASIBLE SOLUTION PATH2
QIHANG LIN∗, SELVAPRABU NADARAJAH† , AND NEGAR SOHEILI‡3
Abstract. Large-scale constrained convex optimization problems arise in several application4domains. First-order methods are good candidates to tackle such problems due to their low iteration5complexity and memory requirement. The level-set framework extends the applicability of first-order6methods to tackle problems with complicated convex objectives and constraint sets. Current methods7based on this framework either rely on the solution of challenging subproblems or do not guarantee a8feasible solution, especially if the procedure is terminated before convergence. We develop a level-set9method that finds an ε-relative optimal and feasible solution to a constrained convex optimization10problem with a fairly general objective function and set of constraints, maintains a feasible solution11at each iteration, and only relies on calls to first-order oracles. We establish the iteration complexity12of our approach, also accounting for the smoothness and strong convexity of the objective function13and constraints when these properties hold. The dependence of our complexity on ε is similar to14the analogous dependence in the unconstrained setting, which is not known to be true for level-set15methods in the literature. Nevertheless, ensuring feasibility is not free. The iteration complexity of16our method depends on a condition number, while existing level-set methods that do not guarantee17feasibility can avoid such dependence.18
Key words. constrained convex optimization, level-set technique, first-order methods, com-19plexity analysis20
AMS subject classifications. 90C25, 90C30, 90C5221
1. Introduction. Large scale constrained convex optimization problems arise22
in several business, science, and engineering applications. A commonly encountered23
form for such problems is24
f∗ := minx∈X
f(x)(1)25
s.t. gi(x) ≤ 0, i = 1, . . . ,m,(2)2627
where X ⊆ Rn is a closed and convex set, and f and gi, i = 1, . . . ,m, are convex real28
functions defined on X . Given ε > 0, a point xε is ε-optimal if it satisfies f(xε)−f∗ ≤ ε29
and ε-feasible if it satisfies maxi=1,...,m gi(xε) ≤ ε. In addition, given 0 < ε ≤ 1, and a30
feasible and suboptimal solution e ∈ X , a point xε is ε-relative optimal with respect31
to e if it satisfies (f(xε)− f∗)/(f(e)− f∗) ≤ ε.32
Level-set techniques [2, 3, 11, 17, 27, 28] provide a flexible framework for solving33
the optimization problem (1)-(2). Specifically, this problem can be reformulated as a34
convex root finding problem L∗(t) = 0, where L∗(t) = minx∈X max{f(x)−t; gi(x), i =35
1, . . . ,m} (see Lemarechal et al. [17]). Then a bundle method [16, 29, 14] or an accel-36
erated gradient method [20, Section 2.3.5] can be applied to handle the minimization37
subproblem to estimate L∗(t) and facilitate the root finding process. In general, these38
methods solve the subproblem using a nontrivial quadratic program that needs to be39
solved at each iteration.40
An important variant of (1)-(2) arising in statistics, image processing, and signal41
processing occurs when f(x) takes a simple form, such as a gauge function, and there42
is a single constraint with the structure ρ(Ax− b) ≤ λ, where ρ is a convex function,43
A is a matrix, b is a vector, and λ is a scalar. In this setting, Van Den Berg and44
∗Tippie College of Business, University of Iowa, USA, [email protected]†College of Business Administration, University of Illinois at Chicago, USA, [email protected]‡College of Business Administration, University of Illinois at Chicago, USA, [email protected]
1
This manuscript is for review purposes only.
2 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
Friedlander [27, 28] and Aravkin et al. [2] use a level-set approach to swap the roles45
of the objective function and constraints in (1)-(2) to obtain the convex root finding46
problem v(t) = 0 with v(t) := minx∈X ,f(x)≤t ρ(Ax − b) − λ. Harchaoui et al. [11]47
proposed a similar approach. The root finding problem arising in this approach is48
solved using inexact newton and secant methods in [27, 28, 2] and [2], respectively.49
Since f(x) is assumed to be simple, projection onto the feasible set {x ∈ Rn|x ∈50
X , f(x) ≤ t} is possible and the subproblem in the definition of v(t) can be solved51
by various first-order methods. Although the assumption on the objective function52
is important, the constraint structure is not limiting because (2) can be obtained by53
choosing A to be an identity matrix, b = 0, λ = 0, and ρ(x) = maxi=1,...,m gi(x).54
The level-set techniques discussed above find an ε-feasible and ε-optimal solution55
to (1)-(2), that is, the terminal solution may not be feasible. This issue can be56
overcome when the infeasible terminal solution is also super-optimal and a strictly57
feasible solution to the problem exists. In this case, a radial-projection scheme [2, 25]58
can be applied to obtain a feasible and ε-relative optimal solution. Super-optimality of59
the terminal solution holds if the constraint f(x) ≤ t is satisfied at termination. This60
condition can be enforced when projections on to the feasible set with this constraint61
are computationally cheap, which typically requires f(x) to be simple (e.g. see [2]).62
Otherwise, such a guarantee does not exist.63
Building on this rich literature, we develop a feasible level-set method that (i) em-64
ploys a first-order oracle to solve subproblems and compute a feasible and ε-relative65
optimal solution to (1)-(2) with a general objective function and set of constraints66
while also maintaining feasible intermediate solutions at each iteration (i.e., a feasible67
solution path), and (ii) avoids potentially costly projections. Our focus on maintain-68
ing feasibility is especially desirable in situations where level-set methods need to be69
terminated before convergence, for instance due to a time constraint. In this case, our70
level-set approach would return a feasible, albeit possibly suboptimal, solution that71
can be implemented whereas the solution from existing level-set approaches may not72
be implementable due to potentially large constraint violations. Convex optimiza-73
tion problems where constraints model operating conditions that need to be satisfied74
by solutions for implementation arise in several applications, including network and75
nonlinear resource allocation, signal processing, and circuit design [7, 8, 23].76
The radial sub-gradient method proposed by Renegar [25] and extended by Grim-77
mer [10] also emphasizes feasibility. This method avoids projection and finds a feasible78
and ε-relative optimal solution by performing a line search at each iteration. This line79
search, while easy to execute for some linear or conic programs, can be challenging for80
general convex programs (1)-(2). Moreover, it is unknown whether the iteration com-81
plexity of this approach can be improved by utilizing the potential smoothness and82
strong convexity of the objective function and constraints. In contrast, the feasible83
level-set approach in this paper avoids the need for line search, in addition to projec-84
tion, and its iteration complexity leverages the strong convexity and smoothness of85
the objective and constraints when these properties are true.86
Our work is indeed related to first-order methods, which are popular due to their87
low per-iteration cost and memory requirement, for example compared to interior88
point methods, for solving large scale convex optimization problems [12]. Most ex-89
isting first-order methods find a feasible and ε-optimal solution to a version of the90
optimization problem (1)-(2) with a simple feasible region (e.g. a box, simplex or91
ball) which makes projection on to the feasible set at each iteration inexpensive. En-92
suring feasibility using projection becomes difficult for general constraints (2). Some93
sub-gradient methods, including the method by Nesterov [20, Section 3.2.4] and its94
This manuscript is for review purposes only.
3
recent variants by Lan and Zhou [15] and Bayandina [4], can be applied to (1)-(2)95
without the need for projection mappings. However these methods do not ensure96
feasibility of the feasible solution as we do, that is, they instead find an ε-feasible and97
ε-optimal solution to (1)-(2) at convergence.98
Given the focus of our research, one naturally wonders whether there is an asso-99
ciated computational cost of ensuring feasibility in the context of level-set methods.100
Our analysis suggests that the answer depends on the perspective one takes. First,101
consider the dependence of the iteration complexity on a condition measure as a cri-102
terion. A nice property of the level-set method in [2] is that its iteration complexity103
is independent of a condition measure for finding an ε-feasible and ε-optimal solu-104
tion. Instead, finding an ε-relative optimal and feasible solution leads to iteration105
complexities with dependence on condition measures when (i) combining the level-set106
approach [2] and the transformation in [25], or (ii) using our feasible level-set ap-107
proach. This suggests that a cost of ensuring feasibility is the presence of a condition108
measure in the iteration complexity. Second, suppose we consider the dependence of109
the number of iterations on ε as our assessment criterion. The complexity result of the110
level-set approach provided by [2] can be interpreted as the product of the iteration111
complexity of the oracle used to solve the subproblem and a O(log(1/ε)) factor. The112
dependence of our algorithm would be similar under a non-adaptive analysis akin to113
[2]. However, using a novel adaptive analysis we find that the iteration complexity of114
our feasible level-set approach can in fact eliminate the O(log(1/ε)) factor. In other115
words, our complexity is analogous to the known complexity of first order methods for116
unconstrained convex optimization and it is unclear to us whether a similar adaptive117
analysis can be applied to the approach in [2]. Overall, in terms of the dependence118
on ε there appears to be no cost to ensuring feasibility.119
The rest of this paper is organized as follows. Section 2 introduces the level-set120
formulation and our feasible level-set approach, also establishing its outer iteration121
complexity to compute an ε-relative optimal and feasible solution to the optimization122
problem (1)-(2). Section 3 describes two first-order oracles that solve the subproblem123
of our level-set formulation when the functions f and gi, i = 1, . . . ,m, are smooth124
and non-smooth, and in each case accounts for the potential strong convexity of these125
functions. The inner iteration complexity of each oracle is also analyzed in this section.126
Section 4 presents an adaptive complexity analysis to establish the overall iteration127
complexity of using our feasible level-set approach.128
2. Feasible Level-set Method. Our methodological developments rely on the129
assumption below.130
Assumption 1. There exists a solution e ∈ X such that maxi=1,...,m gi(e) < 0131
and f(e) > f∗. We label such a solution g-strictly feasible.132
Given t ∈ R, we define133
L∗(t) := minx∈X
L(t, x),(3)134135
where L(t, x) := max{f(x) − t; gi(x), i = 1, . . . ,m} for all x ∈ X . Let ∂L∗(t) denote136
the sub-differential of L∗ at t. Lemma 1 summarizes known properties of L∗(t).137
Lemma 1 (Lemmas 2.3.4, 2.3.5, and 2.3.6 in [20]). It holds that138
(a) L∗(t) is non-increasing and convex in t;139
(b) L∗(t) ≤ 0, if t ≥ f∗ and L∗(t) > 0, if t < f∗. Moreover, under Assumption 1,140
it holds that L∗(t) < 0, for any t > f∗.141
This manuscript is for review purposes only.
4 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
(c) Given ∆ ≥ 0, we have L∗(t) − ∆ ≤ L∗(t + ∆) ≤ L∗(t). Therefore, −1 ≤142
ζL∗ ≤ 0 for all ζL∗ ∈ ∂L∗(t) and t ∈ R.143
Proof. We only prove the second part of Item (b) as the proofs of other items are144
provided in [20]. In particular, we show L∗(t) < 0 for any t > f∗ under Assumption 1.145
Let x∗ denote an optimal solution to (1)-(2). If t > f∗ we have f(x∗)− t < 0. Since f146
is a continuous function, there exists δ > 0 such that for any x ∈ Bδ(x∗)∩X we have147
f(x)− t < 0, where Bδ(x∗) is a ball of radius δ centered at x∗. Take x ∈ Bδ(x∗) ∩ X148
such that x = λx∗ + (1 − λ)e for some λ ∈ (0, 1), where e is a solution that satisfies149
Assumption 1. By the convexity of the functions gi, i = 1, . . . ,m, the feasibility150
of x∗, and the g-strict feasibility of e, we have gi(x) ≤ λgi(x∗) + (1 − λ)gi(e) < 0151
for i = 1, . . . ,m. In addition, since x ∈ Bδ(x∗), we have f(x) − t < 0. Therefore,152
L∗(t) ≤ max{f(x)− t, gi(x), i = 1, . . . ,m} < 0.153
Lemma 1(b), specifically L∗(t) < 0 for any t > f∗, ensures that the solution154
t∗ = f∗ is a unique root to L∗(t) = 0 and the solution x∗(t∗) determining L∗(t∗) is155
an optimal solution to (1)-(2). In this case, it is possible to find an ε-relative optimal156
and feasible solution by closely approximating t∗ from above with a t and then finding157
a point x ∈ X with L(t, x) < 0. The latter condition ensures x is feasible. In order158
to obtain such a t, we start with an initial value t0 > t∗ and generate a decreasing159
sequence t0 ≥ t1 ≥ t2 ≥ · · · ≥ t∗, where tk is updated to tk+1 based on a term160
obtained by solving (3) with t = tk up to a certain optimality gap. We present this161
approach in Algorithm 1 which utilizes an oracle A defined as follows.162
Definition 1. An algorithm is called an oracle A if, given t ≥ f∗, x ∈ X , α > 1163
and γ ≥ L(t, x)− L∗(t), it finds a point x+ ∈ X , such that L∗(t) ≥ αL(t, x+).164
Algorithm 1 Feasible Level-Set Method
1 Input: g-Strictly feasible solution e, and parameters α > 1 and 0 < ε ≤ 1.2 Initialization: Set x0 = e and t0 = f(x0).3 for k = 0, 1, 2, ... do4 Choose γk such that γk ≥ L(tk, xk)− L∗(tk).5 Generate xk+1 = A(tk, xk, α, γk) such that L∗(tk) ≥ αL(tk, xk+1).6 if αL(tk, xk+1) ≥ L(t0, x1)ε then7 Terminate and return xk+1.8 end if9 Set tk+1 = tk + 1
2L(tk, xk+1).10 end for
Given a g-strictly feasible solution e and α > 1, Algorithm 1 finds an ε-relative165
optimal and feasible solution by calling an oracle A with input tuple (tk, xk, α, γk)166
at each iteration k. This oracle provides a lower bound, αL(tk, xk+1) ≤ 0, and an167
upper bound, L(tk, xk+1) ≤ 0, for L∗(tk). We use the upper bound to update tk168
and the lower bound in the stopping criterion. In particular, Algorithm 1 terminates169
when αL(tk, xk+1) is greater than or equal to L(t0, x1)ε. We discuss possible choices170
for oracle A in §3 and its input γ in Table 1. Theorem 2 shows that Algorithm 1 is171
a feasible level-set method, that is xk generated at each iteration k is feasible, and172
terminates with an ε-relative optimal solution. It also establishes its outer iteration173
complexity. This complexity depends on a condition measure174
β := −L∗(t0)
t0 − t∗,175
This manuscript is for review purposes only.
5
0 1
t∗ t0
β
L∗(t0)
t
L∗(t)
(a) Well conditioned case.
0 1
t∗ t0
βL∗(t0)
t
L∗(t)
(b) Poorly conditioned case.
Fig. 1: Illustration of condition measure β for t∗ and t0 equals 0 and 1, respectively.
for problem (1)-(2). The suboptimality of x0 = e implies t0 = f(x0) > t∗. By176
Lemma 1 we can show that β ∈ (0, 1]. Larger values of β signify a better conditioned177
problem, as illustrated in Figure 1, because we iteratively use a close approximation178
of L∗(t) to approach t∗ from t0. Therefore, larger L∗(t) in absolute value intuitively179
suggests faster movement towards t∗. In addition, the following inequalities can be180
easily verified from the convexity of L∗(t) and Lemma 1(c):181
(4) − β =L∗(t0)
t0 − t∗≥ L∗(t)
t− t∗≥ −1, ∀t ∈ (t∗, t0].182
Theorem 2. Given α > 1, 0 < ε ≤ 1, and a g-strictly feasible solution e, Algo-183
rithm 1 maintains a feasible solution at each iteration and ensures184
(5) tk+1 − t∗ ≤(
1− β
2α
)(tk − t∗).185
In addition, it returns an ε-relative optimal solution to (1)-(2) in at most186
K :=
⌈log(1− β
2α )−1
(α2
βε
)⌉+ 1187
outer iterations.188
Proof. Notice that t0 > t∗ since t0 = f(e). Suppose t∗ ≤ tk ≤ t0 at iteration k.189
At iteration k + 1 we have190
tk+1 − t∗ = tk − t∗ +1
2L(tk, xk+1)191
≥ tk − t∗ +1
2L∗(tk)192
≥ tk − t∗ −1
2(tk − t∗)193
≥ 0,194195
This manuscript is for review purposes only.
6 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
where the first equality is obtained by substituting for tk+1 using the update equation196
for t in Algorithm 1, the first inequality using the definition of L∗(tk), the second197
inequality from (4) with t = tk, and the final inequality from tk− t∗ ≥ 0. This implies198
that tk ≥ t∗ for all k. In addition, the condition L∗(tk) ≥ αL(tk, xk+1) satisfied by199
the oracle A, α > 1, and Lemma 1(b) imply that L(tk, xk+1) ≤ L∗(tk)/α ≤ 0. It thus200
follows that tk+1 ≤ tk and the xk+1 maintained by Algorithm 1 is feasible for all k.201
Next we establish (5) and the outer iteration complexity of Algorithm 1. Using202
the definiton of tk+1, α > 1, and the condition L∗(tk) ≥ αL(tk, xk+1) satisfied by203
oracle A, we write204
tk+1 − t∗ = tk − t∗ +1
2L(tk, xk+1) ≤ tk − t∗ +
L∗(tk)
2α.(6)205
206
Using (4) with t = tk to substitute for L∗(tk) in (6) gives the required inequality (5).207
Recursively using (5) starting from k = 0 gives208
(7) tk − t∗ ≤(
1− β
2α
)k(t0 − t∗).209
To obtain the terminal condition, we note that, for k ≥ K − 1,210
−αL(tk, xk+1) ≤ −αL∗(tk)211
≤ α(tk − t∗)212
≤ α(
1− β
2α
)K−1
(t0 − t∗)213
≤ − εαL∗(t0)214
≤ −εL(t0, x1),215216
where the first inequality holds by the definition of L∗(t); the second using (4) for217
t = tk; the third using (7); the forth by applying the definitions of K and β. The218
last inequality follows from the condition satisfied by the oracle at t0 and x1, that is,219
αL(t0, x1) ≤ L∗(t0). Hence, Algorithm 1 terminates after K outer iterations.220
Finally we proceed to show that the terminal solution is ε-relative optimal. Ob-221
serve that222
(8) f(xk+1)− f∗ = f(xk+1)− t∗ ≤ tk − t∗,223
where the equality is a consequence of the definition of t∗ and the inequality is obtained224
using f(xk+1) ≤ tk, which holds because f(xk+1)− tk ≤ L(tk, xk+1) ≤ 0. Combining225
tk − t∗ ≤ (t0 − t∗)L∗(tk)/L∗(t0) from (4) and (8), we obtain f(xk+1) − f∗ ≤ (t0 −226
t∗)L∗(tk)/L∗(t0). Rearranging terms results in the inequality227
f(xk+1)− f∗
f(e)− f∗≤ L∗(tk)
L∗(t0),228
where we use t∗ = f∗ and t0 = f(e). Indeed, L∗(tk)/L∗(t0) ≤ ε when the algorithm229
terminates. To verify this, the terminal condition and the property satisfied by oracle230
A indicate that L∗(tk) ≥ αL(tk, xk+1) ≥ L(t0, x1)ε ≥ L∗(t0)ε, which further implies231
L∗(tk)
L∗(t0)≤ ε.
232
This manuscript is for review purposes only.
7
3. First-order oracles. In this section, we adapt first-order methods to be used233
as the oracle A in Algorithm 1 to solve (3) under different properties of the convex234
functions f and gi, i = 1, . . . ,m. We adapt these methods in a manner that facilitates235
our adaptive complexity analysis in §4.236
We assume functions f and gi, i = 1, . . . ,m, are µ-convex for some µ ≥ 0. Namely,237
for h = f, g1, . . . , gm, we have238
(9) h(x) ≥ h(y) + 〈ζh, x− y〉+µ
2‖x− y‖22239
for any x and y in X and ζh ∈ ∂h(y), where ∂h(y) denotes the set of all sub-gradients240
of h at y ∈ X and ‖ · ‖2 denotes the 2-norm. We call functions f and gi, i = 1, . . . ,m,241
convex if µ = 0 and strongly convex if µ > 0. Let242
D := maxx∈X‖x‖2 and M := max
x∈Xmax{‖ζf‖2; ‖ζgi‖2, i = 1, . . . ,m},243
where ζf ∈ ∂f(x) and ζgi ∈ ∂gi(x), i = 1, . . . ,m. When f and gi, i = 1, . . . ,m are244
smooth, M := maxx∈X max{‖∇f(x)‖2; ‖∇gi(x)‖2, i = 1, . . . ,m}.245
We make Assumption 2 in the rest of the paper.246
Assumption 2. M < +∞ and either (i) µ > 0 or (ii) µ = 0 and D < +∞.247
Assuming M to be finite is a common assumption in the literature when developing248
first-order methods for solving smooth min-max problems (see e.g. [6, 24]) and non-249
smooth problems (see e.g. [9, 18, 22]). The assumption of finite D is present to obtain250
computable stopping criteria for the first-order oracles that we use.251
3.1. Smooth convex functions. In this subsection, in addition to assump-252
tions 1 and 2, we also assume that the functions in the objective and constrains of253
(1)-(2) are smooth, formally stated below.254
Assumption 3. Functions f and gi, i = 1, . . . ,m, are S-smooth for some S ≥ 0,255
namely, for h = f, g1, . . . , gm, h(x) ≤ h(y) + 〈∇h(y), x− y〉 + S2 ‖x − y‖
22 for any x256
and y in X .257
A first-order method achieves a fast linear convergence rate when the objective258
function is both smooth and strongly convex [20]. Despite this smoothness assump-259
tion, since the function L(t, x) is a maximum of functions modeling the objective and260
constraints, it may be non-smooth in x. Nevertheless, we can approximate L(t, x)261
by a smooth and strongly convex function of x using a smoothing method [6, 21]262
and a standard regularization technique. Applying first-order methods to minimize263
this smooth approximation of L(t, x) leads to a lower iteration complexity than di-264
rectly working with the original non-smooth function. The technique we utilize is265
a modified version of the smoothing and regularization reduction method by Allen-266
Zhu and Hazan [1]. However, since the formulation and the non-smooth structure267
of (3) are quite different from the problems considered in [1], we present below the268
reduction technique and conduct the complexity analysis under our framework. The269
exponentially smoothed and regularized version of L(t, x) is270
Lσ,λ(t, x) :=1
σlog
(exp(σ(f(x)− t)) +
m∑i=1
exp(σgi(x))
)+λ
2‖x‖22,(10)271
272
where σ > 0 and λ ≥ 0 are smoothing and regularization parameters, respectively.273
Lemma 2 is based on related arguments in [6] and summarizes properties of Lσ,λ(t, x)274
and its “closeness” to L(t, x).275
This manuscript is for review purposes only.
8 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
Lemma 2. Suppose assumptions 2 and 3 hold. Given σ > 0 and λ ≥ 0, the276
function Lσ,λ(t, x) is (i) (σM2 + S)-smooth and (ii) (λ + µ)-convex. Moreover, we277
have278
(11) 0 ≤ Lσ,λ(t, x)− L(t, x) ≤ log(m+ 1)
σ+λD2
2.279
Here, we follow the convention that 0·+∞ = 0 so that λD2 = 0 if λ = 0 and D = +∞.280
Proof. Item (i) and the relationship (11) are respectively minor modifications of281
Proposition 4.1 and Example 4.9 in [6]. Item (ii) directly follows from the µ-convexity282
of f , gi, i = 1, . . . ,m and λ-convexity of the function λ2 ‖x‖
22.283
We solve the following version of (3) with L(t, x) replaced by Lσ,λ(t, x):
minx∈X
Lσ,λ(t, x).
Specifically, we apply a variant of Nesterov’s accelerated gradient method [5, 26] for284
a given t > t∗ and a sequence of (σ, λ) with σ increasing to infinity and λ decreasing285
to zero. We refer to this gradient method as restarted smoothing gradient method286
(RSGM) and summarize its steps in Oracle 1. The choice of (σ, λ) and the number of287
iterations performed by RSGM depend on whether f and gi, i = 1, . . . ,m are strongly288
convex (µ > 0) or convex (µ = 0). Moreover, RSGM is re-started after Ni accelerated289
gradient method iterations with a smaller precision γi until this precision is below the290
required threshold of (1 − α)L(t, xi). Theorem 3 establishes that Oracle 1 returns a291
solution x+ ∈ X such that L∗(t) ≥ αL(t, x+). Hence, RSGM can be used as an oracle292
A (see Definition 1).293
Theorem 3. Suppose assumptions 1, 2, and 3 hold. Given level t ∈ (t∗, t0],294
solution x ∈ X , parameters α > 1 and γ ≥ L(t, x)−L∗(t), Oracle 1 returns a solution295
x+ ∈ X that satisfies L∗(t) ≥ αL(t, x+) in at most296 (√40S
µ+ 1
)log2
(αγ
(1− α)L∗(t)+ 1
)+ 3M
√160α log(m+ 1)
µ(1− α)L∗(t)297
298
iterations of the accelerated gradient method when µ > 0 and299
12D
√10Sα
(1− α)L∗(t)+
4MDα√
80 log(m+ 1)
(1− α)L∗(t)+ log2
(αγ
(1− α)L∗(t)+ 1
)300
301
such iterations when µ = 0 and D < +∞.302
Proof. Let x∗σ,λ(t) := arg minx∈XLσ,λ(t, x) and L∗σ,λ(t) := Lσ,λ(t, x∗σ,λ(t)). We303
will first show by induction that at each iteration i = 0, 1, 2, . . . of Oracle 1 the304
following inequality holds.305
(12) L(t, xi)− L∗(t) ≤ γi.306
The validity of (12) for i = 0 holds because x0 = x, γ0 = γ, and we know307
L(t, x) − L∗(t) ≤ γ. Suppose (12) holds for iteration i. We claim that it also holds308
for iteration i+ 1 in both the strongly convex and convex cases. In fact, at iteration309
i+ 1, we have310
L(t, xi+1)− L∗(t) ≤ Lσi,λi(t, xi+1)− L∗σi,λi(t) +log(m+ 1)
σi+λiD
2
2311
This manuscript is for review purposes only.
9
Oracle 1 Restarted Smoothing Gradient Method: RSGM(t, x, α, γ)
1 Input: Level t, initial solution x ∈ X , constant α > 1, and initial precisionγ ≥ L(t, x)− L∗(t).
2 Set x0 = x and γ0 = γ.3 for i = 0, 1, 2, . . . do4 if γi ≤ (1− α)L(t, xi) then5 Terminate and return x+ = xi.6 end if7 Set
σi =
4 log(m+ 1)
γiif µ > 0
8 log(m+ 1)
γiif µ = 0
and λi =
0 if µ > 0
γi4D2
if µ = 0.
8 Starting at xi, apply the accelerated gradient method [26, Algorithm 1] on theoptimization problem minx∈X Lσi,λi(t;x) for
Ni =
⌈max
{√40S
µ,M√
160 log(m+ 1)√µγi
}⌉if µ > 0
⌈max
{√160SD2
γi,
4MD√
80 log(m+ 1)
γi
}⌉if µ = 0,
iterations and let xi+1 be the solution returned by this method.9 Set γi+1 = γi/2.
10 end for
≤2(σiM
2 + S)‖xi − x∗σi,λi(t)‖22
N2i
+log(m+ 1)
σi+λiD
2
2312
≤4(σiM
2 + S)(Lσi,λi(t, xi)− L∗σi,λi(t))(µ+ λi)N2
i
+log(m+ 1)
σi+λiD
2
2313
≤ 4(σiM2 + S)
(µ+ λi)N2i
(γi +
log(m+ 1)
σi+λiD
2
2
)+
log(m+ 1)
σi+λiD
2
2314
=5γi(σiM
2 + S)
(µ+ λi)N2i
+γi4
315
≤ γi8
+γi8
+γi4
316
=γi2
= γi+1.317318
The first inequality follows from Lemma 2; the second inequality from the (σiM2 +319
S)-smoothness of Lσi,λi(t, x) and the convergence property, i.e., Lσi,λi(t, xi+1) −320
L∗σi,λi(t) ≤2N2i
(σiM2 + S)‖xi − x∗σi,λi(t)‖
22, of an accelerated gradient method [26,321
Corollary 1]; the third using the (λi +µ)-convexity of Lσi,λi(t, x) in x; and the fourth322
by combining Lemma 2 and the induction hypothesis that L(t, xi)− L∗(t) ≤ γi. The323
This manuscript is for review purposes only.
10 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
rest of the relationships follow from the inequalities 5γiσiM2/((µ + λi)N
2i ) ≤ γi/8324
and 5γiS/((µ+λi)N2i ) ≤ γi/8, which hold for both µ > 0 and µ = 0 according to the325
definitions of σi, λi, and Ni.326
It can be easily verified that Oracle 1 terminates in at most I iterations, where327
I := log2
(αγ
(1−α)L∗(t) + 1)≥ 0. In fact, according to (12), for any i ≥ I, we have328
L(t, xi)− L∗(t) ≤ γi =γ0
2i≤ γ
2I≤ (1−α)
α L∗(t), which implies that αL(t, xi) ≤ L∗(t).329
Hence γi ≤ 1−αα (αL(t, xi)) = (1− α)L(t, xi) and Oracle 1 terminates.330
We next analyze the total number of iterations across all calls of the accelerated331
gradient method [26, Algorithm 1] during the execution of Oracle 1.332
Suppose µ > 0. The number of iterations taken by Oracle 1 can be bounded as333
follows:334
I−1∑i=0
Ni ≤I−1∑i=0
√40S
µ+
I−1∑i=0
M
√160 log(m+ 1)
µγi+
I−1∑i=0
1335
=
(√40S
µ+ 1
)I +M
√160 log(m+ 1)
µ
I−1∑i=0
(1
γi
)1/2
336
=
(√40S
µ+ 1
)I +M
√160 log(m+ 1)
µγ0
I−1∑i=0
2i/2337
=
(√40S
µ+ 1
)I +M
√160 log(m+ 1)
µγ
(1− 2I/2
1−√
2
)338
≤
(√40S
µ+ 1
)log2
(αγ
(1− α)L∗(t)+ 1
)+ 3M
√160α log(m+ 1)
µ(1− α)L∗(t).339
340
The first inequality above is obtained using the inequality dae ≤ a+1 for a ∈ R+, and341
by bounding the maximum determining Ni by a sum of its terms; the first equality342
from simplifying terms; the second equality from using the definition of γi; the third343
equality from γ0 = γ and the geometric sum formula∑I−1i=0 2i/2 = (1−2I/2)/(1−
√2);344
and the last inequality by substituting for I, using the relationship√a+ 1 ≤
√a+ 1345
for a ∈ R+ to obtain the inequality346
2I/2 − 1 =
(αγ
(1− α)L∗(t)+ 1
)1/2
− 1 ≤(
αγ
(1− α)L∗(t)
)1/2
,(13)347348
and bounding 1/(√
2− 1) above by 3.349
Suppose µ = 0. The iteration complexity can be bounded using the following350
steps:351
I−1∑i=0
Ni ≤I−1∑i=0
4D
√10S
γi+
I−1∑i=0
4MD
√80 log(m+ 1)
γi+
I−1∑i=0
1352
= 4D
√10S
γ0
I−1∑i=0
2i/2 + 4MD
√80 log(m+ 1)
γ0
I−1∑i=0
2i + I353
= 4D
√10S
γ0
(1− 2I/2
1−√
2
)+ 4MD
√80 log(m+ 1)
γ0
(1− 2I
1− 2
)+ I354
This manuscript is for review purposes only.
11
≤ 12D
√10Sα
(1− α)L∗(t)+
4αMD√
80 log(m+ 1)
(1− α)L∗(t)+ log2
(αγ
(1− α)L∗(t)+ 1
).355
356
The first inequality follows from replacing the maximum in the definition of Ni by357
the sum of its terms and using dae ≤ a+ 1 for a ∈ R+, the first and second equalities358
by rearranging terms and using the geometric sum formula, respectively, and the last359
inequality by substituting for I and using the inequality (13).360
3.2. Non-smooth functions. In this subsection, we only enforce assumptions 1361
and 2. In other words, the functions f and gi i = 1, . . . ,m can be non-smooth. Under362
Assumption 2, we have maxx∈X ‖ζL,t(x)‖2 ≤M for any ζL,t(x) ∈ ∂xL(t, x) and t ∈ R,363
where ∂xL(t, x) denotes the set of all sub-gradients of L(t, ·) at x with respect to x.364
As oracle A, we choose the standard sub-gradient method with varying stepsizes365
from [19] and its variant in [13]1, respectively, when µ = 0 and µ > 0 to directly solve366
(3). We present this sub-gradient method for both these choices of µ in Oracle 2. The367
iteration complexity of Oracle 2 is well-known and given in Proposition 1 without368
proof. The proof for µ = 0 can be found in [19, §2.2] and and the one for µ > 0369
is given in [13, Section 3.2]. Here we only provide arguments to show that Oracle 2370
will return a solution with the desired optimality gap required by Algorithm 1, which371
then qualifies this algorithm to be used as an oracle A (see Definition 1).372
Oracle 2 Sub-gradient Descent: SGD(t, x, α)
1 Input: Level t, initial solution x ∈ X and constant α > 12 Set x0 = x.3 for i = 0, 1, 2, . . . do4 Set
ηi =
2
(i+ 2)µif µ > 0,
√2D√
i+ 1Mif µ = 0,
and Ei :=
M2
(i+ 2)µif µ > 0,
2√
2DM√i+ 1
if µ = 0.
5 Set
xi =
2
(i+ 1)(i+ 2)
∑ij=0(j + 1)xj if µ > 0,
1
i+ 1
∑ij=0 xj if µ = 0.
6 Let xi+1 = arg minx∈X12‖x− xi + ηiζL,t(xi)‖22.
7 if Ei ≤ (1− α)L(t, xi) then8 Terminate and return x+ = xi.9 end if
10 end for
Proposition 1 ([13, 19]). Given level t ∈ (t∗, t0], solution x ∈ X , and a param-373
eter α > 1, Oracle 2 returns a solution x+ ∈ X that satisfies L∗(t) ≥ αL(t, x+) in at374
1The original algorithms proposed in [13] and [19] are stochastic sub-gradient methods. Here,we apply these methods with deterministic sub-gradient.
This manuscript is for review purposes only.
12 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
most375
(14)αM2
µ(1− α)L∗(t)376
iterations of the sub-gradient method when µ > 0 and at most377
(15)8α2M2D2
((1− α)L∗(t))2378
such iterations when µ = 0 and D < +∞.379
Proof. We first consider the case when µ = 0 and D < +∞. Oracle 2 is the sub-380
gradient method in [19] with the distance-generating function 12‖ · ‖
22. From equation381
(2.24) in [19], we have382
L(t, xi)− L∗(t) ≤2√
2DM√i+ 1
= Ei.383384
Hence, when i ≥ 8α2M2D2
((α−1)L∗(t))2 − 1, we have L(t, xi) − L∗(t) ≤ Ei ≤ 1−αα L∗(t), which385
implies L∗(t) ≥ αL(t, xi). Hence, Ei ≤ 1−αα L∗(t) ≤ (1 − α)L(t, xi) which indicates386
that Oracle 2 terminates in no more than 8α2M2D2
((α−1)L∗(t))2 iterations.387
Now, suppose µ > 0. Oracle 2, in this case, is the sub-gradient method in [13].388
By Section 3.2 in [13], we have389
L(t, xi)− L∗(t) ≤M2
(i+ 2)µ.390
391
Therefore, when i ≥ αM2
µ(α−1)L∗(t) −2, we have L(t, xi)−L∗(t) ≤ Ei ≤ 1−αα L∗(t), which392
implies L∗(t) ≥ αL(t, xi). In addition, the algorithm terminates after at most this393
many iterations because Ei ≤ 1−αα (αL(t, xi)) ≤ (1− α)L(t, xi).394
This proposition indicates that the sub-gradient method in [19] and its variant395
in [13] qualify as oracles A in Algorithm 1 when µ = 0 and µ > 0, respectively (see396
Definition 1). Since the iteration complexities (14) and (15) do not depend on the397
initial solution x, the worst-case complexity does not restrict this choice of x ∈ X . In398
addition, γ is not needed for executing Oracle 2. Therefore, we omit it as an input to399
this algorithm even though the generic call to the oracle A in Algorithm 1 includes γ.400
4. Overall iteration complexity. We present an adaptive complexity analysis401
to establish the overall complexity of Algorithm 1 when using the oracles A discussed402
in §3. We present two lemmas that are needed to show our main complexity result in403
Theorem 4.404
Lemma 3. Suppose Algorithm 1 terminates at iteration K. Then for any 1 ≤405
k ≤ K, we have406
(16) L∗(tk) <L∗(t0)ε
2α2.407
Proof. We first use the relationship tk = tk−1 + 12L(tk−1, xk) to show L∗(tk) ≤408
12L∗(tk−1) for any k ≥ 1. Defining x∗k := arg minx∈X L(tk, x), we proceed as follows:409
L∗(tk) ≤max{f(x∗k−1)− tk−1 −1
2L(tk−1, xk); gi(x
∗k−1), i = 1, . . . ,m}410
This manuscript is for review purposes only.
13
= max{f(x∗k−1)− tk−1; gi(x∗k−1) +
1
2L(tk−1, xk), i = 1, . . . ,m}411
− 1
2L(tk−1, xk)412
≤ L∗(tk−1)− 1
2L(tk−1, xk)413
≤ 1
2L∗(tk−1),(17)414
415
where the second inequality holds since L(tk−1, xk) ≤ 1αL∗(tk−1) ≤ 0 and the third416
inequality follows from the inequality L∗(tk−1) ≤ L(tk−1, xk). Since Algorithm 1 has417
not terminated at iteration k − 1 for 1 ≤ k ≤ K, we have418
(18) L∗(t0)ε ≥ αL(t0, x1)ε > α2L(tk−1, xk) ≥ α2L∗(tk−1),419
where the first inequality holds by the condition satisfied by oracle A, the second420
by the stopping criterion of Algorithm 1 not being satisfied, and the third by the421
definition of L∗(tk−1). The inequality (16) then follows by combining (17) and (18).422
Lemma 4. Suppose Algorithm 1 terminates at iteration K. The following hold:423
1.∑Kk=0
1
−L∗(tk)≤ 4α3
−β2L∗(t0)ε;424
2.∑Kk=0
1√−L∗(tk)
≤ 4√
2α2
β1.5√−L∗(t0)ε
;425
3.∑Kk=0
1
(L∗(tk))2≤ 16α5
3β3(L∗(t0)ε)2.426
Proof. We only prove Item 1. Items 2 and 3 can be derived following analogous427
steps. We have428
K∑k=0
1
−L∗(tk)≤
K∑k=0
1
β(tk − t∗)429
≤K∑k=0
(1− β2α )K−k
β(tK − t∗)430
≤ 1
−βL∗(tK)
K∑k=0
(1− β
2α
)K−k431
≤ 2α2
−βL∗(t0)ε
K∑k=0
(1− β
2α
)K−k432
=2α2
−βL∗(t0)ε
[1− (1− β
2α )K+1
1− (1− β2α )
]433
≤ 4α3
−β2L∗(t0)ε,434
435
where the first inequality follows by (4); the second by (5); the third by (4); the436
fourth using (16) with k = K; the first equality by the geometric sum; and the fifth437
inequality by dropping negative terms.438
This manuscript is for review purposes only.
14 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
Table 1: Choices of the oracle A and γk in Algorithm 1
Case Oracle Choice of γk
Smooth(Asms 1, 2, & 3)
Strongly convex(µ > 0)
1any γ ≥ −L∗(t0) if k = 0−αL(tk−1, xk) if k ≥ 1
Convex(µ = 0 & D < +∞)
1 2MD
Non-smooth(Asms 1 & 2)
Strongly convex(µ > 0)
2 N/A
Convex(µ = 0 & D < +∞)
2 N/A
We summarize the choice of oracle A and γk in each iteration of Algorithm 1 in439
Table 1 (asms abbreviates assumptions) and characterize the overall iteration com-440
plexity (measured by the total number of gradient iterations) of this algorithm in441
Theorem 4. The dependence on ε in our big-O complexity for solving the constrained442
convex optimization problem (1)-(2) and the known analogue of this complexity for443
the unconstrained version of this problem are interestingly the same. However, unlike444
the unconstrained case, the complexity terms in Theorem 4 depend on the condition445
measure β.446
Theorem 4. Suppose the oracle A and γk in Algorithm 1 are chosen according447
to Table 1. Given 0 < ε ≤ 1, α > 1, and a g-strictly feasible solution e, Algorithm 1448
returns an ε-relative optimal and feasible solution to (1)-(2) in at most449
450
O
(1
β1.5√−L∗(t0)ε
+ log(1− β2α )−1
(1
βε
)+ log2
(1
−L∗(t0)ε
)),
O(
1
−β2L∗(t0)ε
), O
(1
−β2L∗(t0)ε
), and O
(1
β3(L∗(t0)ε)2
)451
gradient iterations, respectively, when the functions f and gi, i = 1, . . . ,m are smooth452
strongly convex, smooth convex, non-smooth strongly convex, and non-smooth convex.453
Notice that we have only presented the terms ε, β and L∗(t0) in our big-O com-454
plexity. The exact numbers of iterations are given explicitly in the proof.455
Remark 1. Based on Table 1, when the functions f , gi, i = 1, . . . ,m, are smooth456
and µ > 0, the input γ used in the first call of Oracle 1 should be bounded below by457
−L∗(t0). We do not specify the choice of such γ0 in our complexity bound (see (23)).458
However, one simple choice of γ0, in this case, could be 12µ‖∇f(e)‖2 since459
−L∗(t0) ≤ −minx∈X
max{〈∇f(e), x− e〉+
µ
2‖x− e‖2, gi(x), i = 1, . . . ,m
}460
≤ −minx∈X
{〈∇f(e), x− e〉+
µ
2‖x− e‖2
}461
≤ 1
2µ‖∇f(e)‖2,462
463
where the first inequality follows from (9) and the third holds since x = e − 1µ∇f(e)464
solves minx∈Rn{〈∇f(e), x− e〉+ µ
2 ‖x− e‖2}
.465
This manuscript is for review purposes only.
15
Proof. [Smooth strongly convex case] We first show that we can guarantee γk ≥466
L(tk, xk)−L∗(tk) by choosing γ0 ≥ −L∗(t0) and γk = −αL(tk−1, xk) for k ≥ 1. Hence467
γk can be used as an input to Oracle 1.468
When k = 0, we have L(t0, x0) = 0 and hence L(t0, x0)−L∗(t0) = −L∗(t0) ≤ γ0.469
Take k ≥ 1. We show that470
L(tk, xk) = max{f(xk)− tk−1 −1
2L(tk−1, xk); gi(xk), i = 1, . . . ,m}471
= max{f(xk)− tk−1; gi(xk) +1
2L(tk−1, xk), i = 1, . . . ,m} − 1
2L(tk−1, xk)472
≤ L(tk−1, xk)− 1
2L(tk−1, xk)473
≤ 0,(19)474475
where the first and the second inequalities hold because L(tk−1, xk) ≤ 1αL∗(tk−1) ≤ 0.476
Furthermore, since L∗(t) is a decreasing function in t we have L∗(tk−1) ≤ L∗(tk)477
since tk ≤ tk−1. Combining this with (19) yields L(tk, xk) − L∗(tk) ≤ −L∗(tk−1) ≤478
−αL(tk−1, xk) = γk.479
We next characterize the iteration complexity of oracle A in each iteration of Al-480
gorithm 1 using Theorem 3. By virtue of this theorem, the total number of gradient481
iterations performed by Algorithm 1 is at most482
K∑k=0
[(√40S
µ+ 1
)log2
(αγk
(1− α)L∗(tk)+ 1
)+ 3M
√160α log(m+ 1)
µ(1− α)L∗(tk)
]483
=
(√40S
µ+ 1
)[log2
(αγ0
(1− α)L∗(t0)+ 1
)+
K∑k=1
log2
((α− 2α2)L(tk−1, xk)
(1− α)L∗(tk)
)]484
+ 3M
√160α log(m+ 1)
µ(α− 1)
K∑k=0
1√−L∗(tk)
,
(20)
485
486
where we have replaced γk by −αL(tk−1, xk) for k > 0, and used the inequality487
(1− α)L∗(tk) ≤ (1− α)L∗(tk−1) ≤ α(1− α)L(tk−1, xk). The first sum in (20) can be488
bounded as follows.489
K∑k=1
log2
((α− 2α2)L(tk−1, xk)
(1− α)L∗(tk)
)490
= log2
(K∏k=1
(α− 2α2)L(tk−1, xk)
(1− α)L∗(tk)
)491
= log2
((2α2 − αα− 1
)K· L(t0, x1)
L∗(tK)
K−1∏k=1
L(tk, xk+1)
L∗(tk)
)492
≤ log2
((2α2 − αα− 1
)K· L(t0, x1)
L∗(tK)
)493
=
[K log2
(2α2 − αα− 1
)+ log2
(L(t0, x1)
L∗(tK)
)]494
This manuscript is for review purposes only.
16 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
≤(⌈
log(1− β2α )−1
(α2
βε
)⌉+ 1
)log2
(2α2 − αα− 1
)+ log2
(2α2
ε
),
(21)
495496
where the first inequality holds because L(tk, xk+1)/L∗(tk) ≤ 1 for all k = 0, 1, . . . ,K,497
and the second inequality follows from setting K equal to⌈log(1− β
2α )−1
(α2
βε
)⌉+ 1,498
using (16) for k = K, and the inequality L∗(t0) ≤ L(t0, x1) ≤ 0.499
The second sum in (20) can be upper bounded using Item 2 of Lemma 4. Specif-500
ically,501
K∑k=0
3M
√160α log(m+ 1)
(1− α)µL∗(tk)≤ 12Mα2
β1.5
√320α log(m+ 1)
µ(1− α)L∗(t0)ε.(22)502
503
Adding (21) and (22), we obtain the following upper bound on the total number504
of iterations (20) required by Algorithm 1:505 (√40S
µ+ 1
)[(⌈log(1− β
2α )−1
(α2
βε
)⌉+ 1
)log2
(2α2 − αα− 1
)+ log2
(2α2
ε
)506
+ log2
(αγ0
(1− α)L∗(t0)+ 1
)]+
12Mα2
β1.5
(320α log(m+ 1)
µ(1− α)L∗(t0)ε
)1/2
.(23)507508
[Smooth convex case] Notice that γk = 2MD can be used as an input to Oracle 1509
at every iteration k ≥ 1 since510
L(tk, xk)− L∗(tk) ≤ −〈ζL,tk(xk), x∗k − xk〉511
≤ ‖ζL,tk(xk)‖2‖x∗k − xk‖2512
≤ 2MD,513514
where the first inequality follows from the convexity of L(t, x) in x, the second using515
the Cauchy-Schwartz inequality, and the last using the definition of M and D. By516
theorems 2 and 3, the total number of iterations required by Algorithm 1 is equal to517
K∑k=0
[12D
√10αS
(1− α)L∗(tk)+
4MDα√
80 log(m+ 1)
(1− α)L∗(tk)+ log2
(αγk
(1− α)L∗(tk)+ 1
)].518
519
An upper bound on this quantity is520
K∑k=0
[12D
√10αS
(1− α)L∗(tk)+
4MDα√
80 log(m+ 1)
(1− α)L∗(tk)+
√2MDα
(1− α)L∗(tk)
],521
522
which is obtained by replacing γk = 2MD for all k and using the inequality log(a) ≤523a−1√a
for a > 1. Next using Lemma 4, we obtain the required upper bound on the total524
number of iterations taken by Algorithm 1 as525
48Dα2
β1.5
(20Sα
(1− α)L∗(t0)ε
)1/2
+16MDα4
β2(1− α)L∗(t0)ε(80 log(m+ 1))1/2
526
+8α2
β1.5
(MDα
(1− α)L∗(t0)ε
)1/2
.527
This manuscript is for review purposes only.
17
528
[Non-smooth strongly convex case] By Lemma 4 and Proposition 1, the total num-529
ber of iterations required by Algorithm 1 can be bounded above in a straightforward530
manner:531K∑k=0
αM2
µ(1− α)L∗(tk)≤ 4M2α4
µβ2(1− α)L∗(t0)ε.532
[Non-smooth convex case] In a similar fashion, by Lemma 4 and Proposition 1,533
the total number of iterations required by Algorithm 1 is upper bounded by534
K∑k=0
8α2M2D2
((1− α)L∗(tk))2 ≤
43M2D2α7
β3 ((1− α)L∗(t0)ε)2 .
535
REFERENCES536
[1] Z. Allen-Zhu and E. Hazan, Optimal black-box reductions between optimization objectives,537arXiv preprint arxiv:1603.05642, (2016).538
[2] A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, and S. Roy, Level-set539methods for convex optimization, arXiv preprint arXiv:1602.01506, (2016).540
[3] A. Y. Aravkin, J. V. Burke, and M. P. Friedlander, Variational properties of value func-541tions, SIAM Journal on Optimization, 23 (2013), pp. 1689–1717.542
[4] A. Bayandina, Adaptive mirror descent for constrained optimization, arXiv preprint543arXiv:1705.02029, (2017).544
[5] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse545problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.546
[6] A. Beck and M. Teboulle, Smoothing and first order methods: A unified framework, SIAM547Journal on Optimization, 22 (2012), pp. 557–580.548
[7] K. M. Bretthauer and B. Shetty, The nonlinear knapsack problem–algorithms and appli-549cations, European Journal of Operational Research, 138 (2002), pp. 459–472.550
[8] M. Chiang, Geometric programming for communication systems, Foundations and Trends in551Communications and Information Theory, 2 (2005), pp. 1–154.552
[9] R. M. Freund and H. Lu, New computational guarantees for solving convex optimization553problems with first order methods, via a function growth condition measure, arXiv preprint554arXiv:1511.02974, (2015).555
[10] B. Grimmer, Radial subgradient method, arXiv preprint arXiv:1703.09280v2, (2017).556[11] Z. Harchaoui, A. Juditsky, and A. Nemirovski, Conditional gradient algorithms for norm-557
regularized smooth convex optimization, Mathematical Programming, 152 (2015), pp. 75–558112.559
[12] A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scale op-560timization, i: General purpose methods, in Optimization for Machine Learning, S. Sra,561S. Nowozin, and S. Wright, eds., MIT Press, Cambridge, MA, USA, 2011, pp. 121–148.562
[13] S. Lacoste-Julien, M. Schmidt, and F. Bach, A simpler approach to obtaining an563O(1/t) convergence rate for the projected stochastic subgradient method, arXiv preprint564arXiv:1212.2002, (2012).565
[14] G. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex opti-566mization, Mathematical Programming, 149 (2015), pp. 1–45.567
[15] G. Lan and Z. Zhou, Algorithms for stochastic optimization with expectation constraints,568arXiv preprint arXiv:1604.03887, (2016).569
[16] C. Lemarechal, An extension of davidon methods to non differentiable problems, Nondiffer-570entiable Optimization, (1975), pp. 95–109.571
[17] C. Lemarechal, A. Nemirovskii, and Y. Nesterov, New variants of bundle methods, Math-572ematical programming, 69 (1995), pp. 111–147.573
[18] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation574approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–5751609.576
[19] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation577approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–5781609.579
This manuscript is for review purposes only.
18 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI
[20] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Kluwer, Nether-580lands, (2004).581
[21] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming,582103 (2005), pp. 127–152.583
[22] Y. Nesterov, Primal-dual subgradient methods for convex problems, Mathematical program-584ming, 120 (2009), pp. 221–259.585
[23] M. Patriksson and C. Stromberg, Algorithms for the continuous nonlinear resource allo-586cation problem–new implementations and numerical studies, European Journal of Opera-587tional Research, 243 (2015), pp. 703–722.588
[24] E. Pee and J. O. Royset, On solving large-scale finite minimax problems using exponential589smoothing, Journal of optimization theory and applications, 148 (2011), pp. 390–421.590
[25] J. Renegar, “Efficient” subgradient methods for general convex optimization, SIAM Journal591on Optimization, 26 (2016), pp. 2649–2676.592
[26] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, submit-593ted to SIAM Journal on Optimization, (2008).594
[27] E. Van Den Berg and M. P. Friedlander, Probing the pareto frontier for basis pursuit595solutions, SIAM Journal on Scientific Computing, 31 (2008), pp. 890–912.596
[28] E. Van den Berg and M. P. Friedlander, Sparse optimization with least-squares constraints,597SIAM Journal on Optimization, 21 (2011), pp. 1201–1229.598
[29] P. Wolfe, A method of conjugate subgradients for minimizing nondifferentiable functions,599(1975), pp. 145–173.600
This manuscript is for review purposes only.