If (1. L Remark: Simple setsewkaras/pesquisa/publicacoes/gonzaga_karas_rossetto.pdf2 C. GONZAGA, E. KARAS, AND D. ROSSETTO be based only on ﬁrst-order information, and each iteration

AN OPTIMAL ALGORITHM FOR CONSTRAINED

DIFFERENTIABLE CONVEX OPTIMIZATION

CLOVIS C. GONZAGA∗, ELIZABETH W. KARAS† , AND DIANE R. ROSSETTO‡

Abstract. We describe two algorithms for solving differentiable convex optimization problemsconstrained to simple sets in Rn, i.e., sets on which it is easy to project an arbitrary point. Thealgorithms are optimal in the sense that they achieve an absolute precision of ε in relation to theoptimal value in O(1/

√ε) iterations using only first order information. This complexity depends on

an (unknown) Lipschitz constant L∗ for the function derivatives and on a (known) strong convexityconstant µ∗ ≥ 0. The algorithms are extensions of well-known methods devised by Nesterov [9],without the need for estimates of L∗ and including (in the second algorithm) line searches andan adaptive procedure for estimating a strong convexity constant. The algorithms are such thatall computed points are feasible, and the complexity analysis follows a simple geometric approach.Numerical tests for box-constrained quadratic problems are presented.

Key words. constrained convex optimization, optimal method

AMS subject classifications. 65K05, 68Q25, 90C25, 90C60

1. Introduction. We study the nonlinear programming problem

(P )minimize f(x)subject to x ∈ Ω,

where Ω ⊂ Rn is a closed convex set and f : Rn → R is convex and continuously

differentiable, with a Lipschitz constant L∗ > 0 for the gradient and a convexityparameter µ∗ ≥ 0 for f(·) on Ω. It means that for all x, y ∈ Ω,

‖∇f(x)−∇f(y)‖ ≤ L∗‖x− y‖(1.1)

and

f(x) ≥ f(y) +∇f(y)T (x− y) +µ∗

2‖x− y‖2.(1.2)

If (1.2) holds for some µ∗ > 0, the function is said to be strongly convex. Notethat µ∗ ≤ L∗.Remark: A positive convexity constant is seldom available in practice. Wheneverone in known, it will be used by the algorithms and induce a linear convergence rate.Otherwise, we set µ∗ = 0.

Simple sets: we assume that Ω is a “simple” set, in the following sense: given anarbitrary point x ∈ R

n, an oracle is available to compute PΩ(x) = argminy∈Ω

‖x−y‖, theorthogonal projection onto the set Ω. A well-known algorithm for solving Problem(P ) is the projected gradient method described by Bertsekas [2]. Our methods will

∗Department of Mathematics, Federal University of Santa Catarina. CP 5210, 88040-970, Flo-rianopolis, SC, Brazil; e-mail :[email protected]. This author was partially supported by CNPqgrants 308413/2009-1 and 472313/2011-8.

†Department of Mathematics, Federal University of Parana. CP 19081, 81531-980, Curitiba, PR,Brazil; e-mail : [email protected]. This author was partially supported by CNPq grants 307714/2011-0 and 472313/2011-8.

‡Department of Mathematics, University of Sao Paulo. SP, Brazil; e-mail : [email protected] author was partially supported by CNPq grant 141740/2007-8.

1

2 C. GONZAGA, E. KARAS, AND D. ROSSETTO

be based only on first-order information, and each iteration will contain a projectedgradient step.

Optimal methods: For a class of problems and a specific information accessibleabout each problem (the oracle), a method is called “optimal” if its worst case com-plexity is proved to be minimal. This is well explained in the classical book byNemirovskii and Yudin [8]. For our problem, it is shown in this reference that theabsolute precision attainable (in the worst case) in k steps of a method using onlyfirst order information is given by f(xk) − f∗ = O

(

1/k2)

, where f∗ is the value ofan optimal solution. This is again proved by Nesterov [9], who also shows that theclassical constant step steepest descent (as well as the projected gradient method)cannot ensure a performance better than O (1/k).

The main reference for this paper is the book by Yurii Nesterov [9], and ouralgorithms are extensions of his methods described in Chapter 2. The first of thesemethods [9, Algorithm 2.2.6] solves unconstrained convex problems, and the second– the constant step scheme [9, Algorithm 2.2.19] – solves constrained problems. Bothmethods depend on the availability of a Lipschitz constant for ∇f(·), and the methodfor constrained problems generates infeasible points. These drawbacks were removedby Nesterov in [10], where he describes a method for a more general problem, easilyapplied to (P) with the use of the indicator function for Ω. This method includes ascheme for estimating the Lipschitz constant, but does not take advantage of µ∗. Itsiterations include at least two gradient computations, which makes them expensive.

Nesterov’s approach is applied to penalty methods by Lan, Lu and Monteiro [7],and interior descent methods based on Bregman distances are described by Auslenderand Teboulle [1]. This method has been generalized to a non-interior method usingprojections by Rossetto [11].

In §2 we present an algorithm based on Nesterov’s methods with the followingcharacteristics: the method does not depend on the knowledge of L∗, the strongconvexity constant µ∗ is used whenever available, and all the action happens in theset Ω. The technique for proving the efficiency of the algorithm is based on resultsby Auslender and Teboulle [1].

We believe that the strongest point in this paper is in the technique used to provethe efficiency of the method, based on very geometrical arguments and considerablysimplified, incorporating the control of step lengths (which are not interpreted asLipschitz constant estimates, although this association can be made), and projectionsof all the relevant points onto the set Ω.

In §3 we propose a method for using estimates of µ∗, similar to what was donein [5] for unconstrained problems, and show how to perform a line search at eachiteration.

Finally, in §4 we show the results of a limited set of computational tests.

Here we describe some useful tools.

Simple quadratic functions

We shall call “simple” a quadratic function φ : Rn → R with ∇2φ(x) = γI, γ ∈ R,γ > 0. The following facts are easily proved for such functions:

(i) φ(·) has a unique minimizer v ∈ Rn (which we shall refer as the center of

the quadratic), and the function can be written as

x ∈ Rn 7→ φ(v) +

γ

2‖x− v‖2.

OPTIMAL ALGORITHM 3

(ii) Given a point x ∈ Rn and a closed convex set Ω ⊂ R

n,

argminx∈Ω

φ(x) = PΩ(v).

(iii) Given x ∈ Rn,

v = x− 1

γ∇φ(x),(1.3)

and

φ(x)− φ(v) =1

2γ‖∇φ(x)‖2.(1.4)

Given a simple quadratic φ with center v ∈ Rn, we can construct another simple

quadratic φ+ with center v+ ∈ Ω with the same Hessian and such that φ+ ≤ φ in Ω.

Lemma 1.1. Let Ω be a convex set in Rn, and consider a simple quadratic defined

in Rn by φ(x) = φ∗+γ‖x−v‖2/2. Define φ+(x) = φ∗

++γ‖x−v+‖2/2, with v+ = PΩ(v)and φ∗

+ = φ(v+). Then for all x ∈ Ω, φ+(x) ≤ φ(x).Proof. By definition, v+ = PΩ(v) = argmin

x∈Ω

φ(x). Hence, by optimality, −∇φ(v+)

belongs to the normal cone of Ω at v+, i.e., for all x ∈ Ω

∇φ(v+)T (x− v+) ≥ 0.

It follows that for all x ∈ Ω,

φ(x) = φ(v+) +∇φ(v+)T (x− v+) +

γ

2‖x− v+‖2

≥ φ∗

+ +γ

2‖x− v+‖2

= φ+(x).

2. The algorithm. Consider problem (P) and assume that a strong convexityconstant µ = µ∗ ≥ 0 is given. As shown in Nesterov [9] (and also in the present paper),the knowledge of µ∗ > 0 improves the asymptotic convergence rate of the algorithmswhen applied to strongly convex functions. When no such constant is known, themethods use µ∗ = 0 and keep the optimal (but sublinear) complexity. The usage ofa large convexity constant (when available) improves significantly the performance ofthe algorithms. In §3 we shall propose a method for estimating values of µ and profitfrom this fact.

We now state the main algorithm and then study its properties. We include in thestatement of the algorithm the definitions of the relevant functions (approximationsof f(·) and the simple quadratic defined below).

Here we summarize the geometrical construction at an iteration k, representedin Fig 2.1. The iteration starts with two points xk, vk ∈ Ω and a simple quadratic

function φk(x) = f(xk) +γk2‖x− vk‖2, whose global minimizer vk is feasible.

A point yk = xk + θ(vk − xk) is chosen between xk and vk. All the action iscentered on yk, with the following construction:

(i) Take a projected gradient step from yk, generating xk+1 ∈ Ω.


(ii) Define a simple quadratic lower approximation of f(·)

x ∈ Rn 7→ ℓµ(x) = f(yk) +∇f(yk)

T (x− yk) +µ

2‖x− yk‖2

Note that when µ = 0 this is the linear approximation of f(·) around yk.(iii) Compute a value α ∈ (0, 1), and define φα(x) = αℓµ(x)+ (1−α)φk(x), with

Hessian γk+1I = ∇2φα(x) = (αµ+ (1− α)γk) I, and let vk+1 be the minimizer of this(simple quadratic) on Ω. The iteration is completed by defining

φk+1(x) = f(xk+1) +γk+1

2‖x− vk+1‖2.

f

ykxk

φk

ℓµ

vk Ω

f

yk

φα

xk+1

φk

φk+1

ℓµ

vk+1 Ω

Fig. 2.1. The mechanics of the algorithm.

Algorithm 1. Main AlgorithmData: x0 ∈ Ω, v0 = x0, µ ≥ 0 (convexity constant), L0 > µ, γ0 ≥ L0.k = 0, L = L0.repeat

Compute α ∈ (0, 1) such that Lα2 = αµ+ (1− α)γk

Set θ =γk

γk + µαα and y = xk + θ(vk − xk)

Compute f(y) and g = ∇f(y)Use Routine 1 (Steepest descent step) to compute x+ and update L.If the step is rejected, restart the iteration.Updates:

xk+1 = x+

γk+1 = αµ+ (1− α)γk

Define x 7→ φk(x) = f(xk) +γk2‖x− vk‖2

Define x 7→ ℓµ(x) = f(y) + gT (x − y) +µ

2‖x− y‖2

Define x 7→ φα(x) = αℓµ(x) + (1− α)φk(x)

vk+1 = argminx∈Ω

φα(x) = PΩ

(

vk −α

γk+1

(g + µ(vk − y))

)

Define αk = α, θk = θ, yk = y, Lk = L, φ∗

αk= φαk

(vk+1) (for the analysis)k = k + 1.

Initial values: γ0 = L0 should be some estimate for a Lipschitz constant for ∇f(·).The method is not very sensitive to the choice of these parameters, because in the

OPTIMAL ALGORITHM 5

initial iterations they will be automatically adjusted. Nevertheless, as we shall see inthe complexity analysis, it is better to overestimate γ0 (because this parameter neverincreases) than to use a small value.

Steepest descent step: the next algorithm performs a projected steepest descentstep x+ = PΩ(y − g/L). The step is accepted if f(x+) ≤ uL(x+), where uL(·) is asimple quadratic approximation of f with Hessian LI. It is easy to see that in theunconstrained case this is an Armijo condition.

If the step length 1/L is not accepted, then L is increased and the iteration mustbe restarted because α depends on L.

Routine 1. Steepest descent stepData: y ∈ Ω, f(y), g, L > µ ≥ 0,

Define uL(x) = f(y) + gT (x − y) +L

2‖x− y‖2.

x+ = PΩ(y − g/L)if f(x+) > uL(x+), L = 2L and reject the step.If the step is not rejected, test a reduced L (for the next iteration).

z = PΩ(y − 2g/L)if f(z) ≤ uL/2(z), set L = L/2

Updating L: The parameter α depends on L. The only requirement needed in ouranalysis for the steepest descent step is that f(x+) ≤ uL(x+). This is trivially trueif L is a Lipschitz constant for ∇f(·) on Ω. The algorithm checks this condition andincreases L whenever it is not satisfied.

Then we always have L ≤ L := maxL0, 2L∗. The number L should be inter-

preted as the inverse of the step length used by the steepest descent step. Wheneverthe step length 1/L is accepted, we test the step length 2/L: if it is also acceptable,we set L = L/2 for the next iteration. Note that changing L also affects the value ofα and θ, which will occur in the following iteration.

One could also start all iterations by setting L = L/2, but then the step length1/L would be rejected in most iterations, resulting in a large number of gradientcomputations.

Note that if the initial value for L is known to be a Lipschitz constant for ∇f(·),then no updates are needed, and the algorithm needs no function computations atall, but then it will be a short steps method, very sensitive to the value of L.

2.1. Analysis of the algorithm. The most important procedure in the al-gorithm is the choice of the parameter α, which then determines θ and y at eachiteration. The choice made is the one devised by Nesterov in [9, Scheme (2.2.6)].Instead of “discovering” the values for these parameters, we shall simply adopt themand show their properties. We shall also show simple geometrical properties of thischoice, both in the unconstrained and constrained case. Then we show that his algo-rithm extends to the constrained case simply by projecting the vectors xk+1 and vk+1

that would be obtained for the unconstrained case. This treatment will then differfrom Nesterov’s, making use of an idea by Auslender and Teboulle [1].

Once y is determined, two independent actions are taken:(i) A projected steepest descent step from y computes xk+1.(ii) A new simple quadratic is constructed by combining φk(·) and a lower approxi-mation ℓµ(·) of f(·) about y:

φαk(x) = αkℓµ(x) + (1− αk)φk(x).


In the unconstrained case, we know that f(xk+1) ≤ minx∈Rn φαk(x), as proved

by Nesterov [9]. In the constrained case we shall define the simple quadratic φk+1(·)with center in Ω as in Algorithm 1, and prove similar results, guided by a constructionmade by Auslender and Teboulle [1].

Our scope will be to prove two facts:(i) At any iteration k, φ∗

αk≥ f(xk+1).

(ii) For all x ∈ Ω, φk+1(x)− f(x) ≤ (1− αk)(φk(x)− f(x)).From these facts we shall conclude that f(xk) → f∗ with the same speed as

γk − µ → 0, which easily leads to the desired complexity result.

Analysis of iteration k: Let us initially simplify the notation at iteration k. With-out loss of generality we assume that xk = 0 and that ‖vk − xk‖ = 1. This can beobtained by a simple change of variables, unless in the case xk = vk. In this case, weassume that xk = vk = yk = 0.

We also drop subscripts, so that all the points, functions and variables will bedenoted as follows:

(0, v, y, v+, x+, α, θ, γ, γ+, L, φ, φα, φ+) ≡ (xk, vk, yk, vk+1, xk+1, αk, θk, γk, γk+1,Lk, φk, φαk

, φk+1).

Given the point y, denote g = ∇f(y) and define:

• Lower approximation: x 7→ ℓ(x) = f(y) + gT (x − y) +µ

2‖x − y‖2, with gradient

∇ℓ(x) = g + µ(x − y) and Hessian ∇2ℓ(x) = µI. By definition, for all x ∈ Rn,

ℓ(x) ≤ f(x).

• Upper approximation: x 7→ u(x) = f(y) + gT (x − y) +L

2‖x − y‖2, with gradient

∇u(x) = g + L(x − y) and Hessian ∇2u(x) = LI. If L is a a Lipschitz constant for∇f(·), then for all x ∈ R

n, f(x) ≤ u(x).

• Simple quadratic: φ(x) = f(0) +γ

2‖x− v‖2.

In our algorithms L is not necessarily a Lipschitz constant, and in most of thissection µ does not need to be a strong convexity constant: this will be the case in §3.So, let us write clearly the hypotheses used in the paper at all iterations, using oursimplified notation:Let x+ = PΩ(y − g/L).

(H1) f(x+) ≤ u(x+) (always ensured by the algorithms).(H2) f(0) ≥ ℓ(0) (always true for Algorithm 1).

Remarks: The actual function f(·) plays no role in the analysis below. Only itsapproximations are relevant, and need only to satisfy the hypotheses above.

The hypothesis (H2) is trivially true for Algorithm 1, because µ is a strong con-vexity constant and then ℓ(·) sub-estimates f(·). It is included here because it will beneeded in §3, where µ may not have this property: in this case ℓ(·) will be requiredto sub-estimate f(·) at the point 0.

So, for the remaining of this section we assume that these assumptions hold.During the iteration we construct the function

x 7→ φα(x) = αℓ(x) + (1− α)φ(x),(2.1)

we have: y = θv, v − y = (1− θ)v, and then

∇φα(v) = α(g + µ(1− θ)v)∇2φα(v) = (αµ+ (1 − α)γ))I.

OPTIMAL ALGORITHM 7

Properties of the unconstrained problem

We begin by studying properties of the following points, shown in Fig. 2.2:

v = argminx∈Rn

φα(x) = v − 1

γ+∇φα(v) = v − α

γ+(g + µ(1− θ)v)(2.2)

x = argminx∈Rn

u(x) = y − 1

Lg.(2.3)

g g + µ(1−θ)d

0 θv αv v

v

x

Fig. 2.2. Geometric properties of the unconstrained case.

The following expressions will be useful below:

By construction, θ =γ

γ + αµα, α2 =

γ+L

and γ+ = αµ+ (1− α)γ.

By substitution, we get

1− θ =γ+

γ + αµ(2.4)

α− θ =α2µ

γ + αµ=

µ

L(1 − θ)(2.5)

The first lemma shows our main finding about the geometry of these points. Allthe action happens in the two-dimensional space defined by 0, v, v. Note the beautifulsimilarity of the various triangles in Fig 2.2.

Lemma 2.1. Consider v, x ∈ Rn given by (2.2) and (2.3), respectively. Then

x = αv.Proof. By definition of v,

αv = αv

(

1− αµ(1 − θ)

γ+

)

− α2

γ+g.(2.6)

Using (2.4) and the expression of θ,

1− αµ(1− θ)

γ+= 1− αµ

γ + αµ=

γ

γ + αµ=

θ

α.

Substituting into (2.6),

αv = θv − α2

γ+g.


Remembering that α2 = γ+/L and that y = θv, we conclude that αv = y − 1

Lg,

completing the proof.

Now we can compare the values of u(x) and φα(v). This will be done in two steps(observe the parallel lines in Fig 2.2).

Lemma 2.2. Consider v = argminx∈Rn

φ(x) and θ, α ∈ (0, 1). Let u(·) and ℓ(·) be the

upper and lower approximations of f at y = θv, and φα(·) be the quadratic functiondefined by (2.1). Then

u(αv) ≤ φα(v).

Proof. On one hand, if ‖v‖ = 1,

φα(v) = αℓ(v) + (1− α)φ(v)

= α(

f(y) + gT (1 − θ)v +µ

2(1− θ)2

)

+ (1− α)f(0).

By hypothesis (H2), f(0) ≥ ℓ(0) = f(y)− θgT v + θ2µ

2and rearranging,

φα(v) ≥ f(y) + ((1 − θ)α− θ(1 − α)) gTv +µ

2

(

α(1 − θ)2 + (1− α)θ2)

,

and simplifying,

φα(v) ≥ f(y) + (α− θ)gT v +µ

2(α− 2αθ + θ2).(2.7)

On the other hand

u(αv) = f(y) + (α− θ)gT v +L

2(α− θ)2.(2.8)

Let us simplify the last term, using (2.5),

L

2(α− θ)(α− θ) =

µ

2(1 − θ)(α− θ).

It follows that

u(αv) = f(y) + (α− θ)gT v +µ

2(α− αθ + θ2 − θ).

Subtracting this from (2.7), we obtain immediately

φα(v) − u(αv) ≥ µ

2θ(1 − α) ≥ 0

because θ, α ∈ (0, 1). If x = y = v = 0, then ℓ(0) = f(0) = u(0), and the result isstraightforward, completing the proof.

Lemma 2.3. Let u(·) and ℓ(·) be the upper and lower approximations of f , andφα(·) be the quadratic function defined by (2.1). Consider v, x ∈ R

n given by (2.2)and (2.3), respectively. Then

u(x) ≤ φα(v).

OPTIMAL ALGORITHM 9

Proof. Since x and v are respectively global minimizers of u(·) and φα(·), we havefor all x ∈ R

n,

u(x) = u(x) +L

2‖x− x‖2, φα(x) = φα(v) +

γ+2‖x− v‖2.

We already know from the last lemma that u(αv) ≤ φα(v). Now we only need toshow that

u(αv) − u(x) = φα(v)− φα(v).

The construction is shown in Fig. 2.2: since, by Lemma 2.1, x = αv,

αv − x = α(v − v).

By construction, α2 =γ+L

, then

u(αv)− u(x) =Lα2

2‖v − v‖2 =

γ+2‖v − v‖2 = φα(v)− φα(v),

completing the proof.

This completes the treatment of the unconstrained case: we know that φα(v) ≥u(x) and x = αv. We are ready for the main result.

An iteration for the constrained problem

The geometry of an iteration for the constrained problem is shown in Fig. 2.3 for thecases µ = 0 (left) and µ > 0 (right). The figures also show the position of the point zwhich will be used in the complexity analysis. This point corresponds to what wouldbe used by the generalization of the Auslender-Teboulle method [1] done by Rossetto[11].

Ω

v

0

x = αv

x+

v

v+

y = θv

zz

0 θv αv v

v

x

x+

v+

Ω

Fig. 2.3. An iteration for the constrained problem with µ = 0 and µ > 0, respectively.

Remark: In an algorithm for the unconstrained problem, the new simple function atiteration k+1 would have the same center as φα. Now φα is minimized in Ω to obtainv+ and φ∗

α = φα(v+) by projecting v into Ω. The new simple function will differ fromφα: it will be a simple quadratic with minimizer v+ and Hessian γ+I. We shall provethat this function lays below φα in the set Ω, ensuring the desired decrease in thesimple function.


Consider an iteration of Algorithm 1, in the simplified setting made above. Definex+ = argmin

x∈Ω

u(x) = PΩ(x), v+ = argminx∈Ω

φα(x) = PΩ(v), φ∗

α = φα(v+),

x 7→ φ+(x) = f(x+) +γ+2‖x− v+‖2.

Theorem 2.4. For the construction above,(i) f(x+) ≤ φ∗

α.(ii) For any x ∈ Ω, φ+(x) − ℓ(x) ≤ (1− α) (φ(x) − ℓ(x)), or equivalently,

φ+(x) − ℓ(x) ≤ γ+ − µ

γ − µ(φ(x) − ℓ(x)) .

(iii) If µ is a strong convexity constant on Ω, then for any x ∈ Ω,

φ+(x) − f(x) ≤ γ+ − µ

γ − µ(φ(x) − f(x)) .

(iv) (Recursion, recovering the original notation) If µ is a strong convexity con-stant on Ω, then for any x ∈ Ω and k = 1, 2, . . .,

φk(x)− f(x) ≤ γk − µ

γ0 − µ(φ0(x) − f(x)) .(2.9)

Proof.(i) By definition, v+ = PΩ(v) ∈ Ω. Define z = αv+. Then z ∈ Ω, because

0 ∈ Ω and v+ ∈ Ω. By definition, u(x+) ≤ u(z). Then

f(x+) ≤ u(x+) ≤ u(x) +L

2‖z − x‖2,

where the first inequality is ensured by the algorithm. By Lemma 2.3, u(x) ≤ φα(v)and by Lemma 2.1, x = αv. It follows that

f(x+) ≤ φα(v) +Lα2

2‖v+ − v‖2 = φα(v) +

γ+2‖v+ − v‖2 = φα(v+) = φ∗

α,

proving (i).

(ii) By construction, for all x ∈ Rn,

φα(x) = αℓ(x) + (1− α)φ(x) = φα(v) +γ+2‖x− v‖2.(2.10)

With v+ = PΩ(v) and φ∗

α = φα(v+), a direct application of Lemma 1.1 gives for anyx ∈ Ω

φ∗

α +γ+2‖x− v+‖2 ≤ φα(x).

Using (i) and the definition of φ+, we get

φ+(x) = f(x+) +γ+2‖x− v+‖2 ≤ φ∗

α +γ+2‖x− v+‖2 ≤ φα(x).

Subtracting ℓ(x) to both sides and using the definition (2.10) of φα, we obtain

φ+(x) − ℓ(x) ≤ (1− α)(φ(x) − ℓ(x)).

The equivalent formulation follows from the definition of γ+ = αµ + (1 − α)γ, fromwhich we get γ+ − µ = (1 − α)(γ − µ), proving (ii).

OPTIMAL ALGORITHM 11

(iii) Assume now that for any x ∈ Ω, ℓ(x) ≤ f(x), and define β = (γ+−µ)/(γ−µ).For any x ∈ Ω, adding ℓ(x) − f(x) ≤ 0 from both sides of the expression in (ii), weobtain

φ+(x) − f(x) ≤ β(φ(x) − f(x)) + (1− β)(ℓ(x) − f(x)) ≤ β(φ(x) − f(x)),

proving (iii).

(iv) follows directly from (iii) by recursion, completing the proof.

2.2. Complexity. The following lemma was proved by Nesterov [9, p.77] witha different notation:

Lemma 2.5. Let ζ0, ζ1, . . . be a sequence satisfying ζ0 > 0 and for k ∈ N

ζk+1 = (1− αk)ζk, Lα2k ≥ ζk+1.

Then, for k ∈ N, ζk ≤ 4L/k2.Proof. See [5, Lemma 10]. The original proof assumes that Lα2

k = ζk+1, but theresult remains trivially true with the inequality in our statement.

For the complexity analysis, define L = max2L∗, L0, where L∗ is a Lipschitzconstant for ∇f(·) and L0 is in the algorithm data. As we have noted above, theparameter L used in the algorithm will never be increased when L ≥ L∗, and henceat all iterations L ≤ L holds.

Theorem 2.6. Consider the sequences generated by Algorithm 1 and assume thatx∗ is an optimal solution. Then for k ∈ N,

(γk − µ) ≤ min

4L

k2,

(

1−√

µ

L

)k

(γ0 − µ)

,(2.11)

and consequently

f(xk)− f∗ ≤ min

4L

(γ0 − µ)k2,

(

1−√

µ

L

)k

(

f(x0)− f∗ +γ02‖x∗ − x0‖2

)

.

Proof. The algorithm sets at iteration k,

γk+1 − µ = (1 − αk)(γk − µ).(2.12)

Define for k ∈ N, ζk = γk − µ. Then ζk+1 = (1 − αk)ζk. Also by construction andby the reasoning above, Lα2

k ≥ Lα2k = γk+1 ≥ γk+1 − µ = ζk+1. Using Lemma 2.5,

(γk − µ) ≤ 4L/k2. On the other hand, α2k =

γk+1

L≥ µ

L. Using this in (2.12),

γk − µ ≤(

1−√

µ

L

)k

(γ0 − µ),

proving (2.11).From Theorem 2.4 (iv), (2.9) holds in particular at x∗,

φk(x∗)− f∗ ≤ γk − µ

γ0 − µ(φ0(x

∗)− f∗) .


Using the fact that f(xk) = φk(vk) ≤ φk(x∗) and the definition of φ0, we get,

f(xk)− f∗ ≤ γk − µ

γ0 − µ

(

f(x0) +γ02‖x∗ − x0‖2 − f∗

)

.

Using (2.11), we complete the proof.

Remark: a good choice for the first step is γ0 = L0 where L0 is an estimate for L∗.If one chooses L0 >> L∗, the algorithm will rapidly reduce L, and then γk will alsobe reduced. In fact, if in some iteration (say, in the first iteration) γ0 > L, then it is

easy to see that γ1 ≤ L. In fact, (γ1 − µ) = (1 − α)(γ0 − µ), α2 =γ1L. If γ1 > L,

(γ1 − µ) < 0, which is impossible. With the (ideal) choice γ0 = L∗ + µ and L0 = L∗,we may set L = L0 and the result implies

f(xk)− f∗ ≤ 4

k2

(

f(x0)− f∗ +L∗ + µ

2‖x∗ − x0‖2

)

.

Remark: The term f(x0)− f∗ might be removed by using the inequality

f(x0) ≤ f∗ +∇f(x∗)T (x0 − x∗) +L∗

2‖x∗ − x0‖2,

but for constrained problems ∇f(x∗) 6= 0 in general. In the unconstrained case, ofcourse, ∇f(x∗) = 0, which simplifies the expression.

This concludes the complexity analysis of the main method, showing that it isan optimal algorithm. We hope to have unveiled the geometrical aspects of eachiteration. All the action happens in a three-dimensional affine space determined bythe points xk, vk, the gradient vector g (which define the two-dimensional space inwhich the unconstrained iteration would work), and vk+1.

3. Enhanced algorithm. In this section we propose two enhancements to themain algorithm:Projected line search: in each iteration of the main algorithm we double the steplength 1/L for the following iteration whenever the longer step is presently acceptable.Now we keep the same rule for the next iteration and add a projected line search forthe computation of xk+1. One could also use the result of this search for increasingthe next step length to more than 2/L, but this proved inefficient in our tests.The parameter µ: In general, no positive strong convexity constant µ∗ is known,specially if we think of the function restricted to the set Ω. Whenever µ∗ > 0 isknown, its usage improves both the theoretical complexity and the performance of thealgorithm. In [5] we proposed the usage of a decreasing sequence of values µk > µ∗

instead of the fixed value µ∗ ≥ 0, and proved that the optimal complexity is preservedfor unconstrained problems. Now we adopt the same procedure in the constrainedcase, and briefly show how to adapt the analysis made in that paper.Remark: Even when µ∗ > 0 is known, the usage of higher values of µk may improvethe performance of the algorithm, keeping the same complexity properties. This willbe the case when f(·) admits a higher convexity constant in the region where thealgorithm acts. Of course, setting µ0 = µ∗ eliminates this feature.

In the algorithm below, we omit the definitions of φk, φα and ℓµ.

Algorithm 2. Enhanced Algorithm


Data: v0 = x0 ∈ Ω, L0 ≥ µ0 ≥ µ∗ ≥ 0, γ0 ≥ L0, β > 1, η ∈ (0, 1).(Suggested values: µ0 = L0/100, β = 1.1, η = 0.2, µmin = 10−4).k = 0, L = L0, µ = µ0.repeat

If γk − µ∗ < β(µ− µ∗), set µ = maxµ∗, ηµ.Compute α, θ, y and g as in Algorithm 1.Use Routine 2 to compute x+ and update L and µ.If the step is rejected, restart the iteration.Updates:

xk+1 = x+

γk+1 = αµ+ (1− α)γk

vk+1 = argminx∈Ω

φα(x) = PΩ

(

vk −α

γk+1

(g + µ(vk − y))

)

αk = α, θk = θ, yk = y, Lk = L, µk = µ.k = k + 1.

This algorithm uses in each iteration an estimated convexity constant µk, satis-fying always γk − µ∗ ≥ β(µk − µ∗). The steepest descent below also deals with µ: itreduces it whenever the hypothesis (H2) used in the previous section is violated. Theconstant µmin avoids a loop when the function is linear along the line segment fromy to x+.

We omit the definitions of uL and ℓµ (see Routine 1).

Routine 2. Steepest descent step

Data: y ∈ Ω, f(y), g, L > µ ≥ 0, jmax ∈ N, µmin > 0.(Suggested value: µmin = 10−4).j = 1x+ = PΩ(y − g/L)if f(x+) > uL(x+), L = 2L and reject the step.if f(x+) < ℓµ(x+),

reduce µ and reject the step:µ = maxµ∗, ηµ.if µ < µmin, set µ = µ∗.Reject the step.

If the step is not rejected, test a reduced Lz = PΩ(y − 2g/L)if f(z) ≤ uL/2(z), set L+ = L/2, else L+ = LProjected line search:while f(z) ≤ uL/2(z) and j < jmax

x+ = zL = L/2z = PΩ(y − 2g/L)j = j + 1

end while

L = L+

The algorithm is well-defined: whenever the algorithm detects that the parameterµ is too large, it reduces µ− µ∗ by a factor η, and hence µk converges to µ∗. Besidesthis, the total number of µ updates in Routine 2 is bounded by j such that ηjµ0 ≤µmin.


3.1. Complexity results. The complexity analysis for this method is quitesimilar to what was done in [5], with one difference: there ∇f(x∗) = 0 at any optimalsolution x∗. We only need to prove one lemma (akin to Lemma 9 in that reference),and then adapt the complexity results to our setting.

The following hypothesis is needed:Hypothesis. Given x0 ∈ Ω, assume that the level set associated with x0 is bounded,i.e.,

D = sup‖x− y‖ ∈ Ω|f(x) ≤ f(x0), f(y) ≤ f(x0) < ∞.

Define Q =D2

2.

Lemma 3.1. Consider γ0 > µ∗. Let x∗ be an optimal solution of (P ). Then, atall iteration of Algorithm 2,

φk(x∗)− f∗ ≤ γ0Q+ f(x0)− f∗

γ0 − µ∗(γk − µ∗).(3.1)

Proof. Set ∆f = f(x0)− f(x∗). By induction, first consider k = 0:

The result follows from φ0(x∗)− f∗ = f(x0)− f∗ +

γ02‖x∗ − x0‖2 ≤ ∆f + γ0Q.

For k > 0, by construction, (H1) and (H2) are satisfied, and hence by Theorem 2.4(i), f(xk+1) ≤ φ∗

αk. Using the definition of φαk

, it follows that for any x ∈ Ω,

φk+1(x) ≤ φαk(x)

= (1− αk)φk(x) + αk

(

f(yk) +∇f(yk)T (x− yk) +

µk

2‖x− yk‖2

)

.

Using (1.2),

φk+1(x) ≤ (1− αk)φk(x) + αk

(

f(x) +µk − µ∗

2‖x− yk‖2

)

.

Subtracting f∗ from both sides,

φk+1(x∗)− f∗ ≤ (1 − αk) (φk(x

∗)− f∗) + αkQ(µk − µ∗).

Using the induction hypothesis,

φk+1(x∗)− f∗ ≤ (1 − αk)

γ0Q+∆f

γ0 − µ∗(γk − µ∗) + αkQ(µk − µ∗).

Sinceγ0Q+∆f

γ0 − µ∗> Q, we can write

φk+1(x∗)− f∗ ≤ γ0Q+∆f

γ0 − µ∗((1− αk)(γk − µ∗) + αk(µk − µ∗)) ,

and expanding the last term,

φk+1(x∗)− f∗ ≤ γ0Q+∆f

γ0 − µ∗((1− αk)γk + αkµk − µ∗) .

Using the definition of γk+1 we get (3.1), completing the proof.


Now the adaptation of the results in [5] it is straightforward, leading to thefollowing result for any iteration k:

γk − µ∗ ≤ min

(

1− β − 1

β

√

µ∗

L

)k

(γ0 − µ∗),4β2L

(β − 1)21

k2

.

Substituting this in (3.1) and using the fact that f(xk) = φk(vk) ≤ φk(x∗), we get

the optimal complexity result

f(xk)− f∗ ≤ min

(

1− β − 1

β

√

µ∗

L

)k

,4β2L

(β − 1)2(γ0 − µ∗)

1

k2

(f(x0)− f∗ + γ0 Q).

4. Numerical Results. In this section, we report the results of a limited set ofcomputational experiments, restricted to the minimization of quadratic problems inboxes. The codes are written in Matlab and we used as stopping criterion the normof the projected gradient. We compare the performance of the following algorithms:

• [A1] Algorithm 1.• [A2] Algorithm 2.• [A3] Scheme 2.2.19 proposed by Nesterov in [9].• [A4] Scheme (4.9) proposed by Nesterov in [10].• [A5] Projected BFGS as presented in [6].• [A6] Algorithm SPG proposed by Birgin, Martınez and Raydan in [3].

The MATLAB code of the projected BFGS method was contributed by C.T.Kelley, which is available at http://www4.ncsu.edu/∼ctk/.

Our main intent is to compare optimal algorithms, but we also include some testscomparing these methods with a projected BFGS method published and implementedby C.T. Kelley [6] and with the spectral projected gradient method (SPG) publishedby Birgin, Martınez and Raydan in [3]. As we only test quadratic problems in boxes,we used a BFGS method specific for this case. The other algorithms are fit for generalconvex problems with simple constraints.

We compare the performance of the algorithms for minimizing a quadratic func-tion in a box:

minimize f(x) = (g∗)T (x− x∗) + 1

2(x− x∗)TQ(x− x∗)

subject to x ∈ Ω = x ∈ Rn | 0 ≤ x ≤ U.

The box is constructed by generating the vector U with random values in [0, 1].Then we generate a random vector z ∈ R

n and define the optimal solution as x∗ =PΩz.

For the function, n, µ = 1 and L > µ are given. Using Matlab routines, wegenerate a random n×n matrix Q with eigenvalues in the interval [µ, L] and a randomgradient vector g∗ at x∗, satisfying the optimality condition PΩ(x

∗ + g∗) = x∗.For each experiment in R

n with n = 1000, we generate the problem as above anda random initial point x0 ∈ Ω.

The algorithms A1-A3 are, in our knowledge, the only optimal methods thatare capable of treating constrained problems with unknown L∗. The algorithm A4,described in [10, Scheme (4.9)] is very dependent on L: if a sharp value of L∗ isgiven, it performs well, but it becomes slow if L∗ is overestimated, and fails if L∗

is underestimated. We shall observe this behavior in our tests. Fig. 4.1 shows the


variation of the function values against iterations in a typical situation. On the left,L0 is a good estimation for the Lipschitz constant, while on the right picture, L0 isan overestimation for L∗.

0 100 200 300 400 500 600 700 800 900 100010

−8

10−6

10−4

10−2

100

102

104

106

A1

A2

A3

A4

A5

A6

0 100 200 300 400 500 600 700 800 900 100010

−8

10−6

10−4

10−2

100

102

104

106

A1

A2

A3

A4

A5

A6

Fig. 4.1. Values of f(xk) against k. Left: L0 sharp Lipschitz constant; Right: L0 overestimated.

1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A1

A2

A3

A4

A5

A6

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A1

A2

A3

A4

A5

A6

Fig. 4.2. Performance profile for number of iterations (left) and laboriousness measure (right).

We generated 400 instances of problems with n = 1000 and L∗ from 102 to 105.We consider that L0 is an overestimation for the Lipschitz constant, the strong con-vexity parameter is unknown and µ0 = L0/100. These problems are “well behaved”,in the sense that the eigenvalues of Q are positive and regularly distributed, and strictcomplementarity holds at an optimal solution.

Fig. 4.2 shows the performance profiles [4] of the methods for the number ofiterations (left) and for a laboriousness measure defined by ng + (nf − ng)/3, whereng and nf are respectively the numbers of gradient and function evaluations. Thismeasure is based on the assumption that the oracle for evaluating the function spendsone third of the time used for computing both function and gradient.

We conclude that our algorithm A2 is indeed faster than all other optimal meth-ods in number of iterations. For these experiments, the P-BFGS method used moreiterations than A2, with about the same laboriousness. The SPG method has alwayshad the best laboriousness for these well behaved problems. In our knowledge, thereare presently no complexity results for this method.Ill-behaved problems: Now we check the effect on the algorithms of three kinds ofbad behavior in the problem data. In all these tests, the relationship among the op-timal methods was not significantly affected , and so we only compare the algorithms


A2, BFGS and SPG.Lack of strong convexity: We repeated the tests for problems with 20% of nulleigenvalues in Q. The resulting performance curves and profiles are very similar towhat we obtained above, and so we do not show them.Lack of strict complementarity: It is known that the algorithm performance forconstrained problems is usually affected by the absence of strict complementarity atoptimal solutions. So, we compared the algorithms for three similar problems, whoseperformances are depicted in Fig. 4.3: at the left, a problem with strict complementar-ity at the optimal solution; at the center, a problem with half the optimal multipliersequal to zero; at the right, a problem with null optimal multipliers. We see that A2 wasmuch less affected by the degeneracy than the other algorithms. Badly distributed

0 50 100 150 200 250 300 35010

−8

10−6

10−4

10−2

100

102

104

106

A2

A5

A6

0 100 200 300 400 500 600 700 800 900 100010

−8

10−6

10−4

10−2

100

102

104

106

A2

A5

A6

0 100 200 300 400 500 600 700 800 900 100010

−8

10−6

10−4

10−2

100

102

104

106

A2

A5

A6

Fig. 4.3. Function values against iterations for examples with increasing degeneracy.

eigenvalues: We show in Fig. 4.4 the variation of the function values against itera-tions for an example in which the Hessian eigenvalues are badly distributed: a largenumber of eigenvalues in [0, 1] and a small number in [5000, 10000]. The interestingfeature here is that the SPG algorithm performs very badly, while the other methodswere not much affected.

0 500 1000 1500 2000 2500 300010

−10

10−8

10−6

10−4

10−2

100

102

104

A2

A5

A6

Fig. 4.4. Function values against iterations for a problem with badly distributed Hessian eigen-values.

5. Conclusion. We have presented two first-order algorithms for minimizing adifferentiable convex function f with a Lipschitz constant L∗ > 0 (possibly unknown)for the gradient and a given convexity parameter µ∗ ≥ 0 (possibly null) for f , subjectto a closed convex set Ω ⊂ R

n. We assume that Ω is a “simple” set, in the sensethat is easy to compute the orthogonal projection of an arbitrary point onto Ω. Theproposed algorithms are optimal in the sense that they achieve an absolute precisionof ε in relation to the optimal value in O(1/

√ε) iterations. The optimal complexity

is proved by following a simple geometric approach. The algorithms are extensions


of well-known methods devised by Nesterov [9], without the need for estimating aLipschitz constant for objective function gradient. The second algorithm includes anadaptive procedure for estimating the strong convexity constant. This algorithm wascompared with other algorithms designed for problems on simple sets, using a set ofbox-constrained quadratic problems. It proved to be numerically efficient, even for“ill-behaved” problems.

Acknowledgements. The authors would like to thank the anonymous refereesfor their valuable comments and suggestions which very much improved this paper.

REFERENCES

[1] A. Auslender and M. Teboulle. Interior gradient and proximal methods for convex and conicoptimization. SIAM Journal on Optimization, 16(3):697–725, 2006.

[2] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, USA, 1995.[3] E. G. Birgin, J. M. Martınez, and M. Raydan. Algorithm SPG - software for convex-constrained

optimization. ACM Transactions on Mathematical Software, 27:340–349, 2001.[4] E. D. Dolan and J. J. More. Benchmarking optimization software with performance profiles.

Mathematical Programming, 91(2):201–213, 2002.[5] C. C. Gonzaga and E. W. Karas. Fine tuning Nesterov’s steepest descent algorithm for differ-

entiable convex programming. Mathematical Programming, 138(1–2):141–166, 2013.[6] C. T. Kelley. Iterative methods for optimization. SIAM, Philadelphia, 1999.[7] G. Lan, Z. Lu, and R.D.C. Monteiro. Primal-dual first-order methods with O(1/ǫ) iteration-

complexity for cone programming. Mathematical Programming, 126(1):1–29, 2011.[8] A. S. Nemirovski and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization.

John Wiley, New York, 1983.[9] Y. Nesterov. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic

Publishers, Boston, 2004.[10] Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical

Programming, Series B, 140(1):125–161, 2013.[11] D. R. Rossetto. Topicos em metodos otimos para otimizacao convexa. PhD thesis, Department

of Applied Mathematics, University of Sao Paulo, Brazil, 2012. In portuguese.

Documents

If (1. L Remark: Simple setsewkaras/pesquisa/publicacoes/gonzaga_karas_rossetto.pdf2 C. GONZAGA, E. KARAS, AND D. ROSSETTO be based only on ﬁrst-order information, and each iteration