Scaled memoryless BFGS preconditioned conjugate gradient algorithm for unconstrained optimization

Optimization Methods and SoftwareVol. 22, No. 4, August 2007, 561–571

Scaled memoryless BFGS preconditioned conjugate gradientalgorithm for unconstrained optimization

NECULAI ANDREI*

Research Institute for Informatics, Center for Advanced Modeling and Optimization,8-10, Averescu Avenue, Bucharest 1, Romania

(Received 18 March 2005; revised 8 September 2005; in final form 23 May 2006)

A scaled memoryless BFGS preconditioned conjugate gradient algorithm for solving unconstrainedoptimization problems is presented. The basic idea is to combine the scaled memoryless BFGS methodand the preconditioning technique in the frame of the conjugate gradient method. The preconditioner,which is also a scaled memoryless BFGS matrix, is reset when the Beale–Powell restart criterionholds. The parameter scaling the gradient is selected as the spectral gradient. In very mild conditions,it is shown that, for strongly convex functions, the algorithm is globally convergent. Computationalresults for a set consisting of 750 unconstrained optimization test problems show that this new scaledconjugate gradient algorithm substantially outperforms the known conjugate gradient methods includ-ing the spectral conjugate gradient by Birgin and Martínez [Birgin, E. and Martínez, J.M., 2001, Aspectral conjugate gradient method for unconstrained optimization. Applied Mathematics and Opti-mization, 43, 117–128], the conjugate gradient by Polak and Ribière [Polak, E. and Ribière, G., 1969,Note sur la convergence de méthodes de directions conjuguées. Revue Francaise Informat. ResercheOpérationnelle, 16, 35–43], as well as the most recent conjugate gradient method with guaranteeddescent by Hager and Zhang [Hager, W.W. and Zhang, H., 2005, A new conjugate gradient methodwith guaranteed descent and an efficient line search. SIAM Journal on Optimization, 16, 170–192;Hager, W.W. and Zhang, H., 2004, CG-DESCENT, A conjugate gradient method with guaranteeddescent ACM Transactions on Mathematical Software, 32, 113–137].

Keywords: Unconstrained optimization; Conjugate gradient method; Spectral gradient method; Wolfeline search; BFGS preconditioning

2000 Mathematics Subject Classification: 49M07; 49M10; 90C06; 65K

1. Introduction

In this paper, the following unconstrained optimization problem is considered:

min f (x), (1)

where f : Rn → R is continuously differentiable and its gradient is available. Elaborating analgorithm is of interest for solving large-scale cases for which the Hessian of f is either notavailable or requires a large amount of storage and computational costs.

*Corresponding author. Email: [email protected]

Optimization Methods and SoftwareISSN 1055-6788 print/ISSN 1029-4937 online © 2007 Taylor & Francis

http://www.tandf.co.uk/journalsDOI: 10.1080/10556780600822260

562 N. Andrei

The paper presents a conjugate gradient algorithm based on a combination of thescaled memoryless BFGS method and the preconditioning technique. For general non-linearfunctions, a good preconditioner is any matrix that approximates ∇2f (x∗)−1, where x∗ is thesolution of (1). In this algorithm, the preconditioner is a scaled memoryless BFGS matrixwhich is reset when the Powell restart criterion holds. The scaling factor in the preconditioneris selected as the spectral gradient.

The algorithm uses the conjugate gradient direction where the famous parameter βk isobtained by equating the conjugate gradient direction with the direction corresponding to theNewton method.Thus, a general formula for the direction computation is obtained, which couldbe particularized to include the Polak and Ribiére [1] and the Fletcher and Reeves [2] conjugategradient algorithms, the spectral conjugate gradient (SCG) by Birgin and Martínez [3] or thealgorithm of Dai and Liao [4], for t = 1. This direction is then modified in a canonical manner,as it was considered earlier by Oren and Luenberger [5], Oren and Spedicato [6], Perry [7] andShanno [8,9], by means of a scaled, memoryless BFGS preconditioner placed in the Beale–Powell restart technology. The scaling factor is computed in a spectral manner based on theinverse Rayleigh quotient, as suggested by Raydan [10]. The method is an extension of theSCG by Birgin and Martínez [3] or of a variant of the conjugate gradient algorithm by Daiand Liao [4] (for t = 1) to overcome the lack of positive definiteness of the matrix definingtheir search direction.

The paper is organized as follows. In section 2, the method is presented. Section 3 isdedicated to the scaled conjugate gradient (SCALCG) algorithm. The algorithm performstwo types of steps: a standard one in which a double quasi-Newton updating scheme is usedand a restart one where the current information is used to define the search direction. Theconvergence of the algorithm for strongly convex functions is proved in section 4. Finally,in section 5, the computational results on a set of 750 unconstrained optimization problemsfrom the CUTE [11] collection along with some other large-scale unconstrained optimizationproblems are presented and the performance of the new algorithm is compared with theSCG conjugate gradient method by Birgin and Martínez [3], the Polak and Ribiére conjugategradient algorithm [1], as well as the recent conjugate gradient method with guaranteed descent(CG_DESCENT) by Hager and Zhang [12,13].

2. Method

The algorithm generates a sequence xk of approximations to the minimum x∗ of f , in which

xk+1 = xk + αkdk, (2)

dk+1 = −θk+1gk+1 + βksk, (3)

where gk = ∇f (xk), αk is selected to minimize f (x) along the search direction dk , βk isa scalar parameter, sk = xk+1 − xk and θk+1 is a parameter to be determined. The iterativeprocess is initialized with an initial point x0 and d0 = −g0.

Observe that if θk+1 = 1, then the classical conjugate gradient algorithms are obtainedaccording to the value of the scalar parameter βk . In contrast, if βk = 0, then another class ofalgorithms are obtained according to the selection of the parameter θk+1. Considering βk = 0,there are two possibilities for θk+1: a positive scalar or a positive definite matrix. If θk+1 = 1,then the steepest descent algorithm results. If θk+1 = ∇2f (xk+1)

−1 or an approximation ofit, then the Newton or the quasi-Newton algorithms are obtained, respectively. Therefore, itis seen that in the general case, when θk+1 �= 0 is selected in a quasi-Newton manner, andβk �= 0, (3) represents a combination between the quasi-Newton and the conjugate gradient

Scaled memoryless BFGS preconditioned conjugate gradient algorithm 563

methods. However, if θk+1 is a matrix containing some useful information about the inverseHessian of function f , it is better off using dk+1 = −θk+1gk+1 because the addition of the termβksk in (3) may prevent the direction dk from being a descent direction unless the line searchis sufficiently accurate. Therefore, in this paper, θk+1 will be considered as a positive scalarwhich contains some useful information about the inverse Hessian of function f .

To determine βk , consider the following procedure. As it is known, the Newton directionfor solving (1) is given by dk+1 = −∇2f (xk+1)

−1gk+1. Therefore, from the equality

−∇2f (xk+1)−1gk+1 = −θk+1gk+1 + βksk,

βk = sTk ∇2f (xk+1)θk+1gk+1 − sT

k gk+1

sTk ∇2f (xk+1)sk

. (4)

Using the Taylor development, after some algebra,

βk = (θk+1yk − sk)Tgk+1

yTk sk

(5)

is obtained, where yk = gk+1 − gk . Birgin and Martínez [3] arrived at the same formula forβk , but using a geometric interpretation of quadratic function minimization. The directioncorresponding to βk given in (5) is as follows:

dk+1 = −θk+1gk+1 + (θk+1yk − sk)Tgk+1

yTk sk

sk. (6)

This direction is used by Birgin and Martínez [3] in their SCG package for unconstrainedoptimization, where θk+1 is selected in a spectral manner, as suggested by Raydan [10]. Thefollowing particularizations are obvious. If θk+1 = 1, then (6) is the direction considered byPerry [7]. At the same time, it is seen that (6) is the direction given by Dai and Liao [4]for t = 1, obtained this time by an interpretation of the conjugacy condition. Additionally, ifsTj gj+1 = 0, j = 0, 1, . . . , k, then from (6),

dk+1 = −θk+1gk+1 + θk+1yTk gk+1

αkθkgTk gk

sk, (7)

which is the direction corresponding to a generalization of the Polak and Ribière formula.Of course, if θk+1 = θk = 1 in (7), the classical Polak and Ribière formula [1] is obtained. IfsTj gj+1 = 0, j = 0, 1, . . . , k, and additionally the successive gradients are orthogonal, then

from (6)

dk+1 = −θk+1gk+1 + θk+1gTk+1gk+1

αkθkgTk gk

sk, (8)

which is the direction corresponding to a generalization of the Fletcher and Reeves formula [2].Therefore, (6) is a general formula for direction computation in a conjugate gradient mannerincluding the classical Fletcher and Reeves [6] and Polak and Ribière [1] formulas.

There is a result by Shanno [8,9] that says that the conjugate gradient method is precisely theBFGS quasi-Newton method for which the initial approximation to the inverse of the Hessian,at every step, is taken as the identity matrix. The extension to the scaled conjugate gradient isvery simple. Using the same methodology as considered by Shanno [8], the following direction

564 N. Andrei

dk+1 is obtained:

dk+1 = −θk+1gk+1 + θk+1

(gT

k+1sk

yTk sk

)yk −

[(1 + θk+1

yTk yk

yTk sk

)gT

k+1sk

yTk sk

− θk+1gT

k+1yk

yTk sk

]sk,

(9)involving only four scalar products. Again observe that if gT

k+1sk = 0, then (9) reduces to

dk+1 = −θk+1gk+1 + θk+1gT

k+1yk

yTk sk

sk. (10)

Thus, in this case, the effect is simply one of multiplying the Hestenes and Stiefel [14] searchdirection by a positive scalar.

In order to ensure the convergence of algorithm (2), with dk+1 given by (9), the choice ofαk should be constrained. Consider the line searches that satisfy the Wolfe conditions [15,16]:

f (xk + αkdk) − f (xk) ≤ σ1αkgTk dk, (11)

∇f (xk + αkdk)Tdk ≥ σ2 gT

k dk, (12)

where 0 < σ1 ≤ σ2 < 1.

THEOREM 1 Suppose that αk in (2) satisfies the Wolfe conditions (11) and (12), then thedirection dk+1 given by (9) is a descent direction.

Proof As d0 = −g0, gT0 d0 = − ‖g0‖2 ≤ 0. Multiplying (9) by gT

k+1,

gTk+1dk+1 = 1

(yTk sk)2

[−θk+1 ‖gk+1‖2 (yTk sk)

2 + 2θk+1(gTk+1yk)(g

Tk+1sk)(y

Tk sk)

−(gTk+1sk)

2(yTk sk) − θk+1(y

Tk yk)(g

Tk+1sk)

2].

Applying the inequality uTv ≤ 1/2(‖u‖2 + ‖v‖2) to the second term on the right-hand sideof the above equality, with u = (sT

k yk)gk+1 and v = (gTk+1sk)yk ,

gTk+1dk+1 ≤ − (gT

k+1sk)2

yTk sk

. (13)

is obtained.But, by Wolfe condition (12), yT

k sk > 0. Therefore, gTk+1dk+1 < 0, for every k = 0, 1, . . .

�

Observe that the second Wolfe condition (12) is crucial for the descent character ofdirection (9). Besides, it is seen that estimation (13) is independent of the parameter θk+1.

Usually, all conjugate gradient algorithms are periodically restarted. The Powell restartingprocedure [17,18] is to test if there is very little orthogonality left between the current gradientand the previous one. At step r , when∣∣gT

r+1gr

∣∣ ≥ 0.2 ‖gr+1‖2 , (14)

we restart the algorithm using the direction given by (9).At step r , sr , yr and θr+1 are known. If (14) is satisfied, then a restart step is considered,

i.e. the direction is computed as in (9). For k ≥ r + 1, the same philosophy used by Shanno


[8,9] is considered, where the gradient gk+1 is modified by a positive definite matrix whichbest estimates the inverse Hessian without any additional storage requirements, that is,

v = θr+1gk+1 − θr+1

(gT

k+1sr

yTr sr

)yr +

[(1 + θr+1

yTr yr

yTr sr

)gT

k+1sr

yTr sr

− θr+1gT

k+1yr

yTr sr

]sr , (15)

and

w = θr+1yk − θr+1

(yT

k sr

yTr sr

)yr +

[(1 + θr+1

yTr yr

yTr sr

)yT

k sr

yTr sr

− θr+1yT

k yr

yTr sr

]sr (16)

involving six scalar products. With these, at any non-restart step, the direction dk+1 for k ≥r + 1 is computed using a double update scheme as in Shanno [8]:

dk+1 = −v + (gTk+1sk)w + (gT

k+1w)sk

yTk sk

−(

1 + yTk w

yTk sk

)gT

k+1sk

yTk sk

sk, (17)

involving only 4 scalar products. Observe that yTk sk > 0 is sufficient to ensure that the direction

dk+1 given by (17) is well defined and it is always a descent direction.Motivated by the efficiency of the spectral gradient method introduced by Raydan [10] and

used by Birgin and Martínez [3] in their SCG method for unconstrained optimization, in thealgorithm given in this work, θk+1 is defined as a scalar approximation to the inverse Hessian.This is given as the inverse of the Rayleigh quotient:

sTk

[∫ 10 ∇2f (xk + tsk)dt

]sk

sTk sk

,

i.e.

θk+1 = sTk sk

yTk sk

. (18)

The inverse of Rayleigh quotient lies between the smallest and the largest eigenvalue of theHessian average

∫ 10 ∇2f (xk + tsk)dt . Again observe that yT

k sk > 0 is sufficient to ensure thatθk+1 in (18) is well defined.

3. SCALCG algorithm

Having in view the above developments and the definitions of gk , sk and yk , as well as theselection procedure for θk+1 computation, the following SCALCG algorithm can be presented.

Step 1. Initialization. Select x0 ∈ Rn and the parameters 0 < σ1 ≤ σ2 < 1. Compute f (x0)

and g0 = ∇f (x0). Set d0 = −g0 and α0 = 1/ ‖g0‖. Set k = 0.Step 2. Line search. Compute αk satisfying the Wolfe conditions (11) and (12). Update

the variables xk+1 = xk + αkdk . Compute f (xk+1), gk+1 and sk = xk+1 − xk ,yk = gk+1 − gk .

Step 3. Test for continuation of iterations. If this test is satisfied, the iterations are stopped,else set k = k + 1.

Step 4. Scaling factor computation. Compute θk using (18).Step 5. Restart direction. Compute the (restart) direction dk as in (9).Step 6. Line search. Compute the initial guess: αk = αk−1‖dk−1‖2/‖dk‖2. Using this initiali-

zation, compute αk satisfying the Wolfe conditions. Update the variables xk+1 =xk + αkdk . Compute f (xk+1), gk+1 and sk = xk+1 − xk , yk = gk+1 − gk .

566 N. Andrei

Step 7. Store θ = θk , s = sk and y = yk .Step 8. Test for continuation of iterations. If this test is satisfied, the iterations are stopped,

else set k = k + 1.Step 9. Restart. If the Powell restart criterion (14) is satisfied, then go to step 4 (a restart

step); otherwise, continue with step 10 (a standard step).Step 10. Standard direction. Compute the direction dk as in (17), where v and w are computed

as in (15) and (16) with saved values θ, s and y.Step 11. Line search. Compute the initial guess: αk = αk−1‖dk−1‖2/‖dk‖2. Using this initiali-

zation, compute αk satisfying the Wolfe conditions. Update the variables xk+1 =xk + αkdk . Compute f (xk+1), gk+1 and sk = xk+1 − xk , yk = gk+1 − gk .

Step 12. Test for continuation of iterations. If this test is satisfied, the iterations are stopped,else set k = k + 1 and go to step 9.

It is well known that if f is bounded below along the direction dk , then there exists a steplength αk satisfying the Wolfe conditions. The initial selection of the step length cruciallyaffects the practical behaviour of the algorithm. At every iteration k ≥ 1, the starting guess forthe step αk in line search is computed as αk−1‖dk−1‖2/‖dk‖2. This procedure was consideredfor the first time by Shanno and Phua [19] in CONMIN. The same one is taken by Birgin andMartínez [3] in SCG.

Concerning the stopping criterion to be used in steps 3, 8 and 12, consider

‖gk‖∞ ≤ εg, (19)

where ‖·‖∞ denotes the maximum absolute component of a vector and εg is a tolerancespecified by the user.

4. Convergence analysis for strongly convex functions

Throughout this section, it is assumed that f is strongly convex and Lipschitz continuous onthe level set

L0 = {x ∈ Rn: f (x) ≤ f (x0)

}. (20)

That is, there exists constants μ > 0 and L such that

(∇f (x) − ∇f (y))T(x − y) ≥ μ ‖x − y‖2 (21)

and

‖∇f (x) − ∇f (y)‖ ≤ L ‖x − y‖ , (22)

for all x and y from L0. For the convenience of the reader, the following lemma [12] isincluded here.

LEMMA 1 Assume that dk is a descent direction and ∇f satisfies the Lipschitz condition

‖∇f (x) − ∇f (xk)‖ ≤ L ‖x − xk‖ , (23)

for every x on the line segment connecting xk and xk+1, where L is a constant. If the linesearch satisfies the second Wolfe condition (12), then

αk ≥ 1 − σ2

L

∣∣gTk dk

∣∣‖dk‖2 . (24)


Proof Subtracting gTk dk from both sides of (12) and using the Lipschitz condition,

(σ2 − 1)gTk dk ≤ (gk+1 − gk)

Tdk ≤ Lαk ‖dk‖2 . (25)

Because dk is a descent direction and σ2 < 1, (24) follows immediately from (25). �

LEMMA 2 Assume that ∇f is strongly convex and Lipschitz continuous on L0. If θk+1 isselected by spectral gradient, then the direction dk+1 given by (9) satisfies:

‖dk+1‖ ≤(

2

μ+ 2L

μ2+ L2

μ3

)‖gk+1‖ . (26)

Proof By Lipschitz continuity (22),

‖yk‖ = ‖gk+1 − gk‖ = ‖∇f (xk + αkdk) − ∇f (xk)‖ ≤ Lαk ‖dk‖ = L ‖sk‖ . (27)

In contrast, by strong convexity (21),

yTk sk ≥ μ ‖sk‖2 . (28)

Selecting θk+1 as in (18), it follows that

θk+1 = sTk sk

yTk sk

≤ ‖sk‖2

μ ‖sk‖2 = 1

μ. (29)

Now, using the triangle inequality and estimates (27)–(29), after some algebra on ‖dk+1‖,where dk+1 is given by (9), (26) is obtained. �

The convergence of the SCALCG algorithm when f is strongly convex is given by thefollowing theorem.

THEOREM 2 Assume that f is strongly convex and Lipschitz continuous on the level set L0. Ifat every step of the conjugate gradient (2) dk+1 is given by (9) and the step length αk is selectedto satisfy the Wolfe conditions (11) and (12), then either gk = 0 for some k or limk→∞ gk = 0.

Proof Suppose gk �= 0 for all k. By strong convexity,

yTk dk = (gk+1 − gk)

Tdk ≥ μαk ‖dk‖2 . (30)

By Theorem 1, gTk dk < 0. Therefore, the assumption gk �= 0 implies dk �= 0. As αk > 0, from

(30), it follows that yTk dk > 0. But f is strongly convex over L0, therefore f is bounded from

568 N. Andrei

below. Now, summing over k the first Wolfe condition (11),

∞∑k=0

αkgTk dk > −∞.

Considering the lower bound for αk given by (24) in Lemma 1 and having in view that dk is adescent direction, it follows that

∞∑k=1

∣∣gTk dk

∣∣2

|dk|2< ∞. (31)

Now, from (13), using the inequality of Cauchy and (28),

gTk+1dk+1 ≤ − (gT

k+1sk)2

yTk sk

≤ −‖gk+1‖2 ‖sk‖2

μ ‖sk‖2 = −‖gk+1‖2

μ.

Therefore, from (31), it follows that

∞∑k=0

‖gk‖4

‖dk‖2 < ∞. (32)

Now, inserting the upperbound (26), for dk in (32), yields

∞∑k=0

|gk|2 < ∞,

which completes the proof. �

For general functions, the convergence of the algorithm comes from Theorem 1 and therestart procedure. Therefore, for strongly convex functions and under inexact line search,it is globally convergent. To a great extent, however, the SCALCG algorithm is very closeto the Perry/Shanno computational scheme [8,9]. SCALCG is a scaled memoryless BFGSpreconditioned algorithm where the scaling factor is the inverse of a scalar approximationof the Hessian. If the Powell restart criterion (14) is used for general functions f boundedfrom below with bounded second partial derivatives and bounded level set, using the samearguments considered by Shanno [9], it is possible to prove that either the iterates converge to apoint x∗ satisfying ‖g(x∗)‖ = 0 or the iterates cycle. It remains for further study to determinea complete global convergence result and whether cycling can occur for general functionswith bounded second partial derivatives and bounded level set.

More sophisticated reasons for restarting the algorithms have been proposed in the literature,but the performance of an algorithm that uses the Powell restart criterion is of interest, whichis associated with the scaled memoryless BFGS preconditioned direction choice for restart.Additionally, some convergence analysis with Powell restart criterion was given by Dai andYuan [20] and can be used in this context of the preconditioned and scaled memoryless BFGSalgorithm.

5. Computational results and comparisons

In this section, the performance of a Fortran implementation of the SCALCG on a set of 750unconstrained optimization test problems is presented. At the same time, the performance of


SCALCG is compared with the best SCG algorithm, (betatype =1, Perry-M1), by Birgin andMartínez [3], with the CG_DESCENT by Hager and Zhang [12,13] and with the Polak–Ribière(PR) algorithm.

The SCALCG code is authored by Andrei, whereas the SCG and PR are co-authored byBirgin and Martínez and CG_DESCENT is co-authored by Hager and Zhang. All codes arewritten in Fortran and compiled with f77 (default compiler settings) on an Intel Pentium 4,1.5 GHz. The CG_DESCENT code contains the variant implementing the Wolfe line search(W) and the variant corresponding to the approximate Wolfe conditions (aW). All algorithmsimplement the same stopping criterion as in (19), where εg = 10−6.

The test problems are the unconstrained problems in the CUTE [11] collection, along withother large-scale optimization problems. Seventy-five large-scale unconstrained optimiza-tion problems in extended or generalized form are selected. For each function, 10 numericalexperiments with number of variables n = 1000, 2000, . . . , 10, 000 have been considered.

Let f ALG1i and f ALG2

i be the optimal value found by ALG1 and ALG2, for problemi = 1, . . . , 750, respectively. It is said that, in the particular problem i, the performance ofALG1 was better than the performance of ALG2 if∣∣f ALG1

i − f ALG2i

∣∣ < 10−3

and the number of iterations or the number of function-gradient evaluations or the CPU timeof ALG1 was less than the number of iterations or the number of function-gradient evaluationsor the CPU time corresponding to ALG2, respectively.

The numerical results concerning the number of iterations, the number of restart iterations,the number of function and gradient evaluations, cpu time in seconds, for each of these methodscan be found in ref. [21].

Tables 1–4 show the number of problems, out of 750, for which SCALCG versus SCG,PR, CG_DESCENT(W) or CG_DESCENT(aW) achieved the minimum number of iterations(# iter), the minimum number of function evaluations (# fg) and the minimum cpu time (CPU),respectively.

For example, when comparing SCALCG and SCG (table 1), subject to the numberof iterations, SCALCG was better in 486 problems (i.e. it achieved the minimum number ofiterations in 486 problems), SCG was better in 98 problems, they had the same numberof iterations in 89 problems, etc. From these tables, it is seen that, at least for this set of 750problems, the top performer is SCALCG. As these codes use the same Wolfe line search,

Table 1. Performance of SCALCG versus SCG (750problems).

SCALCG SCG =# iter 486 98 89# fg 425 128 120CPU 592 43 38

Table 2. Performance of SCALCG versus PR (750problems).

SCALCG PR =# iter 497 93 81# fg 415 134 122CPU 580 54 37

570 N. Andrei

Table 3. Performance of SCALCG versus CG−DESCENT(W) (750problems).

SCALCG CG−DESCENT(W) =# iter 440 173 28# fg 444 164 33CPU 515 91 35

Table 4. Performance of SCALCG versus CG−DESCENT(aW) (750problems).

SCALCG CG−DESCENT(aW) =# iter 445 168 28# fg 431 175 35CPU 507 97 37

excepting CG_DESCENT(aW) which implements an approximate Wolfe line search, and thesame stopping criterion, they differ in their choice of the search direction. Hence, among theseconjugate gradient algorithms, SCALCG appears to generate the best search direction, onaverage.

6. Conclusion

A scaled memoryless BFGS preconditioned conjugate gradient algorithm, SCALCGalgorithm, for solving large-scale unconstrained optimization problems is presented. SCALCGcan be considered as a modification of the best algorithm by Birgin and Martínez [3], whichis mainly a scaled variant of Perry’s [7], and of the Dai and Liao [4] (t = 1), in order toovercome the lack of positive definiteness of the matrix defining the search direction. Thismodification takes the advantage of the quasi-Newton BFGS updating formula. Using therestart technology of Beale–Powell, a SCALCG algorithm is obtained in which the parameterscaling the gradient is selected as spectral gradient. Although the update formulas (9) and(15)–(17) are more complicated, the scheme proved to be efficient and robust in numericalexperiments. The algorithm implements the Wolfe conditions and it has been proved that thesteps are along the descent directions.

The performances of the SCALCG algorithm were higher than those of SCG method byBirgin and Martínez, CG_DESCENT by Hager and Zhang and scaled PR for a set of 750unconstrained optimization problems with dimensions ranging between 103 and 104.

As each of these codes considered here is different, mainly in the amount of linear algebrarequired at each step, it is quite clear that different codes will be superior in different problemsets. Generally, one needs to solve thousand different problems before trends begin to emerge.For any particular problem, almost any method can win. However, there is strong computationalevidence that the scaled memoryless BFGS preconditioned conjugate gradient algorithm isthe top performer among these conjugate gradient algorithms.

Acknowledgement

The author was awarded the Romanian Academy Grant 168/2003.


References[1] Polak, E. and Ribière, G., 1969, Note sur la convergence de méthodes de directions conjuguées. Revue Francaise

Informat. Reserche Opérationnelle, 16, 35–43.[2] Fletcher, R. and Reeves, C.M., 1964, Function minimization by conjugate gradients. The Computer Journal, 7,

149–154.[3] Birgin, E. and Martínez, J.M., 2001, A spectral conjugate gradient method for unconstrained optimization.

Applied Mathematics and Optimization, 43, 117–128.[4] Dai, Y.H. and Liao, L.Z., 2001, New conjugate conditions and related nonlinear conjugate gradient methods.

Applied Mathematics and Optimization, 43, 87–101.[5] Oren, S.S. and Luenberger, D.G., 1976, Self-scaling variable metric algorithm. Part I. Management Science, 20,

845–862.[6] Oren, S.S. and Spedicato, E., 1976, Optimal conditioning of self-scaling variable metric algorithms. Mathema-

tical Programming, 10, 70–90.[7] Perry, J.M., 1977, A class of conjugate gradient algorithms with a two step variable metric memory. Discussion

paper 269, Center for Mathematical Studies in Economics and Management Science, Northwestern University.[8] Shanno, D.F., 1978, Conjugate gradient methods with inexact searches. Mathematics of Operations Research,

3, 244–256.[9] Shanno, D.F., 1978, On the convergence of a new conjugate gradient algorithm. SIAM Journal on Numerical

Analysis 15, 1247–1257.[10] Raydan, M., 1997, The Barzilai and Borwein gradient method for the large scale unconstrained minimization

problem. SIAM Journal of Optimization, 7, 26–33.[11] Bongartz, I, Conn, A.R., Gould, N.I.M. and Toint, P.L., 1995, CUTE: constrained and unconstrained testing

environments. ACM Transactions on Mathematical Software, 21, 123–160.[12] Hager, W.W. and Zhang, H., 2005, A new conjugate gradient method with guaranteed descent and an efficient

line search. SIAM Journal on Optimization, 16, 170–192.[13] Hager,W.W. and Zhang, H., 2006,Algorithm 851: CG_DESCENT,A conjugate gradient method with guaranteed

descent. ACM Transactions of Mathematical Software, 32, 113–137.[14] Hestenes, M.R. and Stiefel, E., 1952, Methods of conjugate gradients for solving linear systems. Journal of

Research of the National Bureau of Standards, Section B, 48, 409–436.[15] Wolfe, P., 1969, Convergence conditions for ascent methods. SIAM Review, 11, 226–235.[16] Wolfe, P., 1971, Convergence conditions for ascent methods II: some corrections. SIAM Review, 13, 185–188.[17] Powell, M.J.D., 1976, Some convergence properties of the conjugate gradient method. Mathematical Program-

ming, 11, 42–49.[18] Powell, M.J.D., 1977, Restart procedures for the conjugate gradient method. Mathematical Programming, 12,

241–254.[19] Shanno, D.F. and Phua, K.H., 1976, Algorithm 500, minimization of unconstrained multivariate functions.

ACM Transactions on Mathematical Software, 2, 87–94.[20] Dai, Y.H. and Yuan, Y., 1998, Convergence properties of the Beale–Powell restart algorithm. Science in China

(Series A), 41(11), 1142–1150.[21] Andrei, N., Available online at: http://www.ici.ro/camo/neculai/anpaper.htm/

Documents

Scaled memoryless BFGS preconditioned conjugate gradient algorithm for unconstrained optimization