Accelerating Gradient Boosting Machineweb.mit.edu/haihao/www/papers/AGBM.pdf · Accelerating Gradient Boosting Machine where ‘(y i;f(x i)) is a measure of the data-ﬁdelity for

Accelerating Gradient Boosting Machine

Haihao Lu * 1 2 Sai Praneeth Karimireddy * 3 Natalia Ponomareva 2 Vahab Mirrokni 2

AbstractGradient Boosting Machine (GBM) (Friedman,2001) is an extremely powerful supervised learn-ing algorithm that is widely used in practice.GBM routinely features as a leading algorithmin machine learning competitions such as Kaggleand the KDDCup. In this work, we propose Ac-celerated Gradient Boosting Machine (AGBM) byincorporating Nesterov’s acceleration techniquesinto the design of GBM. The difficulty in accel-erating GBM lies in the fact that weak (inexact)learners are commonly used, and therefore theerrors can accumulate in the momentum term. Toovercome it, we design a “corrected pseudo resid-ual” and fit best weak learner to this correctedpseudo residual, in order to perform the z-update.Thus, we are able to derive novel computationalguarantees for AGBM. This is the first GBM typeof algorithm with theoretically-justified acceler-ated convergence rate. Finally we demonstratewith a number of numerical experiments the ef-fectiveness of AGBM over conventional GBMin obtaining a model with good training and/ortesting data fidelity.

1. IntroductionGradient Boosting Machine (GBM) (Friedman, 2001) isa powerful supervised learning algorithm that combinesmultiple weak-learners into an ensemble with excellent pre-diction performance. GBM works very well for a number oftasks like spam filtering, online advertising, fraud detection,anomaly detection, computational physics (e.g., the HiggsBoson discovery), etc; and has routinely featured as a top al-gorithm in Kaggle competitions and the KDDCup (Chen &Guestrin, 2016). GBM can naturally handle heterogeneousdatasets (highly correlated data, missing data, categoricaldata, etc). It is also quite easy to use with several publicly

*Equal contribution 1MIT, Cambridge, MA, USA 2Google,New York, NY, USA 3EPFL, Lausanne, Switzerland. Correspon-dence to: Haihao Lu <[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

available implementations: scikit-learn (Pedregosa et al.,2011), R gbm (Ridgeway et al., 2013), LightGBM (Ke et al.,2017), XGBoost (Chen & Guestrin, 2016), TF BoostedTrees (Ponomareva et al., 2017), etc.

In spite of the practical success of GBM, there is a consid-erable gap in its theoretical understanding. The traditionalinterpretation of GBM is to view it as a form of steepestdescent in functional space (Mason et al., 2000; Friedman,2001). While this interpretation serves as a good startingpoint, such framework lacks rigorous non-asymptotic con-vergence guarantees, especially when compared to the grow-ing body of literature on first order convex optimization.

In convex optimization literature, Nesterov’s acceleration isa successful technique to speed up the convergence of first-order methods. In this work, we show how to incorporateNesterov momentum into the gradient boosting frameworkto obtain an accelerated gradient boosting machine.

1.1. Our contributions

We propose the first accelerated gradient boosting algorithmthat comes with strong theoretical guarantees and can beused with any type of weak learner. In particular:

1. We propose a novel accelerated gradient boosting al-gorithm (AGBM) (Section 3) and prove (Section 4)that it reduces the empirical loss at a rate of O(1/m2)after m iterations, improving upon the O(1/m) rateobtained by traditional gradient boosting methods

2. We propose a computationally inexpensive practicalvariant of AGBM, taking advantage of strong convexityof loss function, which achieves linear convergence(Section 5). We also list the conditions (on the lossfunction) under which AGBMs would be beneficial.

3. With a number of numerical experiments with weaktree learners (one of the most popular type of GBMs)we confirm the effectiveness of AGBM.

4. Our accelerated boosting framework will be open-sourced at this url1.

Apart from theoretical contributions, we paved the way tospeed up some practical applications of GBMs, which cur-rently require a large number of boosting iterations. For

1Link removed for anonymity; will be added upon acceptance.


example, GBMs with boosted trees for multi-class problemsare commonly implemented as a number of one-vs-restlearners, resulting in more complicated boundaries (Fried-man et al., 1998) and a potentially a larger number of boost-ing iterations required. Additionally, it is a common practiceto build many very weak learners for problems where it iseasy to overfit. Such large ensembles result not only inslow training time, but slower inference. AGBMs can bepotentially beneficial for all these applications.

1.2. Related Literature

Convergence Guarantees for GBM: After being first in-troduced by Friedman et al. (Friedman, 2001), severalworks established its guaranteed convergence, without ex-plicitly stating the convergence rate (Collins et al., 2002;Mason et al., 2000). Subsequently, when the loss functionwas both smooth and strongly convex, (Bickel et al., 2006)proved an exponential convergence rate—more preciselythat O(exp(1/ε2)) iterations are sufficient to ensure thatthe training loss is within ε of its optimal value. (Telgar-sky, 2012) then studied the primal-dual structure of GBMand demonstrated that in fact only O(log(1/ε)) iterationsare needed. However the constants in their rate were non-standard and less intuitive. This result was recently im-proved upon by (Freund et al., 2017) and (Lu & Mazumder,2018), who showed a similar convergence rate but with moretransparent constants such as the smoothness and strong con-vexity constant of the loss function, and the density of weaklearners, etc. Additionally, if the loss function is assumed tobe smooth and convex (but not necessarily strongly convex),(Lu & Mazumder, 2018) also showed that O(1/ε) iterationsare sufficient. We refer the reader to (Telgarsky, 2012), (Fre-und et al., 2017) and (Lu & Mazumder, 2018) for a moredetailed literature review on the theoretical results of GBMconvergence.

Accelerated Gradient Methods: For optimizing a smoothconvex function, (Nesterov, 1983) showed that the standardgradient descent algorithm can be made much faster, re-sulting in the accelerated gradient descent method. Whilegradient descent requires O(1/ε) iterations, accelerated gra-dient methods only requires O(1/

√ε). In general, this rate

of convergence is optimal and cannot be improved upon(Nesterov, 2004). Since its introduction in 1983, the main-stream research community’s interest in Nesterov’s accel-erated method started around 15 years ago; yet even todaymost researchers struggle to find basic intuition as to whatis really going on in accelerated methods. Such lack of intu-ition about the estimation sequence proof technique used by(Nesterov, 2004) has motivated many recent works tryingto explain this acceleration phenomenon (Su et al., 2016;Wilson et al., 2016; Hu & Lessard, 2017; Lin et al., 2015;Frostig et al., 2015; Allen-Zhu & Orecchia, 2014; Bubecket al., 2015). Some have recently attempted to give a physi-

cal explanation of acceleration techniques by studying thecontinuous-time interpretation of accelerated gradient de-scent via dynamical systems (Su et al., 2016; Wilson et al.,2016; Hu & Lessard, 2017).

Accelerated Greedy Coordinate and Matching PursuitMethods: Recently, (Locatello et al., 2018) and (Lu et al.,2018) discussed how to accelerate matching pursuit andgreedy coordinate descent algorithms respectively. Theirmethods however require a random step and are hence only‘semi-greedy’, which does not fit in the boosting framework.

Accelerated GBM: Recently, (Biau et al., 2018) and(Fouillen et al., 2018) proposed an accelerated version ofGBM by directly incorporating Nesterov’s momentum inGBM, however, no theoretical justification was provided.Furthermore, as we argue in Section 5.2, their proposedalgorithm may not converge to the optimum.

2. Gradient Boosting MachineWe consider a supervised learning problem with n trainingexamples (xi, yi), i = 1, . . . , n such that xi ∈ Rp is thefeature vector of the i-th example and yi is a label (in a clas-sification problem) or a continuous response (in a regressionproblem). In the classical version of GBM (Friedman, 2001),we assume we are given a base class of learners B and thatour target function class is the linear combination of suchbase learners (denoted by lin(B)). Let B = {bτ (x) ∈ R} bea family of learners parameterized by τ ∈ T . The predictioncorresponding to a feature vector x is given by an additivemodel of the form:

f(x) :=

(M∑m=1

βmbτm(x)

)∈ lin(B) , (1)

where bτm(x) ∈ B is a weak-learner and βm is its corre-sponding additive coefficient. Here, βm and τm are chosenin an adaptive fashion in order to improve the data-fidelityas discussed below. Examples of learners commonly usedin practice include wavelet functions, support vector ma-chines, and classification and regression trees (Friedmanet al., 2001). We assume the set of weak learners B isscalable, namely that the following assumption holds.

Assumption 2.1. If b(·) ∈ B, then λb(·) ∈ B for any λ > 0.

Assumption 2.1 holds for most of the set of weak learners weare interested in. Indeed scaling a weak learner is equivalentto modifying the coefficient of the weak learner, so it doesnot change the structure of B.

The goal of GBM is to obtain a good estimate of the functionf that approximately minimizes the empirical loss:

L? = minf∈lin(B)

{L(f) :=

n∑i=1

`(yi, f(xi))}

(2)


where `(yi, f(xi)) is a measure of the data-fidelity for thei-th sample for the loss function `.

2.1. Best Fit Weak Learners

The original version of GBM by (Friedman, 2001), pre-sented in Algorithm 1, can be viewed as minimizing theloss function by applying an approximated steepest descentalgorithm (2). GBM starts from a null function f0 ≡ 0 andat each iteration computes the pseudo-residual rm (namely,the negative gradient of the loss function with respect to thepredictions so far fm):

rmi = −d `(yi, fm(xi))

dfm(xi). (3)

Then a weak-learner, that best fits the current pseudo-residual in terms of the least squares loss, is computed asfollows:

τm = arg minτ∈T

n∑i=1

(rmi − bτ (xi))2. (4)

This weak-learner is added to the model with a coefficientfound via a line search. As the iterations progress, GBMleads to a sequence of functions {fm}m∈[M ] (where [M ] isa shorthand for the set {1, . . . ,M}). The usual intention ofGBM is to stop early—before one is close to a minimumof Problem (2)—with the hope that such a model will leadto good predictive performance (Friedman, 2001; Freundet al., 2017; Zhang & Yu, 2005; Buhlmann et al., 2007).

Algorithm 1 Gradient Boosting Machine (GBM) (Fried-man, 2001)

Initialization. Initialize with f0(x) = 0.For m = 0, . . . ,M − 1 do:Perform Updates:(1) Compute pseudo residual: rm =

−[∂`(yi,f

m(xi))∂fm(xi)

]i=1,...,n

.

(2) Find the parameters of the best weak-learner: τm =arg minτ∈T

∑ni=1(rmi − bτ (xi))

2.(3) Choose the step-size ηm by line-search: ηm =arg minη

∑ni=1 `(yi, f

m(xi) + ηbτm(xi)).(4) Update the model fm+1(x) = fm(x) + ηmbτm(x).

Output. fM (x).

Perhaps the most popular set of learners are classificationand regression trees (CART) (Breiman et al., 1984), result-ing in Gradient Boosted Decision Tree models (GBDTs).These are the models that we are using for our numericalexperiments. At the same time, we would like to highlightthat our algorithm is not tied to a particular type of weaklearner and is a general algorithm.

3. Accelerated Gradient Boosting Machine(AGBM)

Given the success of accelerated gradient descent as a firstorder optimization method, it seems natural to attempt toaccelerate the GBMs. As a warm-up, we first look at howto obtain an accelerated boosting algorithm when our classof learners B is strong (complete) and can exactly fit anypseudo-residuals. This assumption is quite unreasonable butwill serve to understand the connection between boostingand first order optimization. We then describe our actualalgorithm which works for any class of weak learners.

3.1. Boosting with strong learners

In this subsection, we assume the class of learners B isstrong, i.e. for any pseudo-residual r ∈ Rn, there exists alearner b(x) ∈ B such that

b(xi) = ri,∀ i ∈ [n] .

Of course the entire point of boosting is that the learners areweak and thus the class is not strong, therefore this is not arealistic assumption. Nevertheless this section will providethe intuitions on how to develop AGBM.

In the GBM we compute the psuedo-residual rm in (3) to bethe negative gradient of the loss function over the predictionsso far. A gradient descent step in a functional space wouldtry to find fm+1 such that for i ∈ {1, . . . , n}

fm+1(xi) = fm(xi) + ηrmi .

Here η is the step-size of our algorithm. Since our class oflearners is rich, we can choose bm(x) ∈ B to exactly satisfythe above equation.

Thus GBM (Algorithm 1) then has the following update:

fm+1 = fm + ηbm ,

where bm(xi) = rmi . In other words, GBM performs ex-actly functional gradient descent when the class of learnersis strong, and so it converges at a rate of O(1/m). Akinto the above argument, we can perform functional accel-erated gradient descent, which has the accelerated rate ofO(1/m2). In the accelerated method, we maintain threemodel ensembles: f , g, and h of which f(x) is the onlymodel which is finally used to make predictions during in-ference. h(x) is the momentum sequence and g(x) is aweighted average of f and h (refer to Table 1 for list of allnotations used). These sequences are updated as follows fora step-size η and {θm = 2/(m+ 2)}:

gm = (1− θm)fm + θmhm

fm+1 = gm + ηbm : primary model

hm+1 = hm + η/θmbm : momentum model ,

(5)


Parameter Dimension Explanation(xi, yi) Rp × R The features and the label of the i-th sample.X Rp×n X = [x1, x2, · · · , xn] is the feature matrix for all training data.

bτ (x) function Weak learner parameterized by τ .fm(x) function Ensemble of weak learners at the m-th iteration.

gm(x), hm(x) functions Auxiliary ensembles of weak learners at the m-th iteration.rm Rn Pseudo residual at the m-th iteration.cm Rn Corrected pseudo-residual at the m-th iteration.

bτ (X) Rn A vector of predictions [bτ (xi)]i.f(X) Rn A vector of [f(xi)]i for any function f(x).

Table 1. List of notations used.

where bm(x) satisfies for i ∈ 1, . . . , n

bm(xi) = −d `(yi, gm(xi))

dgm(xi). (6)

Note that the psuedo-residual is computed w.r.t. g insteadof f . The update above can be rewritten as

fm+1 = fm + ηbm + θm(hm − fm) .

If θm = 0, we see that we recover the standard functionalgradient descent with step-sze η. For θm ∈ (0, 1], there isan additional momentum in the direction of (hm − fm).

The three sequences f , g, and h match exactly those used intypical accelerated gradient methods (see (Nesterov, 2004;Tseng, 2008) for details).

3.2. Boosting with weak learners

In this subsection, we consider the general case withoutassuming that the class of learners is strong. Indeed, theclass of learners B is usually quite simple and it is verylikely that for any τ ∈ T , it is impossible to exactly fit theresidual rm. We call this case boosting with weak learners.Our task then is to modify (5) to obtain a truly acceleratedgradient boosting machine.

The full details are summarized in Algorithm 2 but we willhighlight two key differences from (5).

First, the update to the f sequence is replaced with a weak-learner which best approximates rm similar to (5). In par-ticular, we compute pseudo-residual rm computed w.r.t. gas in (6) and find a parameter τm,1 such that

τm,1 = arg minτ∈T

n∑i=1

(rmi − bτ (xi))2 .

Secondly, and more crucially, the update to the momentummodel h is decoupled from the update to the f sequence.We use an error-corrected pseudo-residual cm instead ofdirectly using rm. Suppose that at iteration m− 1, a weak-learner bτm−1,2 was added to hm−1. Then error corrected

residual is defined inductively as follows: for i ∈ {1, . . . , n}

cmi = rmi +m+ 1

m+ 2

(cm−1i − bτm−1,2

(xi)),

and then we compute


n∑i=1

(cmi − bτ (xi))2 .

Thus at each iteration two weak learners are computed—bτm,1

(x) approximates the residual rm and the bτm,2(x),

which approximates the error-corrected residual cm. Notethat if our class of learners is complete then cm−1

i =bτm−1,2

(xi), cm = rm and τm,1 = τm,2. This would re-vert back to our accelerated gradient boosting algorithm forstrong-learners described in (5).

4. Convergence Analysis of AGBMWe first formally define the assumptions required and thenoutline the computational guarantees for AGBM.

4.1. Assumptions

Let’s introduce some standard regularity/continuity con-straints on the loss function that we require in our analysis.

Definition 4.1. We denote ∂`(y,f)∂f as the derivative of the

bivariant loss function ` w.r.t. the prediction f . We say that` is σ-smooth if for any y and predictions f1 and f2, it holdsthat

`(y, f1) ≤ `(y, f2) +∂`(y, f2)

∂f(f1 − f2) +

σ

2(f1 − f2)2.

We say ` is µ-strongly convex (with µ > 0) if for any y andpredictions f1 and f2, it holds that

`(y, f1) ≥ `(y, f2) +∂`(y, f2)

∂f(f1 − f2) +

µ

2(f1 − f2)2.

Note that µ ≤ σ always. Smoothness and strong-convexitymean that the function l(x) is (respectively) upper and lower


Algorithm 2 Accelerated Gradient Boosting Machine (AGBM)

Input. Starting function f0(x) = 0, step-size η, momentum-parameter γ ∈ (0, 1], and data X, y = (xi, yi)i∈[n].Initialization. h0(x) = f0(x) and sequence θm = 2

m+2 .For m = 0, . . . ,M − 1 do:Perform Updates:(1) Compute a linear combination of f and h: gm(x) = (1− θm)fm(x) + θmh

m(x).(2) Compute pseudo residual: rm = −

[∂`(yi,g

m(xi))∂gm(xi)

]i=1,...,n

.

(3) Find the best weak-learner for pseudo residual: τm,1 = arg minτ∈T∑ni=1(rmi − bτ (xi))

2.(4) Update the model: fm+1(x) = gm(x) + ηbτm,1

(x).

(5) Update the corrected residual: cmi =

{rmi if m = 0

rmi + m+1m+2 (cm−1

i − bτm−1,2(xi)) o.w..

(6) Find the best weak-learner for the corrected residual: τm,2 = arg minτ∈T∑ni=1(cmi − bτ (xi))

2.(7) Update the momentum model: hm+1(x) = hm(x) + γη/θmbτm,2(x).

Output. fM (x).

bounded by quadtratic functions. Intuitively, smoothnessimplies that that gradient does not change abruptly andhence l(x) is never ‘sharp’. Strong-convexity implies thatl(x) always has some ‘curvature’ and is never ‘flat’.

The notion of Minimal Cosine Angle (MCA) introduced in(Lu & Mazumder, 2018) plays a central rule in our conver-gence rate analysis of GBM. MCA measures how well theweak-learner bτ(r)(X) approximates the desired residual r

Definition 4.2. Let r ∈ Rn be a vector. The Minimal CosineAngle (MCA) is defined as the similarity between r and theoutput of the best-fit learner bτ (X):

Θ := minr∈Rn

maxτ∈Tm

cos(r, bτ (X)) , (7)

where bτ (X) ∈ Rn is a vector of predictions [bτ (xi)]i.

The quantity Θ ∈ (0, 1] measures how “dense” the learnersare in the prediction space. For strong learners (in Section3.1), the prediction space is complete, and Θ = 1. For acomplex space of learners T such as deep trees, we expectthe prediction space to be dense and that Θ ≈ 1. For asimpler class such as tree-stumps Θ would be much smaller.Refer to (Lu & Mazumder, 2018) for a discussion of Θ.

4.2. Computational Guarantees

We are now ready to state the main theoretical result of ourpaper.

Theorem 4.1. Consider Accelerated Gradient Boosting Ma-chine (Algorithm 2). Suppose ` is σ-smooth, the step-sizeη ≤ 1

σ and the momentum parameter γ ≤ Θ4/(4 + Θ2),where Θ is the MCA introduced in Definition 4.2. Then forall M ≥ 0, we have:

L(fM )− L(f∗) ≤ 1

2ηγ(M + 1)2‖f∗(X)‖22 .

Proof Sketch. Here we only give an outline—the full proofcan be found in the Appendix (Section B). We use thepotential-function based analysis of accelerated method (cf.(Tseng, 2008; Wilson et al., 2016)). Recall that θm = 2

m+2 .For the proof, we introduce the following vector sequenceof auxiliary ensembles h as follows:

h0(X) = 0, hm+1(X) = hm(X) +ηγ

θmrm .

The sequence hm(X) is in fact closely tied to the sequencehm(X) as we demonstrate in the Appendix (Lemma B.2).Let f? be any function which obtains the optimal loss (2)

f? ∈ arg minf∈lin(B)

{L(f) :=

n∑i=1

`(yi, f(xi))}.

Let us define the following sequence of potentials:

V m(f?) =

12

∥∥∥f?(X)− h0(X)∥∥∥2

if m = 0 ,

ηγθ2m−1

(L(fm)− L?) + 12

∥∥∥f?(X)− hm(X)∥∥∥2

o.w

Typical proofs of accelerated algorithms show that the po-tential V m(f?) is a decreasing sequence. In boosting, weuse the weak-learner that fits the pseudo-residual of theloss. This can guarantee sufficient decay to the first termof V m(f?) related to the loss L(f). However, there is nosuch guarantee that the same weak-learner can also providesufficient decay to the second term as we do not aprioriknow the optimal ensemble f?. That is the major challengein the development of AGBM.

We instead show that the potential decreases at least by δm:

V m+1(f?) ≤ V m(f?) + δm ,

where δm is an error term depending on Θ (see Lemma B.4for the exact definition of δm and proof of the claim). By


telescope, it holds that

ηγ

θ2m

(L(fm+1)− L(f?)

)≤ V m+1(f?)

≤m∑j=0

δj +1

2

∥∥∥f?(X)− h0(X)∥∥∥2

.

Finally a careful analysis of the error term (Lemma B.6)shows that

∑mj=0 δj ≤ 0 for any m ≥ 0. Therefore,

L(fm+1)− L(f?) ≤ θ2m

2ηγ‖f?(X)‖2 ,

which furnishes the proof by letting m = M − 1 and substi-tuting the value of θm.

Remark 4.1. Theorem 4.1 implies that to get a function fM

such that the error L(fM )−L(f?) ≤ ε, we need number of

iterations M ∼ O(

1Θ2√ε

). In contrast, standard gradient

boosting machines as proved in (Lu & Mazumder, 2018)require M ∼ O

(1

Θ2ε

)This means that for high values of ε,

AGBM (Algorithm 2) can require far fewer weak learnersthan GBM (Algorithm 1).

5. Extensions and VariantsIn this section we study two more practical variants ofAGBM. First we see how to restart the algorithm to takeadvantage of strong convexity of the loss function. Thenwe will study a straight-forward approach to acceleratedGBM, which we call vanilla accelerated gradient boostingmachine (VAGBM), a variant of the recently proposed algo-rithm in (Biau et al., 2018), however without any theoreticalguarantees.

5.1. Restart and Linear Convergence

It is more common to show a linear rate of convergence forGBM methods by additionally assuming that the functionl(x) is µ-strongly convex (e.g. (Lu & Mazumder, 2018)). Itis then relatively straight-forward to recover an acceleratedlinear rate of convergence by restarting Algorithm 2.

Theorem 5.1. Consider Accelerated Gradient Boostingwith Restarts with Option 1 (Algorithm 3) . Suppose thatl(x) is σ-smooth and µ-strongly convex. If the step-sizeη ≤ 1

σ and the momentum parameter γ ≤ Θ4/(4 + Θ2),then for any p and optimal loss L(f?),

L(fp+1)− L? ≤ 1

2(L(fp)− L(f?)) .

Proof. The loss function l(x) is µ-strongly convex, whichimplies that

µ

2‖f(X)− f∗(X)‖22 ≤ L(f)− L(f?) .

Substituting this in Theorem 4.1 gives us that

L(fM )− L(f?) ≤ 1

µηγ(M + 1)2(L(f0)− L(f?)) .

Recalling that f0(x) = fp(x), fM (x) = fp+1(x), andM2 = 2/ηµγ gives us the required statement.

The restart strategy in Option 1 requires knowledge of thestrong-convexity constant µ. Alternatively, one can also useadaptive restart strategy (Option 2) which is known to havegood empirical performance (Odonoghue & Candes, 2015).Remark 5.1. Theorem 5.1 shows that M =O(

1Θ2

√σµ log(1/ε)

)weak learners are sufficient to

obtain an error of ε using ABGMR (Algorithm 3).In contrast, standard GBM (Algorithm 1) requiresM = O

(1

Θ2σµ log(1/ε)

)weak learners. Thus AGBMR

is significantly better than GBM only if the conditionnumber is large i.e. (σ/µ ≥ 1). When l(y, f) is theleast-squares loss, (µ = σ = 1) we would see no advantageof acceleration. However for more complicated functionswith (σ � µ) (e.g. logistic loss or exp loss), AGBMRmight result in an ensemble that is significantly better (e.g.obtaining lower training loss) than that of GBM for thesame number of weak learners.

5.2. A Vanilla Accelerated Gradient Boosting Method

A natural question to ask is whether, instead of adding twolearners at each iteration, we can get away with adding onlyone? Below we show how such an algorithm would looklike and argue that it may not always converge.

Following the updates in Equation (5), we can get a directacceleration of GBM by using the weak learner fitting thegradient. This leads to an Algorithm 4.

Algorithm 4 is equivalent to the recently developed accel-erated gradient boosting machines algorithm (Biau et al.,2018; Fouillen et al., 2018). Unfortunately, it may not al-ways converge to optimum or may even diverge. This isbecause bτm from Step (2) is only an approximate-fit to rm,meaning that we only take an approximate gradient descentstep. While this is not an issue in the non-accelerated ver-sion, in Step (2) of Algorithm 4, the momentum term pushesthe h sequence to take a large step along the approximategradient direction. This exacerbates the effect of the approx-imate direction and can lead to an additive accumulation oferror as shown in (Devolder et al., 2014). In Section 6.1,we see that this is not just a theoretical concern, but thatAlgorithm 4 also diverges in practice in some situations.Remark 5.2. Our corrected residual cm in Algorithm 2 wascrucial to the theoretical proof of converge in Theorem 4.1.One extension could be to introduce γ ∈ (0, 1) in step (5)of Algorithm 4 just as in Algorithm 2.


Algorithm 3 Accelerated Gradient Boosting Machine with Restart (AGBMR)

Input: Starting function f0(x), step-size η, momentum-parameter γ ∈ (0, 1], strong-convexity parameter µ.For p = 0, . . . , P − 1 do:

(1) Run AGBM (Algorithm 2) initialized with f0(x) = fp(X):

Option 1: for M =√

2ηγµ .

Option 2: as long as fM+1(X) ≤ fM (X).(2) Set fp+1(x) = fM (x).

Output: fP (x).

Algorithm 4 Vanilla Accelerated Gradient Boosting Machine (VAGBM)

Input. Starting function f0(x) = 0, step-size η, momentum parameter γ ∈ (0, 1].Initialization. h0(x) = f0(x), and sequence θm = 2

m+2 .For m = 0, . . . ,M − 1 do:Perform Updates:(1) Compute a linear combination of f and h: gm(x) = (1− θm)fm(x) + θmh

m(x).(2) Compute pseudo residual: rm = −

[∂`(yi,g

m(xi))∂gm(xi)

]i=1,...,n

.

(3) Find the best weak-learner for pseudo residual: τm = arg minτ∈Tm∑ni=1(rmi − bτ (xi))

2.(4) Update the model: fm+1(x) = gm(x) + ηbτm(x).(5) Update the momentum model: hm+1(x) = hm(x) + η/θmbτm(x).

Output. fM (x).

6. Numerical ExperimentsIn this section, we present the results of computational ex-periments and discuss the performance of AGBM with treesas weak-learners. Subsection 6.1 demonstrates that the al-gorithm described in Section 5.2 may diverge numerically;Subsection 6.2 shows training and testing performance forGBM and AGBM with different parameters; and Subsec-tion 6.3 compares the performance of GBM and AGBMwith best tuned parameters. The code for the numericalexperiments will be also open-sourced.

Datasets: Table 2 summaries the basic statistics of the LIB-SVM datasets that were used.

Dataset task # samples # featuresa1a classification 1605 123w1a classification 2477 300

diabetes classification 768 8german classification 1000 24housing regression 506 13

eunite2001 regression 336 16

Table 2. Basic statistics of the (real) datasets used.

AGBM with CART trees: In our experiments, all algo-rithms use CART trees as the weak learners. For classi-fication problems, we use logistic loss function, and forregression problems, we use least squares loss. To reducethe computational cost, for each split and each feature, we

consider 100 quantiles (instead of potentially all n values).These strategies are commonly used in implementationsof GBM like (Chen & Guestrin, 2016; Ponomareva et al.,2017).

6.1. Evidence that VAGBM May Diverge

Figure 1 shows the training loss versus the number of treesfor the housing dataset with step-size η = 1 and η = 0.3 forVAGBM and for AGBM with different parameters γ. Thex-axis is number of trees added to the ensemble (recall thatour AGBM algorithm adds two trees to the ensemble periteration, so the number of boosting iterations of VAGBMand AGBM is different). As we can see, when η is large,the training loss for VAGBM diverges very fast while ourAGBM with proper parameter γ converges. When η getssmaller, the training loss for VAGBM may decay fasterthan our AGBM at the begining, but it gets stuck and neverconverges to the true optimal solution. Eventually the train-ing loss of VAGBM may even diverge. On the other hand,our theory guarantees that AGBM always converges to theoptimal solution.

6.2. AGBM Sensitivity to the hyperparameters

In this section we look at how the two parameters η and γ af-fect the performance of AGBM. Figure 2 shows the trainingloss and the testing loss versus the number of trees for thea1a dataset with two different step-sizes η = 4 and η = 0.1


η = 1 η = 0.3tr

aini

nglo

ss

number of trees number of trees

Figure 1. Training loss versus number of trees for VAGBM (whichdoesn’t converge) and AGBM with different parameters γ.

(recall AGBM adds two trees per iteration). When the step-size η is large (with logistic loss, the largest step-size toguarantee the convergence is η = 1/σ = 4), the trainingloss decays very fast, and the traditional GBM can convergeeven faster than our AGBM at the beginning. But the test-ing performance is suffering, demonstrating that such a fast(due to the learning rate) convergence can result in severoverfitting. In this case, our AGBM with proper parame-ter γ has a slightly better testing performance. When thestep-size becomes smaller, the testing performances of all al-gorithms become more stable, though the training becomesslower. AGBM with proper γ may require less number ofiterations/trees to get a good training/testing performance.

η = 4 η = 0.1

trai

ning

loss

test

ing

loss

number of trees number of trees

Figure 2. Training and testing loss versus number of trees for lo-gisitc regression on a1a.

6.3. Experiments with Fine Tuning

In this section we look at the testing performance of GBM,VAGBM, AGBM on the six datasets with hyperparametertuning. We consider depth 5 trees as weak-learners. Weearly stop the splitting when the gain smaller than 0.001(roughly 1/n for these datasets). The hyper-parameters andtheir ranges we tuned are:

• step size (η): 0.01, 0.03, 0.1, 0.3, 1 for least squaresloss and 0.04, 0.12, 0.4, 1.2, 4 for logistic loss;

Dataset Method Training Testing # iter # trees

a1aGBM 0.2187 0.3786 97 97VAGBM 0.2454 0.3661 33 33AGBM 0.1994 0.3730 33 66

w1aGBM 0.0262 0.0578 84 84VAGBM 0.0409 0.0578 32 32AGBM 0.0339 0.0552 47 94

diabetesGBM 0.297 0.462 87 87VAGBM 0.271 0.462 24 24AGBM 0.297 0.458 47 94

germanGBM 0.244 0.505 54 54VAGBM 0.288 0.514 51 51AGBM 0.305 0.485 35 70

housingGBM 0.2152 4.6603 93 93VAGBM 0.5676 5.8090 73 73AGBM 0.215 4.5074 35 70

eunite2001GBM 36.73 270.1 64 64VAGBM 28.99 245.2 58 58AGBM 26.74 245.4 24 48

Table 3. Performance of after tuning hyper-parameters.

• number of trees: 10, 11, . . . , 100;• momentum parameter γ (only for AGBM):

0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.5, 1.

For each dataset, we randomly choose 80% as the train-ing dataset and the remainder was used as the final test-ing dataset. We use 5-fold cross validation on the train-ing dataset to tune the hyperparameters. Instead of go-ing through all possible hyper-parameters, we utilize ran-domized search (RandomizedSearchCV in scikit-learn). AsAGBM has more parameters (namely γ), we did proportion-ally more iterations of random search for AGBM. Table 3presents the performance of GBM, VAGBM, AGBM withthe tuned parameters. As we can see, the accelerated meth-ods (AGBM and VAGBM) in general require less numbersof iterations to get similar or slightly better testing perfor-mance than GBM. Compared with VAGBM, AGBM addstwo trees per iteration, and that can be more expensive, butthe performance of AGBM can be more stable, for example,the testing error of VAGBM for housing dataset is muchlarger than AGBM.

7. ConclusionIn this paper, we proposed a novel Accelerated GradientBoosting Machine (AGBM), prooved its rate of convergenceand introduced a computationally inexpensive practical vari-ant of AGBM that takes advantage of strong convexity ofloss function and achives linear convergence. Finally wedemonstrated with a number of numerical experiments theeffectiveness of AGBM over conventional GBM in obtain-ing a model with good training and/or testing data fidelity.


ReferencesAllen-Zhu, Z. and Orecchia, L. Linear coupling: An ulti-

mate unification of gradient and mirror descent. arXivpreprint arXiv:1407.1537, 2014.

Biau, G., Cadre, B., and Rouvıere, L. Accelerated gradientboosting. arXiv preprint arXiv:1803.02042, 2018.

Bickel, P. J., Ritov, Y., and Zakai, A. Some theory forgeneralized boosting algorithms. Journal of MachineLearning Research, 7(May):705–732, 2006.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J.Classification and Regression Trees. Wadsworth, 1984.ISBN 0-534-98053-8.

Bubeck, S., Lee, Y. T., and Singh, M. A geometric alter-native to nesterov’s accelerated gradient descent. arXivpreprint arXiv:1506.08187, 2015.

Buhlmann, P., Hothorn, T., et al. Boosting algorithms:Regularization, prediction and model fitting. StatisticalScience, 22(4):477–505, 2007.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boostingsystem. In Proceedings of the 22nd acm sigkdd inter-national conference on knowledge discovery and datamining, pp. 785–794. ACM, 2016.

Collins, M., Schapire, R. E., and Singer, Y. Logistic regres-sion, adaboost and bregman distances. Machine Learning,48(1-3):253–285, 2002.

Devolder, O., Glineur, F., and Nesterov, Y. First-ordermethods of smooth convex optimization with inexactoracle. Mathematical Programming, 146(1-2):37–75,2014.

Fouillen, E., Boyer, C., and Sangnier, M. Accelerated proxi-mal boosting. arXiv preprint arXiv:1808.09670, 2018.

Freund, R. M., Grigas, P., Mazumder, R., et al. A new per-spective on boosting in linear regression via subgradientoptimization and relatives. The Annals of Statistics, 45(6):2328–2364, 2017.

Friedman, J., Hastie, T., and Tibshirani, R. Additive logis-tic regression: a statistical view of boosting. Annals ofStatistics, 28:2000, 1998.

Friedman, J., Hastie, T., and Tibshirani, R. The elements ofstatistical learning, volume 1. Springer series in statisticsNew York, NY, USA:, 2001.

Friedman, J. H. Greedy function approximation: a gradientboosting machine. Annals of statistics, pp. 1189–1232,2001.

Frostig, R., Ge, R., Kakade, S., and Sidford, A. Un-regularizing: approximate proximal point and fasterstochastic algorithms for empirical risk minimization. InInternational Conference on Machine Learning, 2015.

Hu, B. and Lessard, L. Dissipativity theory for Nesterov’saccelerated method. arXiv preprint arXiv:1706.04381,2017.

Karimireddy, S. P., Stich, S. U., and Jaggi, M. Globallinear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprintarXiv:1806.00413, 2018.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma,W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficientgradient boosting decision tree. In Advances in NeuralInformation Processing Systems, pp. 3146–3154, 2017.

Lin, H., Mairal, J., and Harchaoui, Z. A universal cata-lyst for first-order optimization. In Advances in NeuralInformation Processing Systems, 2015.

Locatello, F., Raj, A., Karimireddy, S. P., Ratsch, G.,Scholkopf, B., Stich, S., and Jaggi, M. On matchingpursuit and coordinate descent. In International Confer-ence on Machine Learning, pp. 3204–3213, 2018.

Lu, H. and Mazumder, R. Randomized gradient boostingmachine. arXiv preprint arXiv:1810.10158, 2018.

Lu, H., Freund, R. M., and Mirrokni, V. Acceleratinggreedy coordinate descent methods. arXiv preprintarXiv:1806.02476, 2018.

Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. R. Boost-ing algorithms as gradient descent. In Advances in neuralinformation processing systems, pp. 512–518, 2000.

Nesterov, Y. A method of solving a convex programmingproblem with convergence rate O(1/k2). In Soviet Math-ematics Doklady, volume 27, pp. 372–376, 1983.

Nesterov, Y. Introductory lectures on convex optimization:A basic course, volume 87. Springer Science & BusinessMedia, 2004.

Nesterov, Y. Accelerating the cubic regularization of new-tons method on convex problems. Mathematical Pro-gramming, 112(1):159–181, 2008.

Nesterov, Y. and Polyak, B. T. Cubic regularization ofnewton method and its global performance. MathematicalProgramming, 108(1):177–205, 2006.

Odonoghue, B. and Candes, E. Adaptive restart for accel-erated gradient schemes. Foundations of computationalmathematics, 15(3):715–732, 2015.


Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., et al. Scikit-learn: Machinelearning in python. Journal of machine learning research,12(Oct):2825–2830, 2011.

Ponomareva, N., Radpour, S., Hendry, G., Haykal, S.,Colthurst, T., Mitrichev, P., and Grushetsky, A. Tf boostedtrees: A scalable tensorflow based framework for gradi-ent boosting. In Joint European Conference on MachineLearning and Knowledge Discovery in Databases, pp.423–427. Springer, 2017.

Ridgeway, G., Southworth, M. H., and RUnit, S. Packagegbm. Viitattu, 10(2013):40, 2013.

Sigrist, F. Gradient and newton boosting for classificationand regression. arXiv preprint arXiv:1808.03064, 2018.

Su, W., Boyd, S., and Candes, E. J. A differential equationfor modeling Nesterovs accelerated gradient method: the-ory and insights. Journal of Machine Learning Research,17(153):1–43, 2016.

Sun, P., Zhang, T., and Zhou, J. A convergence rate analysisfor logitboost, mart and their variant. In ICML, pp. 1251–1259, 2014.

Telgarsky, M. A primal-dual convergence analysis of boost-ing. Journal of Machine Learning Research, 13(Mar):561–606, 2012.

Tseng, P. On accelerated proximal gradient methods forconvex–concave optimization preprint, 2008.

Wilson, A. C., Recht, B., and Jordan, M. I. A Lyapunovanalysis of momentum methods in optimization. arXivpreprint arXiv:1611.02635, 2016.

Zhang, T. and Yu, B. Boosting with early stopping: Con-vergence and consistency. The Annals of Statistics, 33(4):1538–1579, 2005.


A. Additional DiscussionsBelow we include some additional discussions which could not fit into the main paper but which nevertheless help tounderstand the relevance of our results when applied to frameworks typically used in practice.

A.1. Line search in Boosting

Traditionally the analysis of gradient boosting methods has focused on algorithms which use line search to select thestep-size η (e.g. Algorithm 1). Analysis of gradient descent suggests that is not necessary—using a fixed step-size of 1/βwhere l(x) is β-smooth is sufficient (Lu & Mazumder, 2018). Our accelerated Algorithm 2 also adopts this fixed step-sizestrategy. In fact, even the standard boosting libraries (XGBoost and TFBT) typically use a fixed (but tuned) step-size andavoid an expensive line search.

A.2. Use of Hessian

Popular boosting libraries such as XGBoost (Chen & Guestrin, 2016) and TFBT (Ponomareva et al., 2017) compute theHessian and perform a Newton boosting step instead of gradient boosting. Since the Newton step may not be well defined(e.g. if the Hessian is degenerate), an additional euclidean regularizer is also added. This has been shown to improveperformance and reduce the need for a line-search for the η parameter sequence (Sun et al., 2014; Sigrist, 2018). ForLogitBoost (i.e. when l(x) is the logistic loss), (Sun et al., 2014) demonstrate that trust-region Newton’s method can indeedsignificantly improve the convergence. Leveraging similar results in second-order methods for convex optimization (e.g.(Nesterov & Polyak, 2006; Karimireddy et al., 2018)) and adapting accelerated second-order methods (Nesterov, 2008)would be an interesting direction for the future work.

A.3. Out-of-sample Performance

Throughout this work we focus only on minimizing the empirical training loss L(f) (see Formula (2)). In reality what wereally care about is the out-of-sample error of our resulting ensemble fM (x). A number of regularization tricks such as i)early stopping (Zhang & Yu, 2005), ii) pruning (Chen & Guestrin, 2016; Ponomareva et al., 2017), iii) smaller step-sizes(Ponomareva et al., 2017), iv) dropout (Ponomareva et al., 2017) etc. are usually employed in practice to prevent over-fittingand improve generalization. Since AGBM requires much fewer iterations to achieve the same training loss than GBM,it outputs a much sparser set of learners. We believe this is partially the reason for its better out-of-sample performance.However a joint theoretical study of the out-of-sample error along with the empirical error Ln(f) is much needed. It wouldalso shed light on the effectiveness of the numerous ad-hoc regularization techniques currently employed.

B. Proof of Theorem 4.1This section proves our major theoretical result in the paper:

Theorem 4.1 Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose ` is σ-smooth, the step-size η ≤ 1σ

and the momentum parameter γ ≤ Θ4/(4 + Θ2). Then for all M ≥ 0, we have:

L(fM )− L(f∗) ≤ 1

2ηγ(M + 1)2‖f∗(X)‖22 .

Let’s start with some new notations. Define scalar constants s = γ/Θ2 and t := (1− s)/2 ∈ (0, 1). We mostly only needs+ t ≤ 1—the specific values of γ and t are needed only in Lemma B.6. Then define

αm :=ηγ

θm=ηsΘ2

θm,


then the definitions of the sequences {rm}, {cm}, hm(X) and {θm} from Algorithm 3 can be simplified as:

θm =2

m+ 2

rm = −[∂l(yi, g

m(xi))

∂gm(xi)

]i=1,...n

cm = rm + (αm−1/αm)(cm−1 − bτ2m−1

(X))

hm+1(X) = hm(X) + αmrm .

The sequence hm(X) is in fact closely tied to the sequence hm(X) as we show in the next lemma. For notationalconvenience, we define c−1 = bτ2

−1(X) = 0 and similarly α−1

θ−1= 0 throughout the proof.

Lemma B.1.hm+1(X) = hm+1(X) + αm(cm − bτm,2

(X)) .

Proof. Observe that

hm+1(X) =

m∑j=0

αjrj and that hm+1(X) =

m∑j=0

αjbτj,2(X) .

Then we have

hm+1(X)− hm+1(X) =

m∑j=0

αj(rj − bτj,2(X))

=

m∑j=0

αj(rj − αj−1

αjbτ2

j−1(X))− αmbτm,2(X)

=

m∑j=0

αj(cj − αj−1

αjcj−1)− αmbτm,2(X)

=

m∑j=0

(αjcj − αj−1c

j−1)− αmbτm,2(X)

= αm(cm − bτm,2(X)) ,

where the third equality is due to the definition of cm.

Lemma B.2 presents the fact that there is sufficient decay of the loss function:

Lemma B.2.

L(fm+1) ≤ L(gm)− ηΘ2

2‖rm‖2 .

Proof. Recall that τm,1 is chosen such that


‖bτ (X)− rm‖2 .

Since the class of learners T is scalable (Assumption 2.1), we have

‖bτm,1(X)− rm‖2 = min

τ∈Tmminσ∈R‖σbτ (X)− rm‖2

= ‖rm‖2(1− arg maxτ∈T

cos(rm, bτ (X))2)

≤ ‖rm‖2(1−Θ2) , (8)


where the last inequality is because of the definition of Θ, and the second equality is due to the simple fact that for any twovectors a and b,

minσ∈R‖σa− b‖2 = ‖a‖2 −max

σ∈R(σ〈a, b〉 − σ2

2‖b‖2) = ‖a‖2 − ‖a‖2 〈a, b〉

‖a‖2‖b‖2.

Now recall that L(fm+1) =∑ni=1 l(yi, f

m+1(xi)) and that fm+1(x) = gm(x) + ηbτm,1(x). Since the loss function

l(yi, x) is σ-smooth and step-size η ≤ 1σ , it holds that

L(fm+1) =

n∑i=1

l(yi, fm+1(xi))

≤n∑i=1

l(yi, gm(xi) + ηbτm,1(xi))

≤n∑i=1

(l(yi, g

m(xi)) +∂l(yi, g

m(xi))

∂gm(xi)(ηbτm,1

(xi)) +σ

2(ηbτm,1

(xi))2

)

≤n∑i=1

(l(yi, g

m(xi)) +∂l(yi, g

m(xi))

∂gm(xi)(ηbτm,1

(xi)) +η

2(bτm,1

(xi))2

)

=

n∑i=1

(l(yi, g

m(xi))− rmi (ηbτm,1(xi)) +

1

2η(bτm,1

(xi))2

)= L(gm)− η

⟨rm, bτm,1

(X)⟩

+η

2‖bτm,1

(X)‖2

= L(gm) +η

2‖bτm,1

(X)− rm‖2 − η

2‖rm‖2

≤ L(gm)− Θ2η

2‖rm‖2 ,

where the final inequality follows from (8). This furnishes the proof of the lemma.

Lemma B.3 is a basic fact of convex function, and it is commonly used in the convergence analysis in accelerated method.Lemma B.3. For any function f and m ≥ 0,

L(gm) + θm〈rm, hm(X)− f(X)〉 ≤ θmL(f) + (1− θm)L(fm) .

Proof. For any function f , it follows from the convexity of the loss function l that

L(gm) + 〈rm, gm(X)− f(X)〉 =

n∑i=1

l(yi, gm(xi)) +

∂l(yi, gm(xi))

∂gm(xi)(f(xi)− gm(xi))

≤n∑i=1

l(yi, f(xi)) = L(f) . (9)

Substituting f = fm in (9), we get

L(gm) + 〈rm, gm(X)− fm(X)〉 ≤ L(fm) . (10)

Also recall that gm(X) = (1− θm)fm(X) + θmhm(X). This can be rewritten as

θm(gm(X)− hm(X)) = (1− θm)(fm(X)− gm(X)) . (11)

Putting (9), (10), and (11) together:

L(gm) + θm〈rm, hm(X)− f(X)〉=L(gm) + θm〈rm, gm(X)− f(X)〉+ θm〈rm, hm(X)− gm(X)〉=θm[L(gm) + 〈rm, gm(X)− f(X)〉] + (1− θm)[L(gm) + 〈rm, gm(X)− fm(X)〉]≤θmL(f) + (1− θm)L(fm) ,

which furnishes the proof.


We are ready to prove the key lemma which gives us the accelerated rate of convergence.

Lemma B.4. Define the following potential function V (f) for any given output function f :

V m(f) =αm−1

θm−1(L(fm)− L(f)) +

1

2

∥∥∥f(X)− hm(X)∥∥∥2

. (12)

At every step, the potential decreases at least by δm:

V m+1(f) ≤ V m(f) + δm ,

where δm is defined as:

δm :=sα2

m−1

2t‖cm−1 − bτ2

m−1(X)‖2 − (1− s− t)α

2m

2s‖rm‖2 . (13)

Proof. Recall that c−1 = bτ2−1

(X)) = 0 and α−1

θ−1= 0. It follows from Lemma B.2 that:

L(fm+1)− L(gm) +(1− s)ηΘ2

2‖rm‖2

≤− sηΘ2

2‖rm‖2

=− αmθm‖rm‖2 +αmθm

2‖rm‖2

=θm

⟨rm, hm(X)− hm+1(X)

⟩+

θm2αm

‖hm(X)− hm+1(X)‖2

=θm

⟨rm, hm(X)− f(X)

⟩+

θm2αm

(‖f(X)− hm(X)‖2 − ‖f(X)− hm+1(X)‖2

),

where the second equality is by the definition of hm(x) and the third is just mathematical manipulation of the equation (it isalso called three-point property). By rearranging the above inequality, we have

L(fm+1) +(1− s)ηΘ2

2‖rm‖2

≤L(gm) +⟨rm, hm(X)− f(X)

⟩+

θm2αm

(‖f(X)− hm(X)‖2 − ‖f(X)− hm+1(X)‖2

)=L(gm) + θm〈rm, hm(X)− f(X)〉+

θm2αm

(‖f(X)− hm(X)‖2 − ‖f(X)− hm+1(X)‖2

)+ θm

⟨rm, hm(X)− hm(X)

⟩≤θmL(f) + (1− θm)L(fm) +

θm2αm

(‖f(X)− hm(X)‖2 − ‖f(X)− hm+1(X)‖2

)+ θmαm−1

⟨rm, cm−1 − bτ2

m−1(X)

⟩,

where the first inequality uses Lemma B.3 and the last inequality is due to the fact that hm(X)− hm(X) = αm−1(cm−1 −bτ2

m−1(X)) from Lemma B.1. Rearranging the terms and multiplying by (αm/θm) leads to

αmθm

(L(fm+1)− L(f)) +1

2‖f(X)− hm+1(X)‖2

≤ αm(1− θm)

θm︸︷︷︸:=A

(L(fm)−L(f))+1

2‖f(X)−hm(X)‖2+αmαm−1

⟨rm, (cm−1 − bτ2

m−1(X))

⟩− (1− s)ηΘ2αm

2θm‖rm‖2︸︷︷︸

:=B

.

Let us examine first the term A:

αm(1− θm)

θm= (ηΘ2s)

1− θmθ2m

≤ (ηΘ2s)1

θ2m−1

=αm−1

θm−1.

We have thus far shown thatV m+1(f) ≤ V m(f) + B ,


and we now need to show that B ≤ δm. Using Mean-Value inequality, the first term in B can be bounded as

αmαm−1

⟨rm, (cm−1 − bτ2

m−1(X))

⟩≤ α2

mt

2s‖rm‖2 +

α2m−1s

2t‖cm−1 − bτ2

m−1(X)‖2 .

Substituting it in B shows:

B = αmαm−1

⟨rm, (cm−1 − bτ2

m−1(X))

⟩− (1− s)ηΘ2αm

2θm‖rm‖2

≤ α2mt

2s‖rm‖2 +

α2m−1s

2t‖cm−1 − bτ2

m−1(X)‖2 − (1− s)α2

m

2s‖rm‖2

=α2m−1s

2t‖cm−1 − bτ2

m−1(X)‖2 − (1− s− t)α

2m

2s‖rm‖2

= δm ,

which finishes the proof.

Unlike the typical proofs of accelerated algorithms, which usually shows that the potential V m(f) is a decreasing sequence,there is no guarantee that the potential V m(f) is decreasing in the boosting setting due to the use of weak learners. Instead,we are able to prove that:

Lemma B.5. For any given m, it holds that∑mj=0 δj ≤ 0.

Proof. We can rewrite the statement of the lemma as:

m−1∑j=0

α2j‖cj − bτj,2(X)‖2 ≤ t(1− s− t)

s2

m∑j=0

α2j‖rj‖2 . (14)

Here, let us focus on the term ‖cj+1 − bτ2j+1

(X)‖2 for a given j. We have that∥∥∥cj+1 − bτ2j+1

(X)∥∥∥2

≤ (1−Θ2)∥∥cj+1

∥∥2

= (1−Θ2)

∥∥∥∥rj+1 +θj+1

θj(cj − bτj,2(X))

∥∥∥∥2

≤ (1−Θ2)(1 + ρ)∥∥rj+1

∥∥2+ (1−Θ2)(1 + 1/ρ)

∥∥∥∥θj+1

θj(cj − bτj,2(X))

∥∥∥∥2

≤ (1 + ρ)(1−Θ2)∥∥rj+1

∥∥2+ (1−Θ2)(1 + 1/ρ)

∥∥(cj − bτj,2(X))∥∥2,

where the first inequality follows from our assumption about the density of the weak-learner class B (the same of theargument in (8)), the second inequality holds for any ρ ≥ 0 due to Mean-Value inequality, and the last inequality is fromθj+1 ≤ θj . We now derives a recursive bound on the left side of (14). From this, (14) follows from an elementary fact ofrecursive sequence as stated in Lemma B.6 with aj = α2

j

∥∥cj − bτj,2(X)∥∥2

and cj = α2j

∥∥rj∥∥2.

Remark B.1. If cm = bτm,2(X) (i.e. our class of learners B is strong), then δm = −(1− s− t)α2m

2s2 ‖rm‖2 ≤ 0.

Lemma B.6 is an elementary fact of recursive sequence used in the proof of Lemma B.5.

Lemma B.6. Given two sequences {aj ≥ 0} and {cj ≥ 0} such that the following holds for any ρ ≥ 0,

aj+1 ≤ (1−Θ2)[(1 + 1/ρ)aj + (1 + ρ)cj+1] ,

then the sum of the terms aj can be bounded as

m∑j=0

aj ≤t(1− s− t)

s2

m∑j=0

cj .


Proof. The recursive bound on aj implies that

aj ≤ (1−Θ2)[(1 + 1/ρ)aj−1 + (1 + ρ)cj ]

≤j∑

k=0

[(1 + 1/ρ)(1−Θ2)]j−k(1 + ρ)(1−Θ2)ck .

Summing both the terms givesm∑j=0

aj ≤m∑j=0

j∑k=0

[(1 + 1/ρ)(1−Θ2)]j−k(1 + ρ)(1−Θ2)ck

=

m∑k=0

m∑j=k

[(1 + 1/ρ)(1−Θ2)]j−k(1 + ρ)(1−Θ2)ck

≤m∑k=0

∞∑j=0

[(1 + 1/ρ)(1−Θ2)]j

(1 + ρ)(1−Θ2)ck

=(1 + ρ)(1−Θ2)

1− (1 + 1/ρ)(1−Θ2)

m∑k=0

ck

=(1 + ρ)(1−Θ2)

Θ2 − (1−Θ2)/ρ

m∑k=0

ck

=2(1 + ρ)(1−Θ2)

Θ2

m∑k=0

ck

=2(2−Θ2)(1−Θ2)

Θ4

m∑k=0

ck ,

where in the last two equalities we chose ρ = 2(1−Θ2)Θ2 . Now recall that s ≤ Θ2

4+Θ2 ∈ (0, 1) and that t = (1− s)/2:m∑j=0

aj ≤2(2−Θ2)(1−Θ2)

Θ4

m∑k=0

ck

≤ 4

Θ4

m∑k=0

ck

=

(4 + Θ2

Θ2− 1

)21

4

m∑k=0

ck

≤(

1

s− 1

)21

4

m∑k=0

ck

=(1− s)2

4s2

m∑k=0

ck

=t(1− s− t)

s2

m∑k=0

ck .

Lemma B.4 and Lemma B.5 directly result in our major theorem:

Proof of Theorem 4.1 It follows from Lemma B.4 and Lemma B.5 that

VM (f?) ≤ VM−1(f?) + δm ≤ V 0(f?) +

M−1∑j=0

δj ≤1

2‖f0(X)− f?(X)‖2 .


Notice VM (f?) ≥ αm−1

θm−1(L(fM )− L(f?)) as the term 1

2‖fM (X)− f?(X)‖2 ≥ 0, which induces that

L(fM )− L(f?) ≤ θM−1

2αM−1‖f0(X)− f?(X)‖2 =

1

2γη· ‖f

0(X)− f?(X)‖2

M2.

Documents

Accelerating Gradient Boosting Machineweb.mit.edu/haihao/www/papers/AGBM.pdf · Accelerating Gradient Boosting Machine where ‘(y i;f(x i)) is a measure of the data-ﬁdelity for