15
CobBO: Coordinate Backoff Bayesian Optimization Jian Tan *1 Niv Nayman *2 Mengchang Wang 3 Feifei Li 3 Rong Jin 4 Abstract Bayesian optimization is a popular method for optimizing expensive black-box functions. The objective functions of hard real world problems are oftentimes characterized by a fluctuated land- scape of many local optima. Bayesian optimiza- tion risks in over-exploiting such traps, remain- ing with insufficient query budget for exploring the global landscape. We introduce Coordinate Backoff Bayesian optimization (CobBO) to alle- viate those challenges. CobBO captures a smooth approximation of the global landscape by inter- polating the values of queried points projected to randomly selected promising subspaces. Thus also a smaller query budget is required for the Gaussian process regressions applied over the lower dimensional subspaces. This approach can be viewed as a variant of coordinate ascent, tai- lored for Bayesian optimization, using a stopping rule for backing off from a certain subspace and switching to another coordinate subset. Exten- sive evaluations show that CobBO finds solutions comparable to or better than other state-of-the- art methods for dimensions ranging from tens to hundreds, while reducing the trial complexity. 1. Introduction Bayesian optimization (BO) has emerged as an effective zero-order paradigm for optimizing expensive black-box functions. The entire sequence of iterations rely only on the function values of the already queried points without information on their derivatives. Though highly competitive in low dimensions (e.g., the dimension D 20 (Frazier, 2018)), Bayesian optimization based on Gaussian Process (GP) regression has obstacles that impede its effectiveness, especially in high dimensions. * Equal contribution 1 Alibaba Group, Sunnyvale, California, USA 2 Alibaba Group, Tel Aviv, Israel 3 Alibaba Group, Hangzhou, Zhejiang, China 4 Alibaba Group, Seattle, Washington, USA. Cor- respondence to: Jian Tan <[email protected]>, Niv Nayman <[email protected]>. An implementation of CobBO is available at: https://github.com/Alibaba-MIIL/CobBO Less accurate: Capture the global landscape More accurate: Capture the local landscape Exploration Exploitation Figure 1. The average error of the GP Regression at the query points (solid curves, the higher the better) and the best observed function value (dashed curves, the lower the better) at iteration t for the fluctuated Rastrigin function on [-5, 10] 50 with 20 initial samples. GP on the full space Ω is prone to get trapped at local minima. CobBO starts by capturing the global landscape less ac- curately, and then explores selected subspaces Ωt more accurately. This eventually better exploits those promising subspaces. Approximation accuracy: GP regression assumes a class of random smooth functions in a probability space as surrogates that iteratively yield posterior distributions by conditioning on the queried points. For complex functions with numerous local optima and saddle points due to local fluctuations, always exactly using the values on the queried points as the conditional events could misinform the global function landscape for suggesting new query points. Curse of dimensionality: As a sample efficient method, Bayesian optimization often suffers from high dimensions. Computing the Gaussian process posterior and optimizing the acquisition function incur large computational costs and statistical insufficiency of exploration in high dimen- sions (Djolonga et al., 2013; Wang et al., 2017). Stagnation at local optima: It is known that Bayesian optimization could stagnate at local optima (Qin et al., 2017; Bull, 2011; Snoek et al., 2012). An effective exploration could encourage the search for other promising regions. To alleviate these issues, we design coordinate backoff Bayesian optimization (CobBO), by challenging a seem- ingly natural intuition stating that it is always better for Bayesian optimization to have a more accurate approxima- tion of the objective function at all times. We demonstrate that this is not necessarily true, by showing that smooth- arXiv:2101.05147v2 [cs.LG] 16 Feb 2021

arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Jian Tan * 1 Niv Nayman * 2 Mengchang Wang 3 Feifei Li 3 Rong Jin 4

AbstractBayesian optimization is a popular method foroptimizing expensive black-box functions. Theobjective functions of hard real world problemsare oftentimes characterized by a fluctuated land-scape of many local optima. Bayesian optimiza-tion risks in over-exploiting such traps, remain-ing with insufficient query budget for exploringthe global landscape. We introduce CoordinateBackoff Bayesian optimization (CobBO) to alle-viate those challenges. CobBO captures a smoothapproximation of the global landscape by inter-polating the values of queried points projectedto randomly selected promising subspaces. Thusalso a smaller query budget is required for theGaussian process regressions applied over thelower dimensional subspaces. This approach canbe viewed as a variant of coordinate ascent, tai-lored for Bayesian optimization, using a stoppingrule for backing off from a certain subspace andswitching to another coordinate subset. Exten-sive evaluations show that CobBO finds solutionscomparable to or better than other state-of-the-art methods for dimensions ranging from tens tohundreds, while reducing the trial complexity.

1. IntroductionBayesian optimization (BO) has emerged as an effectivezero-order paradigm for optimizing expensive black-boxfunctions. The entire sequence of iterations rely only onthe function values of the already queried points withoutinformation on their derivatives. Though highly competitivein low dimensions (e.g., the dimension D ≤ 20 (Frazier,2018)), Bayesian optimization based on Gaussian Process(GP) regression has obstacles that impede its effectiveness,especially in high dimensions.

*Equal contribution 1Alibaba Group, Sunnyvale, California,USA 2Alibaba Group, Tel Aviv, Israel 3Alibaba Group, Hangzhou,Zhejiang, China 4Alibaba Group, Seattle, Washington, USA. Cor-respondence to: Jian Tan <[email protected]>, Niv Nayman<[email protected]>.

An implementation of CobBO is available at:https://github.com/Alibaba-MIIL/CobBO

Less accurate:Capture the global landscape

More accurate:Capture the local landscape

Exploration Exploitation

Figure 1. The average error of the GP Regression at the querypoints (solid curves, the higher the better) and the best observedfunction value (dashed curves, the lower the better) at iteration tfor the fluctuated Rastrigin function on [−5, 10]50 with 20 initialsamples. GP on the full space Ω is prone to get trapped at localminima. CobBO starts by capturing the global landscape less ac-curately, and then explores selected subspaces Ωt more accurately.This eventually better exploits those promising subspaces.

Approximation accuracy: GP regression assumes aclass of random smooth functions in a probability space assurrogates that iteratively yield posterior distributions byconditioning on the queried points. For complex functionswith numerous local optima and saddle points due to localfluctuations, always exactly using the values on the queriedpoints as the conditional events could misinform the globalfunction landscape for suggesting new query points.

Curse of dimensionality: As a sample efficient method,Bayesian optimization often suffers from high dimensions.Computing the Gaussian process posterior and optimizingthe acquisition function incur large computational costsand statistical insufficiency of exploration in high dimen-sions (Djolonga et al., 2013; Wang et al., 2017).

Stagnation at local optima: It is known that Bayesianoptimization could stagnate at local optima (Qin et al., 2017;Bull, 2011; Snoek et al., 2012). An effective explorationcould encourage the search for other promising regions.

To alleviate these issues, we design coordinate backoffBayesian optimization (CobBO), by challenging a seem-ingly natural intuition stating that it is always better forBayesian optimization to have a more accurate approxima-tion of the objective function at all times. We demonstratethat this is not necessarily true, by showing that smooth-

arX

iv:2

101.

0514

7v2

[cs

.LG

] 1

6 Fe

b 20

21

Page 2: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

ing out local fluctuations to serve as the conditional eventscan help in capturing the large-scale properties of the ob-jective function f(x) and thus find better solutions moreefficiently. The design principle is to capture the land-scape of f(x) through interpolation of points projected tocoordinate subspaces for inducing designated Gaussian pro-cesses over those subspaces. For iteration t, instead ofdirectly computing the Gaussian process posterior distribu-tion

f(x)

∣∣∣Ht = (xi, yi)ti=1 , x ∈ Ω

by conditioningon the observations yi = f(xi) at queried points xi in thefull space Ω ⊂ RD for i = 1, . . . , t, we change the condi-tional events, and consider

f(x)∣∣∣R (PΩt(x1, . . . , xt),Ht) , x ∈ Ωt,Ωt ⊂ Ω

for a projection function PΩt

(·) to a random subspace Ωtand an interpolation function R(·, ·) using the radial basisfunction (RBF). The projection PΩt(·) maps the queriedpoints to virtual points on a subspace Ωt of a lower dimen-sion (Rahimi & Recht, 2008). The interpolation functionR(·, ·) estimates the objective values at the virtual pointsusing the queried points and their values as specified byHt.

This method can be viewed as a variant of block coordinateascent tailored to Bayesian optimization by applying back-off stopping rules for switching coordinate blocks. Whilesimilar work exists (Moriconi et al., 2020; Oliveira et al.,2018), CobBO tries to address the following three issues:

1. Selecting a block of coordinates for ascending requiresdetermining the block size as well as the coordinatestherein. The selected coordinates shall sample morepromising regions with higher probabilities.

2. A coordinate subspace requires a sufficient amount ofquery points acting as the conditional events for the GPregression. Without enough points and correspondingvalues, the function landscape within a subspace cannotbe well-characterized (Bull, 2011).

3. Querying a certain subspace, under some trial budget,comes at the expense of exploring other coordinateblocks. Yet prematurely shifting to different subspacesdoes not fully exploit the full potential of a given sub-space. Hence determining the number of consecutivefunction queries within a subspace makes a trade-offbetween exploration and exploitation.

To this end, we design CobBO. The coordinate subsetsare selected by using a multiplicative weights updatemethod (Arora et al., 2012) to the preference probabilityassociated with each coordinate. The trade-off betweenexploration and exploitation is balanced by the subspaceselection and the switching, governed by the preferenceprobability and the backoff stopping rule, respectively. Inaddition, differently from (Acerbi & Ma, 2017; Gonzalez

et al., 2016; McLeod et al., 2018; Eriksson et al., 2019),CobBO dynamically forms trust regions on two time scalesto further tune this trade-off.

Through comprehensive evaluations, CobBO demonstratesappealing performance for dimensions ranging from tensto hundreds. It obtains comparable or better solutions withfewer queries, in comparison with the state-of-the-art meth-ods, for most of the problems tested in Section 4.2.

2. Related workGiven the large body of work on Bayesian optimization (Fra-zier, 2018; Brochu et al., 2010; Shahriari et al., 2016), wesummarize the most relevant ones in four topics: high dimen-sionality, trust regions, hybrid methods and batch sampling.

High dimensionality: To apply Bayesian optimization inhigh dimensions, certain assumptions are often imposedon the latent structure. Typical assumptions include lowdimensional structures and additive structures. Their ad-vantages manifest on problems with a low dimension or alow effective dimension. However, these assumptions donot necessarily hold for non-separable functions with noredundant dimensions.

Low dimensional structure: The black-box function f isassumed to have a low effective dimension (Kushner, 1964;Tyagi & Cevher, 2014), e.g., f(x) = g(Φx) with some func-tion g(·) and a matrix Φ of d×D, d << D. A number ofdifferent methods have been developed, including randomembedding (Wang et al., 2013; Djolonga et al., 2013; Wanget al., 2016; Li et al., 2016; Munteanu et al., 2019; Zhanget al., 2019; Binois et al., 2020), low-rank matrix recov-ery (Djolonga et al., 2013; Tyagi & Cevher, 2014), and learn-ing subspaces by derivative information (Djolonga et al.,2013; Eriksson et al., 2018). In contrast to existing work onsubspace selections, e.g., LineBO (Kirschner et al., 2019),Hashing-enhanced Subspace BO (HeSBO) (Munteanu et al.,2019) and DROPOUT (Li et al., 2017), CobBO exploitssubspace structure from a perspective of block coordi-nate ascent, independent of the dimensions, different fromsome algorithms that are suitable for low dimensions, e.g.,BADS (Acerbi & Ma, 2017).

Additive structure: A decomposition assumption is oftenmade by f(x) =

∑ki=1 f

(i) (xi), with xi defined over low-dimensional components. In this case, the effective dimen-sionality of the model is the largest dimension among alladditive groups (Mutny & Krause, 2018), which is usu-ally small. The Gaussian process is structured as an addi-tive model (Gilboa et al., 2013; Kandasamy et al., 2015),e.g., projected-additive functions (Li et al., 2016), ensembleBayesian optimization (EBO) (Wang et al., 2018a), latentadditive structural kernel learning (HDBBO) (Wang et al.,2017) and group additive models (Kandasamy et al., 2015;

Page 3: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Li et al., 2016). However, learning the unknown structureincurs a considerable computational cost (Munteanu et al.,2019), and is not applicable for non-separable functions, forwhich CobBO can still be applied.

Kernel methods: Various kernels have been used for re-solving the difficulties in high dimensions, e.g., a hierar-chical Gaussian process model (Chen et al., 2019), a cylin-drical kernel (Oh et al., 2018) and a compositional ker-nel (Duvenaud et al., 2013). CobBO can be integratedwith other sophisticated methods (Snoek et al., 2012; Du-venaud et al., 2013; Oh et al., 2018; Chen et al., 2019;Mockus, 1975; Jones et al., 1998; Srinivas et al., 2010; Fra-zier et al., 2008; Scott et al., 2011; Wang & Jegelka, 2017),e.g., ATPE/TPE (Bergstra et al., 2011; ElectricBrain, 2018)and SMAC (Hutter et al., 2011).

Trust regions and space partitions: Trust region BO hasbeen proven effective for high-dimensional problems. A typ-ical pattern is to alternate between global and local search re-gions. In the local trust regions, many efficient methods havebeen applied, e.g., local Gaussian models (TurBO (Erikssonet al., 2019)), adaptive search on a mesh grid (BADS (Acerbi& Ma, 2017)) or quasi-Newton local optimization (BLOS-SOM (McLeod et al., 2018)). TurBO (Eriksson et al., 2019)uses Thompson sampling to allocate samples across mul-tiple regions. CobBO dynamically forms a variable-sizetrust region around the optimum of the already queriedpoints, which can be switched to a different region to escapestagnant local optima. A related method is to use spacepartitions, e.g., LA-MCTS(Wang et al., 2020) on a MonteCarlo tree search algorithm to learn efficient partitions.

Hybrid methods: Combining BO and other techniquesyields hybrid approaches. Bayesian adaptive direct search(BADS) (Acerbi & Ma, 2017) alternates between local BOand grid search, which only fits low dimensions, as com-mented in (Acerbi & Ma, 2017). Gradients can be usedwith BO, e.g., derivative-enabled knowledge gradient (d-KG) (Wu et al., 2017). EGO-CMA (Mohammadi et al.,2015) is combined with CMA-ES (Hansen & Ostermeier,2001). In this regard, CobBO can be viewed as a combina-tion of block coordinate ascent and Bayesian optimization.

Batch sampling: The leverage of parallel computation forBO requires a batch of queries at each iteration. Popu-lar methods include, e.g., Batch Upper Confidence Bound(BUCB) (Desautels et al., 2014), combining UCB and PureExploration by entropy reduction (UCB-PE) (Contal et al.,2013), local penalization (Gonzalez et al., 2016), determi-nantal point processes (Kathuria et al., 2016), Monte-Carlosimulation (Azimi et al., 2010), sampling according to areward function (Desautels et al., 2014) and using the repa-rameterization trick for acquisition functions (Wilson et al.,2017). These methods can be combined with CobBO. Inaddition, CobBO can be paralleled in a batch mode by sam-pling multiple different subspaces simultaneously.

3. AlgorithmWithout loss of generality, suppose that the goal is to solvea maximization problem x∗ = argmaxx∈Ωf(x) for a black-box function f : Ω → R. The domain is normalized Ω =[0, 1]D with the coordinates indexed by I = 1, 2, · · · , D.

For a sequence of points Xt = x1, x2, · · · , xt witht indexing the most recent iteration, we observe Ht =(xi, yi = f(xi))ti=1. A random subset Ct ⊆ I of thecoordinates is selected, forming a subspace Ωt ⊆ Ω at it-eration t. As a variant of coordinate ascent, the subspaceΩt contains a pivot point Vt, which presumably is the max-imum point xMt = argmaxx∈Xt

f(x) with Mt = f(xMt).

CobBO may set Vt different from xMt to escape local op-tima. Then, BO is conducted within Ωt while fixing all theother coordinates Cct = I \ Ct, i.e., the complement of Ct.For BO in Ωt, we use Gaussian processes as the randomsurrogates f = fΩt(x) to describe the Bayesian statistics off(x) for x ∈ Ωt. At each iteration, the next query point isgenerated by solving

xt+1 = argmaxx∈Ωt,Vt∈ΩtQfΩt (x)∼p(f |Ht)

(x|Ht),

where the acquisition function Q(x|Ht) incorporates theposterior distribution of the Gaussian processes p(f |Ht).Typical acquisition functions include the expected improve-ment (EI) (Mockus, 1975; Jones et al., 1998), the upperconfidence bound (UCB) (Auer, 2003; Srinivas et al., 2010;Srinivas et al., 2012), the entropy search (Hennig & Schuler,2012; Henrandez-Lobato et al., 2014; Wang & Jegelka,2017), and the knowledge gradient (Frazier et al., 2008;Scott et al., 2011; Wu & Frazier, 2016).

Instead of directly computing the posterior distributionp(f |Ht), we replace the conditional eventsHt by

Ht := R (PΩt(Xt) ,Ht) = (xi, yi)ti=1

with an interpolation function R(·, ·) and a projection func-tion PΩt(·),

PΩt(x)(j) =

x(j) if j ∈ CtV

(j)t if j /∈ Ct

(1)

at coordinate j. It simply keeps the values of x whosecorresponding coordinates are in Ct and replaces the rest bythe corresponding values of Vt, as illustrated in Fig. 2.

Applying PΩt(·) on Xt and discarding duplicates generate a

new set of distinct virtual points Xt = x1, x2, x3, · · · , xt,xi ∈ Ωt ∀ 1 ≤ i ≤ t ≤ t. The function values at xi ∈ Xtare interpolated as yi = R(xi,Ht) using the standard radialbasis function (Buhmann, 2003; Buhmann & Buhmann,2003) and the observed points inHt. Together, a two-stepcomposite random kernel is formed in the subspace Ωt, withthe first step being the random projection in conjunctionwith the smooth interpolation and the second step being

Page 4: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Algorithm 1 CobBO(f, τ , T)

Hτ ← sample τ initial points and evaluate their valuesVτ ,Mτ ← Find the tuple with the maximal objective value inHτqτ ← 0 Initialize the number of consecutive failed queriesπτ ← Initialize a uniform preference distribution on the coordinatesfor t← τ to T do

if switch Ωt−1 by the backoff stopping rule (Section 3.3) thenif qt > Θ then

Vt,Mt ← Escape local maxima (Section 3.1)]qt ← 0

Kt ← Update a virtual clock (Eq. 3)Ωt ← Form trust regions on two different time scales (Section 3.4)Ct ← Sample a promising coordinate block according to πt (Section 3.2)Ωt ← Take the subspace of Ωt over the coordinate block Ct, such that Vt ∈ Ωt

elseΩt ← Ωt−1

Xt ← PΩt(Xt)

[Project Xt onto Ωt to obtain a set of virtual points (Eq. 1)

]Ht ← R

(Xt,Ht

) [Smooth function values on Xt by interpolation usingHt

]p[fΩt

(x)|Ht]← Compute the posterior distribution of the Gaussian process in Ωt conditional on Ht

xt+1 ← argmaxx∈ΩtQf∼p(f |Ht)

(x|Ht)[

Suggest the next query point in Ωt (Section 3)]

yt+1 ← Evaluate the black-box function yt+1 = f(xt+1)if yt+1 > Mt then

Vt+1 ← xt+1, Mt+1 ← yt+1, qt+1 ← 0else

Vt+1 ← Vt, Mt+1 ←Mt, qt+1 ← qt + 1πt+1 ← Update πt by a multiplicative weights update method (Eq. 2)Ht+1 ← Ht

⋃(xt+1, yt+1), Xt+1 ← Xt

⋃xt+1

end

Figure 2. Subspace projection and function value interpolation

the Gaussian process regression with the chosen kernel,e.g., Automatic Relevance Determination (ARD) Matern5/2 kernel (Matern kernel). These two decoupling steps notonly significantly reduce the computation times due to theefficiency of RBF (Buhmann, 2003) and the regression inlow dimensions (Djolonga et al., 2013), but also eventuallyimproves the accuracy of the GP regression in Ωt due tobetter estimations of the length scales (scikit learn) of thekernel functions in lower dimensions.

Figure 1 plots the average of the GP regression errors,|fτ (xτ+1)− f(xτ+1)|, τ < t , between the GP predictionsand the true function values at the queried points at iteration

Figure 3. Compare execution times of vanilla BO and CobBO

t for Rastrigin over [−5, 10]50. After querying sufficient ob-servations, eventually the accuracy of CobBO outperformsvanilla BO in Ωt. Moreover, smoothing out the local fluc-tuations shows benefits in capturing the global landscape.This eventually guides CobBO to promising local subspacesto explore more accurately and exploit. Figure 3 measuresthe computation times for training the GP regression modeland maximizing the acquisition function at each iteration.CobBO is more efficient, e.g. ×13 faster in this case.

Note that only a fraction of the points in Xt ∩ Xt directlyobserve the exact function values. The function values onthe rest ones in Xt\Xt are estimated by interpolation, which

Page 5: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

captures the landscape of f(x) by smoothing out the localfluctuations. To control the trade-off between the inaccurateestimations and the exact observations in Ωt, we design astopping rule that optimizes the number of consistent queriesin Ωt. Clearly, the more consistent queries conducted in agiven subspace, the more accurate observations could beobtained, albeit at the expense of a smaller remaining budgetfor exploring other regions.

The key features of CobBO are listed in Algorithm 3. Fur-ther elaborations appear in the following sections.

3.1. Escaping trapped local maxima

CobBO can be viewed as a variant of block coordinateascent. Each subspace Ωt contains a pivot point Vt. If fixingthe coordinates’ values incorrectly, one is condemned tomove in a suboptimal subspace. Considering that those aredetermined by Vt, it has to be changed in the face of manyconsecutive failures to improve over Mt in order to escapethis trapped local maxima. We do that by decreasing theobserved function value at Vt and setting Vt+1 as a selectedsub-optimal random point in Xt. Specifically, we randomlysample a few points (e.g., 5) in Xt with their values abovethe median and pick the one furthest away from Vt.

3.2. Block coordinate ascent and subspace selection

For Bayesian optimization, consider an infeasible assump-tion that each iteration can exactly maximize the functionf(x) in Ωt. This is not possible for one iteration but onlyif one can consistently query in Ωt, since the points con-verge to the maximum, e.g., under the expected improve-ment acquisition function with fixed priors (Vazquez &Bect, 2010) and the convergence rate can be character-ized for smooth functions in the reproducing kernel Hilbertspace (Bull, 2011). However, even with this infeasible as-sumption, it is known that coordinate ascent with fixedblocks can cause stagnation at a non-critical point, e.g.,for non-differentiable (Warga, 1963) or non-convex func-tions (Powell, 1972). This motivates us to select a subspacewith a variable-size coordinate block Ct for each query. Agood coordinate block can help the iterations to escape thetrapped non-critical points. For example, one condition canbe based on the result in (Grippo & Sciandrone, 2000) thatassumes f(x) to be differentiable and strictly quasi-convexover a collection of blocks. In practice, we do not restrictourselves to these assumptions.

In order to balance between exploitation and exploration,we alternate between two different approaches in selectingCt. For the first approach that emphasizes exploitation, weestimate the top performing coordinate directions. A similarmethod is used in (Mania et al., 2018). We select Ct to bethe coordinates with the largest absolute gradient values ofthe RBF regression on the whole space Ω at point Vt.

For the second approach that favors exploration, we induce apreference distribution πt over the coordinate set I , and sam-ple a variable-size coordinate block Ct accordingly. Thisdistribution is updated at iteration t through a multiplicativeweights update method (Arora et al., 2012). Specifically,the values of πt at coordinates in Ct increase in face of animprovement or decrease otherwise according to differentmultiplicative ratios α and β, respectively,

πt,j ∝ πt−1,j ·

α if j ∈ Ct and yt > Mt−1

1/β if j ∈ Ct and yt ≤Mt−1

1 if j /∈ Ct(2)

Then, πt is normalized. This update characterizes how likelya coordinate block can generate a promising search subspace.The multiplicative ratio α is chosen to be relatively large,e.g., α = 2.0, and β relatively small, e.g., β = 1.1, sincethe queries that improve the best observations yt > Mt−1

happen more rare than the opposite yt ≤Mt−1.

How to dynamically select the size |Ct|? It is known thatBayesian optimization works well for low dimensions (Fra-zier, 2018). Thus, we specify an upper bound for the di-mension of the subspace (e.g. |Ct| ≤ 30). In principle,|Ct| can be any random number in this finite set. This isdifferent from the method that partitions the coordinatesinto fixed blocks and selects one according to, e.g., cyclicorder (Wright, June 2015), random sampling or Gauss-Southwell (Nutini et al., July 2015).

The above method works well for low dimensions where|Ct|/D is relatively large, as shown in Section 4.2.1. How-ever, in high dimensions, |Ct|/D could be small. In thiscase, additionally we also encourage cyclic order for explo-ration. With a certain probability p (e.g., p = 0.3), we select|Ct| coordinates whose πt values are the largest, and withprobability 1− p, we randomly sample a coordinate subsetaccording to the distribution πt without replacement. Pick-ing the coordinates with the largest values approximately im-plements a cyclic order, due to the selected weights update(Eq. 2) incurring probability oscillations. Since improve-ments tend to be less common than failures, the weights ofthe selected coordinates tend to decrease as the probabilityfor choosing unselected coordinates increase in turn.

3.3. Backoff stopping rule for consistent queriesApplying BO on Ωt requires a strategy to determine the num-ber of consecutive queries for making a sufficient progress.This strategy is based on previous observations, thus form-ing a stopping rule. In principle, there are two differentscenarios, exemplifying exploration and exploitation, re-spectively. Persistently querying a given subspace refrainsfrom opportunistically exploring other coordinate combina-tions. Abruptly shifting to different subspaces does not fullyexploit the potential of a given subspace.

Page 6: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Figure 4. Ablation study using Rastrigin on [−5, 10]50 with 20 initial random samples

CobBO designs a heuristic stopping rule in compromise. Ittakes the above two scenarios into joint consideration, byconsidering not only the number of consecutive queries thatfail to improve the objective function but also other factorsincluding the improved difference Mt −Mt−1, the pointdistance ||xt − xt−1||, the query budget T and the prob-lem dimension D. On the one hand, switching to anothersubspace Ωt+1 ( 6= Ωt) prematurely without fully exploit-ing Ωt incurs an additional approximation error associatedwith the interpolation of observations in Ωt projected toΩt+1. On the other hand, it is also possible to over-exploit asubspace, spending high query budget on marginal improve-ments around local optima. In order to mitigate this, evenwhen a query leads to an improvement, other factors areconsidered for sampling a new subspace.

3.4. Forming trust regions on two time scalesTrust regions have been shown to be effective in Bayesianoptimization (Eriksson et al., 2019; Acerbi & Ma, 2017;Gonzalez et al., 2016; McLeod et al., 2018). They areformed by shrinking the domain, e.g., by centering at Vt andhalving the domain in each coordinate. CobBO forms coarseand fine trust regions on both slow and fast time scales, re-spectively, and alternates between them. This brings yet an-other tradeoff between exploration and exploitation. Sincesampled points tend to reside near the boundaries in highdimensions (Oh et al., 2018), inducing trust regions encour-ages sampling densely in the interior. However, aggressivelyshrinking those trust regions too fast around Vt can lead toan over-exploitation, getting trapped in a local optimum.Hence, we alternate between two trust regions, followingdifferent time scales, as fast ones are formed inside slowones. When the former allows fast exploitation of local

optima, the latter avoids getting trapped in those.

The refinements of trust regions are triggered when a virtualclockKt, characterizing the Bayesian optimization progress,reaches certain thresholds. Specifically,

Kt+1 =

Kt + 1 if ∆t ≤ 0

γt(∆t, xt, xt−1) ·Kt if 0 < ∆t ≤ δ0 if ∆t > δ

(3)

where ∆t = Mt−Mt−1

|Mt−1| is the relative improvement and forexample,

γt(∆, xt, xt−1) =

(1− ∆

δ

(1− ||xt − xt−1||√

|Ct|

)

Starting from the full domain Ω, on a slow time scale, everytime Kt reaches a threshold κS (e.g., κS = 30), a coarsetrust region ΩS is formed followed by setting Kt+1 = 0.Within the coarse trust region, on a fast time scale, when thenumber of consecutive fails exceeds a threshold κF < κS(e.g, κF = 6), a fine trust region is formed. In face ofimprovement, both the trust regions are back to the previousrefinement of the coarse one.

4. Numerical ExperimentsThis section presents the detailed ablation studies on CobBOand the comparisons with other algorithms.

4.1. Empirical analysis and ablation studyAblation studies are designed to study the contributions ofthe key components in Algorithm 3 by experimenting withthe Rastrigin function on [−5, 10]50 with 20 initial points.

Page 7: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Figure 5. Performance over low dimensional problems: Ackley (left), Levy (middle) and Rastrigin (right)

Backoff stopping rule: CobBO applies a stopping rule toquery a variable number of points in subspace Ωt (Sec-tion 3.3). To validate its effectiveness, we compare it withschemes that use a fixed budget of queries for Ωt. Figure 4(a) shows that the stopping rule yields superior results.

Coordinate blocks of a varying size: CobBO selects ablock of coordinates of a varying size Ct (Section 3.2).Figure 4 (b) shows that a varying size is better fixed.

Forming trust regions on two different time scales:CobBO alternates between coarse and fine trust regionson slow and fast time scales, respectively (Section 3.4). Fig-ure 4 (c) compares CobBO with two other schemes: withoutany trust regions and forming only coarse trust regions. Twotime scales show better results.

Escaping trapped optima: Figure 4 (d) shows that the wayCobBO escapes local optima (Section 3.1) by decreasingMt−1 and setting Vt as a selected random point is beneficial.

Preference probability over coordinates: For demonstrat-ing the effectiveness of coordinate selection (Section 3.2),we artificially let the function value only depend on the first25 coordinates of its input and ignore the rest. It forms twoseparate sets of active and inactive coordinates, respectively.We expect CobBO to refrain from selecting inactive coor-dinates. Figure 4 (e) show the entropy of this preferenceprobability πt over coordinates and the overall probabilityfor picking active and inactive coordinate at each iteration.We see that the entropy decreases, as the preference distri-bution concentrates on the significant active coordinates.

RBF interpolation: RBF calculation is time efficient.Specifically, this is much beneficial in high dimensions. Thisis demonstrated in Figure 3, where the computation timeof plain Bayesian optimization is compared to CobBO’s.While the former applies GP regression using the Maternkernel in the high dimensional space directly, the later ap-plies RBF interpolation in the high dimensional space andGP regression with the Matern kernel in the low dimensionalsubspace. This two-step composite kernel leads to a signifi-cant speed-up. Other time efficient alternatives are, e.g., theinverse distance weighting (IDW) and the simple approachof assigning the value of the observed nearest neighbour.Figure 4 (f) shows that RBF is the most favorable.

4.2. Comparisons with other methodsThe default configuration for CobBO is specified in thesupplementary materials. CobBO performs on par or outper-forms a collection of state-of-the-art methods across the fol-lowing experiments. Most of the experiments are conductedusing the same settings as in TurBO (Eriksson et al., 2019),where it is compared with a comprehensive list of base-lines, including BFGS, BOCK (Oh et al., 2018), BOHAMI-ANN, CMA-ES (Hansen & Ostermeier, 2001), BOBYQA,EBO (Wang et al., 2018a), GP-TS, HeSBO (Munteanu et al.,2019), Nelder-Mead and random search. To avoid repeti-tions, we only show TuRBO and CMA-ES that achieve thebest performance among this list, and additionally compareCobBO with BADS (Acerbi & Ma, 2017), REMBO (Wanget al., 2016), Differetial Evolution (Diff-Evo) (Storn & Price,1997), Tree Parzen Estimator (TPE) (Bergstra et al., 2011)and Adaptive TPE (ATPE) (ElectricBrain, 2018).

4.2.1. LOW DIMENSIONAL TESTS

To evaluate CobBO on low dimensional problems, we usethe classic synthetic black-box functions (Surjanovic &Bingham, 2013), as well as two more challenging problemsof lunar landing (LunarLander v2; Eriksson et al., 2019) androbot pushing (Wang et al., 2018b), by following the setupin (Eriksson et al., 2019). Confidence intervals (95%) over30 independent experiments for each problem are shown.

Classic synthetic black-box functions (minimization):Three synthetic 10 dimensional functions are chosen, in-cluding Ackley over [−5, 10]10, Levy over [−5, 10]10 andRastrigin over [−3, 4]10. TuRBO is configured the sameas in (Eriksson et al., 2019), with a batch size of 10 and5 concurrent trust regions where each has 10 initial points.The other algorithms use 20 initial points. The results areshown in Fig. 5. CobBO shows competitive or better perfor-mance. It finds the best optima on Ackley and Levy amongall the algorithms and outperforms the others for the difficultRastrigin function. Notably, BADS is more suitable for lowdimensions, as commented in (Acerbi & Ma, 2017). Itsperformance is close to CobBO except for Rastrigin.

Lunar landing (maximization): This controller learn-ing problem (12 dimensions) is provided by the OpenAIgym (LunarLander v2) and evaluated in (Eriksson et al.,

Page 8: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Figure 6. Performance over low (left) medium (middle) and high (right) dimensional problems

2019). Each algorithm has 50 initial points and a budget of1, 500 trials. TuRBO is configured with 5 trust regions anda batch size of 50 as in (Eriksson et al., 2019). Fig. 6 (upperleft) shows that, among the 30 independent tests, CobBOquickly exceeds 300 along some good sample paths.

Robot pushing (maximization): This control problem (14dimensions) is introduced in (Wang et al., 2018b) and exten-sively tested in (Eriksson et al., 2019). We follow the settingin (Eriksson et al., 2019), where TuRBO is configured witha batch size of 50 and 15 trust regions, each of which has30 initial points. Each experiment has a budget of 10, 000evaluations. On average CobBO exceeds 10 within 5500trials, as shown in Fig. 6 (lower left).

4.2.2. HIGH DIMENSIONAL TESTS

Since the duration of each experiment in this section is long,confidence intervals (95%) over repeated 10 independentexperiments for each problem are presented.

Additive latent structure (minimization): As mentionedin Section 2, additive latent structures have been exploitedin high dimensions. We construct an additive function of 56dimensions, defined as f56(x) = Ackley(x1)+Levy(x2)+Rastrigin(x3) + Hartmann(x4) + Rosenbrock(x5) +Schwefel(x6), where the first three terms express the exactfunctions and domains described in Section 4.2.1, the Hart-mann function on [0, 1]6 and the Rosenbrock and Schwefelfunctions on [−5, 10]10 and [−500, 500]10, respectively.

We compare CobBO with TPE, BADS, CMA-ES andTuRBO, each with 100 initial points. Specifically, TuRBOis configured with 15 trust regions and a batch size 100.ATPE is excluded as it takes more than 24 hours per runto finish. The results are shown in Fig. 6 (upper middle),where CobBO quickly finds the best solution among thealgorithms tested.

Rover trajectory planning (maximization): This prob-lem (60 dimensions) is introduced in (Wang et al., 2018b).The objective is to find a collision-avoiding trajectory ofa sequence consisting of 30 positions in a 2-D plane. Wecompare CobBO with TuRBO, TPE and CMA-ES, eachwith a budget of 20, 000 evaluations and 200 initial points.TuRBO is configured with 15 trust regions and a batch sizeof 100, as in (Eriksson et al., 2019). ATPE, BADS andREMBO are excluded for this problem and the followingones, as they all take more than 24 hours per run. Fig. 6(lower middle) shows that CobBO has a good performance.

The 100-dimensional Levy and Rastrigin functions(minimization): We minimize the Levy and Rastrigin func-tions on [−5, 10]100 with 500 initial points. TuRBO is con-figured with 15 trust regions and a batch size of 100. Ascommented in (Eriksson et al., 2019), these two problemsare challenging and have no redundant dimensions. Fig. 6(right) shows that CobBO can greatly reduce the trial com-plexity. For Levy, it finds solutions close to the final onewithin 1, 000 trials, and eventually reach the best solutionamong all the algorithms tested. For Rastrigin, within 2, 000trials CobBO and TuRBO surpass the final solutions of allthe other methods, eventually with a large margin.

5. ConclusionCobBO is a variant of coordinate ascent tailored forBayesian optimization. The number of consistent querieswithin each selected coordinate subspace is determined bya backoff stopping rule. Combining projection on randomsubspaces, function value interpolation and GP regression,we provide a practical Bayesian optimization method. Em-pirically, CobBO consistently finds comparable or bettersolutions with reduced trial complexity in comparison withthe state-of-the-art methods across a variety of benchmarks.

Page 9: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

ReferencesAcerbi, L. and Ma, W. J. Practical bayesian optimization

for model fitting with bayesian adaptive direct search.In Proceedings of the 31st International Conference onNeural Information Processing Systems, NIPS’17, pp.1834–1844, Red Hook, NY, USA, 2017. Curran Asso-ciates Inc. ISBN 9781510860964.

Arora, S., Hazan, E., and Kale, S. The multiplicativeweights update method: a meta-algorithm and applica-tions. Theory of Computing, 8(6):121–164, 2012.

Auer, P. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3(null):397–422, March 2003. ISSN 1532-4435.

Azimi, J., Fern, A., and Fern, X. Z. Batch bayesian op-timization via simulation matching. In Proceedings ofthe 23rd International Conference on Neural InformationProcessing Systems - Volume 1, NIPS’10, pp. 109–117,Red Hook, NY, USA, 2010. Curran Associates Inc.

Bergstra, J. S., Bardenet, R., Bengio, Y., and Kegl, B. Al-gorithms for hyper-parameter optimization. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., andWeinberger, K. Q. (eds.), Advances in Neural Informa-tion Processing Systems 24, pp. 2546–2554. Curran As-sociates, Inc., 2011.

Binois, M., Ginsbourger, D., and Roustant, O. On the choiceof the low-dimensional domain for global optimizationvia random embeddings. Journal of Global Optimization,76(1):69–90, January 2020.

Brochu, E., Cora, V. M., and de Freitas, N. A tutorialon bayesian optimization of expensive cost functions,with application to active user modeling and hierarchicalreinforcement learning. ArXiv, abs/1012.2599, 2010.

Buhmann, M. D. Radial Basis Functions: Theory andImplementations. Cambridge Monographs on Appliedand Computational Mathematics. Cambridge UniversityPress, 2003. doi: 10.1017/CBO9780511543241.

Buhmann, M. D. and Buhmann, M. D. Radial Basis Func-tions. Cambridge University Press, USA, 2003. ISBN0521633389.

Bull, A. D. Convergence rates of efficient global optimiza-tion algorithms. Journal of Machine Learning Research,12(88):2879–2904, 2011.

Chen, Z., Mak, S., and Wu, C. F. J. A hierarchical expectedimprovement method for bayesian optimization, 2019.

Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N.Parallel gaussian process optimization with upper con-fidence bound and pure exploration. In Proceedings of

the 2013th European Conference on Machine Learningand Knowledge Discovery in Databases - Volume Part I,ECMLPKDD’13, pp. 225–240, Berlin, Heidelberg, 2013.Springer-Verlag. ISBN 9783642409875.

Desautels, T., Krause, A., and Burdick, J. W. Paralleliz-ing exploration-exploitation tradeoffs in gaussian processbandit optimization. Journal of Machine Learning Re-search, 15(119):4053–4103, 2014.

Djolonga, J., Krause, A., and Cevher, V. High-dimensionalgaussian process bandits. In Burges, C. J. C., Bottou,L., Welling, M., Ghahramani, Z., and Weinberger, K. Q.(eds.), Advances in Neural Information Processing Sys-tems 26, pp. 1025–1033. Curran Associates, Inc., 2013.

Duvenaud, D., Lloyd, J., Grosse, R., Tenenbaum, J., andZoubin, G. Structure discovery in nonparametric regres-sion through compositional kernel search. In Interna-tional Conference on Machine Learning, pp. 1166–1174,2013.

ElectricBrain. Blog: Learning to optimize, 2018.URL https://www.electricbrain.io/post/learning-to-optimize.

Eriksson, D., Dong, K., Lee, E., Bindel, D., and Wilson,A. G. Scaling gaussian process regression with deriva-tives. In Bengio, S., Wallach, H., Larochelle, H., Grau-man, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Ad-vances in Neural Information Processing Systems 31, pp.6867–6877. Curran Associates, Inc., 2018.

Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., andPoloczek, M. Scalable global optimization via localbayesian optimization. In Advances in Neural Informa-tion Processing Systems 32, pp. 5496–5507. Curran As-sociates, Inc., 2019.

Frazier, P. I. A tutorial on bayesian optimization, 2018.

Frazier, P. I., Powell, W. B., and Dayanik, S. A knowledge-gradient policy for sequential information collection.SIAM J. Control Optim., 47(5):2410–2439, September2008. ISSN 0363-0129.

Gilboa, E., Saatci, Y., and Cunningham, J. P. Scaling mul-tidimensional Gaussian processes using projected addi-tive approximations. In Proceedings of the 30th Interna-tional Conference on International Conference on Ma-chine Learning - Volume 28, ICML’13, pp. I–454–I–461.JMLR.org, 2013.

Gonzalez, J., Dai, Z., Hennig, P., and Lawrence, N. Batchbayesian optimization via local penalization. In Gret-ton, A. and Robert, C. C. (eds.), Proceedings of the 19thInternational Conference on Artificial Intelligence and

Page 10: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Statistics, volume 51 of Proceedings of Machine Learn-ing Research, pp. 648–657, Cadiz, Spain, 09–11 May2016. PMLR.

Grippo, L. and Sciandrone, M. On the convergence ofthe block nonlinear gauss-seidel method under convexconstraints. Operations Research Letters, 26(3):127–136,2000.

Hansen, N. and Ostermeier, A. Completely derandomizedself-adaptation in evolution strategies. Evolutionary Com-putation, 9(2):159–195, June 2001. ISSN 1063-6560.

Hennig, P. and Schuler, C. J. Entropy search for information-efficient global optimization. J. Mach. Learn. Res., 13(1):1809–1837, June 2012. ISSN 1532-4435.

Henrandez-Lobato, J. M., Hoffman, M. W., and Ghahra-mani, Z. Predictive entropy search for efficient globaloptimization of black-box functions. In Proceedings ofthe 27th International Conference on Neural InformationProcessing Systems - Volume 1, NIPS’14, pp. 918–926,Cambridge, MA, USA, 2014. MIT Press.

Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequentialmodel-based optimization for general algorithm configu-ration. In Proc. of LION-5, pp. 507–523, 2011.

IDW. https://en.wikipedia.org/wiki/Inverse_distance_weighting.

Jones, D. R., Schonlau, M., and Welch, W. J. Efficient globaloptimization of expensive black-box functions. Journalof Global optimization, 13(4):455–492, 1998.

Kandasamy, K., Schneider, J., and Poczos, B. High di-mensional bayesian optimization and bandits via additivemodels. In Proceedings of the 32nd International Confer-ence on International Conference on Machine Learning -Volume 37, ICML’15, pp. 295–304. JMLR.org, 2015.

Kathuria, T., Deshpande, A., and Kohli, P. Batched gaus-sian process bandit optimization via determinantal pointprocesses. In Proceedings of the 30th InternationalConference on Neural Information Processing Systems,NIPS’16, pp. 4213–4221, Red Hook, NY, USA, 2016.Curran Associates Inc. ISBN 9781510838819.

Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R., andKrause, A. Adaptive and safe Bayesian optimizationin high dimensions via one-dimensional subspaces. InChaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-ings of the 36th International Conference on MachineLearning, volume 97 of Proceedings of Machine Learn-ing Research, pp. 3429–3438, Long Beach, California,USA, 09–15 Jun 2019. PMLR.

Kushner, H. J. A new method of locating the maximumpoint of an arbitrary multipeak curve in the presence ofnoise. Journal of Basic Engineering, 86(1):97–106, mar1964.

Li, C., Gupta, S., Rana, S., Nguyen, V., Venkatesh, S., andShilton, A. High dimensional bayesian optimization usingdropout. In Proceedings of the Twenty-Sixth InternationalJoint Conference on Artificial Intelligence, IJCAI-17, pp.2096–2102, 2017.

Li, C.-L., Kandasamy, K., Poczos, B., and Schneider, J.High dimensional bayesian optimization via restrictedprojection pursuit models. In Gretton, A. and Robert,C. C. (eds.), Proceedings of the 19th International Con-ference on Artificial Intelligence and Statistics, volume 51of Proceedings of Machine Learning Research, pp. 884–892, Cadiz, Spain, 09–11 May 2016. PMLR.

LunarLander v2. https://gym.openai.com/envs/LunarLander-v2/.

Mania, H., Guy, A., and Recht, B. Simple random searchof static linear policies is competitive for reinforcementlearning. In Bengio, S., Wallach, H., Larochelle, H.,Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information Processing Systems 31,pp. 1800–1809. Curran Associates, Inc., 2018.

Matern kernel. https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.Matern.html.

McLeod, M., Osborne, M. A., and Roberts, S. J. Optimiza-tion, fast and slow: Optimally switching between localand bayesian optimization. In ICML, 2018.

Mockus, J. On bayesian methods for seeking the extremum.In Marchuk, G. I. (ed.), Optimization Techniques IFIPTechnical Conference Novosibirsk, July 1–7, 1974, pp.400–404, Berlin, Heidelberg, 1975. Springer Berlin Hei-delberg. ISBN 978-3-540-37497-8.

Mohammadi, H., Le Riche, R., and Touboul, E. Makingego and cma-es complementary for global optimization.In International Conference on Learning and IntelligentOptimization, pp. 287–292. Springer, 2015.

Moriconi, R., Kumar, K. S. S., and Deisenroth, M. P. High-dimensional bayesian optimization with projections usingquantile gaussian processes. Optimization Letters, 14:51–64, 2020.

Munteanu, A., Nayebi, A., and Poloczek, M. A frameworkfor Bayesian optimization in embedded subspaces. InChaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-ings of the 36th International Conference on Machine

Page 11: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Learning, volume 97 of Proceedings of Machine Learn-ing Research, pp. 4752–4761, Long Beach, California,USA, 09–15 Jun 2019. PMLR.

Mutny, M. and Krause, A. Efficient high dimensionalbayesian optimization with additivity and quadraturefourier features. In Bengio, S., Wallach, H., Larochelle,H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information Processing Systems 31,pp. 9005–9016. Curran Associates, Inc., 2018.

Nutini, J., Schmidt, M., Laradji, I. H., Friedlander, M., andKoepke, H. Coordinate descent converges faster with thegauss-southwell rule than random selection. ICML’15:Proceedings of the 32nd International Conference onInternational Conference on Machine Learning, 37, July2015.

Oh, C., Gavves, E., and Welling, M. BOCK : Bayesian opti-mization with cylindrical kernels. In Dy, J. and Krause,A. (eds.), Proceedings of the 35th International Confer-ence on Machine Learning, volume 80 of Proceedings ofMachine Learning Research, pp. 3868–3877, Stockholm,Sweden, 10–15 Jul 2018. PMLR.

Oliveira, R., Rocha, F., Ott, L., Guizilini, V., Ramos, F.,and Jr, V. Learning to race through coordinate descentbayesian optimisation. In IEEE International Conferenceon Robotics and Automation (ICRA), February 2018.

Powell, M. On Search Directions for Minimization Algo-rithms. AERE-TP. AERE, Theoretical Physics Division,1972.

Qin, C., Klabjan, D., and Russo, D. Improving the expectedimprovement algorithm. In Guyon, I., Luxburg, U. V.,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,and Garnett, R. (eds.), Advances in Neural InformationProcessing Systems 30, pp. 5381–5391. Curran Asso-ciates, Inc., 2017.

Rahimi, A. and Recht, B. Random features for large-scalekernel machines. In Platt, J. C., Koller, D., Singer, Y., andRoweis, S. T. (eds.), Advances in Neural Information Pro-cessing Systems 20, pp. 1177–1184. Curran Associates,Inc., 2008.

scikit learn. Gaussian processes. URL https://scikit-learn.org/stable/modules/gaussian_process.html.

Scott, W., Frazier, P., and Powell, W. The correlated knowl-edge gradient for simulation optimization of continuousparameters using gaussian process regression. SIAM Jour-nal on Optimization, 21(3):996–1026, 2011.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and deFreitas, N. Taking the human out of the loop: A reviewof bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.

Snoek, J., Larochelle, H., and Adams, R. P. Practicalbayesian optimization of machine learning algorithms. InPereira, F., Burges, C. J. C., Bottou, L., and Weinberger,K. Q. (eds.), Advances in Neural Information Process-ing Systems 25, pp. 2951–2959. Curran Associates, Inc.,2012.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaus-sian process optimization in the bandit setting: No regretand experimental design. In Proceedings of the 27th In-ternational Conference on International Conference onMachine Learning, ICML’10, pp. 1015–1022, Madison,WI, USA, 2010.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W.Information-theoretic regret bounds for gaussian processoptimization in the bandit setting. IEEE Transactions onInformation Theory, 58(5):3250–3265, 2012.

Storn, R. and Price, K. Differential evolution–a simple andefficient heuristic for global optimization over continuousspaces. Journal of global optimization, 11(4):341–359,1997.

Surjanovic, S. and Bingham, D. Optimization test problems,2013. URL http://www.sfu.ca/˜ssurjano/optimization.html.

Tyagi, H. and Cevher, V. Learning non-parametric basis in-dependent models from point queries via low-rank meth-ods. Applied and Computational Harmonic Analysis, 37(3):389 – 412, 2014. ISSN 1063-5203.

Vazquez, E. and Bect, J. Convergence properties of theexpected improvement algorithm with fixed mean andcovariance functions. Journal of Statistical Planning andInference, 140(11):3088 – 3095, 2010. ISSN 0378-3758.

Wang, L., Fonseca, R., and Tian, Y. Learning search spacepartition for black-box optimization using monte carlotree search. ArXiv, abs/2007.00708, 2020.

Wang, Z. and Jegelka, S. Max-value entropy search for effi-cient Bayesian optimization. In Precup, D. and Teh, Y. W.(eds.), Proceedings of the 34th International Conferenceon Machine Learning, volume 70 of Proceedings of Ma-chine Learning Research, pp. 3627–3635, InternationalConvention Centre, Sydney, Australia, 06–11 Aug 2017.PMLR.

Wang, Z., Zoghi, M., Hutter, F., Matheson, D., and De Fre-itas, N. Bayesian optimization in high dimensions viarandom embeddings. In Proceedings of the Twenty-Third

Page 12: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

International Joint Conference on Artificial Intelligence,IJCAI ’13, pp. 1778–1784. AAAI Press, 2013. ISBN9781577356332.

Wang, Z., Hutter, F., Zoghi, M., Matheson, D., and De Fre-itas, N. Bayesian optimization in a billion dimensions viarandom embeddings. J. Artif. Int. Res., 55(1):361–387,January 2016. ISSN 1076-9757.

Wang, Z., Li, C., Jegelka, S., and Kohli, P. Batched high-dimensional bayesian optimization via structural kernellearning. In Proceedings of the 34th International Con-ference on Machine Learning - Volume 70, ICML’17, pp.3656–3664. JMLR.org, 2017.

Wang, Z., Gehring, C., Kohli, P., and Jegelka, S. Batchedlarge-scale bayesian optimization in high-dimensionalspaces. In International Conference on Artificial Intelli-gence and Statistics (AISTATS), 2018a.

Wang, Z., Gehring, C., Kohli, P., and Jegelka, S. Batchedlarge-scale bayesian optimization in high-dimensionalspaces. In International Conference on Artificial Intelli-gence and Statistics, pp. 745–754, 2018b.

Warga, J. Minimizing certain convex functions. Journal ofthe Society for Industrial and Applied Mathematics, 11(3):588–593, 1963.

Wilson, J., Moriconi, R., Hutter, F., and Deisenroth, M. P.The reparameterization trick for acquisition functions. InProceedings of BayesOpt, 2017.

Wright, S. J. Coordinate descent algorithms. MathematicalProgramming: Series A and B, June 2015.

Wu, J. and Frazier, P. I. The parallel knowledge gradientmethod for batch bayesian optimization. In Proceedingsof the 30th International Conference on Neural Informa-tion Processing Systems, NIPS’16, pp. 3134–3142, RedHook, NY, USA, 2016. Curran Associates Inc. ISBN9781510838819.

Wu, J., Poloczek, M., Wilson, A. G., and Frazier, P.Bayesian optimization with gradients. In Advances inNeural Information Processing Systems 30, pp. 5267–5278. Curran Associates, Inc., 2017.

Zhang, M., Li, H., and Su, S. High dimensional bayesianoptimization via supervised dimension reduction. In Pro-ceedings of the Twenty-Eighth International Joint Confer-ence on Artificial Intelligence, IJCAI-19, pp. 4292–4298.International Joint Conferences on Artificial IntelligenceOrganization, 7 2019.

Page 13: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization - Supplementary Materials

A. Default hyper-parameter configurationTable 1 contains the default configuration of CobBO, which is used to test all the experiments in this paper.

Hyperparameter Description Default ValueΘ The threshold for the number of consecutive fails qt before changing Vt 60 if T > 2000 else 30α Increase multiplicative ratio for the coordinate distribution update 2.0β Decay multiplicative ratio for the coordinate distribution update 1.1p Probability for selecting coordinates with the largest πt values 0.3

κSThe threshold for the virtual clock value Kt

before shrinking the coarse trust region ΩS30

κFThe threshold for the number of consecutive fails qt before

shrinking the fine trust region ΩF on the fast time scale 6

τF The number of consecutive fails qt in the fine trust region ΩF 6δ The relative improvement threshold governing the virtual clock update rule 0.1

Gussian process kernel Matern 5/2

Table 1. CobBO’s hyperparameters configuration for all of the experiments

B. ImplementationThe proposed CobBO algorithm is implemented in Python 3.An implementation of CobBO is available at: https://github.com/Alibaba-MIIL/CobBO.

C. Further ablation of escaping local optimaCobBO is described in Algorithm 1, where Line 8 is about escaping local maxima by changing the pivot point Vt when thenumber of consecutive fails exceeds a threshold, i.e., qt > Θ. In this case, we decrease the observed function value at Vt andset Vt+1 as a selected sub-optimal random point in Xt. Specifically, we randomly sample 5 points in Xt with their valuesabove the median and pick the one furthest away from Vt.

We use the experiments on Levy and Ackley functions of 100 dimensions, as described in section 3.2 to compute the fractionof queries that improve the already observed maximal points due to changing Vt according to Line 8.

Problem Average # improved queries Average # improved queries due to escapingAckley 228 15.3Levy 155 3

Table 2. The number of improved queries due to escaping local maxima

We observe that optimizing the Levy function yields very few queries that improve the maximal points by changing thepivot point, while optimizing the Ackley function can benefit more from that.

Page 14: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

D. Forming trust regions on two time scalesCobBO alternates between the two trust regions according to a duty cycle determined by κF and τF as specified byAlgorithm 1 and Table 1. The formation of trust regions is triggered when a virtual clock Kt, expressing the progress of theoptimization, reaches certain thresholds. Specifically, the virtual clock evolves as following

Kt+1 =

Kt + 1 if ∆t ≤ 0

γt(∆t, xt, xt−1) ·Kt if 0 < ∆t ≤ δ0 if ∆t > δ

which is described in equation (3) in the main body of the paper.

Algorithm 2 FormTrustRegions(Kt,yt,Mt−1)Parameters:Slow/fast thresholds κS/F respectivelyFast duty cycle τFInit: Ω0, Ω0 ← Ωif yt > Mt−1 then

Ωt ← Double Ωt−1 around VtΩt ← Ωt [Ωt is the trust region formed on the slow time scale]

else if Kt == κS thenΩt ← Halve Ωt around VtΩt ← ΩtReset Kt = 0

elseΩt ← Ωt−1

if mod (Kt, κF + τF ) == κF − 1 thenΩt ← Halve Ωt−1 around Vt

else if mod (Kt, κF + τF ) == κF + τF − 1 thenΩt ← Ωt

elseΩt ← Ωt−1

Output: Trust Region Ωt

In addition, when the number of queried points exceeds a threshold, e.g., 70% of the query budget, we shrink the totalspace Ω every time when the fraction of the queried points increases by 10%.

E. Additional experimentsWe provide more experiments for demonstrating the performance of CobBO. Confidence intervals (95%) are computedby repeating 30 and 10 independent experiments for the medium-sized functions and the 200-dimensional functions,respectively.

Medium-sized synthetic black-box functions (minimization): We test three synthetic functions (30 dimensions), includ-ing Ackley on [−5, 10]30, Levy [−5, 10]30, and Rastrigin on [−3, 4]30. In addition, we add experiments for an additivefunction of 36 dimensions, defined as f36(x) = Ackley(x1) + Levy(x2) + Rastrigin(x3) + Hartmann(x4), where thefirst three terms express the same functions over the same domains specified in Section 3.1 of this paper, with the Hartmannfunction over [0, 1]6. TuRBO is configured identically the same as in Section 3.1, with a batch size of 10 and 5 trust regionswith 10 initial points each. The other algorithms use 20 initial points. The results are shown in Fig. 7 and 8, where CobBOshows competitive or better performance compared to all of the methods tested across all of these problems.

The 200-dimensional Levy and Ackley functions (minimization): We minimize the Levy and Ackley functions over[−5, 10]200 with 500 initial points. TuRBO is configured with 15 trust regions and a batch size of 100. These two problemsare challenging and have no redundant dimensions. For Levy, in Fig. 9 (left), CobBO reaches 100.0 within 2, 000 trials,while CMA-ES and TuRBO obtain 200.0 after 8, 000 trials. TPE cannot find a comparable solution within 10, 000 trials inthis case. For Ackley, in Fig. 9 (right), CobBO reaches the best solution among all of the algorithms tested. The appealing

Page 15: arXiv:2101.05147v1 [cs.LG] 13 Jan 2021 · 2021. 1. 14. · Hashing-enhanced Subspace BO (HeSBO) [45] and DROPOUT [36], CobBO exploits subspace structure from a perspective of block

CobBO: Coordinate Backoff Bayesian Optimization

Figure 7. Performance over medium dimensional problems: Ackley (left), Levy (middle) and Rastrigin (right)

Figure 8. Performance over an additive function of 36 dimensions

trial complexity of CobBO suggests that it can be applied in a hybrid method, e.g., used in the first stage of the query processcombined with gradient estimation methods or CMA-ES.

Figure 9. Performance over high dimensional synthetic problems: Levy (left) and Ackley (right)