Automated Machine Learning via ADMM

Automated Machine Learning via ADMM

Sijia Liu∗ , Parikshit Ram∗ , Djallel Bouneffouf , Gregory Bramble , Andrew R Conn† , HorstSamulowitz and Alexander Gray

IBM Research AI{sijia.liu, parikshit.ram, djallel.bouneffouf, alexander.gray}@ibm.com, {gbramble,

samulowitz}@us.ibm.com

AbstractWe study the automated machine learning (Au-toML) problem of jointly selecting appropriate al-gorithms from an algorithm portfolio as well as op-timizing their hyper-parameters for certain learningtasks. The main challenges include a) the couplingbetween algorithm selection and hyper-parameteroptimization (HPO), and b) the black-box opti-mization nature of the problem where the optimizercannot access the gradients of the loss function butmay query function values. To circumvent thesedifficulties, we propose a new AutoML frameworkby leveraging the alternating direction method ofmultipliers (ADMM) scheme. Due to the split-ting properties of ADMM, algorithm selection andHPO can be decomposed through the augmentedLagrangian function. As a result, HPO with mixedcontinuous and integer constraints are efficientlyhandled through a query-efficient Bayesian opti-mization approach and Euclidean projection oper-ator that yields a closed-form solution. Algorithmselection in ADMM is naturally interpreted as acombinatorial bandit problem. The effectiveness ofour proposed methodology is compared to state-of-the-art AutoML schemes such as TPOT and Auto-sklearn on numerous benchmark data sets.

1 IntroductionAutomated machine learning (AutoML) has received in-creasing attention, starting with hyper-parameter optimiza-tion (HPO) to determine the most appropriate parameters fora machine learning (ML) model (for example, the number oftrees in a random forest [Snoek et al., 2012]), to automatingmultiple stages within a ML pipeline (such as data transfor-mations like feature selection combined with the predictivemodeling stage [Feurer et al., 2015]). AutoML offers a widerange of difficult technical challenges ranging from HPO, au-tomated feature engineering, to neural network design.

Auto-WEKA [Thornton et al., 2012; Kotthoff et al., 2017]and Auto-sklearn [Feurer et al., 2015] are the main represen-∗Equal contribution†Deceased 14 March 2019.

tatives for solving AutoML by so-called sequential parame-ter optimization. Both apply the general purpose algorithmconfiguration framework SMAC [Hutter et al., 2011] basedon Bayesian Optimization to find optimal machine learningpipelines. Auto-sklearn improves upon Auto-WEKA withtwo main innovations1: (i) a warm-start technique that uses“meta-features” of the datasets to determine good candidatesto be considered in the pipeline based on past experiences onsimilar datasets, (ii) Auto-sklearn can be configured to returnan ensemble of pipelines instead of a single pipeline. It isimportant to note that the greedy ensemble creation in Auto-sklearn is a post-processing step2and can be applied to anyAutoML solution. Tree-Parzen Estimator (TPE) [Bergstra etal., 2011] based Hyperopt-sklearn is a predecessor of Auto-sklearn. Recently, the Bayesian optimization used in Auto-sklearn has been combined with a bandit based strategy [Li etal., 2018] to solve HPO problems [Falkner et al., 2018]. Fusiet al., 2017 [Fusi et al., 2017] use meta-features of the datato build a recommender system that selects the most suitablepredictive model for the current input data. While this ap-proach can be successfully applied to select a single model,it is unable to perform HPO or compose elements to formlonger pipelines. Genetic algorithms have been used to spliceand concatenate randomly generated ML pipelines to createthe AutoML system TPOT [Olson and Moore, 2016].

Different from the aforementioned literature, we considerthe problem of joint algorithm selection and HPO, and studyit from the perspective of mixed nonlinear integer program-ming. We face two main challenges: (a) the coupling be-tween algorithm selection and HPO, and (b) the black-boxoptimization nature of the problem. The latter arises dueto the lack of explicit loss function gradients in any MLpipeline (for example, one including a decision tree), andfeedback is only available in the form of loss evaluations. Wepropose a new AutoML framework to address these issuesby leveraging the alternating direction method of multipliers(ADMM). ADMM offers a two-block alternative optimiza-tion procedure that splits an involved problem (with multiplevariables and constraints) into simpler sub-problems through

1Vanilla Auto-sklearn does not use warm starts nor ensembles.2The sequential parameter optimization is not involved in the en-

semble creation; after the optimization is complete, the ensemble iscreated with greedy forward-selection [Caruana et al., 2004] fromthe list of configurations attempted during the optimization.

arX

iv:1

905.

0042

4v1

[cs

.LG

] 1

May

201

9

the augmented Lagrangian function, which is a conventionalLagrangian penalized by the `2 norm regularization of equal-ity constraints [Boyd et al., 2011; Parikh et al., 2014].

With ADMM, the AutoML problem is decomposed intothree stages: (a) HPO with continuous constraints and vari-ables, (b) Euclidean projection onto integer constraints fordiscrete variables, and (c) the combinatorial problem of algo-rithm selection. The first two stages formally address the factthat hyper-parameters are mixed continuous and discrete vari-ables. The last stage characterizes the combinatorial nature ofalgorithm selection. We avoid using off-the-shelf integer pro-gram solvers by interpreting algorithm selection as a combi-natorial multi-armed bandit problem, and utilize a Thompsonsampling (TS) based algorithm [Durand and Gagne, 2014;Bouneffouf et al., 2017] for query (loss function evaluations)efficient algorithm selection. We summarize our contribu-tions as follows:• We explicitly model the coupling between hyper-

parameters and available algorithms, and exploit the hid-den structure in the AutoML problem.• We employ ADMM to decompose the problem into a se-

quence of sub-problems, which decouple the difficultiesin AutoML and can each be solved more efficiently.• We interpret the problem of algorithm selection in

ADMM as a combinatorial bandit problem that allowsus to solve it via a query-efficient bandit approach.• We demonstrate the effectiveness our approach against

state-of-the-art AutoML baselines Auto-sklearn andTPOT on numerous data sets.

We present the formal AutoML optimization problem in thefollowing Section 2 and propose an ADMM based joint opti-mization in Section 3. We thoroughly evaluate our proposedscheme against reasonable baselines in Section 4 and presentconcluding remarks and future directions in Section 5.

2 An Optimization Perspective to AutoMLIn this section, we formulate the AutoML problem for jointalgorithm selection and HPO as a constrained mixed inte-ger nonlinear programming problem. In contrast to Auto-sklearn [Feurer et al., 2015], we explicitly introduce Booleanvariables to encode whether or not an algorithm is selected.Each algorithm could belong to one of the functional mod-ule such as data prepossessing, feature engineering, and esti-mation (classification/regression). Each module correspondsto a set of specific algorithms, such as different dimen-sion reduction algorithms for data transformation. Algo-rithm hyper-parameters can be constrained continuous anddiscrete/integer variables. Our goal is to design a joint op-timizer to minimize the loss of a learning task.

We begin by defining the optimization variables of Au-toML. Let N be the number of functional modules and Kbe the number of algorithms. Algorithm selection is encodedwith Boolean variables z = [(z1)>, . . . , (zN )>]> ∈ {0, 1}K ,where zi ∈ {0, 1}Ki corresponds to the Ki algorithms formodule i, and K =

∑Ni=1Ki. Assuming only one algorithm

is selected in each module, we get a set of equality constraints1>zi = 1, i ∈ [N ], where [N ] is the integer set {1, . . . , N}.

We denote by θi ∈ Rmi the hyper-parameters of algorithmi, and let θ = [θ>1 , . . . ,θ

>K ]> ∈ Rm with m =

∑Ki=1mi.

We highlight that θi could be a vector of mixed continuousand discrete variables. For example, the scikit-learn trans-former ‘Nystroem’ [Pedregosa et al., 2011a] involves bothdiscrete (‘kernel type’: linear, polynomial, radial basis func-tion) and continuous hyper-parameters (‘Gamma’ parameterfor different kernels). For conditional hyper-parameters, theparameter tree could be flattened so that each conditional pa-rameter corresponds to a different algorithm. For example,the aforementioned ‘Nystroem’ transformer with a ‘linear’kernel is considered to be a different algorithm compared to‘Nystroem’ with a ‘polynomial’ kernel (the latter having ahyper-parameter ‘degree’ enabled only for ‘polynomial’ ker-nel). Let f(z,θ;A) denote the loss (validation/testing error)of a learning task when algorithm selection based on z utilizeshyper-parameter configuration θ on the data set A.

We now pose the AutoML problem of our interest as

minimizez,θ

f(z,θ;A)

subject to 1>zi = 1, zi ∈ {0, 1}Ki , i ∈ [N ]θi ∈ Ci, θi ∈ Di, i ∈ [K],

(1)

where Ci andDi denote continuous and discrete (integer) con-straints respectively for the hyper-parameters θi. Comparedto the previous ‘CASH’ formulation [Feurer et al., 2015], ourformulation (1) introduces two key differences: a) we use ex-plicit Boolean variables z to encode the algorithm selection,and b) we incorporate both continuous and integer constraintsof hyper-parameters θ. These features could better character-ize the properties of AutoML problems and thus enable moreefficient joint optimizers. With the aid of z, the objectivefunction of problem (1) can be explicitly expressed as

f(z,θ;A) =

K1∑j1=1

. . .

KN∑jN=1

[(N∏i=1

ziji

)f({ziji = 1}Ni=1,θ;A

)], (2)

where ziji denotes the ji-th element of zi, with ziji = 1representing the selection of the ji-th algorithm in modulei. f

({ziji = 1}Ni=1,θ;A

)denotes the loss achieved by algo-

rithms {j1, . . . , jN} under hyper-parameters θ, and the lossis measured only if

∏Ni=1 z

iji

= 1. Note that in (1) f is ablack-box function of θ with no gradients available; we areonly able to query values of f at any specified θ.

3 ADMM-Based Joint OptimizerIn this section, we show how alternating direction method ofmultipliers (ADMM) is suitable for AutoML. ADMM pro-vides a general and effective optimization framework to solvecomplex problems, like (1), with mixed variables and multi-ple constraints [Boyd et al., 2011; Liu et al., 2018].

We begin by reformulating problem (1) that lends itself tothe application of ADMM [Boyd et al., 2011; Parikh et al.,2014]. By introducing an auxiliary variable δ together witha new equality constraint, problem (1) can be written as thestandard form amenable to ADMM

minimizez,θ,δ

f(z,θ;A) + IZ(z) + IC(θ) + ID(δ)

subject to θ = δ,(3)

where IZ(z), IC(θ) and ID(δ) are indicator functions to en-code the constraints of problem (1):

IZ(z) =

{0 1>zi = 1, zi ∈ {0, 1}Ki , i ∈ [N ]∞ otherwise, (4)

IC(θ) =

{0 θi ∈ Ci, i ∈ [K]∞ otherwise, (5)

ID(δ) =

{0 δi ∈ Di, i ∈ [K]∞ otherwise. (6)

Here the indicator functions are introduced to express thehard constraints of problem (1) in the objective function of(3), e.g., IZ(z) =∞ when z is infeasible to problem (1).

ADMM is performed by minimizing the augmented La-grangian function of problem (3)

L(z,θ, δ,λ) =f(z,θ;A) + IZ(z) + IC(θ) + ID(δ)

+ λ>(θ − δ) +ρ

2‖θ − δ‖22, (7)

where λ is the Lagrangian multiplier, and ρ > 0 is a givenpenalty parameter of the augmented term. ADMM splits allof optimization variables into two blocks and alternativelyminimizes the augmented Lagrangian function. As a result,ADMM is executed at iteration t = 0, 1, . . .,

θ(t+1) = arg minθ

L(z(t),θ, δ(t),λ(t)), (8)

{δ(t+1), z(t+1)} = arg minδ,z

L(z,θ(t+1), δ,λ(t)), (9)

λt+1 = λt + ρ(θ(t+1) − δ(t+1)), (10)

where z(0), δ(0), and λ(0) are given initial values. Based on(4), (6) and (7), we note that the variables δ and z are inde-pendent, and thus the step (9) can be further decomposed intotwo independent subproblems,

minimizeδ

ρ

2‖δ − a‖22 + ID(δ), a := θ(t+1) +

1

ρλ(t), (11)

minimizez

f(z,θ(t+1);A) + IZ(z). (12)

Remark 1 It is worth mentioning that ADMM provides ageneral operator splitting framework. If we only update pri-mal variables θ, δ and z, then ADMM simplifies to the alter-nating optimization (AO) method that involves (8), (11) and(12) by dropping the dual variable λ and setting ρ = 0.

Clearly, the main advantage of ADMM is that it allowsus to split the original AutoML problem (3) into three sub-problems (8), (11) and (12). As will be evident later, prob-lem (8) can be cast as a Bayesian optimization (BO) problemwith only continuous constraints. Problem (11) gives the Eu-clidean projection operator, which projects the known point aonto the integer constraints {Di}. Problem (12) can be inter-preted as a combinatorial bandit problem.

3.1 BO for solving problem (8)Substituting (7) into (8) and completing squares with respectto θ in (8), we obtain the optimization problem

minimizeθ

f(z(t),θ;A) + ρ2‖θ − b‖22

subject to θi ∈ Ci, i ∈ [K],(13)

where b := δ(t) − (1/ρ)λ(t). In problem (13), since thealgorithm selection scheme z(t) is known at the current iter-ation, the loss function f only relies on hyper-parameters θicorresponding to the N selected algorithms with z(t)i = 1,where z(t)i denotes the ith element of z(t). Therefore, we can

divide θ into two parts θ = [θ>, θ>

]>, where θ are hyper-parameters corresponding to active algorithms with z(t)i = 1

for i ∈ [K], and hyper-parameters θ are associated with in-active algorithms with z(t)i = 0 for i ∈ [K].

By splitting hyper-parameters with respect to active andinactive algorithms, problem (13) can be equivalently trans-formed into

minimizeθ

ρ2‖θ − b‖22

subject to θi ∈ Ci, i ∈ [K],and (14)

minimizeθ

f(z(t), θ;A) + ρ2‖θ − b‖22

subject to θi ∈ Ci, i ∈ [K],(15)

where b = [b>, b>]>, and the partition of b is consistent

with that of θ = [θ>, θ>

]>.It is often the case that the continuous constraint Ci in (13)

is given by a bounded box Ci = {θi | ci ≤ θi ≤ di}, whereci and di are lower and upper bounds of hyper-parameters.Thus, problem (14) can be solved analytically for every θiwith z(t)i = 0,

θ(t+1)

i = Proj[ci,di](bi), (16)

where θi ∈ [ci, di], Proj(·) is Euclidean projection operator

[θ(t+1)

i

]k

=

[ci]k if [bi]k ≤ [ci]k[bi]k if [ci]k ≤ [bi]k ≤ [di]k[di]k if [bi]k ≥ [di]k,

(17)

and [x]k denotes the kth entry of x.We next turn to problem (15), which is a HPO problem but

only corresponds to a reduced set of hyper-parameters {θi}.Due the the black-box nature of HPO, we solve problem (15)by using BO, which has become a core component of someAutoML systems [Snoek et al., 2012; Shahriari et al., 2016;Ariafar et al., 2017]. Let us call the objective function ofproblem (15) as l(θ). BO assumes a statistical model, usuallya Gaussian process (GP) of l(θ). Based on the observed func-

tion values y = [l(θ(0)

), . . . , l(θ(t)

)]>, BO updates the GP

and determines the next query point θ(t+1)

by maximizingthe expected improvement (EI) over the posterior GP model[Shahriari et al., 2016; Ariafar et al., 2017].

To be more specific, we begin by modeling the loss f(θ)as a GP with a prior distribution f(·) ∼ N (0, κ(·, ·)), whereκ(·, ·) is a positive definite kernel. Since the quadratic termin l(θ) is deterministic, the probability distribution of l(θ) isalso a GP with mean ρ

2‖θ− b‖22 and the same kernel functionas f(θ). Given the observed function values y, the posteriorprobability of a new function evaluation l(θ) at iteration t+1

can be modelled as a Gaussian distribution with mean µ(θ)

and variance σ2(θ) [Shahriari et al., 2016, Sec. III-A],{µ(θ) = κ>[Γ + σ2

nI]−1y

σ2(θ) = κ(θ, θ)− κ>[Γ + σ2nI]−1κ,

(18)

where κ is a vector of covariance terms between θ and{θ

i}ti=0, and Γ denotes the covariance of {θ

i}ti=0, namely,

Γij = κ(θi, θj), and σ2

n is a small positive number to modelthe variance of the observation noise.Remark 2 To determine the GP model (18), we choose thekernel function κ(·, ·) as the ARD Matern 5/2 kernel [Snoeket al., 2012; Shahriari et al., 2016],

κ(x,x′) = τ20 exp(−√

5r)(1 +√

5r +5

3r2) (19)

for two vectors x,x′, where r2 =∑di=1(xi − x′i)2/τ2i , and

{τi}di=0 are kernel parameters. We determine the GP hyper-parameters ψ = {{τi}di=0, σ

2n} by minimizing the negative

log marginal likelihood log p(y|ψ) [Shahriari et al., 2016],

minimizeψ

log det(Γ + σ2nI) + y>

(Γ + σ2

nI)−1

y. (20)

Solving problem (20) via a standard gradient descent rou-tine is computationally intensive in a high dimensional settingsince acquiring gradients requires the inversion of large ma-trices. To circumvent this difficulty, we utilize a zeroth-ordersign-based gradient descent method [Liu et al., 2019] whichhas a provable fast convergence to moderate accuracy but isfree of gradient computation (and hence, matrix inversion).With the posterior model (18), the desired next query point

θ(t+1)

maximizes the EI acquisition function

θt+1

= arg max{θi∈Ci}

EI(θ) :=(y+ − l(θ)

)I(l(θ) ≤ y+)

= arg max{θi∈Ci}

(y+ − µ)Φ

(y+ − µσ

)+ σφ

(y+ − µσ

), (21)

where y+ = mini∈[t] l(θi), namely, the minimum observed

value, I(l(θ) ≤ y+) = 1 if l(x) ≤ y+, and 0 otherwise(indicating that the desired next query point θ should yielda smaller loss than the observed minimum loss), µ and σ2

are defined in (18), Φ denotes the cdf of the standard normaldistribution, and φ is its pdf. The derivation of EI is illustratedin the supplement [Authors, 2019, Section 1].

With the aid of (21), EI can be maximized via projectedgradient ascent. In practice, a customized bound-constrainedL-BFGS-B solver [Zhu et al., 1997] is often adopted.

3.2 Euclidean projection for solving problem (11)According to the definition of ID(δ) in (6), problem (11) canbe rewritten as the Euclidean projection operator

minimizeδ

ρ

2‖δ − a‖22 subject to δi ∈ Di∀i ∈ [K]. (22)

We recall from (3) that δ provides a copy of hyper-parametersθ, and {Di} represent their integer constraints. Similar to Ci,Di is often given by a bounded integer box in practice,

Di = {δi | ei ≤ δi ≤ fi and δi ∈ Nmi+ }, (23)

Algorithm 1 TS for CMAB with probabilistic rewards

1: Input: Beta distribution priors α0 and δ0, maximum it-erations L, upper bound f of loss f .

2: Set: nj(k) and rj(k) as the cumulative counts and re-wards respectively of arm j pulls at bandit iteration k.

3: for k ← 1, 2, . . . , L do4: for all arms j ∈ [K] do5: αj(k)← α0+rj(k), δj(k)← δ0+nj(k)−rj(k).6: Sample ωj ∼ Beta(αj(k), δj(k)).7: end for8: Determine the arm selection scheme z(k) by solving

maxz

N∑i=1

(zi)>ωis.t.1>zi = 1, zi ∈ {0, 1}Ki (25)

where ω = [(ω1)>, . . . , (ωN )>]> is the vector of{ωj}, and ωi is its subvector limited to module i.

9: Apply strategy z(k) and observe continuous reward r

r = 1−min

{max

{f(k + 1)

f, 0

}, 1

}(26)

where f(k + 1) is the loss value after applying z(k).10: Observe binary reward r ∼ Bernoulli(r).11: Update nj(k + 1)← nj(k) + zj(k).12: Update rj(k + 1)← rj(k) + zj(k)r.13: end for

where ei and fi are lower and upper bounds of δi, and Nmi+denotes the positive integer set of dimension mi.

Given (23), the solution of problem (22) is achieved byprojecting a onto Di,

δ(t+1)i = Round

(Proj[ei,fi](ai)

), (24)

where Proj[ei,fi](·) is defined in (16), and Round(·) is thefunction of rounding a number to its nearest integer.

3.3 Combinatorial bandits for problem (12)Utilizing (4) in the algorithm selection problem (12) gets us

minimizez∈{0,1}K

f(z,θ(t);A) subject to 1>zi = 1∀i ∈ [N ]. (27)

Based on (2), problem (27) can be solved as an integer pro-gram, but has two issues: (i)

∏Ni=1Ki function queries would

be needed in (2) at each ADMM iteration, and (ii) integer pro-gramming is difficult with problems such as (27) containinga product

∏Ni=1 z

iji

of integer variables.We propose a customized combinatorial multi-armed ban-

dit (CMAB) algorithm as a query-efficient alternative by in-terpreting problem (27) through combinatorial bandits: Wewish to select the optimal N algorithms (arms) from K =∑Ni=1Ki algorithms based on bandit feedback (‘reward’) r

inversely proportional to the loss f . CMAB problems canbe efficiently solved with Thompson sampling (TS) [Durandand Gagne, 2014]. However, the conventional algorithm uti-lizes binary rewards, and hence is not directly applicable toour case of continuous rewards (with r ∝ 1 − f where the

Algorithm 2 ADMM to solve AutoML problem (1)

1: Input: Maximum iterations T , parameter ρ, initial iter-ates z(0), δ(0) and λ(0).

2: for t← 0, 1, . . . , T − 1 do3: solve HPO problem (8) via (16) and (21)4: solve projection problem (11) via (24)5: solve model selection problem (12) by CMAB6: perform dual update (10)7: end for8: output: hyper-parameters δ(T ) & algorithms z(T )

loss f ∈ [0, 1] denotes the validation/testing objective). Weaddress this issue by using “probabilistic rewards” [Agrawaland Goyal, 2012].

We present the customized CMAB algorithm in Algorithm1 and elaborate on it in the supplement [Authors, 2019, Sec-tion 2]. The closed-form solution of problem (25) is givenby zij = 1 for j = arg maxj∈[Ki] ω

ij , and zij = 0 otherwise.

Step 9 of Algorithm 1 normalizes the continuous loss f withrespect to its upper bound f (lower bound is 0), and maps itto the continuous reward r within [0, 1]. In experiments, wefound that Algorithm 1 is not very sensitive to the choice off . Step 10 of Algorithm 1 converts a probabilistic reward to abinary reward. Lastly, steps 11-12 of Algorithm 1 update thepriors of TS for combinatorial bandits [Durand and Gagne,2014]. We conclude Section 3 with the summarized ADMM-based AutoML procedure in Algorithm 2.

4 ExperimentsIn this section, we evaluate the capabilities of the proposedADMM optimizer against widely used AutoML systemsAuto-sklearn [Feurer et al., 2015] and TPOT [Olson andMoore, 2016]. ADMM has two parameters – the penalty ρon the augmented term (7) and the loss upper-bound f in theCMAB algorithm (Algorithm 1). Our experiments show thatthe proposed approach is not very sensitive to the choice of ρand f ; see Section 3.1 of supplement. We start the ADMM

Table 1: Overview of the scikit-learn preprocessors, transformers,and estimators used in our empirical evaluation.

Module Algorithm # parameters

Preprocessor

None∗Normalizer

QuantileTransformerRobustScaler

MinMaxScalerStandardScaler

ImputerOneHotEncoder

none2d†2d2c†1d3d1d1d

TransformerNonePCA

PolynomialFeatures

none1c, 1d1c, 2d

EstimatorGaussianNB

KNeighborsClassifierRandomForestClassifier

none3d

1c, 5d∗None means no algorithm is selected and corresponds to a empty setof hyper-parameters. † ‘d’ and ‘c’ represents discrete and continuousvariables, respectively.

optimization with λ(0) = 0. Based on Remarks 1 & 2,we consider two variants of ADMM: (i) ADMM(BO, Ba):ADMM uses the basic Bayesian optimization (BO) and theCMAB algorithm (Ba) with ρ = 1, (ii) AO(BO2, Ba): Al-ternating optimization (AO) (ADMM with ρ = 0 and onlyprimal variable updates) using BO with the zeroth-order sign-based gradient descent (BO2) and Ba. In both these variants,f = 0.7. We optimize for 4096 seconds and generate timeto convergence curves with respect to the black-box objectivewith 10 repetitions. This procedure is detailed in Section 3.2of the supplement.

Baselines. We consider 2 versions of vanilla Auto-sklearn(Auto-sklearn without ensembles3and meta-learning) –ASKL:SMAC3, which uses the SMAC HPO [Hutter et al.,2011], and ASKL:RND, which uses random search, a fairlycompetitive baseline [Falkner et al., 2018; Li et al., 2018].For TPOT, we consider 2 instances TPOT:10 and TPOT:50with population sizes of 10 and 50 respectively to ensure thatTPOT is able to achieve a meaningful solution in the maxi-mum allotted optimization time. For all optimizers, we usethe same scikit-learn components [Pedregosa et al., 2011b]and restrict the set of composable elements to the methodslisted in Table 1. For each of these methods, we use the hyper-parameter ranges as set in Auto-sklearn4.

Remarks on TPOT. The proposed ADMM based schemesand Auto-sklearn are optimizing over the same search spaceby making a single selection from each of the functional mod-ules listed in Table 1 to generate a pipeline. In contrast, TPOTcan use multiple methods from the same functional modulewithin a single pipeline with stacking and chaining due to thenature of the splicing/crossover schemes in its underlying ge-netic algorithm. This gives TPOT access to a larger searchspace of more complex pipelines featuring longer as well asparallel compositions, rendering the comparison somewhatbiased towards TPOT. Notwithstanding this caveat, we con-sider TPOT as a baseline since it is a competitive open sourceAutoML alternative to Auto-sklearn. Further details regard-ing the baselines and the overall evaluation are in Section 3.2of the supplement.

Data and black-box objective function. We consider 26data sets from the UCI ML [Asuncion and Newman, 2007]& OpenML repositories, and Kaggle. All these data sets cor-respond to binary classification problems although our pro-posed scheme (and baselines) can work with multiclass clas-sification and regression. Details on the data sets are in Sec-tion 3.3 of the supplement. For binary classification, weconsider the Area under the ROC curve (AUROC) on a vali-dation set as the black-box optimization objective. Since the

3As mentioned in Section 1, the ensemble learning in Auto-sklearn is done post-hoc using the pipelines found during the op-timization – the optimization is not affected or driven by the ensem-ble learning. Hence we do not focus on this aspect of Auto-sklearnhere. This same post-hoc ensemble learning can be applied to ourproposed ADMM based optimization, and we demonstrate this ca-pability in Section 3.5 of the supplement.

4https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components

https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components


AUROC lies within [0, 1] and we solve a minimization prob-lem, our precise objective is (1−AUROC). We consider AU-ROC since it is a meaningful predictive performance metricregardless of the class imbalance in the data (as opposed tocommon classification error metric). However, the proposedscheme (and baselines) is a black-box optimizer and can workwith any black-box objective.

Convergence properties. We present the convergence be-havior of the proposed schemes and the baselines in Figure 1on 4 data sets as representative of the relative performance.The figures for all 26 data sets are presented in Figure A2 ofthe supplement. As detailed in Section 3.2 of the supple-ment, an optimizer is assigned the worst-case objective valueof 1.0 when it has not completed a single black-box functionevaluation up until that time. This is the reason for the suddendrop in the objective values for ASKL:SMAC3, TPOT:10 andTPOT:50, with TPOT:10 dropping before TPOT:50. Hencethe initial behavior of the various optimizers is more depen-dent on the initial heuristics than the optimization capabilitiesof the method. For larger optimization times, most methodsare fairly competitive, with most methods significantly out-performing ASKL:RND (random search baseline). No onemethod obviously dominates the rest at any particular opti-mization time – most of them are fairly competitive on manydata sets (for example, Home Credit Default Risk) However,as expected TPOT:50 is almost always close to the best bythe maximum optimization time of 4096 seconds (on accountof having a larger search space). However, there are sev-eral cases where our proposed schemes ADMM(BO, Ba) orAO(BO2, Ba) achieve the lowest objective (for example, OilSpill), as does ASKL:SMAC3. As expected, TPOT:10 dom-inates TPOT:50 for lower optimization times; given enoughtime, TPOT:50 always improves upon TPOT:10.

Aggregate behavior across multiple data sets. To fur-ther quantify the overall relative performance of the optimiz-ers, we consider the relative rank of each optimizer (withrespect to the median objective over 10 trials) for everytime stamp, and average this rank across all data sets (us-ing the minimum rank for ties) similar to the comparison inFeurer, et al., 2015 [Feurer et al., 2015]. The aggregate re-sults are presented in Figure 2. They indicate that, givenenough optimization time, all optimizers significantly out-perform random search ASKL:RND. AO(BO2, Ba) outper-forms ADMM(BO, Ba) for larger optimization times, demon-strating ADMM’s ability to gain from the improvement inthe underlying sub-optimizer (in this case, Bayesian opti-mization). Both ADMM(BO, Ba) and AO(BO2, Ba) outper-form ASKL:SMAC3, the vanilla Auto-sklearn, on aggregate.TPOT:50 improves upon TPOT:10 and performs very welloverall given enough time on account of having access to alarger search space with the potential for improved objectivevalues; TPOT:10’s performance saturates quickly because ofthe smaller population size. Our proposed AO(BO2, Ba) isvery competitive to TPOT:50, being within an average rankof ∼ 0.5 (across 26 data sets) when AO(BO2, Ba) is not bet-ter than TPOT:50 notwithstanding the smaller search space.

(a) Sonar (b) Oil spill

(c) Mammography (d) Home Credit Default Risk

Figure 1: Convergence of validation objective (median & the inter-quartile range over 10 trials) with respect to optimization time(please view in color). Please note the log scale on both axes.Smaller values on the vertical axis are better.

5 ConclusionsIn this paper, we proposed a new ADMM-based AutoMLapproach for joint algorithm selection and hyper-parameteroptimization. We delved into the structures of AutoMLproblems from the optimization perspective, and demon-strated how ADMM can effectively be integrated with othergradient-free solvers used in AutoML, such as Bayesian op-timization and the multi-armed bandit algorithm. We showedthat ADMM allows us to split the original complex AutoMLproblem (in terms of a mixed nonlinear integer program)into three easily-solved subproblems, a) Bayesian optimiza-tion (BO) with only continuous constraints, b) the Euclideanprojection operator, and c) the combinatorial bandit problem.We demonstrated that the proposed AutoML approach yieldscompetitive performance to the state-of-the-art AutoML al-gorithms including TPOT and auto-sklearn.

In the future, we will seek other better formulations and ap-proaches to tackle the issue of conditional hyper-parameters.Moreover, we will relax our assumption that only one algo-rithm is selected per module in a pipeline. Lastly, instead ofBO, we will consider other gradient-free methods, such astrust region, to further improve the query efficiency.

6 In MemoriamWe would like to use this opportunity to memorize our dearfriend and the deceased co-author, Andrew R Conn. AndrewR Conn, was born in London in 1946, received his PhD fromWaterloo University, where he became a Full Professor. Hejoined IBM in 1990 and made substantial contributions acrossthe board. His pioneering research in the field of numer-ical optimization has awarded him several prestigious hon-ors, including the Lagrange Prize in Continuous Optimization

Figure 2: Average ranking of the different optimizers for differentoptimization times across 26 data sets (please view in color). Notethe log scale on the horizontal axis. Smaller values on the verticalaxis are better. See text for further details.

and the Beale-Orchard-Hays Prize. Beyond his remarkablescholastic accomplishments, Andy was a dear friend withgreat empathy, sense of humor and charming personality. Hewas an incredible colleague with extraordinary insights, cu-riosity, and eagerness to solve complex problems. As a truemathematician, he has sure chosen a proper day (Pi day) toleave us, yet he will always be remembered and cherished byus

References[Agrawal and Goyal, 2012] S. Agrawal and N. Goyal. Anal-

ysis of thompson sampling for the multi-armed banditproblem. In Conference on Learning Theory, pages 39–1, 2012.

[Ariafar et al., 2017] S. Ariafar, J. Coll-Font, D. Brooks, andJ. Dy. An admm framework for constrained bayesian op-timization. NIPS Workshop on Bayesian Optimization,2017.

[Asuncion and Newman, 2007] Arthur Asuncion and DavidNewman. Uci machine learning repository, 2007.

[Authors, 2019] The Authors. Supplementary material.https://tinyurl.com/y2zyjlw4, 2019.

[Bergstra et al., 2011] James S Bergstra, Remi Bardenet,Yoshua Bengio, and Balazs Kegl. Algorithms for hyper-parameter optimization. In Advances in neural informa-tion processing systems, pages 2546–2554, 2011.

[Bouneffouf et al., 2017] Djallel Bouneffouf, Irina Rish,Guillermo A. Cecchi, and Raphael Feraud. Context atten-tive bandits: Contextual bandit with restricted context. InProceedings of the 26th International Joint Conference onArtificial Intelligence, IJCAI’17, pages 1468–1475. AAAIPress, 2017.

[Boyd et al., 2011] Stephen Boyd, Neal Parikh, Eric Chu,Borja Peleato, Jonathan Eckstein, et al. Distributed op-

timization and statistical learning via the alternating direc-tion method of multipliers. Foundations and Trends R© inMachine Learning, 3(1):1–122, 2011.

[Caruana et al., 2004] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selectionfrom libraries of models. In Proceedings of the twenty-firstinternational conference on Machine learning, page 18.ACM, 2004.

[Durand and Gagne, 2014] A. Durand and C. Gagne.Thompson sampling for combinatorial bandits and itsapplication to online feature selection. In Workshops at theTwenty-Eighth AAAI Conference on Artificial Intelligence,2014.

[Falkner et al., 2018] Stefan Falkner, Aaron Klein, andFrank Hutter. BOHB: Robust and efficient hyperparam-eter optimization at scale. In Jennifer Dy and AndreasKrause, editors, Proceedings of the 35th InternationalConference on Machine Learning, volume 80 of Proceed-ings of Machine Learning Research, pages 1437–1446,Stockholmsmassan, Stockholm Sweden, 10–15 Jul 2018.PMLR.

[Feurer et al., 2015] Matthias Feurer, Aaron Klein, Katha-rina Eggensperger, Jost Springenberg, Manuel Blum, andFrank Hutter. Efficient and robust automated machinelearning. In Advances in Neural Information ProcessingSystems, pages 2962–2970, 2015.

[Fusi et al., 2017] N. Fusi, S. Rishit, and H. M. Elibol.Probabilistic Matrix Factorization for Automated MachineLearning. ArXiv e-prints, 2017.

[Hutter et al., 2011] Frank Hutter, Holger H. Hoos, andKevin Leyton-Brown. Sequential model-based optimiza-tion for general algorithm configuration. In Proceedingsof the 5th International Conference on Learning and In-telligent Optimization, LION’05, pages 507–523, Berlin,Heidelberg, 2011. Springer-Verlag.

[Kotthoff et al., 2017] Lars Kotthoff, Chris Thornton, Hol-ger H. Hoos, Frank Hutter, and Kevin Leyton-Brown.Auto-weka 2.0: Automatic model selection and hyper-parameter optimization in weka. J. Mach. Learn. Res.,18(1):826–830, January 2017.

[Li et al., 2018] Lisha Li, Kevin Jamieson, Giulia DeSalvo,Afshin Rostamizadeh, and Ameet Talwalkar. Hyper-band: A novel bandit-based approach to hyperparame-ter optimization. Journal of Machine Learning Research,18(185):1–52, 2018.

[Liu et al., 2018] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen,Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-orderstochastic variance reduction for nonconvex optimization.In Advances in Neural Information Processing Systems,pages 3731–3741, 2018.

[Liu et al., 2019] Sijia Liu, Pin-Yu Chen, Xiangyi Chen, andMingyi Hong. signSGD via zeroth-order oracle. In Inter-national Conference on Learning Representations, 2019.

[Olson and Moore, 2016] Randal S Olson and Jason HMoore. Tpot: A tree-based pipeline optimization tool for

https://tinyurl.com/y2zyjlw4

automating machine learning. In Workshop on AutomaticMachine Learning, pages 66–74, 2016.

[Parikh et al., 2014] Neal Parikh, Stephen Boyd, et al. Prox-imal algorithms. Foundations and Trends R© in Optimiza-tion, 1(3):127–239, 2014.

[Pedregosa et al., 2011a] F. Pedregosa, G. Varoquaux,A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830,2011.

[Pedregosa et al., 2011b] F. Pedregosa, G. Varoquaux,A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830,2011.

[Shahriari et al., 2016] B. Shahriari, K. Swersky, Z. Wang,R. P. Adams, and N. De Freitas. Taking the human out ofthe loop: A review of bayesian optimization. Proceedingsof the IEEE, 104(1):148–175, 2016.

[Snoek et al., 2012] J. Snoek, H. Larochelle, and R. P.Adams. Practical bayesian optimization of machine learn-ing algorithms. In Advances in neural information pro-cessing systems, pages 2951–2959, 2012.

[Thornton et al., 2012] Chris Thornton, Holger H. Hoos,Frank Hutter, and Kevin Leyton-Brown. Auto-weka: Au-tomated selection and hyper-parameter optimization ofclassification algorithms. CoRR, abs/1208.3719, 2012.

[Zhu et al., 1997] Ciyou Zhu, Richard H Byrd, Peihuang Lu,and Jorge Nocedal. Algorithm 778: L-bfgs-b: Fortran sub-routines for large-scale bound-constrained optimization.ACM Transactions on Mathematical Software (TOMS),23(4):550–560, 1997.

Supplementary Material to ‘Automated Machine Learning via ADMM’

1 Derivation of Expected Improvement (EI)

We recall that with the posterior model (18), the desired next query point θ(t+1)

maximizes the EI acquisition function

θt+1

= arg max{θi∈Ci}

EI(θ) = arg max{θi∈Ci}

El(θ)|y[(y+ − l(θ)

)I(l(θ) ≤ y+)

](28)

Substituting (18) into (28), the EI acquisition function can be simplified to

EI(θ)l′=

l(θ)−µσ= El′

[(y+ − l′σ − µ)I

(l′ ≤ y+ − µ

σ

)]= (y+ − µ)Φ

(y+ − µσ

)− σEl′

[l′I(l′ ≤ y+ − µ

σ

)]

= (y+ − µ)Φ

(y+ − µσ

)− σ

∫ y+−µσ

−∞l′φ(l′)dl′ = (y+ − µ)Φ

(y+ − µσ

)+ σφ

(y+ − µσ

), (29)

where µ and σ2 are defined in (18), Φ denotes the cdf of the standard normal distribution, φ is its pdf, and the last equality holdsdue to

∫xφ(x)dx = −φ(x) + C for some constant C. Here we omitted the constant C since it does not affect the solution to

the EI maximization problem (21).

2 Customized Combinatorial Multi-Armed Bandit (CMAB) Algorithm

We summarize the customized CMAB algorithm in Algorithm A1. In Step 1 of Algorithm A1, if we choose L <∏Ni=1Ki,

then Algorithm A1 has lower query complexity than the integer programming solver of (2). In Step 7 of Algorithm A1, problem(31) is equivalent to N independent problems

maximizezi

(zi)>ωi subject to 1>zi = 1, zi ∈ {0, 1}Ki∀i ∈ [N ]. (30)

The closed-form solution of problem (30) is given by zij = 1 for j = arg maxj∈[Ki] ωij , and zij = 0 otherwise. Step 8 of

Algorithm A1 normalizes the continuous loss f with respect to its upper bound f (lower bound is 0), and maps it to thecontinuous reward r within [0, 1]. In experiments, we found that Algorithm A1 is not very sensitive to the choice of f . Step 9of Algorithm A1 applies the idea of [Agrawal and Goyal, 2012] to convert a probabilistic reward to a binary reward. Lastly,step 10 of Algorithm A1 updates the priors of TS for combinatorial bandits [Durand and Gagne, 2014]. In our experiments, wechoose L = 30, α0 = β0 = 10, and f = 0.7 for running Algorithm A1.

3 Additional Experiments3.1 Parameter sensitivity check

We investigate how sensitive our proposed approach is to the ADMM parameter ρ and CMAB parameter f . For each param-eter combination of ρ ∈ {0.001, 0.01, 0.1, 1, 10} and f ∈ {0.5, 0.6, 0.7, 0.8, 0.9}, in Fig. A1 we present the validation error(averaged over 10 trials) by running our approach on the HCDR dataset (see Section 3.3 in this supplement). As we can see,our approach is not very sensitive to the choice of ρ and f . For consistency, we set ρ = 1 and f = 0.7 in other experiments.

3.2 Details on the baselines and evaluation schemeEvaluation scheme. The optimization is run for some maximum runtime T where each proposed configuration is trained ona set St and evaluated on Sv and the obtained score s is the objective that is being minimized by the optimizer. We ensure thatall the optimizers use the same train-validation split. Once the search is over, the history of attempted configurations is used togenerate a search time vs. holdout performance curve in the following manner for N timestamps:

• For each timestamp ti, i = 1, . . . , N, tN = T , we pick the best validation score si obtained by any configuration found bytime ti from the start of the optimization.

• Then we plot the (ti, si) pairs.

• The whole above process is repeated R times for the same T,N and tis to get inter-quartile ranges for the curves.

For the presented results, T = 4096 seconds, N = 256 and R = 10.

Algorithm A1 TS for CMAB with probabilistic rewards

1: Input: set α0 and β0 as priors of the Beta distribution, and the maximum number of iterations L. For each arm (algorithm)j, let nj(k) be the number of times arm j has been played up to bandit iteration k, and let rj(k) be the cumulative rewardassociated with arm j. Let f denote the upper bound on the loss f .

2: for k = 1, 2, . . . , L do3: for all arms j ∈ [K] do4: αj(k) = α0 + rj(k), δj(k) = δ0 + nj(k)− rj(k)5: sample ωj ∼ Beta(αj(k), δj(k))6: end for7: determine the arm selection scheme z(k) by solving

maximizez

∑Ni=1(zi)>ωi

subject to 1>zi = 1, zi ∈ {0, 1}Ki , i ∈ [N ],(31)

where ω = [(ω1)>, . . . , (ωN )>]> is the vector of {ωj}, and ωi is the subvector of ω corresponding to module i.8: apply z(k) and observe continuous reward r

r = 1−min

{max

{f(k + 1)

f, 0

}, 1

}(32)

where f(k + 1) is the loss value after applying z(k).9: observe binary reward r with Bernoulli success probability r.

10: update nj(k + 1) and rj(k + 1): nj(k + 1) = nj(k) + zj(k), rj(k + 1) = rj(k) + zj(k)r.11: end for

(a) Performance versus f for different values of ρ. (b) Heatmap of (a)

Figure A1: Validation error of our proposed ADMM-based AutoML approach against ADMM parameter ρ and CMAB parameter f

Parity with baselines. First we ensure that the operations (such as model training) are done single-threaded (to the extent pos-sible) to remove the effects of parallelism in the execution time. We set OPENBLAS NUM THREADS and OMP NUM THREADSto 1 before the evaluation of ADMM and the other baselines. ADMM can take advantage of the parallel model-training muchlike the other systems, but we want to demonstrate the optimization capability of the proposed scheme independent of theunderlying parallelization in model training. Beyond this, there are some details we note here regarding comparison of methodsbased on their internal implementation:

• For any time ti, if no predictive performance score (the objective being minimized) is available, we give that method theworst objective of 1.0 for ranking (and plotting purposes). After the first score is available, all following time stampsreport the best incumbent objective. So comparing the different baselines at the beginning of the optimization does notreally give a good view of the relative optimization capabilities – it just illustrates the effect of different starting heuristics.

• For ADMM, the first pipeline tried is Naive Bayes, which is why ADMM always has some reasonable solution even at theearliest timestamp.

• The per configuration run time and memory limits in Auto-sklearn are removed to allow Auto-sklearn to have access tothe same search space as the ADMM variants.• The ensembling and meta-learning capabilities of Auto-sklearn are disabled. The ensembling capability of Auto-sklearn

is discussed further in Section 3.5 of this supplement.• For ASKL:SMAC3, the first pipeline tried appears to be a Random Forest with 100 trees, which takes a while to be run.

For this reason, there is no score (or an objective of 1.0) for ASKL:SMAC3 until its objective suddenly drops to a morecompetitive level since Random Forests are very competitive out of the box.• For TPOT, the way the software is set up (to the best of our understanding and trials), scores are only available at the

end of any generation of the genetic algorithm. Hence, as with ASKL:SMAC3, both version of TPOT do not report anyscores until the first generation is complete (which implies worst-case objective of 1.0), and after that, the objective dropssignificantly. For the time limit considered (T = 4096 seconds), the default population size of 100 set in TPOT is unableto complete more than a couple of generations. So we reduce the population size to 10 and 50 to complete a reasonablenumber of generations within the set time.• As a baseline, TPOT has an advantage over ASKL and ADMM – TPOT is allowed to use multiple estimators, transformers

and preprocessors within a single pipeline via stacking and chaining due to the nature of the splicing and crossover schemesin its underlying genetic algorithm. This gives TPOT access to a larger search space of more complex pipelines featuringlonger as well as parallel compositions; all the remaining baselines are allowed to only use a single estimator, transformersand preprocessor. Hence the comparison is somewhat biased towards TPOT, allowing TPOT to potentially find a betterobjective in our experimental set up.• Barring the number of generations (which is guided by the maximum run time) and the population size (which are set to 10

and 50 to give us TPOT:10 and TPOT:50), the remaining parameters of mutation rate, crossover rate, subsample fractionand number of parallel threads to the default values of 0.9, 0.1, 1.0 and 1 respectively.

ASKL:RND is implemented based on the Auto-sklearn example for random search at https://automl.github.io/auto-sklearn/master/examples/example_random_search.html.

3.3 Details on the dataWe consider data sets corresponding to the binary classification task from the UCI machine learning repository [Asuncionand Newman, 2007], OpenML and Kaggle. The names, sizes and sources of the data sets are presented in Table A1. TheHCDR data set from Kaggle is a subset of the data presented in the recent Home Credit Default Risk competition (https://www.kaggle.com/c/home-credit-default-risk).

https://automl.github.io/auto-sklearn/master/examples/example_random_search.html

https://automl.github.io/auto-sklearn/master/examples/example_random_search.html

https://www.kaggle.com/c/home-credit-default-risk

https://www.kaggle.com/c/home-credit-default-risk

Table A1: Details of the data sets used for the empirical evaluations. The ‘Class ratios’ column corresponds to the ratio of the two classes inthe data set, quantifying the class imbalance in the data.

Data # rows # columns Source Class ratioSonar 208 61 UCI 1 : 0.87Heart statlog 270 14 UCI 1 : 0.8Ionosphere 351 35 UCI 1 : 1.79Oil spill 937 50 OpenML 1 : 0.05fri-c2 1000 11 OpenML 1 : 0.72PC3 1563 38 OpenML 1 : 0.11PC4 1458 38 OpenML 1 : 0.14Space-GA 3107 7 OpenML 1 : 0.98Pollen 3848 6 OpenML 1 : 1Sylvine 5124 21 OpenML 1 : 1Page-blocks 5473 11 OpenML 1 : 8.77Wind 6574 15 OpenML 1 : 1.14Delta-Ailerons 7129 6 OpenML 1 : 1.13Twonorm 7400 21 OpenML 1 : 1Bank8FM 8192 9 OpenML 1 : 1.48CPU small 8192 13 OpenML 1 : 0.43Delta-Elevators 9517 7 OpenML 1 : 0.99Japanese Vowels 9961 13 OpenML 1 : 0.19HCDR 10000 24 Kaggle 1 : 0.07Phishing websites 11055 31 UCI 1 : 1.26Mammography 11183 7 OpenML 1 : 0.02eeg-eye-state 14980 15 OpenML 1 : 0.81Cal housing 20640 9 OpenML 1 : 1.46MLSS 2017 CH#2 39948 12 OpenML 1 : 0.22D planes 40768 11 OpenML 1 : 1Electricity 45312 9 OpenML 1 : 0.74

3.4 Convergence plots for all data sets.The optimization time vs. median validation objective value curves (and the inter-quartile range) for all the data sets in Table A1are presented in Figure A2. The results indicate that no single optimizer (ADMM, ASKL or TPOT) dominate on all data setsand all time horizons. The two proposed schemes ADMM(BO, Ba) and AO(BO2, Ba) leads in some data sets – for example,see Heart-Statlog, Oil spill, Space GA, Pollen, Delta-elevators. ASKL:SMAC3 leads the pack on other data sets such as Page-blocks and CPU small. The two TPOT variants (TPOT:10 and TPOT:50) collectively lead (given enough time) on data sets suchas Fri-C2, Sylvine, Japanese Vowels, EEG-eye-state and Electricity. On many of the remaining sets, all methods converge tosimilar values – examples are Wind, Twonorm, Bank8FM, Cal housing and 2D planes. In the remaining sets, 2 out of ADMM,ASKL and TPOT collectively end up being the leaders (again, when TPOT is given enough time). Figure 2 in Section 4 of themain paper attempts at aggregating the behavior across the multiple data sets using relative ranks.

(a) Sonar (b) Heart-Statlog (c) Ionosphere (d) Oil spill (e) fri-c2

(f) PC3 (g) PC4 (h) Space GA (i) Pollen (j) Sylvine

(k) Page-blocks (l) Wind (m) Delta-Ailerons (n) Twonorm (o) Bank8FM

(p) CPU small (q) Delta elevators (r) Japanese Vowels (s) HCDR (t) Phishing websites

(u) Mammography (v) EEG-eye-state (w) Cal housing (x) MLSS2017#2 (y) 2D planes

(z) Electricity

Figure A2: Search/optimization time vs. median validation performance with the inter-quartile range over 10 trials (please view in color).The curves colored Aquamarine, Orange-Red, Grey, Blue, Fuchsia and Black correspond respectively to ADMM(BO, Ba), AO(BO2, Ba),ASKL:RND, ASKL:SMAC3, TPOT:10 and TPOT50.

(a) Bank8FM (b) CPU small (c) Delta Ailerons (d) Japanese Vowels (e) Page blocks

(f) Sylvine (g) Twonorm (h) Wind

Figure A3: Ensemble size vs. median performance on the test set and the inter-quartile range (please view in color). The Aquamarine andBlue curves correspond to ADMM(BO, Ba) and ASKL:SMAC3 respectively.

3.5 Learning ensembles with ADMMWe use the greedy selection based ensemble learning scheme proposed in Caruana, et al., 2004 [Caruana et al., 2004] and used inAuto-sklearn as a post-processing step [Feurer et al., 2015]. We run ASKL:SMAC3 and ADMM(BO, Ba) for T = 300 secondsand then utilize the following procedure to compare the ensemble learning capabilities of Auto-sklearn and our proposedADMM based optimizer:• We consider different ensemble sizes e1 = 1 < e2 = 2 < e3 = 4 . . . < emax = 32.• We perform library pruning on the pipelines found during the optimization run for a maximum search time T by picking

only the emax best models (best relative to their validation score found during the optimization phase).• Starting with the pipeline with the best s as the first member of the ensemble, for each ensemble size ej , we greedily add

the meta-model (with replacement) which results in the best performing bagged ensemble (best relative to the performances′j on the validation set Sv after being trained on the training set St).• Once the ensemble members (possibly with repetitions) are chosen for any ensemble size ej , the ensemble members are

retrained on the whole training set (the training + validation set) and the bagged ensemble is then evaluated on the unseenheld-out test set Sh to get s′j . This is done since the ensemble learning is done using the validation set and hence cannotbe used to generate a fair estimate of the generalization performance of the ensemble.• Plot the (ej , s

′j) pairs.

• The whole process is repeated R = 10 times for the same T and ejs to get error bars for s′j .

For ADMM(BO, Ba), we implement the Caruana, et al., 2006 scheme [Caruana et al., 2004] ourselves. For ASKL:SMAC3,we use the post-processing ensemble-learning based on the example presented in their documentation at https://automl.github.io/auto-sklearn/master/examples/example_sequential.html.

The inter-quartile range (over 10 trials) of the test performance of the post-processing ensemble learning for a subset of thedata sets in Table A1 is presented in Figure A3. The results indicate that the ensemble learning with ADMM is able to improvethe performance similar to the ensemble learning in Auto-sklearn. The overall performance is driven by the starting point (thetest error of the best single pipeline, corresponding to an ensemble of size 1) – if ADMM and Auto-sklearn have test objectivevalues that are close to each other (for example, in Page-blocks and Wind), their performance with increasing ensemble sizesare very similar as well.

https://automl.github.io/auto-sklearn/master/examples/example_sequential.html

https://automl.github.io/auto-sklearn/master/examples/example_sequential.html

4 Pairwise rank comparisonThese are the pairwise plots for the aggregated performance of all methods across 26 data sets. This is an expanded version ofFigure 2 in the original paper.

(a) ADMM(BO, Ba) vs. AO(BO2, Ba) (b) ADMM(BO, Ba) vs. ASKL:SMAC3

(c) AO(BO2, Ba) vs. ASKL:SMAC3 (d) AO(BO2, Ba) vs. TPOT:50

Figure A4: Pairwise rank comparison of the different AutoML schemes.

5 Experiments with larger search spaceWe choose a relative small size of model configurations in order to keep an efficient fair comparison across all baselines,auto-sklearn, TPOT and ADMM, with the same set of operators, including all imputation and rescaling. However, here is ajustification for the smaller search space:• First, a smaller search space does not necessarily imply a limited evaluation. With large search spaces, most black-box op-

timizers are just performing random search until much later in the algorithm, in which case, we are mostly just comparingthe initial “exploration” heuristic of all black-box optimizers, rendering the comparison somewhat uninformative. Witha relatively smaller search space, the black box optimizer actually gets the opportunity to perform some “exploitation”to demonstrate the convergence behaviour of the underlying algorithm. A larger search space would necessitate a muchlarger compute time for a meaningful in-depth comparison.• Also note that in most of the data sets, our experiments (Figure 1, figure A2) indicate that all algorithms (except TPOT:10)

keep improving up until the final time limit (of 4096 seconds), indicating that the limited search space is not causing theblack-box optimizers to saturate; the true “optimum” is probably yet to be found even in this small search space.• Moreover, there is a technical issue making a fair comparison of the underlying black-box optimization with parity across

all baselines hard with more operators – many of the operators in Auto-sklearn are custom preprocessors and estimators(kitchen sinks, extra trees classifier preprocessor, linear svc preprocessors, fastICA, KernelPCA, etc) or have some cus-tom handling in there (see https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components). Inclusion of these operators makes it infeasible to have a fair comparison across all methods.

We did double our search space, detailed in Table A2. It represents a choice of 8×3×6 = 144 possible method combinations(contrast to Table 1). The experimental setup is exactly the same as in Section 4 in the paper. The relative performance on thedifferent datasets is presented in Figure A6 and the aggregate performance of all methods across the data sets in Figure A5.

Table A2: Overview of the scikit-learn preprocessors, transformers, and estimators used in our updated empirical evaluation.

Module Algorithm # parameters

Preprocessor

None∗Normalizer

QuantileTransformerRobustScaler

MinMaxScalerStandardScaler

ImputerOneHotEncoder

none2d†2d2c†1d3d1d1d

TransformerNonePCA

PolynomialFeatures

none1c, 1d1c, 2d

Estimator

GaussianNBQuadraticDiscriminantAnalysis

GradientBoostingClassifierKNeighborsClassifier

RandomForestClassifierExtraTreesClassifier

none1c

3c, 6d3d

1c, 5d1c, 5d

∗None means no algorithm is selected and corresponds to a empty set ofhyper-parameters. † ‘d’ and ‘c’ represents discrete and continuous variables,respectively.



Figure A5: Average ranking of the different optimizers for different optimization times across 26 data sets (please view in color). Note thelog scale on the horizontal axis. Smaller values on the vertical axis are better. See text for further details.

(a) Sonar (b) Heart-Statlog (c) Ionosphere (d) Oil spill (e) fri-c2

(f) Space GA (g) Pollen (h) Delta-Ailerons

Figure A6: Search/optimization time vs. median validation performance with the inter-quartile range over 10 trials (please view in color).The curves colored Aquamarine, Orange-Red, Grey, Blue, Fuchsia and Black correspond respectively to ADMM(BO, Ba), AO(BO2, Ba),ASKL:RND, ASKL:SMAC3, TPOT:10 and TPOT50.

6 TPOT pipelines: Variable length, order and non-sequentialThe genetic algorithm in TPOT does stitch pipelines together to get longer length as well as non-sequential pipelines, using thesame module multiple times and in different ordering. Given the abilities to

i have variable length and variable ordering of modules,ii reuse modules, and

iii have non-sequential parallel pipelines,

TPOT does have access to a much larger search space than auto-sklearn and ADMM. Here are some examples for our experi-ments:

[Sequential, length 3 with 2 estimators]Input --> PolynomialFeatures --> KNeighborsClassifier --> GaussianNB

GaussianNB(KNeighborsClassifier(

PolynomialFeatures(input_matrix,PolynomialFeatures__degree=2,PolynomialFeatures__include_bias=False,PolynomialFeatures__interaction_only=False

),KNeighborsClassifier__n_neighbors=7,KNeighborsClassifier__p=1,KNeighborsClassifier__weights=uniform

))

[Sequential, length 4 with 3 estimators]Input --> PolynomialFeatures --> GaussianNB --> KNeighborsClassifier --> GaussianNB

GaussianNB(KNeighborsClassifier(

GaussianNB(PolynomialFeatures(

input_matrix,PolynomialFeatures__degree=2,PolynomialFeatures__include_bias=False,PolynomialFeatures__interaction_only=False

)),KNeighborsClassifier__n_neighbors=7,KNeighborsClassifier__p=1,KNeighborsClassifier__weights=uniform

))

[Sequential, length 5 with 4 estimators]Input

--> RandomForestClassifier--> RandomForestClassifier

--> GaussianNB--> RobustScaler

--> RandomForestClassifier

RandomForestClassifier(RobustScaler(

GaussianNB(RandomForestClassifier(

RandomForestClassifier(input_matrix,RandomForestClassifier__bootstrap=False,RandomForestClassifier__criterion=gini,RandomForestClassifier__max_features=0.68,RandomForestClassifier__min_samples_leaf=16,RandomForestClassifier__min_samples_split=13,RandomForestClassifier__n_estimators=100

),RandomForestClassifier__bootstrap=False,RandomForestClassifier__criterion=entropy,RandomForestClassifier__max_features=0.9500000000000001,RandomForestClassifier__min_samples_leaf=2,RandomForestClassifier__min_samples_split=18,RandomForestClassifier__n_estimators=100

))

),RandomForestClassifier__bootstrap=False,RandomForestClassifier__criterion=entropy,RandomForestClassifier__max_features=0.48,RandomForestClassifier__min_samples_leaf=2,RandomForestClassifier__min_samples_split=8,RandomForestClassifier__n_estimators=100

)

[Non-sequential]Combine[ Input, Input --> GaussianNB --> PolynomialFeatures --> Normalizer ]

--> RandomForestClassifier

RandomForestClassifier(CombineDFs(

input_matrix,Normalizer(

PolynomialFeatures(GaussianNB(

input_matrix),PolynomialFeatures__degree=2,PolynomialFeatures__include_bias=True,PolynomialFeatures__interaction_only=False

),Normalizer__copy=True,Normalizer__norm=l2

)),RandomForestClassifier__bootstrap=False,RandomForestClassifier__criterion=entropy,RandomForestClassifier__max_features=0.14,RandomForestClassifier__min_samples_leaf=7,RandomForestClassifier__min_samples_split=8,RandomForestClassifier__n_estimators=100

)

Documents

Automated Machine Learning via ADMM