Weight-Sharing for Hyperparameter Optimization in ...mkhodak/docs/FL2020Workshop.pdf · Weight-Sharing for Hyperparameter Optimization in Federated Learning Mikhail Khodak 1Tian Li

Weight-Sharing for Hyperparameter Optimization in Federated Learning

Mikhail Khodak 1 Tian Li 1 Liam Li 2 Maria-Florina Balcan 1 Virginia Smith 1 Ameet Talwalkar 1 2

Abstract

Hyperparameter optimization is an importantcomponent of the machine learning pipeline, anda crucial but expensive part of automating theapplication of the latest methods. It is madeeven more difficult in the case of federated learn-ing (FL), where models are learned over a dis-tributed network of heterogeneous devices; herethe need for local training and keeping data on-device makes it difficult to train and evaluate dif-ferent configurations efficiently. We identify theneural architecture search (NAS) technique ofweight-sharing—the simultaneous optimizationof multiple neural networks using the same param-eters—as a way to accelerate tuning hyperparam-eters for algorithms such as Federated Averaging(FedAvg) in the case of federated learning withpersonalization. Using the connection betweenthis setting and meta-learning, and drawing in-spiration from differentiable NAS, we proposea practical method for federated hyperparameteroptimization called FedEx based upon the expo-nentiated gradient update. Measured using finaltest accuracy across five random trials, our algo-rithm outperforms a natural baseline for federatedhyperparameter tuning by 1.0-1.9% across severaldifferent settings on the Shakespeare dataset.

1. IntroductionFederated learning has emerged as a critical framework forapplying machine learning in settings where data is dis-tributed across computational units and training must bedone locally or privately (McMahan et al., 2017; Li et al.,2020c). While tuning hyperparameters of FL methods hasbeen identified as an important problem (Kairouz et al.,2019), the direction has seen limited work; this is despite

1Carnegie Mellon University 2Determined AI. Correspondence to:Mikhail Khodak <[email protected]>.

International Workshop on Federated Learning for User Privacyand Data Confidentiality in Conjunction with ICML 2020.

Copyright 2020 by the author(s).

major differences from the usual hyperparameter optimiza-tion setting that make applying standard algorithms possiblebut difficult.

We make progress on this challenge using the weight-sharing technique widely ued in neural architecture search(Pham et al., 2018; Liu et al., 2019; Cai et al., 2019). Whileearly NAS methods trained multiple architectures to comple-tion from scratch in order to evaluate them, weight-sharingdramatically reduces the training cost to that of a singlesuper-network encompassing all possible architectures inthe search space. Although the standard weight-sharingformulation only allows for accelerating the tuning of archi-tectural hyperparameters such as the type of layer or acti-vation function, and not critical FL hyperparameters suchas the settings of local stochastic gradient descent (SGD),we show that by considering the case of FL with person-alization (Smith et al., 2017) we are able to tune most ofthese as well. Furthermore, because many FL algorithmssuch as FedAvg (McMahan et al., 2017), FedProx (Liet al., 2020d), and SCAFFOLD (Karimireddy et al., 2019)are based on local training, the FedEx procedure we intro-duce leads to models that are good both globally and foron-device personalization.

Our contributions are as follows:

• We formalize the problem of hyperparameter optimiza-tion in FL and discuss key challenges and reasonablebaselines.

• By applying techniques from NAS with weight-sharingto FL with personalization, we introduce FedEx, asimple modification of local training-based FL meth-ods that enables hyperparameter tuning for a varietyof FL algorithms such as FedAvg, FedProx, andSCAFFOLD. Our method can, in-principle, also tuneinitialization-based meta-learning algorithms such asMAML (Finn et al., 2017) and Reptile (Nichol et al.,2018).

• We instantiate FedEx to tune hyperparameters ofFedAvg on the Shakespeare benchmark in FL (McMa-han et al., 2017; Caldas et al., 2018). Our methodoutperforms a strong baseline in both the i.i.d. andnon-i.i.d. settings and using both online evaluation andfinal test accuracy.


1.1. Related Work

While hyperparameter optimization has been the subject ofintensive empirical and theoretical study (Hutter et al., 2011;Bergstra & Bengio, 2012; Li et al., 2018), to our knowledgewe are the first to analyze its formulation and challenges inthe FL setting. Methodologically, our approach draws on thefact that FedAvg can be viewed as optimizing a surrogateobjective for on-device personalization (Khodak et al., 2019)and more broadly the similarity of the personalized FL setupand initialization-based meta-learning (Chen et al., 2018; Liet al., 2020a; Jiang et al., 2019). This connection and ourapplication of weight-sharing NAS techniques also makesresearch connecting NAS and meta-learning relevant (Lianet al., 2020; Elsken et al., 2019), but unlike these methodsour focus is on tuning non-architectural parameters for FLrather than on finding architectures in few-shot learning. Infact, we believe our work is the first to apply weight-sharingto regular hyperparameter optimization. Furthermore, thefew-shot learning setting does not have the data-access andcomputational restrictions of FL, where such methods arelikely impractical due to their use of the DARTS mixturerelaxation (Liu et al., 2019). Instead, FedEx employs thelower-overhead stochastic relaxation (Li & Talwalkar, 2019;Dong & Yang, 2019), and its exponentiated update is basedupon the recently proposed GAEA approach for NAS (Liet al., 2020b).

2. Federated Hyperparameter OptimizationIn this section we formalize the problem of hyperparame-ter optimization for federated learning and discussion theconnection of its personalized variant to meta-learning. Wealso review FedAvg (McMahan et al., 2017), a commonmethod for federated optimization, and present a reasonablebaseline approach for tuning its hyperparameters.

2.1. Global and Personalized FL

FL is concerned with learning and optimization over a net-work of heterogeneous clients i = 1, . . . , n, each with train-ing, validation, and testing sets Ti, Vi, and Ei, respectively.We will use LS(w) to denote the average loss over a datasetS of some w-parameterized ML model, for w ∈ Rd somereal vector. To formulate hyperparameter optimization inthis setting, we assume a class of algorithms Alga hyperpa-rameterized by a ∈ A that use federated access to trainingsets Ti to output some element of Rd. Here by federatedaccess we mean that each iteration corresponds to a com-munication round at which Alga has access to some batchof B clients1 that can do local training and validation.

Specifically, we assume Alga can be described by two

1For simplicity the number of clients is fixed at each round, but allmethods can be easily extended to the more general case.

subroutines whose hyperparameters are encoded by settingsb ∈ B and c ∈ C, so that a = (b, c) and A = B × C: cencodes hyperparameters of a local training algorithm SGDcthat takes a training set S and initialization w ∈ Rd as inputand outputs a model SGDc(S,w) ∈ Rd, while b encodeshyperparameters of an aggregation step Aggb that takes theinitialization w and several outputs of SGDc as input andoutputs a new model parameter in Rd. For example, in asimple instantiation of FedAvg, SGDc will be T steps ofgradient descent with learning rate η and Aggb will takethe average across clients of the outputs of SGDc, with eachoutput weighted by the size of the corresponding client’straining data; here we have c = (η, T ) and b = (). Thisassumption also holds for the FL methods FedProx (Liet al., 2020d) and SCAFFOLD (Karimireddy et al., 2019).

The global hyperparameter optimization problem is then

mina∈A

n∑i=1

|Vi|LVi(Alga({Tj}nj=1)) (1)

However, in many practical cases we are also interested inobtaining a device-specific local model, where we take amodel trained on all clients and finetune it on each individualclient before evaluating. A key assumption we make isthat the finetuning algorithm will be the same as the localtraining algorithm SGDc used by Alga. This assumptioncan be justified by recent work in meta-learning that showsthat algorithms that aggregate the outputs of local SGD canbe viewed as optimizing for personalization using local SGD(Khodak et al., 2019). Then, in the personalized setting, thehyperparameter optimization problem becomes

mina=(b,c)∈A

n∑i=1

|Vi|LVi(SGDc(Ti,Alga({Tj}nj=1)) (2)

Our approach will focus on the setting where the hyperpa-rameters c of local training make up a significant portion ofall hyperparameters a = (b, c); by considering the personal-ization objective we will be able to treat such hyperparame-ters as architectural and thus apply weight-sharing.

2.2. Tuning FL Methods: Challenges and Baselines

In the non-federated setting, the global objective (1) isamenable to regular hyperparameter optimization methods;for example, a random search approach would repeatedlysample a hyperparameter setting a from some distributionover A, run Alga to completion, and evaluate the objecive,saving the best setting and output (Bergstra & Bengio, 2012).With a reasonable distribution and enough configurationsamples this is guaranteed to converge, and the approachcan accelerated using early stopping (Li et al., 2018), inwhich Alga is not always run to completion if the desiredobjective is poor at intermediate stages, or by adapting the


sampling distribution using the results of previous objectiveevaluations (Snoek et al., 2012).

There are two main challenges in applying such methods toFL with personalization:

1. The validation data is federated and so the entiredataset is not available at any one time; instead a cen-tral server is given access to some number of devicesat each communication round, for one or at most a fewruns of local training and validation. Thus, becausethe standard measure of complexity in FL is the num-ber of communication rounds, computing the objectiveexactly dramatically increases the cost.

2. Even with non-federated data, applying standard meth-ods to the personalized objective (2) is costly becauseevaluating it requires running local training n times.

Our FedEx method handles both issues by effectively us-ing noisy validations at each round, in a manner similar tohow SGD can accelerate regular noiseless gradient descent.While the second issue is introduced by our considerationof a personalized objective, the fact that we can solve italso points to the potential of FedEx to tune hyperparame-ters of initialization-based meta-learning algorithms such asMAML (Finn et al., 2017) and Reptile (Nichol et al., 2018).

Before introducing our method, we first state a reasonablebaseline for tuning hyperparameters in FL, which is to usea regular hyperparameter optimization method but use vali-dation data from a single round as a noisy surrogate for thefull validation objective. Specifically, we use the successiveelimination (SEA) algorithm, a simple generalization of ran-dom search in which we sample a set of hyperparametersand partially run all of them for some number of communi-cation rounds before eliminating all but the top 1

η of them,repeating the process until only one configuration remains.Note that this is equivalent to a “bracket” in Hyperband (Liet al., 2018).

3. Weight-Sharing for Personalized FLWe now present FedEx, a method for tuning local hyperpa-rameters in FL based on the weight-sharing approach fromNAS. This section contains the general algorithm; we willinstantiate it in the next section for FedAvg.

3.1. Weight-Sharing for Architecture Search

We first review the weight-sharing approach in NAS. NASis often posed as the bilevel optimization problem

minc∈C

Lvalid(w, c) s.t. w ∈ arg minu∈Rd

Ltrain(u, c) (3)

where C is a set of network configurations and Ltrain, Lvalidevaluate a single set of network configurations and weights.

Algorithm 1: Baseline successive elimination algorithmfor tuning personalized FL. For the non-personalized ob-jective (1), replace LVti(wi) by LVti(wa). For randomsearch with N samples, set η = N and R = 1.Input: distribution D over hyperparameters A,

elimination rate η ∈ N, elimination roundsτ0 = 0, τ1, . . . , τR

sample set of ηR hyperparameters H ∼ D[ηR]

initialize a model wa ∈ Rd for each a ∈ Hfor elimination round r ∈ [R] do

for setting a = (b, c) ∈ H dofor comm. round t = τr−1 + 1, . . . , τr do

for client i = 1, . . . , B dosend wa, c to clientwi ← SGDc(Tti,wa)send wi, LVti(wi) to server

wa ← Aggb(wa, {wi}Bi=1)

sa ←∑Bi=1 |Vti|LVti(wi)/

∑Bi=1 |Vti|

H ← {a ∈ H : sa ≤ 1η -quantile({sa : a ∈ H})}

Output: remaining a ∈ H and associated model wa

If all hyperparameters are architectural, as is the case inNAS, then they are effectively themselves trainable modelparameters (Li et al., 2020b); thus we could instead con-sider solving the following “single-level” empirical riskminimization (ERM) problem

minc∈C,w∈Rd

L(w, c) (4)

Here L(w, c) = Ltrain(w, c) + Lvalid(w, c). Solving thisoptimization instead of the bilevel problem (3) has beenproposed in several recent papers (Li et al., 2019; 2020b).

Early approaches to solving either the bilevel or ERM for-mulation of NAS were costly due to the need for full orpartial training of many architectures in a very large searchspace. The weight-sharing paradigm of Pham et al. (2018)reduces the problem to that of training a single architecture,a “supernet” containing all architectures in the search spaceC. A straightforward way of constructing a supernet is viaa “stochastic relaxation” in which the loss is taken as anexpectation over drawing c from some distribution over C(Dong & Yang, 2019). Then the shared weights can be up-dated using SGD by first sampling an architecture c fromthis distribution and then using a standard unbiased estimateof ∇wL(w, c) to update w.

The distribution over C may itself be adapted or stay fixed;in the latter case the shared parameter obtained at the end oftraining must be a good surrogate for minimizers of L(·, c)for all architectures c ∈ C, something that surprisinglyholds true on certain benchmarks (Li & Talwalkar, 2019).We will focus on the other case, where we adapt some θ-


parameterized distribution Dθ over C for θ ∈ Θ; we thenhave the objective

minθ∈Θ,w∈Rd

Ec∼DθL(w, c) (5)

Since architectural hyperparameters are often discrete deci-sions, e.g. a choice of which of a fixed number of operationsto use or which two nodes to connect, a natural choice ofDθis as a product of categorical distributions over simplices.Note that in this case, any discretization of an optimum θ ofthe relaxed objective (5) whose support is in the support ofθ will be an optimum of the original objective (4). A naturalupdate scheme here is exponentiated gradient (Kivinen &Warmuth, 1997), in which each successive θ is proportionalto θ � exp(−η∇), where η is some learning rate and ∇is an unbiased estimate of ∇θEc∼DθL(w, c) that can becomputed using the re-parameterization trick (Rubinstein& Shapiro, 1993). By alternating this exponentiated updatewith the standard SGD update to w discussed earlier weobtain a simple block-stochastic minimization scheme thatis guaranteed to converge, under certain conditions, for theERM objective and also performs well in practice (Li et al.,2020b).

3.2. The FedEx Method

To obtain FedEx from weight-sharing we restrict to thecase where we are only interested in the hyperparameters cof the local training subroutine SGDc. We can then replacethe personalized FL objective (2) by

minc∈C,w∈Rd

n∑i=1

|Vi|LVi(SGDc(Ti,w)) (6)

Note that since Alga outputs an element of Rd this objectivewill be at least as good as the original (2). The new objectiveis in the form of the weight-sharing objective (4) and so wecan replace it by the stochastic relaxation

minθ∈Θ,w∈Rd

n∑i=1

|Vi|Ec∈DθLVi(SGDc(Ti,w)) (7)

Unlike in NAS, FL hyperparameters such as the learningrate are not extreme points of a simplex and so it is less clearwhat parameterized distribution Dθ to use. Nevertheless,we find that by crudely imposing a categorical distributionover k > 1 random samples from Unif[C] and updating θusing exponentiated gradient over the resulting k-simplexworks well. This is alternated with updating w ∈ Rd, whichin a NAS algorithm would involve an SGD update usingan unbiased estimate of the gradient of the objective at thecurrent w and θ. Following the Reptile method from meta-learning (Nichol et al., 2018), we can view the other sub-routine Aggb of Alga that aggregates the outputs of local

Algorithm 2: FedEx algorithm for tuning local trainingin personalized FL.Input: space C of local training hyperparameters, size

k of categorical distribution, setting b for Aggb,scheme for setting step-size ηt and baseline λt,total number of steps τ ≥ 1

sample k configurations c1, . . . , ck ∈ Unif[C]initialize θ ∈ Rk with θj = 1/k ∀ jinitialize shared weights w ∈ Rdfor comm. round t = 1, . . . , τ do

for client i = 1, . . . , B dosend w, θ to clientsample cij ∼ Dθwi ← SGDcij (Tti,w)send wi, cij , LVti(wi) to server

w← Aggb(w, {wi}Bi=1)set step size ηt and baseline λtset ∇j ←

∑Bi=1 |Vti|(LVti (wi)−λt)1cij=cj

θj∑Bi=1 |Vti|

∀ jθ ← θ � exp(−ηt∇)θ ← θ/‖θ‖1

Output: model w and hyperparameter distribution θ

training as performing an update using a gradient surrogateof the personalization objective (Khodak et al., 2019).

We call this alternating method for solving (7) FedEx anddescribe it for a general Alga consisting of sub-routinesAggb and SGDc in Algorithm 2. The main overhead on topof these sub-routines is in the last four lines of the outer loopfor the update to θ. Thus, as with weight-sharing, FedExcan be viewed as reducing the complexity of tuning localhyperparameters to that of training a single model. Eachupdate to θ requires a step-size ηt, setting which we describein Section 4.2 and the Appendix, and an approximation ∇of the gradient w.r.t. θ. As noted before, we can obtain anapproximation ∇j of each gradient entry using the reparam-eterization trick. The variance of this Monte Carlo estimatecan be reduced by subtracting a baseline λt, setting whichwe describe in the Appendix.

3.3. Wrapping FedEx

We can view FedEx as an algorithm of the form tunedby Algorithm 1 that implements federated training of asupernet parameter (w, θ), with the local training routineSGD including a step for sampling c ∼ Dθ and the serveraggregation routine including an exponentiated update of θ.Therefore we can wrap FedEx in Algorithm 1, which wefind useful for a variety of reasons:

• The wrapping algorithm can tune the settings of b forthe aggregation step Aggb, which FedEx cannot.


• FedEx itself has a few hyperparameters, e.g. how thebaseline λt is computed and the size k of the categori-cal distribution; rather than settting these arbitrarily wecan also tune them. We do fix a cutoff parameter forFedEx, which is an entropy threshold below whichwe stop the exponentiated update; we discuss the needfor this in Section 4.2.

• By running multiple seeds and potentially using earlystopping, we can run FedEx using more aggressivesteps-sizes and the wrapping algorithm will discardcases where this leads to poor results.

• We can directly compare FedEx to a regular hyper-parameter optimization scheme run over the originalalgorithm, e.g. FedAvg, by using the same scheme toboth wrap FedEx and tune FedAvg.

4. Empirical ResultsIn our experiments, we instantiate FedEx on FedAvg, themost popular algorithm for federated training of deep mod-els. At communication round t FedAvg uses the aggrega-tion sub-routine

Aggb(w, {wi}Bi=1) = (1− εt)w+εt∑B

i=1 |Tti|

B∑i=1

|Tti|wi

(8)for some learning rate εt > 0 that can vary through time.The local training sub-routine SGDc used by FedEx is SGDwith hyperparameters c over some objective defined by thetraining data Tti, which can also depend on c.

Note that the original FedAvg scheme is the special caseof (8) with εt = 1, i.e. the update is the average of theclient model output. Thus, strictly speaking we are tuning ageneralization of FedAvg with a server learning rate, alsoknown as Reptile (Nichol et al., 2018). We do this in order todecrease model variation over time by decaying εt over time,in contrast to FedAvg which accomplishes the same thingby decaying the learning rate of local SGD; the latter notmake sense for the personalization objective (2) as it makesthe amount the model is personalized time-dependent.

4.1. Baseline Comparisons

We tune several hyperparameters of both aggregation andlocal training; for the former we tune the server learningrate schedule and momentum, found to be helpful for per-sonalization by Jiang et al. (2019); for the latter we tunethe learning rate, momentum, weight-decay, the batch-size,dropout, and proximal regularization. Note that our tuningof the latter hyperparameter effectively means we are tuningthe FedProx generalization of FedAvg (Li et al., 2020d).

Because our baseline is running Algorithm 1, a standard

Figure 1. Evaluation of FedEx on the Shakespeare next-characterprediction dataset in the i.i.d. client data setting (top) and the fullynon-i.i.d. setting (bottom). We report global model performance onthe left and personalized performance on the right. All evaluationsare run for five trials.

hyperparameter tuning algorithm, to tune FedAvg, and be-cause we need to also wrap FedEx in such an algorithmfor the reasons described in Section 3, our empirical resultswill test the following question: does FedEx, wrapped by asuccessive elimination algorithm, do better than successiveelimination run with the same settings on FedAvg? Here‘better’ will mean both the final test accuracy obtained andthe online evaluation setting, which tests how well hyperpa-rameter optimization is doing at intermediate phases.

We run Algorithm 1 on the personalized objective and setthe elimination rate to be η = 3, as in Hyperband (Li et al.,2018). To obtain the elimination rounds in Algorithm 1,we set the number of eliminations to R = 3, fix a totalcommunication round budget, and fix a maximum numberof rounds to be allocated to any configuration a; as detailedin the Appendix, this allows us to determine T1, . . . , TR soas to use up as much of the budget as possible.

We use the Shakespeare dataset and consider two settings:

1. i.i.d. client data, in which client data is shuffled beforesplitting into train, validation, and test sets.

2. fully non-i.i.d. data, in which client data, consistingof a single actor’s lines, is split according to temporalposition in the play.

In both cases we use 80% for training and 10% each forvalidation and testing. We set a total communication roundbudget of 4000 rounds and allow each individual configura-tion to be run for at most 800 communication rounds. Theactual model trained follows McMahan et al. (2017), whouse a two-layer character-LSTM.


Figure 2. Modifying Algorithm 1 to use more of the validationdata for elimination decisions has minimal effect.

As reported in Figure 1, we see that FedEx does better thanthe baseline of tuning FedAvg directly in both data settings,for both global and personalized models, and in both onlineevaluation—FedEx is almost always better throughout thetuning process—and in terms of final test error, where theimprovement is between 1.0-1.9%.

4.2. Discussion

We now discuss two finer questions about our empirical com-parison, which will also allow us to discuss how FedExworks. First, one possible criticism of the baseline—tuningFedAvg using successive elimination—is that we use verylittle of the validation data. This is a result of the sa quantityin Algorithm 1, which measures the current performance ofconfiguration a, depending only on the validation data of theprevious communication round, which leads to noisy elimi-nation decisions. We could use validation data from moreclients to get a better estimate of the performance of eachconfiguration, but this would cost resources (communica-tion rounds). Alternatively, we could use a state estimationmethod to use previous sa values to construct an estimateof the true validation performance.

We try a crude state estimation approach, in which we makeelimination decisions in Algorithm 1 using a γ-discountedsum of the quantities sa over since the previous eliminationround, for γ = 1.0 (the average over all quantities) andγ = 0.5 (discounting by reciprocals of powers of two). Notethat γ = 0.0 just uses the very last quantity sa, as specifiedin Algorithm 1. We see that discounting only helps by anegligible amount compared to the benefit given by FedEx,and in any case this scheme, or a more complicated model-based state estimation approach such as a Kalman filter, canalso be applied to wrap FedEx.

A more interesting questions is the setting of the step-size ηtfor the exponentiated gradient update in FedEx. We exam-ine three different settings: a constant rate of ηt =

√2 log k,

an ‘adaptive’ schedule of ηt =√

2 log k/√∑

s≤t ‖∇s‖2∞,

and an ‘aggressive’ schedule of ηt =√

2 log k/‖∇t‖∞.Here ∇t is the stochastic gradient w.r.t. θ computed in Al-gorithm 2 at step t and the form of the step-size is derived

Figure 3. Comparison of different step-size schedules for ηt inFedEx using final test accuracy (left) and entropy over time (right).Results are from ten trials of Algorithm 2.

from standard settings for exponentiated gradient in onlinelearning (Shalev-Shwartz, 2011).

As shown in Figure 3, we find that the ‘aggressive’ sched-ule works best in practice, although all approaches workbetter than FedAvg and often successfully train the sharedweights w. A key issue with using the lower-variance ‘adap-tive’ approach is that it continues to assign high probabilityto several configurations late in the tuning process, as shownin the slow entropy decrease; this slows down training ofthe shared weights. One could consider a tradeoff betweenallowing FedEx to run longer than FedAvg while keepingthe total budget constant, but for simplicity we chose themore effective ‘aggressive’ schedule. Note, however, thatthis choice could lead to instability later during training, asshown in the entropy variability in Figure 3; this can leadto catastrophically poor performance if bad configurationsare chosen. As discussed earlier, we avoid this problemby stopping the exponentiated update after a low enoughentropy, specifically when it reaches 1E-4. Finally, note thatthe rapid decrease in entropy of the ‘aggressive’ schememeans that FedEx effectively picks a single configurationvery quickly, and can thus be viewed as a way to improvethe configuration distribution for successive elimination.

5. ConclusionFedEx is a simple, low-overhead algorithm for acceleratingthe tuning of hyperparameters in federated learning that isapplicable to many popular FL algorithms. It is inspired byour connection of this challenging setting to weight-sharingin NAS, as well as to meta-learning via personalization. Fur-thermore, it effectively tunes FedAvg on standard bench-marks. Finally, we note that the scope of application ofFedEx is very broad, including tuning actual architecturalhyperparameters rather than just settings of local SGD, i.e.doing federated NAS, and tuning initialization-based meta-learning algorithms such as Reptile and MAML.


ReferencesBergstra, J. and Bengio, Y. Random search for hyper-

parameter optimization. Journal of Machine LearningResearch, 13:281–305, 2012.

Cai, H., Zhu, L., and Han, S. ProxylessNAS: Direct neuralarchitecture search on target task and hardware. In Pro-ceedings of the 7th International Conference on LearningRepresentations, 2019.

Caldas, S., Wu, P., Li, T., Konecny, J., McMahan, H. B.,Smith, V., and Talwalkar, A. LEAF: A benchmark forfederated settings. arXiv, 2018.

Chen, F., Dong, Z., Li, Z., and He, X. Federated meta-learning for recommendation. arXiv, 2018.

Dong, X. and Yang, Y. Searching for a robust neural archi-tecture in four GPU hours. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,2019.

Elsken, T., Staffler, B., Metzen, J. H., and Hutter, F. Meta-learning of neural architectures for few-shot learning.arXiv, 2019.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceed-ings of the 34th International Conference on MachineLearning, 2017.

Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequentialmodel-based optimization for general algorithm config-uration. In Proceedings of the International Conferenceon Learning and Intelligent Optimization, 2011.

Jiang, Y., Konecny, J., Rush, K., and Kannan, S. Improvingfederated learning personalization via model agnosticmeta learning. arXiv, 2019.

Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis,M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode,G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E.,Evans, D., Gardner, J., Garrett, Z., Gascon, A., Ghazi, B.,Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He,L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi,T., Joshi, G., Khodak, M., Konecny, J., Korolova, A.,Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P.,Mohri, M., Nock, R., Ozgur, A., Pagh, R., Raykova, M.,Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich,S. U., Sun, Z., Suresh, A. T., Tramer, F., Vepakomma, P.,Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H.,and Zhao, S. Advances and open problems in federatedlearning. arXiv, 2019.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich,S. U., and Suresh, A. T. SCAFFOLD: Stochastic con-trolled averaging for federated learning. arXiv, 2019.

Khodak, M., Balcan, M.-F., and Talwalkar, A. Adaptivegradient-based meta-learning methods. In Advances inNeural Information Processing Systems, 2019.

Kivinen, J. and Warmuth, M. K. Exponentiated gradientversus gradient descent for linear predictors. Informationand Computation, 132:1–63, 1997.

Li, G., Zhang, X., Wang, Z., Li, Z., and Zhang, T. Stac-NAS: Towards stable and consistent differentiable neuralarchitecture search. arXiv, 2019.

Li, J., Khodak, M., Caldas, S., and Talwalkar, A. Differ-entially private meta-learning. In Proceedings of the 8thInternational Conference on Learning Representations,2020a.

Li, L. and Talwalkar, A. Random search and reproducibilityfor neural architecture search. In Proceedings of the Con-ference on Uncertainty in Artificial Intelligence, 2019.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., andTalwalkar, A. Hyperband: A novel bandit-based approachto hyperparameter optimization. Journal of MachineLearning Research, 18(185):1–52, 2018.

Li, L., Khodak, M., Balcan, M.-F., and Talwalkar, A.Geometry-aware gradient algorithms for neural architec-ture search. arXiv, 2020b.

Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Feder-ated learning: Challenges, methods, and future directions.IEEE Signal Processing Magazine, 37, 2020c.

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A.,and Smith, V. Federated optimization in heterogeneousnetworks. In Proceedings of the Conference on MachineLearning and Systems, 2020d.

Lian, D., Zheng, Y., Xu, Y., Lu, Y., Lin, L., Zhao, P., Huang,J., and Gao, S. Towards fast adaptation of neural archi-tectures with meta-learning. In Proceedings of the 8thInternational Conference on Learning Representations,2020.

Liu, H., Simonyan, K., and Yang, Y. DARTS: Differentiablearchitecture search. In Proceedings of the 7th Interna-tional Conference on Learning Representations, 2019.

McMahan, H. B., Moore, E., Ramage, D., Hampson, S., andAguera y Arcas, B. Communication-efficient learningof deep networks from decentralized data. In Proceed-ings of the 20th International Conference on ArtificalIntelligence and Statistics, 2017.

Nichol, A., Achiam, J., and Schulman, J. On first-ordermeta-learning algorithms. arXiv, 2018.


Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J.Efficient neural architecture search via parameter sharing.In Proceedings of the 35th International Conference onMachine Learning, 2018.

Rubinstein, R. Y. and Shapiro, A. Discrete Event Systems:Sensitivity Analysis and Stochastic Optimization by theScore Function Method. John Wiley & Sons, Inc., 1993.

Shalev-Shwartz, S. Online learning and online convex opti-mization. Foundations and Trends in Machine Learning,4(2):107—-194, 2011.

Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A.Federated multi-task learning. In Advances in NeuralInformation Processing Systems, 2017.

Snoek, J., Larochelle, H., and Adams, R. P. Practicalbayesian optimization of machine learning algorithms.In Advances in Neural Information Processing Systems,2012.


A. FedEx DetailsA.1. Stochastic Gradient used by FedEx

Below is a simple calculation showing that the stochastic gradient used to update the categorical architecture distribution ofFedEx is an unbiased approximation of the true gradient w.r.t. its parameters.

∇θjEcij |θLVti(wi) = ∇θjEcij |θ(LVti(wi)− λ)

= Ecij |θ((LVti(wi)− λ)∇θj logPθ(cij)

)= Ecij |θ

((LVti(wk)− λ)∇θj log

n∏i=1

Pθ(cij = cj)

)

= Ecij |θ

((LVti(wi)− λ)

n∑i=1

∇θj logPθ(cij = cj)

)

= Ecij |θ(

(LVti(wi)− λ)1cij=cjθj

)A.2. Hyperparameters of FedEx

As discussed in Section 3, FedEx itself has several settings that the wrapping method, Algorithm 1, can tune. Specifically,we tune the number k of configurations in the categorical distribution over the grid between {1, . . . , ηR}, which is thenumber of arms used by Algorithm 1. We also tune the computation of the baseline λt, which we set to

λt =1∑

s<t γt−s

∑s<t

γt−s∑Bi=1 |Vti|

B∑i=1

LVti(wi)

for discount factor γ ∈ [0, 1].

B. Experimental DetailsB.1. Settings of the Baseline/Wrapper Algorithm

We use the same settings of Algorithm 1 for both tuning FedAvg and for wrapping FedEx. Given an elimination rate η,number of elimination rounds R, resource budget B, and maximum rounds per arm M , we assign T1, . . . , TR s.t.

Ti − Ti−1 =T −M

ηn+1−1η−1 − n− 1

(recall T0 = 0)and assign any remaining resources to maximize resource use.

B.2. Hyperparameters of FedAvg

Server hyperparameters (learning rate εt = γt):

momentum : Unif{0, 0.9}log10(1− γ) : Unif{−4,−2}

Local training hyperparameters:

log10(lr) : Unif[−4, 0]

momentum : Unif{0.0, 0.9}log10(weight-decay) : Unif[−5,−1]

log2(batch) : Unif{3, 4, 5, 6, 7}dropout : Unif{0, 0.5}

Documents

Weight-Sharing for Hyperparameter Optimization in ...mkhodak/docs/FL2020Workshop.pdf · Weight-Sharing for Hyperparameter Optimization in Federated Learning Mikhail Khodak 1Tian Li