Federated Multi-Mini-Batch: An Efficient Training Approach to … · 2020. 11. 16. · Federated Multi-Mini-Batch: An Efﬁcient Training Approach to Federated Learning in Non-IID

FEDERATED MULTI-MINI-BATCH: AN EFFICIENT TRAINING APPROACH TOFEDERATED LEARNING IN NON-IID ENVIRONMENTS

Mohammad Bakhtiari * 1 Reza Nasirigerdeh * 2 Reihaneh Torkzadehmahani 2 Amirhossein Bayat 3

David B. Blumenthal 2 Markus List 2 Jan Baumbach 2

ABSTRACTFederated learning is a well-established approach to privacy-preserving training of a joint model on heavilydistributed data. Federated averaging (FedAvg) is a well-known communication-efficient algorithm for federatedlearning, which performs well if the data distribution across the clients is independently and identically distributed(IID). However, FedAvg provides a lower accuracy and still requires a large number of communication rounds toachieve a target accuracy when it comes to Non-IID environments. To address the former limitation, we presentfederated single mini-batch (FedSMB), where the clients train the model on a single mini-batch from their datasetin each iteration. We show that FedSMB achieves the accuracy of the centralized training in Non-IID configurations,but in a considerable number of iterations. To address the latter limitation, we introduce federated multi-mini-batch(FedMMB) as a generalization of FedSMB, where the clients train the model on multiple mini-batches (specifiedby the batch count) in each communication round. FedMMB decouples the batch size from the batch countand provides a trade-off between the accuracy and communication efficiency in Non-IID settings. This is notpossible with FedAvg, in which a single parameter determines both the batch size and batch count. The simulationresults illustrate that FedMMB outperforms FedAvg in terms of the accuracy, communication efficiency, as well ascomputational efficiency and is an efficient training approach to federated learning in Non-IID environments.

1 INTRODUCTION

Federated learning (Konecny et al., 2015; 2016; McMahanet al., 2017) is a distributed learning approach that enablesmultiple parties (clients) to learn a shared (global) modelwithout moving their local data off-site. In federated learn-ing, most of the training is performed by the clients andan aggregation strategy is employed by a central server toiteratively update the global model. The privacy-preservingnature of federated learning has made it popular for applica-tions such as healthcare data analysis (Sheller et al., 2018;Brisimi et al., 2018; Chen et al., 2020) and mobile keyboardprediction (Hard et al., 2018; Yang et al., 2018), in whichaccess to data is impossible due to strict privacy policies.

Federated averaging (FedAvg) (McMahan et al., 2017) isa communication-efficient approach to federated learningfor gradient descent based models such as neural networks,which aims to reach an accurate global model with an effi-cient number of communication rounds between the clients

*Equal contribution 1Independent researcher, Tehran, Iran2TUM School of Life Sciences, Technical University of Munich,Munich, Germany 3Department of Informatics, Technical Uni-versity of Munich, Munich, Germany. Correspondence to: RezaNasirigerdeh <[email protected]>.

and the server. The main idea behind FedAvg is to performa large number of local updates in the clients and take a sim-ple weighted average over the local model parameters onthe server side. FedAvg can dramatically reduce the commu-nication rounds if the data is independently and identicallydistributed (IID) across the clients.

However, FedAvg faces some challenges when it comesto Non-IID settings (Li et al., 2019; Hsieh et al., 2019).The global model trained by FedAvg might not convergeto the optimum in Non-IID environments, and as a result,FedAvg might not provide the same accuracy as it does forIID settings. Moreover, it might still require a huge numberof communication rounds to achieve a target accuracy inNon-IID configurations.

To address the former constraint, we introduce federated sin-gle mini-batch (FedSMB) training approach, which enforcesthe clients to train the model on a single mini-batch of theirlocal data in each iteration. FedSMB is computationallyefficient, performing one local update per iteration at theclients. More importantly, FedSMB is a robust approachfrom the accuracy perspective, achieving an accuracy closeto the centralized training even in Non-IID environments.However, FedSMB is not communication-efficient since itrequires a large number of communication rounds to reacha target accuracy.

arX

iv:2

011.

0700

6v1

[cs

.LG

] 1

3 N

ov 2

020

Federated Multi-Mini-Batch: An Efficient Training Approach to Federated Learning in Non-IID Environments

To address this challenge, we present federated multi-mini-batch (FedMMB), a generalization of FedSMB, where theclients train the model on multiple mini-batches from theirlocal data, performing multiple local updates in each com-munication round (one update per mini-batch). FedMMBdecouples the batch size from the batch count (number ofbatches) and allows for specifying the number of local up-dates at the clients (batch count) independent of the batchsize. FedAvg does not allow for this decoupling because asingle parameter specifies both the batch size and the batchcount. While FedAvg can reduce the local updates at theclients by increasing the batch size, larger batch sizes ad-versely impact the accuracy and communication efficiency(Section 5.3).

Our simulation results from FedMMB illustrate that the max-imum accuracy and the number of communication roundsare dependent on the number of local updates at the clients.In the IID setting, more local updates dramatically reducethe communication rounds without affecting the accuracy.In Non-IID environments, more local updates lead to a lowermaximum accuracy but considerably improve the communi-cation efficiency in general. Given that, FedMMB can pro-vide a trade-off between the maximum accuracy and numberof communication rounds by controlling the number of lo-cal updates through the batch count parameter. FedMMBalso outperforms FedAvg in terms of accuracy as well as thecommunication and computational efficiency in Non-IIDenvironments.

2 BACKGROUND

Gradient descent is the most widely used optimizationmethod for training neural networks. A neural networkis characterized by the model parameters (weights) w andgradient descent optimization aims to minimize the lossfunction `(w) on a subset S of the training samples in thedataset. The optimization is an iterative process, where thegradients ∇ of the loss function (w.r.t w) are computed andthe model parameters are updated in the opposite directionof the gradients. The learning rate η specifies the step sizeof the update in iteration i (Ruder, 2016).

wi+1 = wi − ηi ×∇wi`(wi;Si) (1)

There are different variants of gradient descent depending onhow the samples of the training dataset are employed to up-date the model parameters. In full gradient descent (FGD),all samples are leveraged to compute the gradients of theloss function; stochastic gradient descent (SGD) calculatesthe gradients using a single randomly selected sample ofthe training dataset; mini-batch gradient descent (MBGD)optimizes the loss function on a random small batch of sam-ples (Hinton et al., 2012; Bottou, 2012). For large neuralnetworks trained on very large training datasets, MBGD

is typically the best choice because it is computationallyefficient (Hinton et al., 2012).

A neural network model can be trained in a centralized ordistributed environment. In the former, the whole dataset islocated at a single site, and the model is iteratively trainedon the dataset using one of the variants of gradient descent.Epoch indicates the number of iterations required to employall samples of the dataset for training. Given a datasetwith N samples, the epoch is 1 and N for FGD and SGD,respectively. For MBGD, the epoch is dNB e, where B is thebatch size.

In distributed training, the data has been distributed acrossmultiple sites (clients). Federated learning is a type of dis-tributed learning in which a set of clients collaborativelylearn a joint model under the orchestration of a centralizedserver while preserving the privacy of their data (Kairouzet al., 2019). The local data never leaves the site and onlythe model parameters are exchanged between each clientand the server. We assume that the clients have differenttraining samples but the same form of a neural networkmodel; additionally, all clients are selected to participate intraining.

In each iteration i of the federated training, all K clientsobtain the global model parameters wg

i from the server.Next, each client j computes the local model parameterswl

ij

by optimizing the loss function `(wgi ) on nj samples from

its local data using one of the variants of gradient descent.Afterwards, the server receives the local parameters fromthe clients and calculates the global model parameters forthe next iteration by taking the weighted average over thelocal parameters:

wgi+1 =

∑Kj=1 nj × wl

ij∑Kj=1 nj

(2)

The important consideration in equation (2) is that the clientsuse the same learning rate ηi in iteration i. As an equivalentmethod, each client j can compute local gradients ∇wl

ij

and the server can calculate the global model parameters forthe next iteration using the following equation:

wgi+1 = wg

i − ηi ×∑K

j=1 nj ×∇wlij∑K

j=1 nj(3)

Each iteration of the federated training updates the globalmodel parameters once and requires one communicationround between each client and the server. Therefore, itera-tion and communication round are used interchangeably inthe federated environment. However, the clients might up-date their local model parameters once or multiple times ineach iteration depending on the variant of gradient descentthey employ for local optimization. If the clients leverageFGD, they will optimize the global model on the whole


dataset (performing one local update) and share the localmodel with the server. We refer to this training method asfederated full gradient descent (FedFGD) approach.

FedAvg algorithm employs MBGD in the clients, aiming toreduce the number of communication rounds by performingmore local updates at the clients. In FedAvg, each client jupdates its local model parameters µj = E × dNj

B e times,where E is the number of local epochs, B is the batch size,and Nj is the number of samples in the training set of clientj. In other words, the clients run MBGD algorithm E timeson the local data before sending the local model parametersto the server.

The distribution of class labels across the clients can be IIDor Non-IID. In the former, the training sets of the clientshave similar (homogeneous) label distributions while in thelatter, the class labels have heterogeneously been distributedacross the clients. As an example, assume a distributedenvironment in which the training sets of the clients includebinary class labels. If the ratio of label 0 to label 1 inall clients is similar (e.g. around 0.7), this environment isconsidered IID. However, if this ratio is very different forthe clients, the setting is Non-IID.

We use three different criteria to evaluate various MBGDbased training methods in the federated environment: ac-curacy, communication efficiency, and computational effi-ciency (Torkzadehmahani et al., 2020). The accuracy cri-terion is an indicator of the maximum accuracy that theglobal model (trained by a federated approach) can achieve.Ideally, it is comparable to the accuracy of the centralizedtraining. Communication efficiency indicates the number ofcommunication rounds (i.e. iterations) required to reach atarget accuracy. Fewer communication rounds imply morecommunication efficiency. Computational efficiency repre-sents the total number of local updates performed by theclients in all iterations to achieve a target accuracy. Thefewer total local updates a federated approach performs, themore computationally efficient the approach is.

3 RELATED WORK

Learning from real-world Non-IID (heterogeneous) data isa crucial but still understudied problem. Recently, Hsieh etal. (Hsieh et al., 2019) empirically studied the challenges oftraining on Non-IID data by comparing the performance ofseveral distributed learning algorithms, including FedAvg.They showed that data heterogeneity makes accurate feder-ated learning very challenging, while the level of heterogene-ity plays a major role in this problem. FedSMB addressesthis challenge by training an accurate federated neural net-work model even in Non-IID environments with differentlevels of label heterogeneity (Section 5.1).

Li et al. (Li et al., 2019) theoretically analyzed the con-

vergence of FedAvg in Non-IID settings and showed thatFedAvg with E > 1 and full batch might not converge to theoptimum. Our results (Section 5.3) indicate that this is alsothe case for FedAvg with E = 1 and mini-batches. In bothconfigurations, the clients update the model multiple times,which affects the convergence of the model.

FedProx (Li et al., 2020) re-parameterized FedAvg and in-troduced a general federated optimization algorithm (withFedAvg as its special case). The algorithm adds a proximalterm to the clients’ loss function to bound their impact onthe divergence of the global model, making the model con-verge more robustly. FedMA (Wang et al., 2020) proposeda layer-wise matching and averaging strategy that outper-forms FedAvg and FedProx for Non-IID data. FedMMBtakes a different approach and provides a trade-off betweenthe accuracy and communication efficiency by controllingthe number of local updates at the clients for a given batchsize. This enables FedMMB to be an efficient alternative toFedAvg for federated learning in Non-IID environments.

4 METHOD

In this section, we delve into the details of FedSMB trainingapproach as well as FedMMB as a generalization of FedSMB.In FedSMB, each client trains the global model on a singlemini-batch of its dataset (i.e. updates the model once) ineach communication round. We will experimentally show(section 5.1) that the FedSMB with K clients and batch sizeB is similar to the centralized training using MBGD withbatch size of B′ = B ×K.

This implies that: (1) federated training using FedSMB willresult in similar learning curves for different values of Band K as long as B′ = B ×K, independent of how datahas been distributed across the clients (IID or Non-IID), (2)these learning curves are, in turn, similar to the learningcurve from centralized training with batch size B′, and (3)both federated and centralized models provide accuracyvalues close to each other on the test set. For instance,centralized training with B′ = 1000, federated trainingwith K = 10 and B = 100 in IID setting and K = 100 andB = 10 in a Non-IID configuration lead to similar learningcurves and close loss and accuracy values (Section 5.1).

FedSMB reaches an accuracy close to that of the centralizedtraining regardless of how data has been distributed acrossthe clients, but it is not communication-efficient. FedMMBcan overcome this limitation by enforcing the clients totrain the global model on multiple batches instead of one(update the model weights multiple times) before sharingthe updated model with the server (Algorithm 1). FedMMBtakes the batch size B and the batch count C as hyper-parameters in the clients. In the initial step, the serverinitializes the global model; moreover, each client j shuffles


Algorithm 1 Federated multi-mini-batchB is batch size; C is batch count; Imax is the maximumnumber of iterations; wg

i and ηi are the global weights andlearning rate in iteration i; there are K clients (participants)and Pj indicates client j. wl

j and nlj represent the localweights and number of local samples used for training atclient j, respectively. Dj is the local dataset of client j,containing Nj samples.

Server:function train:wg

0 ← initialize global weightsfor iteration i from 0 to (Imax − 1) do

for client j from 1 to K dowl

j , nlj ← Pj .update(i, w

gi , ηi)

end forwg

i+1 ←∑K

j=1 nlj×w

lj∑K

j=1 nlj

end forreturn wg

Imax

end function

Client Pj :function update:T ← dNj

B e; f ← bTC c

if i%f == 0 thenβ0 ... β(T−1) ← shuffle and split Dj into batches ofsize B

end ifp← (i%fj)× C; q ← p+ C − 1n← 0; u← 0; w0 ← wg

i

for β from βp to βq dowu+1 ← wu − ηi ×∇wu`(wu;β)u← u+ 1; n← n+ sizeof(β)

end forreturn wu, n

end function

its local dataset of size Nj and splits it into Tj = dNj

B ebatches of size B (except the last one whose size might beless than B).

In the first iteration, the clients train the global model onthe first C batches from their dataset, updating the modelparameters C times. Afterwards, each client j sends theupdated model as well as the number of samples used fortraining (nj) to the server. nj = B × C if all batches havethe same size. Otherwise, nj = B × (C − 1) +B′′, whereB′′ indicates the size of the last batch. The server takes theweighted average over the local models from the clients tocompute the new global model. Likewise, the clients trainthe model on the second C batches of their dataset in thesecond iteration, and the training process is repeated for apre-specified number of iterations. If there are not enoughremaining C batches in the client j for training, which

happens every fj = bTj

C c iteration, the client re-shufflesand re-splits its dataset.

FedSMB is a special case of FedMMB in which C = 1.Unlike FedSMB, FedMMB with C ≥ 2 might not reach theaccuracy of the centralized training in the Non-IID envi-ronments, but it can reduce the number of communicationrounds. In general, higher values of C lead to a lower maxi-mum accuracy but better communication efficiency. Thus,the parameter C enables FedMMB to provide a trade-offbetween accuracy and communication efficiency and makesit very flexible to cope with various Non-IID configurations(Section 5.2).

5 RESULTS

We first illustrate that FedSMB withK clients and batch sizeB provides similar results to those from the centralized train-ing with batch size B′ = B ×K. To this end, we leveragethe MNIST (LeCun et al., 2010) and Fashion-MNIST (Xiaoet al., 2017) as datasets, which include 70000 gray-scaleimages (60000 for training and 10000 for testing) of shape28 × 28 as well as 10 label values. Following (McMahanet al., 2017), we train two different neural network modelson the datasets: (1) a fully connected neural network withtwo hidden layers of size 200 and (2) a convolutional neuralnetwork containing two 5 × 5 convolutional layers, eachfollowed by a 2 × 2 max-pooling layer. The convolutionallayers have 32 and 64 filters, respectively. The second max-pooling layer is followed by a fully connected layer of size512. In the models, the convolutional or fully-connectedlayers use ReLU while the output layer utilizes the softmaxactivation function. We refer to the models as 2FNN and3CFNN, respectively.

We also evaluate FedSMB and FedMMB and compare themwith FedAvg using a more complex model and the CIFAR-10 dataset (Krizhevsky et al., 2009). The CIFAR-10 datasetcontains 60000 color images (50000 train and 10000 testsamples) of shape 32 × 32 and 10 class labels. The modelconsists of two 3 × 3 convolutional layers with 128 and 256filters, respectively. Each convolutional layer is followed bya 2 × 2 max-pooling layer. The second max-pooling layeris followed by two fully-connected layers of size 1024 and2048, respectively. The convolutional and fully-connectedlayers employ ReLU wheres the output layer has softmaxas the activation function. We call this model 4CFNN.

We distribute the datasets across the clients in two differentways: IID and Non-IID. In the former, the distribution ofthe label values is similar among the clients, and each clienthas samples from all ten labels. In the latter, the clientshave heterogeneous label distributions. For the IID case, wefirst shuffle the dataset, and then split it into K partitionswith the same sample size, and give each partition to one


(a) 2FNN-MNIST (b) 2FNN-Fashion-MNIST

(c) 3CFNN-MNIST (d) 3CFNN-Fashion-MNIST

Figure 1. The federated models from FedSMB have similar learning curves as well as loss and accuracy values close to that from thecentralized training with MBGD. The learning rate is η = 0.1 for all scenarios.

of the K clients. In the Non-IID configuration, we haveparameter L, which indicates the number of unique labelsper client and determines the level of the label distributionheterogeneity across the clients. For instance, L = 1 resultsin the extreme case, where each client only contains thesamples of one label. For a Non-IID scenario, we groupthe samples according to their labels. Next, we divide eachgroup into (K × L)/10 partitions and allocate L partitionswith different labels to a client. We assume that number ofclients is divisible by 10. Notice that the number of samplesin the clients might be different in the Non-IID configura-tions. We refer to a Non-IID scenario with parameter L asNon-IID-L (e.g. Non-IID-1, and Non-IID-2).

5.1 FedSMB

To show the similarity between the federated training usingFedSMB and the centralized training using MBGD, we trainthe 2FNN and 3CFNN on the MNIST and Fashion-MNISTdatasets under five scenarios with fixed learning rate η =0.1:

1. Centralized MBGD with batch size B′ = 1000

2. FedSMB with batch size B = 10 and K = 100 clientsin the IID setting

3. FedSMB with batch size B = 10 and K = 100 clientsin the Non-IID-1 setting

4. FedSMB with batch size B = 100 and K = 10 clientsin the IID setting

5. FedSMB with batch size B = 100 and K = 10 clientsin the Non-IID-1 setting

Figure 2. FedSMB: The federated training with FedSMB achievesthe best accuracy from the centralized training in both IID andNon-IID configurations. However, FedSMB is not communication-efficient, requiring a large number of communication rounds toprovide a target accuracy even in the IID setting. For all scenarios,the 4CFNN model has been trained on the CIFAR-10 dataset withlearning rate η = 0.08. There are K = 10 clients using batchsize B = 10 in the federated environment while the centralizedtraining uses batch size B′ = 100.


(a) IID (b) Non-IID-1 (c) Non-IID-2

(d) Non-IID-3 (e) Non-IID-4 (f) Non-IID-5

Figure 3. FedMMB: In the IID configuration, FedMMB improves the communication efficiency using higher batch count values withoutimpacting the accuracy. In the Non-IID environments, FedMMB might not achieve the accuracy of the baseline but it reduces the numberof communication rounds. The exception is the Non-IID-1 scenario, where increasing the batch count does not improve the communicationefficiency. The learning rate η = 0.05 is used for the IID scenario, C=50 in all Non-IID configurations, and C = 20 in the Non-IID-5.For the remaining cases, the learning rate is 0.08. The gray dashed line indicates the baseline (centralized) accuracy. In all scenarios,there are K = 10 clients, the batch size is B = 10, and the 4CFNN model has been trained on the CIFAR-10 dataset.

Figure 1 demonstrates the loss and accuracy of the modelsfor each scenario. The learning curves from all five scenar-ios are similar to each other and they finally converge tosimilar accuracy values. This indicates that FedSMB withKclients, where each client uses batch size B is similar to thecentralized training with batch size B′ = B ×K regardlessof the label distribution across the clients.

To evaluate the performance of FedSMB, we train the4CFNN model (with SGD optimizer) on the CIFAR-10dataset in federated settings with K = 10 clients andbatch size B = 10 under the IID and Non-IID-1 to Non-IID-5 scenarios with fixed learning rate η = 0.08 (Figure2). The baseline is the best result (maximum accuracy of≈ 0.742) from the centralized training on the aggregateddataset (batch size B′ = 100 and learning rate η = 0.08).

Figure 2 shows that FedSMB achieves the accuracy of thebaseline with similar learning curves in all scenarios. Weobserve that:

1. FedSMB can reach the maximum possible accuracyeven in Non-IID scenarios.

2. FedSMB training with K = 10 clients and batch sizeB = 10 is similar to the centralized training with batchsize B′ = 100.

3. FedSMB requires similar number of iterations toachieve the best accuracy in both IID or Non-IID con-figurations.

The last point is crucial from the practical perspective. Itimplies that although FedSMB is an efficient approach in


terms of accuracy, it is not communication-efficient, requir-ing a large number of communication rounds to reach atarget accuracy even in the IID configuration. FedMMB ad-dresses this limitation by providing a trade-off between theaccuracy and communication efficiency in different Non-IIDenvironments.

5.2 FedMMB

To investigate the efficiency of FedMMB, we employ a set-ting similar to the FedSMB case using the 4CFNN model(SGD optimizer), the CIFAR-10 dataset, 10 clients withbatch size of 10, and the centralized accuracy (≈ 0.742) asthe baseline. We train the model using different values ofC (batch count) under the IID and Non-IID-1 to Non-IID-5scenarios (Figure 3).

In the IID configuration (Figure 3a), FedMMB can achievethe accuracy of the baseline using high batch count values(C=20, 50). Additionally, the larger batch count (C=50) re-quires fewer communication rounds to reach the maximumaccuracy. Thus, increasing the batch count of FedMMBmakes the approach more communication-efficient withoutcompromising the accuracy in the IID environment.

Figure 4. Comparison among FedSMB, FedMMB, and FedAvg:FedSMB and FedMMB outperform FedAvg in terms of accuracy.Reducing the local updates by increasing the batch size in FedAvgdoes not improve the accuracy. The dashed line indicates the base-line accuracy. For all cases, there are K = 10 clients training the4CFNN model on the CIFAR-10 dataset under the Non-IID-2 con-figuration. The learning rate is 0.08, 0.05, 0.02, 0.05 for FedSMB,FedMMB, FedAvg (B = 10), and FedAvg (B = 100), respectively.

In all Non-IID scenarios, FedMMB with high batch countsnever reaches the accuracy of the baseline; with low batchcounts (C=2, 5), FedMMB might achieve the baseline ac-curacy (Figure 3b); in general, FedMMB provides betteraccuracy with lower batch counts in all Non-IID environ-ments. In the extremely Non-IID label distribution scenario(Non-IID-1), increasing the batch count does not improvethe communication rounds. In the other Non-IID environ-

ments, however, higher batch counts lead to fewer numberof communication rounds to reach a target accuracy. ForNon-IID environments with more unique labels per client,larger batch counts make the approach more communication-efficient. For instance, FedMMB with C = 50 requiresaround 200 and 850 communication rounds to reach theaccuracy of 0.7 in the Non-IID-5 and Non-IID-2 scenarios,respectively.

In summary, FedMMB with large values of C is a realis-tic choice for the IID environment because it can save ahuge number of communication rounds without negativelyaffecting the accuracy. For Non-IID environments where themaximum accuracy is more important than communicationefficiency, FedSMB is the best option. FedMMB with C ≥ 2is a reasonable choice for Non-IID environments in which atrade-off between communication efficiency and accuracyis required. The best value of C can be determined basedon the target accuracy and the label distribution across theclients.

5.3 FedMMB vs. FedAvg

We also compare the FedMMB approach (including FedSMBas its special case) with FedAvg from the accuracy as wellas communication and computational efficiency viewpoints.We train the 4CFNN model on the CIFAR-10 dataset ina federated configuration with K = 10 clients, batch sizeB = 10, and the Non-IID-2 scenario using FedSMB (η =0.08), FedMMB (C = 50, η = 0.05), and FedAvg (E = 1,η = 0.02). We use a lower learning rate for FedAvg becausethe model diverges for the higher learning rates.

Figure 4 shows the accuracy curves for the different ap-proaches. FedSMB, FedMMB, and FedAvg achieve the max-imum accuracy of 0.742, 0.725, and 0.709, respectively.This indicates that FedSMB and FedMMB outperform Fe-dAvg in terms of the accuracy in the Non-IID scenario.These results are also consistent with the previous ones(Subsections 5.1 and 5.2) regarding the relationship betweenthe number of local updates and the maximum achievableaccuracy in Non-IID environments assuming the same batchsize. With batch size of 10, FedSMB, FedMMB, and FedAvgperform µ = 1, µ = 50, and µ = 5000

10 = 500 local updatesper iteration, respectively (5000 is the sample size of eachclient). The approach with a lower number of local updatesreaches a higher maximum accuracy.

We also test FedAvg with a larger batch size (B = 100,E = 1, η = 0.05) to perform fewer (µ = 5000

100 = 50)local updates per iteration (Figure 4). FedAvg reaches themaximum accuracy of 0.696 in this case, which is lowercompared with the accuracy from FedAvg with batch sizeof 10 (0.709). This indicates that increasing the batch sizeto reduce the local updates in FedAvg does not improve theaccuracy and highlights the importance of decoupling the

Federated Multi-Mini-Batch: An Efficient Training Approach to Federated Learning in Non-IID EnvironmentshhhhhhhhhhhhhhhApproach

Target accuracy0.65 0.7 0.71 0.72 0.73 0.74

FedSMB 1990 | 19.9K 3300 | 33K 3650 | 36.5K 4530 | 45.3K 5040 | 50.4K 7820 | 78.2KFedMMB 290 | 145K 850 | 425K 1260 | 630K 1800 | 900K - -FedAvg 240 | 1200K 1550 | 7750K - - - -

Table 1. Communication rounds | local updates to reach a target accuracy in the Non-IID-2 scenario with batch size B = 10. FedSMBis the least communication-efficient but the most computation-efficient approach. FedAvg is the least computation-efficient approach,performing much more local updates compared to FedMMB and FedSMB. The results in the table are based on those from Figure 4.

batch size from the batch count as in the FedMMB approach.

Table 1 lists the communication rounds and total local up-dates required to reach a target accuracy in each approachusing the same batch size (B = 10). FedSMB is the leastcommunication-efficient but the most computation-efficientapproach as expected. Compared with FedMMB, FedAvgrequires 1.8 times more communication rounds (1550 vs.850) and 18 times more local updates (7750K vs. 425K)to reach the accuracy of 0.7 and never reaches the accu-racy above 0.71. This indicates that FedMMB not onlyprovides a higher accuracy but also more communicationand computation efficiency compared to FedAvg in Non-IIDenvironments.

6 CONCLUSION

In this paper, we address two main challenges of the feder-ated learning and FedAvg in Non-IID environments: accu-racy and network communication efficiency. With respect tothe former challenge, we introduce the FedSMB training ap-proach, where the clients train the global model on a singlemini-batch from their dataset in each iteration. We show thatFedSMB with K clients and batch size B results in similarmodels as the centralized training using MBGD with batchsize B′ = B ×K regardless of the data distribution acrossthe clients. Moreover, FedSMB can achieve the accuracy ofthe baseline even in Non-IID environments, and as a result,it is an efficient approach from the accuracy perspective.However, FedSMB is not communication-efficient, requir-ing a large number of communication rounds to achieve atarget accuracy.

To cope with this limitation, we present FedMMB as a gen-eralization of FedSMB. In FedMMB, each client trains theglobal model on multiple (C ≥ 2) batches from its localdataset, performing multiple local updates in each iterationinstead of once. Our simulation results illustrate that inthe IID setting, a large value of C can considerably reducethe communication rounds without adversely affecting theaccuracy. In Non-IID environments, FedMMB with a lowervalue of C (i.e. less number of local updates) provides abetter accuracy but requires more communication rounds toreach a target accuracy. The results also indicate that Fed-MMB outperforms FedAvg in terms of the accuracy as well

as communication and computation efficiency in Non-IIDenvironments.

Unlike FedAvg, FedMMB decouples the batch size from thebatch count and can control the number of local updatesper iteration independent of the batch size. This decouplingenables FedMMB to provide a trade-off between the accu-racy and communication efficiency and makes it a suitabletraining approach to federated learning in Non-IID environ-ments.

REFERENCES

Bottou, L. Stochastic gradient descent tricks. In Neuralnetworks: Tricks of the trade, pp. 421–436. Springer,2012.

Brisimi, T. S., Chen, R., Mela, T., Olshevsky, A., Pascha-lidis, I. C., and Shi, W. Federated learning of predictivemodels from federated electronic health records. Interna-tional journal of medical informatics, 112:59–67, 2018.

Chen, Y., Qin, X., Wang, J., Yu, C., and Gao, W. Fedhealth:A federated transfer learning framework for wearablehealthcare. IEEE Intelligent Systems, 2020.

Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays,F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage,D. Federated learning for mobile keyboard prediction.arXiv preprint arXiv:1811.03604, 2018.

Hinton, G., Srivastava, N., and Swersky, K. Neural networksfor machine learning lecture 6a overview of mini-batchgradient descent. Cited on, 14(8), 2012.

Hsieh, K., Phanishayee, A., Mutlu, O., and Gibbons, P. B.The non-iid data quagmire of decentralized machine learn-ing. arXiv preprint arXiv:1910.00189, 2019.

Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis,M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode,G., Cummings, R., et al. Advances and open problemsin federated learning. arXiv preprint arXiv:1912.04977,2019.

Konecny, J., McMahan, B., and Ramage, D. Federated opti-mization: Distributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575, 2015.


Konecny, J., McMahan, H. B., Yu, F. X., Richtarik, P.,Suresh, A. T., and Bacon, D. Federated learning: Strate-gies for improving communication efficiency. arXivpreprint arXiv:1610.05492, 2016.

Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. 2009.

Langley, P. Crafting papers on machine learning. In Langley,P. (ed.), Proceedings of the 17th International Conferenceon Machine Learning (ICML 2000), pp. 1207–1216, Stan-ford, CA, 2000. Morgan Kaufmann.

LeCun, Y., Cortes, C., and Burges, C. Mnist hand-written digit database. ATT Labs [Online]. Available:http://yann.lecun.com/exdb/mnist, 2, 2010.

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A.,and Smith, V. Federated optimization in heterogeneousnetworks. Proceedings of Machine Learning and Systems,2:429–450, 2020.

Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On theconvergence of fedavg on non-iid data. In InternationalConference on Learning Representations, 2019.

McMahan, H. B., Moore, E., Ramage, D., Hampson, S.,and Arcas, B. A. Communication-efficient learning ofdeep networks from decentralized data. In Artificial Intel-ligence and Statistics, pp. 1273–1282. PMLR, 2017.

Ruder, S. An overview of gradient descent optimizationalgorithms. arXiv preprint arXiv:1609.04747, 2016.

Sheller, M. J., Reina, G. A., Edwards, B., Martin, J., andBakas, S. Multi-institutional deep learning modelingwithout sharing patient data: A feasibility study on braintumor segmentation. In International MICCAI Brainle-sion Workshop, pp. 92–104. Springer, 2018.

Torkzadehmahani, R., Nasirigerdeh, R., Blumenthal, D. B.,Kacprowski, T., List, M., Matschinske, J., Spath, J.,Wenke, N. K., Bihari, B., Frisch, T., Hartebrodt,A., Hausschild, A.-C., Heider, D., Holzinger, A.,Hotzendorfer, W., Kastelitz, M., Mayer, R., Nogales,C., Pustozerova, A., Rottger, R., Schmidt, H. H. H. W.,Schwalber, A., Tschohl, C., Wohner, A., and Baumbach,J. Privacy-preserving artificial intelligence techniques inbiomedicine, 2020.

Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., andKhazaeni, Y. Federated learning with matched averaging.arXiv preprint arXiv:2002.06440, 2020.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747, 2017.

Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong,N., Ramage, D., and Beaufays, F. Applied federatedlearning: Improving google keyboard query suggestions.arXiv preprint arXiv:1812.02903, 2018.

Documents

Federated Multi-Mini-Batch: An Efficient Training Approach to … · 2020. 11. 16. · Federated Multi-Mini-Batch: An Efﬁcient Training Approach to Federated Learning in Non-IID