Multi-Stage Hybrid Federated Learning over Large-Scale ... · We develop multi-stage hybrid model training (MH-MT), a novel learning methodology for distributed ML in these scenarios

1

Multi-Stage Hybrid Federated Learning overLarge-Scale D2D-Enabled Fog Networks

Seyyedali Hosseinalipour, Member, IEEE, Sheikh Shams Azam, Christopher G. Brinton, Senior Member, IEEE,Nicolo Michelusi, Senior Member, IEEE, Vaneet Aggarwal, Senior Member, IEEE,

David J. Love, Fellow, IEEE, and Huaiyu Dai, Fellow, IEEE

Abstract—Federated learning has generated significant interest,with nearly all works focused on a “star” topology wherenodes/devices are each connected to a central server. We migrateaway from this architecture and extend it through the networkdimension to the case where there are multiple layers of nodesbetween the end devices and the server. Specifically, we developmulti-stage hybrid federated learning (MH-FL), a hybrid of intra-and inter-layer model learning that considers the network as amulti-layer cluster-based structure, each layer of which consists ofmultiple device clusters. MH-FL considers the topology structuresamong the nodes in the clusters, including local networks formedvia device-to-device (D2D) communications. It orchestrates thedevices at different network layers in a collaborative/cooperativemanner (i.e., using D2D interactions) to form local consensuson the model parameters, and combines it with multi-stageparameter relaying between layers of the tree-shaped hierarchy.We derive the upper bound of convergence for MH-FL withrespect to parameters of the network topology (e.g., the spectralradius) and the learning algorithm (e.g., the number of D2Drounds in different clusters). We obtain a set of policies forthe D2D rounds at different clusters to guarantee either a finiteoptimality gap or convergence to the global optimum. We thendevelop a distributed control algorithm for MH-FL to tune theD2D rounds in each cluster over time to meet specific convergencecriteria. Our experiments on real-world datasets verify ouranalytical results and demonstrate the advantages of MH-FL interms of resource utilization metrics.

Index Terms—Fog learning, D2D-assisted federated learning,cooperative learning, distributed machine learning.

I. INTRODUCTION

Machine learning (ML) tasks have attracted significant attentionrecently. When provided with sufficient data for model training,ML has produced automated solutions to problems ranging fromnatural language processing to object detection and tracking [1],[2]. Traditionally, model training has been carried out at acentral node (i.e., a server). In many contemporary applicationsof ML, however, the relevant data is generated at the end userdevices. As these devices generate larger volumes of data,transferring it to a central server for model training has severaldrawbacks: (i) it can require significant energy consumptionfrom battery-powered devices, (ii) the round-trip-times betweendata generation and model training can constitute prohibitivedelays, and (iii) in privacy-sensitive applications, end usersmay not be willing to transmit their raw data in the first place.

Federated learning has emerged as a technique for distribut-ing model training across devices while keeping the devices’

S. Hosseinalipour, S. Azam, C. Brinton, N. Michelusi, V. Aggarwal, and D. Loveare with Purdue University: {hosseina,azam1,cgb,michelus,vaneet,djlove}@purdue.edu.H. Dai is with NC State University: {[email protected]}.

Local Dataset

Device

Main Server

…Local Learning

W W W

𝑤1 𝑤2 𝑤𝑁

W= ∑!"#$ |𝒟!|∑!"#$ |𝒟"|

𝑤𝑗

𝒟2𝒟1 𝒟𝑁…

Global Aggregation

Fig. 1: Architecture of conventional federated learning resembling a starstructure. User devices perform local model updates using their datasets, andsend their learned parameters to the main server. The main server aggregatesthe received parameters into a new global model broadcast to the devices.

datasets local [3]. Its conventional architecture consists of amain server connected to multiple devices in a star topology(see Fig. 1). Each round of model training consists of two stepsperformed sequentially: (i) local updating, where each deviceupdates its local model based on its dataset and the globalmodel, typically using gradient descent methods, and (ii) globalaggregation, where the server gathers devices’ local modelsand computes a new global model, which is then synchronizedacross the devices to begin the next round of training.

In conventional federated learning, only device-to-server (instep (i)) and server-to-device (in step (ii)) communicationsoccur. This is limiting and prohibitive in contemporary large-scale network scenarios, where there are several layers ofnodes between the end devices and the cloud (see Fig. 2a).In particular, it can lead to long delays, large bandwidthutilization, and high power consumption in transferring thelocal updates upstream to datacenters for aggregations [4]. Wemigrate federated learning from its star structure to a moredistributed structure that accounts for the multi-layer networkdimension and leverages topology structures among the devices.

A. Fog Learning: Federated Learning in Fog Environments

Fog computing is an emerging technology which aimsto manage computation resources across the cloud-to-thingscontinuum, encompassing the cloud, core, edge, metro, clients,and things [5]. We recently introduced the fog learning (FogL)paradigm [4], which advocates leveraging the fog computingarchitecture to handle ML tasks. Specifically, FogL requiresextending federated learning to (i) incorporate fog networkstructures, (ii) account for device computation heterogeneity,and (iii) manage the proximity between resource-abundant andresource-constrained nodes. Our focus in this paper is (i), i.e.,extending federated learning along the network dimension.

We consider the sample network architecture of FogLdepicted in Fig. 2a. There are multiple layers between userdevices and the cloud, including base stations (BSs) and edgeservers. Compared with federated learning, FogL has twokey characteristics: (i) It assumes a multi-layer cluster-based

arX

iv:2

007.

0951

1v4

[cs

.NI]

26

Oct

202

0

2

Main Server

Cloud Servers

Edge Servers

BSs/RSUs

End Devices

GatewaysInter-edge Links

Medium/Miniature UAVs

HAPs

Inter-cloud Links

V2V Link

U2U Link

M2M Link

(a): A schematic of model transfer stages for a large-scale ML task ina fog network. The parameters of the end devices are carried throughmultiple layers of the network consisting of base stations (BSs), road sideunits (RSUs), unmanned aerial vehicles (UAVs), high altitude platforms(HAPs), edge servers, and cloud servers before reaching the main server.Devices located at different layers of the network can engage in directcommunications via mobile-mobile (M2M), vehicle-vehicle (V2V), UAV-UAV (U2U), inter-edge, and inter-cloud links.

……

… …

LUT Cluster

EUT Cluster

Horizontal parameter sharingthrough a time varying graph

(cooperative D2D)

Vertical parameter sharing

(b): Partitioning the network layers into multiple LUT/EUT clustersfor FogL, introducing a hybrid model training framework consisting ofboth horizontal and vertical parameter transfer. Inside each LUT cluster,the devices engage in collaborative/cooperative D2D communicationsthrough a time varying network topology and exchange their parameters.The parents of LUT clusters obtain the consensus of their childrenparameters by sampling the model parameter of one node, while theparents of EUT clusters receive all the parameters of their children.

Fig. 2: Architecture of (a) the multi-layer network structure of fog computing systems and (b) the network layers and parameters transfer for FogL.

structure with local parameter aggregations at different layers.(ii) In addition to inter-layer communications, it includes intra-layer communications via device-to-device (D2D) connectivity.D2D communications is being enabled today in 5G and IoT [6].There exists a literature on D2D communication protocolsfor ad-hoc and sensor networks including vehicular ad-hocnetworks (VANET), mobile ad-hoc networks (MANET), andflying ad-hoc networks (FANET) [7]–[10]. FogL also considersserver-to-server interactions [11] under the umbrella of D2D.

Characteristic (ii) mentioned above orchestrates the devicesat each layer of the network in a collaborative/cooperativeframework, which introduces a set of local networks formed viadevices to the model training paradigm. This motivates studyingthe learning performance with explicit consideration of topologystructure among the devices. To do this, we propose partitioningthe network layers into clusters, as depicted in Fig. 2b, whereall nodes in a cluster are connected to the same node in thenext layer up. There are two types of clusters: (i) limited uplinktransmission (LUT) clusters, where D2D communications areenabled, and (ii) extensive uplink transmission (EUT) clusters,where, similar to the conventional federated learning, all nodesonly communicate with their upper layer.

Accommodating both inter- and intra-layer communicationsintroduces a hybrid model for training that considers conven-tional server-device interactions in conjunction with collabora-tive/cooperative D2D communications. Thus, the methodologywe develop in this paper is called multi-stage hybrid federatedlearning (MH-FL), and considers both intra-cluster consensusformation and inter-cluster aggregations for distributing ML infog computing systems. In developing MH-FL, we incorporatethe time-varying local network topologies among the devicesas a dimension of federated learning, and demonstrate how itimpacts ML model convergence and accuracy.

B. Related Work

Researchers have considered the effects of limited andimperfect communication capabilities in wireless networks– such as channel fading, packet loss, and limited bandwidth– on the operation of federated learning [12]–[15]. Also,communication techniques such as quantization [16] andsparsification of model updates (i.e., when only a fraction

of the model parameters are shared during model training) [17]have been studied. Recently, [18] analyzed the convergencebounds in the presence of edge network resource constraints.

Furthermore, research has considered the computation as-pects of implementing federated learning in wireless net-works [13], [18]–[22]. Part of this literature has focused onfederated learning in the presence of straggler nodes, i.e.,when a node has significantly lower computation capabilitiesthan others and cannot train as quickly [13], [19], [21].Another emphasis in this direction has been reducing thecomputation requirements of devices in the network, throughcoding techniques [20], intelligent raw data offloading betweendevices [21], and judicious selection of device training partici-pation [23]. More general techniques for mitigating computelimitations, e.g., through model compression, have also beenapplied to distributed ML environments [24].

There has also been a recent line of work on hierarchicalfederated learning [25]–[27]. These works are mainly focusedon specific use cases of two-tiered network structures abovewireless cellular devices, e.g., edge clouds connected to a mainserver [25], [26] or small cell and macro cell base stations [27].As compared to all the aforementioned works, which considerthe star topology (or tree in case of hierarchical considerations)of federated learning, our work is distinct for several reasons,including that: (i) we introduce a multi-layer cluster-basedstructure with an arbitrary height that encompasses all IoTelements between the end devices and the main server, whichgeneralizes all the prior models, and (ii) more importantly,we explicitly consider the network dimension and topologystructure among the devices at each network layer formed viacooperative/collaborative D2D communications. This migratesus from models used in contemporary literature and enablesnew analysis and considerations for hybrid intra- and inter-layermodel learning in large-scale fog networks.

Beyond federated learning, there is a well developed litera-ture on other distributed ML techniques (e.g., [28]–[30]). Ourproposed framework for FogL inherits its model aggregationrule from federated learning, i.e., local gradient descent at thedevices and weighted averaging to obtain the global model. Wechoose this due to specific characteristics that make it better

3

suited for fog: keeping the user data local, handling non-iiddatasets across devices, and handing imbalances between sizesof local datasets [3]. These capabilities have made federatedlearning the most widely acknowledged distributed learningframework for future wireless systems [31], [32]. The multi-stage hybrid architecture of FogL could also be studied in thecontext of other distributed ML techniques, e.g., ADMM [28].

Finally, there exist a literature on distributed average con-sensus algorithms for different applications in multi-agentsystems [33], [34], sensor networks [35], [36], and optimizationproblems [37]–[40]. The FogL scenario we consider is uniquegiven its multi-layer network structure and focus on a hybridML model training, where the goal is to propagate anexpectation of the nodes’ parameters through the hierarchyto train an ML model. The results we obtain have thus not yetappeared in either consensus-related or ML-related literature.

C. Summary of Contributions

Our contributions in this work can be summarized as follows:

• We formalize multi-stage hybrid federated learning (MH-FL),a new methodology for distributed ML. MH-FL extendsfederated learning along the network dimension, relayingthe local updates of end devices through the networkhierarchy via a novel multi-stage, cluster-based parameteraggregation technique. This paradigm introduces localaggregations achieved by an interplay between cooperativeD2D communications and distributed consensus formation.

• We analytically characterize the upper bound of convergenceof MH-FL. We demonstrate how this bound depends oncharacteristics of the ML model, the network topology,and the learning algorithm, including the number of modelparameters, the communication graph structure, and thenumber of D2D rounds at different device clusters.

• We demonstrate that the model loss achieved by MH-FLunder unlimited D2D rounds coincides with that of federatedlearning. Under the finite D2D rounds regime, we obtaina condition under which a constant optimality gap can beachieved asymptotically. We further show that under limitedfinely-tuned D2D rounds, MH-FL converges linearly to theoptimal ML model. We further introduce a practical clustersampling technique and investigate its convergence behavior.

• We obtain analytical relationships for tuning (i) the numberof D2D rounds in different clusters at different layers and (ii)the number of global iterations to meet certain convergencecriteria. We use these relationships to develop distributedcontrol algorithms that each cluster can employ individuallyto adapt the number of D2D rounds it uses over time.

• Our experimental results on real-world datasets verify ourtheoretical findings and show that MH-MT can improvenetwork resource utilization significantly with negligibleimpact on model training convergence speed and accuracy.

II. SYSTEM MODEL AND PROBLEM FORMULATION

In this section, we formalize the FogL network architecture(Sec. II-A) and the ML problem (Sec. II-B). Then, we formalizethe hybrid learning paradigm via intra- and inter-clustercommunications (Sec. II-C) and parameter sharing (Sec. II-D).

... ...

Main Server

... ... ... .... . . .. . ...

.. . ... .. .… … … … … …

. . . . . . . . .

Sampling

D2D

D2D

D2D

D2D

Vertical Updates (Global Aggregations)

Horizontal Updates (Local Aggregations)

D2D Link

D2D

D2D

... ... ... ... ... .. .... . . .. . ...

. . . ... .. .… … … … … …

. . . . . . . . .

....

FogLAugmented graph

!0!1

!2

!3

!4

!5

D2D

D2D

D2D

D2D

D2D D2D D2D D2D

Virtual

Virtual

Two children share the same parent node

Having children from multiple layers below

... ...

!3,1 !3,2 !3,3 !3,4

Fig. 3: Left: Network representation of FogL for an example network. The rootof the tree corresponds to the main server, the leaves of the tree correspondto the end devices, and the nodes in-between correspond to intermediate fognodes. The nodes belonging to the same layer that have the same parent nodeform clusters. In D2D-enabled clusters, the nodes form a certain topology overwhich the devices communicate with their neighbors for distributed modelconsensus formation. Right: The corresponding FogL augmented graph foranalysis. Virtual nodes and clusters are highlighted in green.

A. Network Architecture and Graph Model

We consider the network architecture of FogL as depictedon the left in Fig. 3. In this network, both inter-layer andintra-layer communications will take place, for the purposeof training an ML model. The inter-layer communications arecaptured via a tree graph, with the main server at the rootand end devices as the leaves. Each layer is partitioned intomultiple clusters, with each device in a cluster sharing thesame parent node in the next layer up. In general, each nodemay have a parent node located multiple layers above (e.g., anedge device can be directly connected to an edge server), andmultiple clusters can share the same parent node (e.g., multiplegroups of cellular devices share the same BS). Except at thebottom layer, each node in the hierarchy is the parent for asubset of the nodes, and is responsible for gathering the modelparameters of its children nodes.

From this FogL network representation, we construct theFogL augmented network graph shown on the right of Fig. 3,where several virtual nodes and clusters have been added.Virtual nodes are added in such a way that each node hasa single parent node in its immediate upper layer. Also,when multiple clusters share the same parent node, a layer isadded to the FogL augmented graph that consists of multipleintermediate virtual nodes forming a virtual cluster, such thatthere is always a one-to-one mapping between the clustersand parent nodes. If necessary, the nodes in the highest layerbefore the main server will form a virtual cluster to preservethis one-to-one mapping. A node without any neighbors inits layer is also assumed to form a virtual singleton cluster.For convenience, we refer to the structure of Fig. 3 as a treesince in macro-scale it resembles a tree structure. However,in the micro-scale, nodes inside the clusters form connectedgraphs through which D2D communications are performed,differentiating the structure from a tree graph.

The FogL augmented graph has the following properties: (i)each parent node has a single cluster associated with it in thelayer immediately below it, (ii) the length of the paths from theroot to each of the leaf nodes are the same, and (iii) the layerbelow the root always consists of one cluster. Formally, thenetwork consists of |L|+ 1 layers, where L , {L1, · · · , L|L|}denotes the set of layers below the server. The root/server islocated at layer L0 and the end devices are contained in L|L|.Inside each layer Lj , 1 ≤ j ≤ |L|, there exists a set of clusters

4

indexed by Lj,1, Lj,2, · · · (see Fig. 3). We let the nodes freelymove between the clusters in the same layer. To capture thesedynamics, we use L(k)

j,i (i.e., calligraphic font) to refer to the setof nodes inside cluster Lj,i at learning iteration k (describedin Sec. II-B). For ease of presentation, we assume that thenumber of clusters at each layer is time invariant, i.e., eachcluster always contains at least one node and nodes do notform new clusters. We let Nj denote the set of nodes locatedin layer Lj , where Nj , |Nj | is not time varying since nodesdo not join and leave the layers.

For convenience, we will sometimes use C to denote anarbitrary cluster (i.e., any Lj,i) and C(k) (i.e., calligraphic font)to refer to its set of nodes at iteration k. We also sometimesuse n to denote an arbitrary node. For each node n locatedin layers L|L|−1, · · · , L0, we let Q(n) denote the index of itschild cluster and Q(k)(n) denote the set of its children nodes(i.e., the nodes in Q(n)) at at global iteration k.B. Machine Learning Model

Each end device n is associated with a dataset Dn. Eachelement di ∈ Dn of a dataset, called a training sample, isrepresented via a feature vector xi and a label yi for the MLtask of interest. For example, in image classification, xi maybe the RGB colors of all pixels in the image, and yi may bethe identity of the person in the image sample. The goal of theML task is to learn the parameters w ∈ RM of a particularM -dimensional model (e.g., an SVM or a neural network) thatare expected to maximize the accuracy in mapping from xi toyi across any sample in the network. The model is associatedwith a loss f(w, xi, yi), referred to as fi(w) for brevity, thatquantifies the error of parameter realization w on di. We referto Table 1 in [18] for a list of common ML loss functions.

The global loss of the ML model is formulated as

F (w) =1

D

∑n∈N|L|

|Dn|fn(w), D =∑

n∈N|L|

|Dn|, (1)

where fn is the local loss at node n, i.e., fn(w) =1|Dn|

∑di∈Dn fi(w). The goal of model training is to identify

the optimal parameter w∗ that minimizes the global loss:

w∗ = arg minw∈RM

F (w). (2)

To achieve this in a distributed manner, training is conductedthrough consecutive global iterations. At the start of globaliteration k ∈ N, the main server possesses a parameter vectorw(k−1) ∈ RM , which propagates downstream through thehierarchy to the end devices. Each end device n overrides itscurrent local model parameter vector w(k−1)

n according to

w(k−1)n = w(k−1), (3)

and then performs a local update using gradient descent [3] as

w(k)n = w(k−1)

n − β∇fn(w(k−1)n ), n ∈ N|L|, (4)

where β is the step-size. The main server wishes to obtain theglobal model used for initiating the next global iteration, whichis defined as a weighted average of end devices’ parameters:

w(k) =

∑n∈N|L| |Dn|w

(k)n

D. (5)

Similar to the literature on federated learning, the parameterD in (5) is assumed to be known at the server. As we will see,our method does not require knowledge of the size of eachusers’ dataset at the server, which eliminates communicationoverhead. On the downlink from the server to the edge devices,we assume that the (common) global parameter w(k−1) canbe readily shared through the hierarchy to reach to the enddevices, e.g., through a broadcasting protocol. These deviceswill then conduct (3) and (4) locally. The challenge, then,is computing (5) at the main server. In federated learning,the devices will directly upload (4) to the server for this.This is prohibitive in a large-scale fog computing system,however. For one, it may require high energy consumption:uplink transmissions from battery-limited devices to nodes ata higher layer typically correspond to long physical distances,and can deplete individual device batteries. Also, it may inducehigh network traffic and long latencies: a neural networkwith even hundreds of parameters, which would be smallby today’s standards [24], would require transmission ofbillions of parameters in the upper layers during each iterationover a network with millions of nodes. Additionally, it mayoverload current cellular and vehicular architectures: theseinfrastructures are not designed to handle large jumps inthe number of active users [41], which would be the casewith simultaneous uplink transmissions at the bottom-mostlayer. These issues require a novel approach to parameteraggregations, which we develop in this paper.

C. Hybrid Learning via Intra- and Inter-Layer Communications

The main server in FogL is only interested in the weightedaverage of the local parameters (5). Consequently, we proposelocal aggregations at each network layer. To achieve this, eachcluster in Fig. 3 follows one of two mechanisms:1) Distributed aggregation: The nodes engage in a cooperative

scheme facilitated by D2D communications to realize theconsensus/average of their local model parameters. Theparent node then samples parameters of one of the childrenand scales it by the number of children to calculate anapproximate sum of the children nodes’ parameters.

2) Instant aggregation: Each node instead uploads its localmodel to the parent node. The parent computes the aggre-gation directly as a sum of the children nodes’ parameters.

The communication requirement of the instant aggregationis significantly higher than the distributed aggregation: eventhough the distributed aggregation may require multiple commu-nication rounds, it generally occurs over much shorter distances.We refer to mechanisms 1 and 2 as limited uplink transmission(LUT) and extensive uplink transmission (EUT), respectively.Not all clusters can operate in LUT mode, since not all are D2Denabled, e.g., due to sparse connections between devices. Still,we can expect significant advantages in terms of the volume ofparameters uploaded through the system with a combinationof EUT and LUT clusters, as depicted in Fig. 4. At the bottomlayers where communication is mostly over the air, D2D canalso be implemented as out-band D2D [42] with an additionaladvantage of not occupying the licensed spectrum.

We allow a cluster to switch between EUT and LUT overtime. For instance, the connectivity between a fleet of miniature

5

M M

D2D

D2D

MMMM

3M 4M 3M 4M

7M7M

D2D

M M MM M M M

M MM

MMM

MMMM M

MMM M M

MMM

MM M

MM MMMM MM

MM

BTot=42M BTot=20M BTot=12M

D2D

M

Σ Σ Σ Σ

Σ

ΣΣΣ

M M

D2D

D2D

M

D2D

M M

BTot=7M

D2D

M

Σ

D2D

D2D

D2D

M

Σ Σ Σ Σ

Fig. 4: Example of the network traffic reduction provided by multi-layeraggregations in FogL, where each device trains a model with parameter lengthM . The sum of the length of parameters transferred among the layers (BTot)is depicted at the top. (a) Network consisting of all EUT clusters, where theparent nodes upload all received parameters from their children. (b) Networkconsisting of all EUT clusters, where the parent nodes sum all receivedparameters and upload to the next layer. (c) Network with a portion of clustersin LUT mode, where each parent node only samples the parameters of onedevice after consensus formation. (d) Network consisting of all LUT clusters.

UAVs will vary as they travel, necessitating EUT when D2Dis not feasible. To capture this dynamic, for each cluster C, atglobal iteration k, 1(k)

{C} captures the operating mode, whichis 1 if the cluster operates in LUT mode, and 0 otherwise.In each LUT cluster, a node will only communicate with itsneighboring devices, which may not include the whole cluster.Further, each node’s neighborhood may evolve over time. Forexample, when the communications are conducted over the airas in Fig. 2a, the neighbors in one aggregation interval areidentified based on the distances among the nodes and theirtransmit powers. We will explicitly consider such evolutions incluster topology in our distributed consensus model in Sec. III.

In the simple case where there is only one layer belowthe server, i.e., |L| = 1, consisting of a single EUT cluster,the FogL architecture reduces to federated learning. If insteadthere is just one layer of one LUT cluster with no server,FogL resembles fully distributed learning [43], [44]. One ofour contributions is developing and analyzing this generalizedcluster-based multi-layer hybrid learning paradigm for FogL.

D. Parameter Sharing vs. Gradient Sharing

Note that the parameter update in (5) can be written as:

w(k) =

∑n∈N|L| |Dn|

(w

(k−1)n − β

(∇fn(w

(k−1)n )

))D

= w(k−1) − β∑n∈N|L| |Dn|∇fn(w

(k−1)n )

D.

(6)

This asserts that the global parameters can also be obtainedvia the gradients of the devices, implying that the devices caneither share their gradients or their parameters during training.This equivalence arises from the one-step update in (4), whichis a common assumption in federated learning [12], [13], [45].However, recent work [18], [21] has advocated conductingmultiple rounds of local updates between global aggregationsto reduce communication costs; in this case, the parameters arerequired for each aggregation. In this paper, we focus on themore general case of parameter sharing, though we obtain onetheoretical result (Proposition 3) based on gradient sharing.

In LUT clusters, devices leverage D2D communications toobtain an approximate value of the average of their parameters.A basic approach would be to implement a message passingalgorithm where nodes exchange parameters with their neigh-

bors until each node in the cluster has all parameters storedlocally. Each node can then readily calculate the aggregatedvalue, and one of them can be sampled by the parent node.Collecting a table of parameters at each node may not befeasible, however, given how large these vectors can be forcontemporary ML models, as discussed in Sec. II-B. Instead,we desire a technique that (i) does not require any node tostore a table of all model parameters in the cluster, (ii) canbe implemented in a distributed manner via D2D, and (iii) isgeneralizable across different network layers. In the following,we leverage distributed average consensus methods for this.

III. MULTI-STAGE HYBRID FEDERATED LEARNING(MH-FL)

In this section, we develop our MH-FL methodology(Sec. III-A&III-B). Then, we conduct a detailed performanceanalysis of our method (Sec. III-C). Finally, based on thisanalysis, we develop online control algorithms for tuning thenumber of D2D rounds in each cluster over time to guaranteedifferent convergence behavior for our method (Sec. III-D).

All proofs have been deferred to the supplementary material.

A. Distributed Average Consensus within Clusters

Referring to Fig. 3, during each iteration k of training, theLUT clusters engage in D2D communications, where eachnode desires estimating the average value of the parametersinside its cluster. For each cluster C that is LUT during thekth global aggregation iteration (i.e., for which 1

(k){C} = 1), we

let G(k)C =

(C(k), E(k)

C

)denote its communication graph. In

this graph, C(k) is the set of nodes belonging to the cluster,and there is an edge (m,n) ∈ E(k)

C between nodes m,n ∈ C(k)

iff they can communicate via D2D during k. We assume thatG

(k)C is undirected, connected, and static for the duration of

one iteration k, though it may vary in different iterations.We employ the general family of linear distributed consensus

algorithms [46], where during the global iteration k, the nodesinside LUT cluster C conduct θ(k)

C ∈ N rounds of D2D.1

Each round of D2D consists of parameter transfers betweenneighboring nodes. Formally, during global iteration k, eachnode n ∈ C(k) engages in the following rounds of iterativeupdates for t = 0, ..., θ

(k)C − 1:

z(t+1)n = v(k)

n,nz(t)n +

∑m∈ζ(k)(n)

v(k)n,mz(t)

m , (9)

where z(0)n = w

(k)n corresponds to node n’s initial parameter

vector, and z(θ(k)C

)n denotes the parameter after the D2D

consensus process. ζ(k)(n) denotes the set of neighbors ofnode n during global iteration k, and v(k)

n,p, p ∈ {n} ∪ ζ(k)(n)are the consensus weights associated with node n during k. Weassume that each node has knowledge of these weights at eachglobal iteration. For instance, one common choice for theseweights [46] gives z(t+1)

n = z(t)n +d

(k)C

∑m∈ζ(k)(n)(z(t)

m −z(t)n ),

0 < d(k)C < 1/D

(k)C , where D

(k)C is the maximum degree

of the nodes in G(k)C . With this implementation, the nodes

inside LUT cluster C only need to have the knowledge of theparameter d(k)

C to perform D2D communications, which can be

1Number of D2D rounds is a design parameter obtained in Sec. III-C.

6

w(k)n′ =

∑j∈Q(k)(n)

|Dj |w(k−1)j

|Q(k)(n)|−

∑j∈Q(k)(n)

β|Dj |∇fj(w(k−1)j )

|Q(k)(n)|+ 1

(k){Q(n)}c

(k)n′ , n

′ ∈ Q(k)(n) (7)

w(k)n′ =

∑i∈Q(k)(n)

∑j∈Q(k)(i)

|Dj |w(k−1)j

|Q(k)(n)|−

∑i∈Q(k)(n)

∑j∈Q(k)(i)

β|Dj |∇fj(w(k−1)j )

|Q(k)(n)|+

∑i∈Q(k)(n)

1(k){Q(i)}|Q

(k)(i)|c(k)i′

|Q(k)(n)|+ 1

(k){Q(n)}c

(k)n′ , n

′∈Q(k)(n)(8)

broadcast by the respective parent node. Note that due to timeconstraints (i.e., depending on the required time between globalaggregations), the number of D2D rounds cannot be arbitrarilylarge, and thus the nodes inside a cluster often do not have aperfect estimate of the average value of their parameters. InSec. III-C, we will analyze the effect of the number of D2Drounds for general choices of v(k)

n,p, p ∈ {n} ∪ ζ(k)(n), ∀n.

B. Local Aggregations and Parameters Propagation

In the following, we introduce a scaled parameter for eachnode n denoted by wn and develop an approach to performglobal aggregations based on manipulation and relaying ofthese parameters across different layers of the network. Thedefinition of wn depends on the layer where node n is located.

1) Nodes’ parameters in the bottom-most layer: Givenw(k−1), the end devices in layer L|L| first perform the localupdate described in (4). Since the nodes’ parameters go throughmultiple stages of aggregations (see Fig. 4), even if the serverknows the number of data points of each device, it still cannotrecover the individual parameters from the aggregated ones tocalculate (5). To overcome this, each device n ∈ N|L| obtainsits scaled parameter as w

(k)n = |Dn|w(k)

n and shares it with itsneighbors during the D2D process for global iteration k, i.e.,its parameters weighted by its number of datapoints.2 Usingthis weighting technique, the number of datapoints of the nodesis encoded in the multi-layer aggregations. Specifically, eachnode n ∈ C(k) belonging to cluster C located in L|L| engagesin the local iterations described by (9), where z(0)

n = w(k)n .

Finally, the node stores w(k)n = z

(θ(k)C

)n , which corresponds to

the final weighted local parameter value after the D2D process.2) Nodes’ parameters in the middle layers: Once D2D

communications are finished in layer L|L|, each parent noden ∈ N|L|−1 of a cluster that operated in LUT mode selects acluster head n′ among its children Q(k)(n) in layer L|L| basedon a selection/sampling distribution. This child n′ uploads itsparameter vector w(k)

n′ to the parent node. The resulting sampledparameter vector at the parent node is given by (7), where thefirst two terms correspond to the true average of the parametersof the nodes inside cluster Q(n), and c

(k)n′ ∈ RM is the error

arising from the consensus. This error is only applicable to theLUT clusters, and is concerned with the deviation from thetrue cluster average (which would be obtained from an EUTcluster). The parent node n ∈ N|L|−1 then computes its scaledparameter w(k)

n by scaling the received vector by the numberof its children, and stores the corresponding vector:

w(k)n = |Q(k)(n)|w(k)

n′ , n∈N|L|−1, n′ ∈Q(k)(n),1

(k)

{Q(n)} = 1.(10)

2Nodes inside EUT clusters of layer L|L| directly share their scaledparameters with their parents.

Also, in layer L|L|−1, each parent node n of a cluster thatoperated in EUT mode receives all the parameters of itschildren and computes w

(k)n =

∑i∈Q(k)(n) w

(k)i . Once this

is completed, the LUT clusters located in layer L|L|−1 engagein distributed consensus formation via cooperative D2D.

This procedure continues up the hierarchy at each layer Lj ,j = |L|−2, ..., 1. More precisely, for an LUT cluster C locatedin one of the middle layers, each node n ∈ C(k) performs thelocal iterations described by (9) with initialization z(0)

n = w(k)n .

At the end of the D2D rounds, each of these nodes stores the

last obtained parameter w(k)n = z

(θ(k)C

)n , and passes it up if

sampled by its parent. At the same time, each node n insidean EUT cluster located in one of the middle layers directlyshares w

(k)n for the calculation of the local aggregation.

Traversing up the layers, the consensus errors accumulate.For example, (8) gives the expression for the final consensusparameter vector of a node n′ ∈ Q(k)(n) inside an LUTcluster at layer L|L|−1, which will be sampled by a parentnode n ∈ N|L|−2 located in the upper layer. In this expression,i′ denotes the sampled node chosen by parent node i.

3) Main server: Let w(k)0 denote the scaled parameter at the

main server. If the cluster in layer L1 operates in LUT mode,then w

(k)0 = |L(k)

1,1|w(k)m , where m ∈ L(k)

1,1 denotes the nodesampled by the main server with parameter w(k)

m obtained afterperforming D2D rounds3. Otherwise, the main server sumsall the received parameters, i.e., w(k)

0 =∑m∈L(k)

1,1w

(k)m . The

main server then computes the global parameter vector as:

w(k) = w(k)0 /D, (11)

which is then broadcast down the hierarchy to start the nextglobal round k+1, beginning with local updates at the devices.

The MH-FL methodology we developed throughout thissection is summarized in Algorithm 1. The lines beginningwith ** and ## are enhancements for tuning the D2D roundsover time at different clusters, which we present in Sec. III-D.

C. Theoretical Analysis of MH-FL

One of the key contributions of MH-FL is the integration ofD2D communications with multi-layer parameter transfers. Asdiscussed, D2D communications are conducted though timevarying topology structures among the nodes, which introducea network dimension to model training. Studying this effectunder limited D2D rounds regime is the main theme of ourtheoretical analysis. Before presenting our main results, wefirst introduce a few assumptions and a definition.

3According to FogL augmented graph properties, L1 consists of one cluster,and thus |L(k)

1,1 | = |N1| which is inherently time invariant. The super-index kis added for consistency in calligraphic notations.

7

Algorithm 1: Multi-stage hybrid federated learning(MH-FL)input : number of global aggregations K, default number of consensus

rounds θ(k)Lj,i

∀i, j, k.

output : Final global model w(K).1 Initial operations at main server: Initialize the global parameter w(0) and

synchronize the edge devices with it.2 ** Initial operations at main server: If asymptotic convergence to optimal is

desired: (i) server randomly sets ‖∇F (w0)‖ and broadcasts it, (ii) Serversets δ either arbitrarily or according to (24) to guarantee a certain accuracy,and broadcasts it. Otherwise, the server sets D2D control parameters{σj}|L|j=1 as described in Sec. III-D and broadcasts them.

3 for k = 1 to K do4 for l = |L| down to l = 0 do5 if l = |L| then6 Given w(k−1), each node n obtains w(k)

n using (4).7 Each node n obtains its scaled parameter

w(k)n = |Dn|w(k)

n .8 ** Each cluster operating in LUT mode runs Algorithm 2.9 ## Each cluster operating in LUT mode runs Algorithm 3.

10 Nodes inside LUT clusters update their parameters using (9).

11 else if 1 ≤ l ≤ |L| − 1 then12 Each parent node n of an LUT cluster Q(n) samples a child

n′ ∈ Q(k)(n) and computes w(k)n = |Q(k)(n)|w(k)

n′ .13 Each parent node n of an LUT cluster Q(n) uses the received

parameters to compute w(k)n =

∑i∈Q(k)(n)

w(k)i .

14 ** Each cluster operating in LUT mode runs Algorithm 2.15 ## Each cluster operating in LUT mode runs Algorithm 3.16 Nodes inside LUT clusters update their parameters using (9).

17 else if l = 0 then18 if 1(k)

{L1,1}= 1 then

19 The server computes w(k)0 =|L(k)

1,1|w(k)m , where m∈L(k)

1,1 .

20 else21 The server computes w

(k)0 =

∑m∈L(k)

1,1

w(k)m .

22 The server computes w(k) using (11) and broadcasts it.23 ** If asymptotic convergence to optimal is desired, the main

server approximates the gradient of the loss function usedfor the next iteration as in Sec. III-D and broadcasts it.

Assumption 1. The global ML loss function (1) has thefollowing properties: (i) µ-strong convexity, i.e., F (y) ≥F (x) + (y − x)>∇F (x) + µ

2 ‖y − x‖2 for some µ > 0 ∀x,y,and (ii) η-smoothness, i.e, ‖∇F (x)−∇F (y)‖ ≤ η ‖x− y‖for some η > µ ∀x,y.

The above properties are common assumptions in federatedlearning and ML literature [12], [45], [47]–[49]. Note thatconvex ML loss functions, such as squared SVM and linearregression, are used with an additional regularization term inpractice to improve the convergence and avoid model overfittingwhich makes them strongly convex functions.

We will also find it useful to write the consensus algorithmin matrix form. Letting W

(k)C ∈ R|C(k)|×M denote the matrix

of scaled parameters across all nodes in an LUT cluster Cprior to consensus in iteration k, the evolution of the nodes’parameters described by (9) can be written as:[

W(k)C

]m

=(V

(k)C

)θ(k)C[W

(k)C

]m, (12)

where[W

(k)C

]m

is the mth column of W(k)C , m = 1, ...,M ,

corresponding to the mth scaled parameter across nodes in thecluster, and W

(k)C denotes the matrix of node parameters after

the consensus. V(k)C ∈ R|C(k)|×|C(k)| denotes the consensus

matrix applied to each parameter vector to realize (9).

Assumption 2. The consensus matrix V(k)C for each LUT clus-

ter C has the following properties [46], [50]: (i)(V

(k)C

)m,n

=

0 if (m,n) /∈ E(k)C , (ii) V(k)

C 1 = 1, (iii) V(k)C = V

(k)C

>, and

(iv) ρ(V

(k)C −

11>|C(k)|

)≤ λ(k)

C < 1, where 1 is the vector of 1sand ρ(A) is the spectral radius of matrix A.

In Assumption 2, λ(k)C can be interpreted as an upper bound

on the spectral radius, which plays a key role in our results.

Definition 1. The divergence of parameters in cluster C atiteration k, denoted by Υ

(k)C , is defined as an upper bound on

the difference of belonging nodes scaled parameters as follows:

‖w(k)q − w

(k)q′ ‖ ≤ Υ

(k)C ,∀q, q′ ∈ C(k). (13)

At the bottom-most layer, the above defined quantity isindicative of the degree of data heterogeneity (i.e., the levelof non-i.i.d) among the nodes in a cluster, whereas in theupper layers it captures the heterogeneity of data contained insub-trees with their roots being the nodes in the cluster. Wewill see in Theorem 1 how it impacts the convergence bound,and in the subsequent results how it dictates the D2D roundsrequired for optimality guarantees. Then, in Sec. III-D, we willdevelop control algorithms that approximate the divergence indistributed manner at every cluster of the network, and use itto control the convergence rate.

1) General convergence bound: In following theorem, weestablish the convergence bound of MH-FL (see Appendix A):

Theorem 1. With a learning rate β = 1/η, after k globaliterations of any realization of MH-FL, an upper bound onF (w(k)) − F (w?) is given in (14), where Φ = N|L|−1 +N|L|−2 + · · · + N1 + 1 is the total number of nodes in thenetwork besides the bottom layer and Ξ(k−t) is given by (15).

Remark 1. Each nested sum in (15), e.g., (b), encompassesthe nodes located in a path starting from the main serverand ending at a node in one of the layers. Different nestedsums capture paths with different lengths. Also, each summand,e.g., (c), corresponds to the characteristics of the child cluster,e.g., Q(a|L|−1) in (c), of the node in the last index of itsassociated nested sum, e.g., a|L|−1 in (b). This contains theoperating mode, number of nodes, upper bound of spectralradius, number of D2D rounds, and divergence of parameters.

Main takeaways. The bound in (14), (15) quantifies howthe convergence is dependent on several learning and systemparameters. In particular, we see a dependence on (i) the charac-teristics of the loss function (i.e., η, µ), (ii) the number of nodesand clusters at each network layer (through the |Q| terms),(iii) the topology and characteristics of the communicationgraph among the nodes inside the clusters, captured via thespectral radius bounds (i.e., λ), (iv) the number of D2D roundsperformed at each cluster (i.e., the θ), (v) the dimension ofthe ML model (i.e., M ), and (vi) the divergence among thenode parameters at each cluster (i.e., Υ). Given a fixed set ofparameters at iteration k − 1, we can observe that increasingthe number of D2D rounds at each cluster in iteration k resultsin a smaller bound (since θ is appeared as the exponent ofλ < 1), and thus a better expected model loss, as we wouldexpect. Furthermore, for a fixed number of D2D rounds, a

8

F (w(k))− F (w∗) ≤(η − µη

)k (F (w(0))− F (w∗)

)︸︷︷︸

(a)

+ηΦM2

2D2

k−1∑t=0

(η − µη

)tΞ(k−t) (14)

Ξ(k−t) =∑

a1∈L(k−t)1,1

∑a2∈Q(k−t)(a1)

· · ·∑

a|L|−1∈Q(k−t)(a|L|−2)︸︷︷︸(b)

1(k−t){Q(a|L|−1)}|Q

(k−t)(a|L|−1)|4(λ

(k−t)Q(a|L|−1)

)2θ(k−t)Q(a|L|−1)

(Υ

(k−t)Q(a|L|−1)

)2

︸︷︷︸(c)

+∑

a1∈L(k−t)1,1

∑a2∈Q(k−t)(a1)

· · ·∑

a|L|−2∈Q(k−t)(a|L|−3)

1(k−t){Q(a|L|−2)}|Q

(k−t)(a|L|−2)|4(λ

(k−t)Q(a|L|−2)

)2θ(k−t)Q(a|L|−2)

(Υ

(k−t)Q(a|L|−2)

)2

+ · · ·+∑

a1∈L(k−t)1,1

1(k−t){Q(a1)}|Q

(k−t)(a1)|4(λ

(k−t)Q(a1)

)2θ(k−t)Q(a1)

(Υ

(k−t)Q(a1)

)2

+ 1(k−t){L1,1}|L

(k−t)1,1 |4

(λ

(k−t)L1,1

)2θ(k−t)L1,1

(Υ

(k−t)L1,1

)2

(15)

smaller spectral radius, corresponding to a better connectedcluster, results in a smaller bound. On the other hand, largerparameter divergence results in a worse bound.

The term (a) in (14) corresponds to the case where there isno consensus error in the system, i.e., when all LUT clustershave θ(k)

C → ∞ (infinite D2D rounds) or when the networkconsists of all EUT clusters. Since η−µ

η < 1, the overall rateof convergence of MH-FL is at best linear with rate 1− µ/η.Achieving this would incur prohibitively long delays, however,which motivates us to carefully investigate the effects of thenumber of D2D rounds. Also, note that the terms

(η−µη

)tinside

the summation have a dampening effect: at global iterationk, the errors from global iteration t < k are multiplied by(η−µη

)k−t, meaning the initial errors for t << k are dampened

by very small coefficients, while the final errors have a morepronounced effect on the bound. At first glance, this seemsto suggest that at higher global iteration indices, more D2Drounds are needed to reduce the errors. However, we can expectthe parameters of the end devices to become more similar toone another with increasing global iteration count, which inturn would decrease the divergence (i.e., Υ) within clustersover time. Further, the connectivity of the clusters (capturedby λ) will change at different layers of the hierarchy, causingthe spectral radius to vary. This motivates us in Sec. III-D toconsider adapting the D2D rounds over two dimensions: time(i.e., global iterations) and space (i.e., network layers).

2) Asymptotic optimality: We now explicitly connect thenumber of D2D rounds performed at different clusters withthe asymptotic optimality of MH-FL (see Appendix B).

Proposition 1. For any realization of MH-FL, if the numberof D2D rounds at different clusters in different layers of thenetwork satisfies the following criterion (∀k, i, j):θ

(k)Lj,i≥

⌈log(σj)−2 log

(∣∣L(k)j,i

∣∣2Υ(k)Lj,i

)2 log

(λ

(k)Lj,i

)⌉, if σj ≤

∣∣L(k)j,i

∣∣4 (Υ(k)Lj,i

)2

θ(k)Lj,i≥ 0, otherwise

(16)

for non-negative constants σ1, · · · , σ|L|, then the asymptoticupper bound on the distance from the optimal is given by:

limk→∞

F (w(k))− F (w∗) ≤ η2ΦM2

2µD2

|L|−1∑j=0

σj+1Nj . (17)

Proposition 1 gives a guideline for designing the number of

D2D rounds at different network clusters over time to achievea desired (finite) upper bound on the optimality gap. It can beseen that optimality is tied to our introduced auxiliary variables{σj}|L|j=1, which we refer to as D2D control parameters.

To eliminate the existence of a constant optimality gapin (17), we obtain an extra condition on the tuning of theseparameters and make them time varying to guarantee a linearconvergence to the optimal solution (see Appendix C):

Proposition 2. For any realization of MH-FL, suppose thatthe number of D2D rounds at different clusters of differentnetwork layers satisfies the following criterion (∀k, i, j):θ

(k)Lj,i≥

⌈log(σ

(k)j

)−2 log

(∣∣L(k)j,i

∣∣2Υ(k)Lj,i

)2 log

(λ

(k)Lj,i

)⌉, if σ(k)

j ≤∣∣L(k)j,i

∣∣4(Υ(k)Lj,i

)2

θ(k)Lj,i≥ 0, otherwise

(18)

where the non-negative constants σ(k)1 , · · · , σ(k)

|L| satisfy

|L|−1∑j=0

σ(k)j+1Nj ≤

D2µ(µ− δη)

η4ΦM2

∥∥∥∇F (w(k−1))∥∥∥2

(19)

for 0 < δ ≤ µ/η. Then, we have

F (w(k+1))− F (w∗)

F (w(k))− F (w∗)≤ (1− δ), ∀k, (20)

which implies a linear convergence of MH-FL and thatlimk→∞ F (w(k))− F (w∗) = 0.

Proposition 2 asserts that under a stricter tuning of thenumber of D2D rounds at different layers, i.e., (18) and (19),convergence to the optimal solution can be guaranteed witha rate that is upper bounded by 1 − µ/η according to (20).Furthermore, the number of D2D rounds are always finitewhen all the D2D tuning variables are greater than zero;according to (19), this can be always satisfied until reachingthe optimal point (the gradient of loss becomes zero) where thealgorithm will stop. Note that Proposition 2’s condition booststhe required θ

(k)Lj,i

over time, since the norm of the gradientin (19) decreases over time, in turn lowering the values of{σ(k)

j }|L|j=1. In Proposition 1, by contrast, the D2D rounds will

become tapered over time, since σj is fixed over k and thedivergence of the parameters is expected to decrease overglobal iterations, especially when dealing with i.i.d data. Our

9

κ ≥

log(ε− η2ΦM2

2µD2

∑|L|−1j=0 σj+1Nj

)− log

(F (w(0))− F (w∗)− η2ΦM2

2µD2

∑|L|−1j=0 σj+1Nj

)log (1− µ/η)

(21) κ ≥⌈

log(ε)− log(F (w(0))− F (w∗))

log(1− δ)

⌉(22)

experiments in Sec. IV verify these effects.Proposition 2’s result also assumes knowledge of the global

loss gradient∥∥∇F (w(k−1))

∥∥, which is not known at thebeginning of global iteration k, where w(k−1) is just sentdown though the layers. In Sec. III-D, we will developan approximation technique for implementing this result inpractice. Finally, note that in both Propositions 1&2, a smallerspectral radius (corresponding to well connected clusters) istied to a lower number of D2D rounds (note that λ < 1).

3) Relationship between global iterations and D2D rounds:The following two corollaries to Propositions 1&2 investigatethe impact of the number of global iterations on the requiredD2D rounds, and vice versa. First, we obtain the number ofD2D rounds required at different clusters to reach a desiredaccuracy in a desired global iteration (see Appendix D):

Corollary 1. Let ε ∈[(1− µ

η)κ(F (w(0))−F (w∗)

), F (w(0))−

F (w∗)). To guarantee that MH-FL obtains a solution to within

ε of the optimal by global iteration κ, i.e., F (w(κ))−F (w∗) ≤ε, a sufficient number of D2D rounds in different clusters ofthe network is given by either of the following conditions:1) θ(k)

Lj,i, ∀i, j, k, given by (16), where the values of

σ1, · · · , σ|L| satisfy the following inequality:

|L|−1∑j=0

σj+1Nj ≤ε− (1− µ/η)

κ(F (w(0))− F (w∗))

(1− (1− µ/η)κ) η2ΦM2

2µD2

. (23)

2) θ(k)Lj,i

, ∀i, j, k, given by (18), where the values of

σ(k)1 , · · · , σ(k)

|L| satisfy (19) with δ given by:

δ ≥ 1− κ

√ε

F (w(0))− F (w∗). (24)

Second, we obtain the number of global iterations requiredto reach a desired accuracy for a predetermined policy of deter-mining the D2D rounds in different clusters (see Appendix E):

Corollary 2. With ε as in Corollary 1, either of the followingtwo conditions give a sufficient number of global iterations κto achieve F (w(κ))− F (w∗) ≤ ε:1) If the θ

(k)Lj,i

, ∀i, j, k, satisfy (16) given σ1, · · · , σ|L|, and

ε ≥ η2ΦM2

2µD2

∑|L|−1j=0 σj+1Nj , then κ follows (21).

2) If the θ(k)Lj,i

, ∀i, j, k, satisfy (18) and (19) given

σ(k)1 , · · · , σ(k)

|L| and δ, then κ follows (22).

4) Varying gradient step size: So far, we have been workingwith a constant step size β in (4). If we design a time-varyingstep size βk that is decreasing over time, we can sharpenthe convergence result in Proposition 1, when devices sharegradients instead of parameters (see Sec. II-D). In particular,we can guarantee that MH-FL converges to the optimal solution,rather than having a finite optimality gap (see Appendix F):

Proposition 3. Suppose that the nodes share gradients usingthe same procedure described in Algorithm 1, and that each

parent node samples one of its children uniformly at random.Also, assume that end devices use a step size βk = α

k+λ , whereλ > 1 and α > 1/µ at global iteration k, with β0 ≤ 1/η. Ifthe number of D2D rounds are performed according to (16)with non-negative constants σ1, · · · , σ|L|, we have:

E[F (w(k))− F (w∗)] ≤ Γ

k + λ, (25)

where

Γ=max

{λ(F (w(0))− F (w∗)),

ηα2ΦM2∑|L|−1j=0 σj+1Nj

2D2(αµ− 1)

}. (26)

Consequently, under such conditions, MH-FL converges to theoptimal solution:

limk→∞

E[F (w(k))− F (w∗)] = 0. (27)

The bound (25) implies a rate of convergence of O(1/k),which is slower than the linear convergence obtained inProposition 2, but also allows tapering of the D2D roundsover time as in Proposition 1.

5) Cluster sampling: In a large-scale fog network withmillions of nodes, it may be desirable to reduce upstreamcommunication demand even further than what is provided byLUT clusters in Fig. 3. We develop a cluster sampling techniquewhere only a portion of the clusters in the bottom-most layerare activated in model training at each global iteration inAppendix G, and extend Theorem 1 to this case. We leavefurther investigation of this technique to future work.

D. Control Algorithms for Distributed Consensus Tuning

With all else constant, fewer rounds of D2D results in lowerpower consumption and network load among the devices inLUT clusters. Motivated by this, we develop control algorithmsfor MH-FL that tune the number of D2D rounds through time(global aggregations) and space (network layers).

1) Adaptive D2D for loss functions satisfying Assumption 1:We are motivated to realize the two D2D consensus roundtuning policies that we obtained in Propositions 1 and 2, whichwe refer to as Policies A and B, respectively. Policy A willprovide a finite optimality gap, with tapering of the D2D roundsthrough time, while Policy B will provide linear convergenceto the optimal, with boosting of the D2D rounds though time.We are interested in realizing these two policies in a distributedmanner, where the number of D2D rounds for different deviceclusters are tuned by the corresponding parent nodes in real-time. It is assumed that parent node of C has an estimate onthe topology of the cluster, and thus an upper-bound on thespectral radius of its children cluster graph λ(k)

C , ∀k.According to (16) and (18), for both polices, tuning of the

D2D rounds for cluster C requires knowledge of the divergenceof parameters Υ

(k)C . Also, Policy A requires a set of fixed D2D

control parameters σj for clusters located in layer Lj , whilePolicy B requires the global gradient of the broadcast weight∥∥∇F (w(k−1))

∥∥ and the real-time D2D control parameters σ(k)j .

10

Algorithm 2: Adaptive D2D round tuning at each clusterinput : Global aggregation count k, tuning parameter ω > 1, cluster

index C = Lj,i.output : Number of D2D rounds θ(k)

Lj,ifor the cluster.

1 Nodes inside the cluster C iteratively compute (a) and (b) of (28).2 Parent node of cluster samples one child and computes (28).3 if asymptotic convergence to optimal desired then

4 Parent node uses (30) with stored ‖ ˜∇F (w(k−1))‖ and δreceived from the server to compute σ(k)

j .

5 Parent node uses (18) to compute θ(k)Lj,i

.

6 else7 Parent node uses the received consensus tuning parameter σ(k)

j

from the server in (16) to compute θ(k)Lj,i

.

In the following, we first derive the divergence of parametersin a distributed manner. Then, we focus on realizing the otherspecific parameters for each policy.

Given Definition 1, at cluster C in layer Lj , 1 ≤ j ≤ |L|,the divergence of the parameters can be approximated as4

Υ(k)C ≈ max

q,q′∈C(k)

{‖w(k)

q ‖ − ‖w(k)q′ ‖

}= maxq∈C(k)

‖w(k)q ‖︸︷︷︸

(a)

− minq′∈C(k)

‖w(k)q′ ‖︸︷︷︸

(b)

. (28)

To obtain (a) and (b) in a distributed manner, at any givenLUT cluster, each node n ∈ C(k) first computes the scalarvalue ‖w(k)

n ‖. Nodes then share these scalar values with theirneighbors iteratively. In each iteration, each node saves twoscalars: the (i) maximum and (ii) minimum values among thereceived values and the node’s current value. It is easy to verifythat for any given communication graph G(k)

C among the clusternodes, once the number of iterations has exceeded the diameterof G(k)

C , the saved values at each node will correspond to (a)and (b) for cluster C in (28). The parent node can then samplethe value of one of its children to compute (28).

For Policy A, since the values of {σj}|L|j=1 are fixed throughtime, one option is for the sever to tune them once at thebeginning of training and distribute them among all the nodes.If satisfaction of a given accuracy ε at a certain iteration κ isdesired, we use the result of Corollary 1 and obtain the D2Dcontrol parameters as the solution of the following max-minoptimization problem: arg max

{σj}|L|j=1

min {Nj−1σj} subject to (23).

It can be verified that the solution is given by

σ?j =ε− (1− µ/η)

κ(F (w(0))− F (w∗))

(1− (1− µ/η)κ) η2ΦM2

2µD2 Nj−1|L|, 1 ≤ j ≤ |L|,(29)

which can be broadcast at the beginning of training among thenodes. The reason behind the choice of the aforementioned max-min problem is two-fold. First, according to (16), for a givenset of divergence of parameters Υ

(k)C across C, fewer numbers

of D2D rounds at each layer Lj is associated with larger valuesof σj , so larger values of D2D control parameters are desired

4Here, for practical purposes, we use the lower bound of divergence∣∣ ‖a‖−

‖b‖∣∣ ≤ ‖a− b‖. The upper bound alternative ‖a− b‖ ≤ ‖a‖+ ‖b‖ can be

arbitrarily large even when a = b.

Algorithm 3: Adaptive D2D round tuning at each clusterfor non-convex loss functions

input : Tolerable error of aggregations ψ, global aggregation countk, cluster index C = Lj,i.

output : Number of D2D rounds θ(k)Lj,i

for the cluster.1 Nodes inside the cluster iteratively compute (a) and (b) of (28).2 Parent node of cluster samples one child and computes (28).3 Parent node of the cluster computes the required rounds of D2D as

follows with σj = ψD2/(ΦM2Nj−1|L|):θ(k)Lj,i≥

log(σj)−2 log

(∣∣L(k)j,i

∣∣2Υ(k)Lj,i

)2 log

(λ

(k)Lj,i

) , if σj ≤∣∣L(k)j,i

∣∣4 (Υ(k)Lj,i

)2

θ(k)Lj,i≥ 0, otherwise.

(31)

at every layer of the network. Second, this choice of objectivefunction results in smaller values of D2D control parametersas we move down the layers (towards the end devices) and thenumber of nodes increases. This leads to larger D2D roundsand lower errors in the bottom network layers, which is desiredin practice given the discussion in Sec. III-B that the errorsfrom the bottom layers are propagated and amplified as wemove up the network layers.

For Policy B, to obtain ‖∇F (w(k−1))‖, we use (6)to approximate ∇F (w(k−2)) as ∇F (w(k−2)) ≈(w(k−2) −w(k−1)

)/β. This is an approximation due

to the error introduced in the consensus process. Usingthis, the main server estimates ‖∇F (w(k−1))‖ via

˜‖∇F (w(k−1))‖ = 1ω‖∇F (w(k−2))‖, where we have

ω > 1 based on the intuition that the norm of the gradientshould be decreasing over k, and broadcasts it along withw(k−1). The choice of ω can be viewed as a tradeoff betweenthe number of global aggregations k and the number of D2Drounds θ(k)

C per aggregation: as ω increases, we tolerate lessconsensus error in (19), requiring more D2D rounds θ(k)

C andfewer global iterations k to achieve an accuracy. Then, thecluster heads obtain the σ(k)

j , ∀j, k, according to the followingmax-min problem: arg max

{σ(k)j }

|L|j=1

min {Nj−1σ(k)j } subject to (19)

for a given δ. It can be verified that the solution is given by

σ(k)j

?=

D2µ(µ− δη)

η4ΦM2Nj−1|L|‖ ˜∇F (w(k−1))‖2, 1 ≤ j ≤ |L|,∀k.(30)

The parameter δ can be tuned by the main server at thebeginning of training to guarantee a desired linear convergence,or it can be tuned by (24) to satisfy a desired accuracy at acertain global iteration. In both cases, the server broadcaststhis parameter among the nodes at the beginning of training.With δ and ‖∇F (w(k−1))‖ in hand, along with the ML modelcharacteristics (D,µ, η,M ) and networked related parameters(Φ, |L|, Nj−1) that can be once broadcast by the server, all theparent nodes of the clusters can calculate (30) at each globaliteration, which then can be used in (18) to tune the numberof D2D rounds for the children nodes in real-time.

A summary of this procedure for tuning the D2D rounds ata cluster is given in Algorithm 2. In the full MH-FL methoddescribed in Algorithm 1, this is (optionally) called for eachcluster in the lines marked via ∗∗.

11

2) Adaptive D2D tuning for non-convex loss functions:Some contemporary ML models, such as neural networks,possess non-convex loss functions for which Assumption 1does not apply. In these cases, we can develop a heuristicapproach to tune the D2D rounds of MH-FL if we specifya maximum tolerable error of aggregations ψ at each globaliteration. The resulting procedure is given in Algorithm 3,which would be called once for each cluster in Algorithm 1 inplace of Algorithm 2 (see the lines marked ##). To execute this,prior to the start of training, each parent node should receiveψ and the number of nodes in its layer. Using Algorithm 3,we can show that the 2-norm of aggregation errors is alwaysbounded by parameter ψ (see Appendix H for the proof).

IV. EXPERIMENTAL EVALUATION

We conducted extensive numerical experiments to evaluateMH-MT. In this section, we present the setup (Sec. IV-A) andresults (Sec. IV-B) for a popular real-world dataset and asample fog topology. Additional results on more datasets andtopologies can be found in Appendix I.A. Experimental Setup

We consider a fog network consisting of a main server andthree subsequent layers. There are 125 edge devices in thebottom layer (L3), clustered into groups of 5 nodes. Each ofthese clusters communicates with one of 25 parent nodes inlayer L2. The nodes at this layer are in turn clustered intogroups of 5, with 5 parent nodes at layer L1 that communicatewith the main server. We consider the cases where (i) allclusters are configured to operate in LUT mode and (ii) all areEUT, which allows us to evaluate the performance differencesin terms of model convergence, energy consumption, andparameters transferred. In the LUT case, connectivity withinclusters follows a random geometric graph (see Appendix I).

We consider a 10-class image classification task on thestandard MNIST dataset of 70K handwritten digits (http://yann.lecun.com/exdb/mnist/). We evaluate with both regularizedSVM and fully-connected neural network (NN) classifiers asloss functions; SVM satisfies Assumption 1 while NN does not.Unless stated otherwise, the results are presented using SVM.Samples are distributed across devices in either an i.i.d. ornon-i.i.d. manner; for i.i.d., each device has samples from eachclass, while for non-i.i.d., each device has samples from onlyof the 10 classes, allowing us to evaluate the robustness of ourmethod to different distributions. More details on the dataset,classifiers, and hyperparameter tuning procedure are availablein Appendix I: there, we also provide additional results on theFashion MNIST (F-MNIST) dataset and for a network of 625edge devices, finding that the results are qualitatively similar.

B. Results and Discussion

1) MH-MT with fixed rounds of D2D: We consider a scenarioin which the number of D2D rounds is set to be a constantvalue θ over all clusters, which provides useful insights forthe rest of the results. In Fig. 5, we compare the performanceof MH-MT when all the clusters work in LUT mode and havefixed rounds of D2D with the case where the network consistsof all EUT clusters (referred to as “EUT baseline”). The EUTcase is identical (in terms of convergence) to carrying out

centralized gradient descent over the entire dataset. We see thatincreasing the number of consensus at different layers increasesthe accuracy and stability of the model training for MH-MT.Though convergence is not achieved in all cases (in particular,when θ is 1 and 2), if the number of D2D rounds chosen islarger than 15, comparable performance to EUT is achieved.This performance is characterized by linear convergence, ascan be seen in Fig. 5(c) with logarithmic axis scales.

2) MH-MT with decaying step size: The effect of decreasingthe gradient descent step size (Proposition 3) is depicted inFig. 6. This verifies that the decay can suppress the finiteoptimality bound and provide convergence to the optimalmodel. Also, the convergence occurs at a slower pace comparedwith the baseline, which is in line with our theoretical results(convergence rate of O(1/k)). Fig. 6 further reveals the inherenttrade-off between conducting a higher number of D2D roundswith a constant learning rate (higher power consumption frommore rounds, but with a fast convergence speed) and performinga fewer number of D2D rounds with a decaying learning rate(lower power consumption with a slower convergence).

3) MH-MT with adaptive rounds of D2D: We next studythe case when our distributed D2D tuning scheme is utilized.The results are depicted for both convergence criteria cases,i.e., where a finite gap of convergence is tolerable (Figs. 7, 8)and when the linear convergence to the optimal solution isdesired (Figs. 9, 10). Recall that Propositions 1&2 obtain thesufficient number of D2D rounds based on an upper bound;for this experiment, we observed that log(σj) and log(σ

(k)j )

in (16) and (18) can be scaled and used as log(χσ(k)j ) and

log(χσ(k)j ), ∀j, where χ ∈ [1, 15] to obtain fewer rounds of

D2D while satisfying the desired convergence characteristics.Fig. 7 depicts the result for the case where local datasets

are i.i.d., with the values of {σj}|L|j=1 depicted. In the figures,θ(k) denotes the average number of D2D rounds employed byclusters over all the network layers at global iteration k, andθ

(k)Lj

denotes the average number of D2D rounds at iteration kin layer Lj . We observe that (i) the D2D rounds performed inthe network is tapered through time (subplot c), and (ii) theD2D rounds performed at different network layers is taperedthough space, where higher layers of the network perform fewerrounds (subplots d-f). We perform the same experiment withnon-i.i.d datasets across the nodes in Fig. 8. Comparing Fig. 8to 7, it can be observed that non-i.i.d. introduces oscillations tothe number of D2D performed at different network layers, andthe smaller values of D2D control parameters result in largernumbers of D2D rounds which leads to more stable training.

The same experiment is repeated in Figs. 9, 10 for the linearconvergence case. We see that the number of D2D rounds istapered through space (subplots d-f) and is boosted over timeas the iteration index k increases (subplots c-f). This is dueto the decrease in the norm of gradient in the right hand sideof (30) over time, which calls for an increment in the numberof D2D rounds for more precision as the gradient becomessmaller. Comparing Figs. 9 and 10 with Figs. 7 and 8, it canbe concluded that guaranteeing the linear convergence comeswith the tradeoff of performing a larger number of D2D roundsover time. In Figs. 7, 8, a small optimality gap is achieved,

http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/

12

0 10 20k

0.4

0.6

0.8ac

curacy

(a)

0 10 20k

5

10

loss (1

0−2 )

(b)

EUT θ = 1 θ = 2 θ = 5 θ = 15 θ = 30

e0 e1 e2 e3k

e−4

e−3

loss (log

scale)

(c)

Fig. 5: Performance comparison between baseline EUT and MH-MT when afixed number of D2D rounds θ is used at every cluster in the network, fornon-i.i.d data. As the number of D2D rounds increases, MH-MT performsmore similar to the EUT baseline and the learning becomes more stable.

0 10 20 30 40 50k

0.2

0.4

0.6

0.8

accuracy

(a)

0 10 20 30 40 50k

0.0

0.2

0.4

loss

(b)

EUT θ = 15 θ = 1 w/ decreasing step θ = 1 w/o decreasing step

Fig. 6: Performance comparison between baseline EUT, and MH-MT withand without (w/o) decreasing the gradient descent step size. Decreasing thestep size can provide convergence to the optimal solution in cases where afixed step size is not capable, but also has a slower convergence speed.

0 10 20k

0.4

0.6

0.8

accuracy

(a)0 10 20

k

2.5

5.0

7.5

loss (×10

2)

(b)

EUT σ′ = 1% σ′ = 10% σ′ = 60% σ′ = 90%

0 10 20k

5

10

15

θ(k)

(c)

0 10 20k

51015

θ(k) L 3

(d)0 10 20

k

2.5

5.0

7.5

θ(k) L 2

(e)0 10 20

k

1

2

3

θ(k) L 1

(f)

Fig. 7: Performance comparison between baseline EUT and MH-MT fori.i.d data when a finite optimality gap is tolerable. σj at Lj is fixed asσj = σ′maxi Υ

(1)Lj,i

. The tapering of D2D rounds though time (c) andspace (layers) (d)-(f) can be observed.

0 10 20k

0.25

0.50

0.75

accuracy

(a)0 10 20

k

5

10

loss (×

10−2)

(b)

EUT σ′ = 1% σ′ = 10% σ′ = 60% σ′ = 90%

0 10 20k

5

10

15

θ(k)

(c)

0 10 20k

51015

θ(k) L 3

(d)0 10 20

k

2.5

5.0

7.5

θ(k) L 2

(e)0 10 20

k

1

2

3

θ(k) L 1

(f)

Fig. 8: Performance comparison between baseline EUT and MH-MT fornon-i.i.d data when a finite optimality gap is tolerable. σi is set as in Fig. 7.Smaller loss and higher accuracy are achieved with smaller σ′, implyingmore rounds of D2D are required.

0 10 20k

0.4

0.6

0.8

accuracy

(a)0 10 20

k

2.5

5.0

7.5

loss (×10

2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

20

40

θ(k)

(c)

0 10 20k

20

40

θ(k) L 3

(d)0 10 20

k

20

40

θ(k) L 2

(e)0 10 20

k

20

40

θ(k) L 1

(f)

Fig. 9: Performance comparison between baseline EUT and MH-MT for i.i.ddata when linear convergence to the optimal is desired. The value of δ isset at δ = δ′ µ

η. Boosting of the D2D rounds though time can be observed

in (c)-(f) as k is increased. Also, tapering through space can be observedby comparing the D2D rounds across the bottom subplots.

0 10 20k

0.4

0.6

0.8

accuracy

(a)0 10 20

k

2.5

5.0

7.5

loss (×10

2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

20

40

θ(k)

(c)

0 10 20k

20

40

θ(k) L 3

(d)0 10 20

k

20

40

θ(k) L 2

(e)0 10 20

k

0

20

40

θ(k) L 1

(f)

Fig. 10: Performance comparison between baseline EUT and MH-MT fornon-i.i.d data when linear convergence to the optimal is desired. The valueof δ is set as in Fig. 9. Smaller values of loss and higher accuracy are bothassociated with larger value of δ, which results in lower error toleranceand more rounds of D2D.

while the number of D2D rounds is tapered over time.4) MH-MT with adaptive rounds of D2D for non-convex

ML models: Recall that we developed Algorithm 3 for non-convex ML loss functions. Figs. 11 and 12 give results withNNs for different values of tolerable error of aggregations ψ,under i.i.d and non-i.i.d data distributions, respectively. Thesefigures show the effect of tolerable error of aggregations onthe performance of NNs; by decreasing the tolerable error, thenumber of D2D rounds is increased, and the performance isenhanced. These figures also reveal a tapering of the numberof consensus through time and space in the i.i.d case.

5) Analytical vs. experimental bound comparison: We nextinvestigate the number of global iterations required to obtain acertain accuracy under the linear convergence property fromCorollary 1. In Fig. 13, we compare the result obtained fromPolicy B implementing (30) to that observed numerically inour experiments. The results indicate that the theoretical results

are reasonably tight bounds.6) Network resource utilization of MH-MT: We have seen

that MH-MT can achieve comparable performance to the EUTbaseline in terms of model convergence. We now considerthe difference in network resource utilization. In particular,we consider the following two metrics: (i) the amount of datatransferred between the network layers, and (ii) the accumulatedenergy consumption of the edge devices. In both cases, EUTand MH-MT are trained up to reaching 98% of the final accuracyachieved after 50 iterations of centralized gradient descent. Weconsider four scenarios, corresponding to those used in theexperiments of Figs. 7, 8, 11, 12. To obtain the accumulatedenergy, we consider the typical transmission power of enddevices to be 10dbm in D2D mode and 24dbm in uplinkmode [51], [52], and assume that transmission of parametersat each round takes 0.25 sec with a data rate of 1Mbit/s at32-bit quantization. The accumulated energy consumption of

13

0 10 20k

0.6

0.8

accu

racy

(a)0 10 20

k

2

4

loss (×

10−2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

10

20

θ(k)

(c)

0 10 20k

0

10

20

θ(k) L 3

(d)0 10 20

k

5

10

θ(k) L 2

(e)0 10 20

k

1

2

3

θ(k) L 1

(f)

Fig. 11: Performance comparison between baseline EUT and MH-MT underi.i.d data using NNs with different values of ψ. Tapering the D2D roundsthough time can be observed. Also, tapering through space can be observedby comparing the D2D rounds across the bottom subplots.

0 10 20k

0.25

0.50

0.75

accu

racy

(a)0 10 20

k

5

10

loss (×

10−2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

20

θ(k)

(c)

0 10 20k

0

20

40

θ(k) L 3

(d)0 10 20

k

5

10

15

θ(k) L 2

(e)0 10 20

k

2

4

θ(k) L 1

(f)

Fig. 12: Performance comparison between baseline EUT and MH-MT undernon-i.i.d data using NNs with different values of ψ. Lower loss and higheraccuracy are associated with smaller values of ψ, which result in lowererror tolerance and larger numbers of D2D rounds.

0.0250.03

00.03

50.04

00.04

50.05

00.05

5ε′

0

5

10

15

20

κ

theory simulation

Fig. 13: Theoretical vs. simulation resultsregarding the number of global iterations toachieve an accuracy of ε′(F (w(0))−F (w∗))for different ε′. Convergence in practice isfaster than the derived upper bound.

scenario

1scen

ario 2scen

ario 3scen

ario 40.0

2.5

5.0

7.5

10.0

12.5

15.0accumulated energy

consum

ption (×10

4 Joules)

EUT MH-MT

Fig. 14: Accumulated energy consumptionof EUT and MH-MT over scenario 1: σ′ =0.1 from Fig. 7, scenario 2: σ′ = 0.1 fromFig. 8, scenario 3: ψ = 104 from Fig. 11,and scenario 4: ψ = 104 from Fig. 12.

scenario

1scen

ario 2

scenario

3scen

ario 40.0

0.5

1.0

1.5

2.0

2.5

accu

mul

ated

num

ber o

fpa

ram

eter

s tra

nsfe

rred

betw

een

netw

ork

laye

rs (×

107 )

EUT MH-MT

Fig. 15: Comparison of parameters transferredamong layers in EUT vs MH-MT over scenario1: σ′ = 0.1 from Fig. 7, scenario 2: σ′ = 0.1from Fig. 8, scenario 3: ψ = 104 from Fig. 11,and scenario 4: ψ = 104 from Fig. 12.

the edge devices through the training phase is depicted inFig. 14, which reveals around 50% energy saving on averageas compared to the EUT baseline. The accumulated number ofparameters transferred over the network layers are shown inFig. 15, revealing 80% reduction in the number of parameterstransferred over the layers as compared to the baseline.

V. CONCLUSION AND FUTURE WORK

We developed multi-stage hybrid federated learning (MH-FL),a novel methodology which migrates the star topology offederated learning to a multi-layer cluster-based distributedarchitecture incorporating cooperative D2D communications.We theoretically obtained the convergence bound of MH-MTexplicitly considering the time varying network topology, timevarying number of D2D rounds at different network clusters,and inherent ML model characteristics. We proposed a setof policies for the number of D2D rounds conducted atdifferent network clusters under which convergence either to afinite optimality gap or the global optimum can be achieved.We further used these policies to develop a set of adaptivedistributed control algorithms that tune the number of D2Drounds at different clusters of the network in real-time.

This paper motivates several directions for future work.Investigating more specific system characteristics that have beenconsidered for federated learning – such as communication im-perfections, interference management, mitigation of stragglers,and device scheduling – for the multi-stage hybrid structureof fog networks is promising. Also, the proposed network

dimension of federated learning introduces motivates workson device synchronization, network (re-)formation, congestionaware data transfer and load balancing, and topology design.

REFERENCES

[1] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neuralnetworks for image classification,” in Proc. IEEE Conf. Comput. VisionPattern Recog. (CVPR), 2012, pp. 3642–3649.

[2] R. Collobert and J. Weston, “A unified architecture for natural languageprocessing: Deep neural networks with multitask learning,” in Proc. Int.Conf. Mach. Learn., 2008, pp. 160–167.

[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efficient learning of deep networks from decentralizeddata,” in Proc. Artif. Intell. Stat., 2017, pp. 1273–1282.

[4] S. Hosseinalipour, C. G. Brinton, V. Aggarwal, H. Dai, and M. Chiang,“From federated to fog learning: Distributed machine learning overheterogeneous wireless networks,” IEEE Commun. Mag. (to appear),2020.

[5] M. Chiang, S. Ha, F. Risso, T. Zhang, and I. Chih-Lin, “Clarifying fogcomputing and networking: 10 questions and answers,” IEEE Commun.Mag., vol. 55, no. 4, pp. 18–20, 2017.

[6] M. N. Tehrani, M. Uysal, and H. Yanikomeroglu, “Device-to-devicecommunication in 5G cellular networks: challenges, solutions, and futuredirections,” IEEE Commun. Mag., vol. 52, no. 5, pp. 86–92, 2014.

[7] M. Abolhasan, T. Wysocki, and E. Dutkiewicz, “A review of routingprotocols for mobile ad hoc networks,” Ad hoc Netw., vol. 2, no. 1, pp.1–22, 2004.

[8] S. Zeadally, R. Hunt, Y.-S. Chen, A. Irwin, and A. Hassan, “Vehicular adhoc networks (VANETS): status, results, and challenges,” Telecommun.Syst., vol. 50, no. 4, pp. 217–241, 2012.

[9] I. Bekmezci, O. K. Sahingoz, and S. Temel, “Flying ad-hoc networks(FANETs): A survey,” Ad hoc Netw., vol. 11, no. 3, pp. 1254–1270,2013.

[10] K. Akkaya and M. Younis, “A survey on routing protocols for wirelesssensor networks,” Ad hoc Netw., vol. 3, no. 3, pp. 325–349, 2005.

14

[11] S. Maheshwari, D. Raychaudhuri, I. Seskar, and F. Bronzino, “Scalabilityand performance evaluation of edge cloud systems for latency constrainedapplications,” in Proc. IEEE/ACM Symp. Edge Comput. (SEC), 2018, pp.286–299.

[12] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A jointlearning and communications framework for federated learning overwireless networks,” arXiv preprint arXiv:1909.07972, 2019.

[13] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong,“Federated learning over wireless networks: Optimization model designand analysis,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM),2019, pp. 1387–1395.

[14] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation forlow-latency federated edge learning,” IEEE Trans. Wireless Commun.,vol. 19, no. 1, pp. 491–506, 2020.

[15] M. M. Amiri and D. Gunduz, “Federated learning over wireless fadingchannels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557,2020.

[16] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui, “Federatedlearning with quantization constraints,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Process. (ICASSP), 2020, pp. 8851–8855.

[17] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler,“SparCML: High-performance sparse communication for machine learn-ing,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal.,2019, pp. 1–15.

[18] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive federated learning in resource constrained edgecomputing systems,” IEEE J. Sel. Areas Commun. (JSAC), vol. 37, no. 6,pp. 1205–1221, 2019.

[19] D. Ye, R. Yu, M. Pan, and Z. Han, “Federated learning in vehicularedge computing: A selective model aggregation approach,” IEEE Access,vol. 8, pp. 23 920–23 935, 2020.

[20] S. Dhakal, S. Prakash, Y. Yona, S. Talwar, and N. Himayat, “Coded fed-erated learning,” in IEEE Glob. Commun. Conf. Workshop (GLOBECOMWorkshop), 2019, pp. 1–6.

[21] Y. Tu, Y. Ruan, S. Wang, S. Wagle, C. G. Brinton, and C. Joe-Wang,“Network-aware optimization of distributed learning for fog computing,”in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), 2020.

[22] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang,and H. V. Poor, “Fast-convergent federated learning,” arXiv preprintarXiv:2007.13137, 2020.

[23] S. Ji, W. Jiang, A. Walid, and X. Li, “Dynamic sampling and selectivemasking for communication-efficient federated learning,” arXiv preprintarXiv:2003.09603, 2020.

[24] W. Wang, Y. Sun, B. Eriksson, W. Wang, and V. Aggarwal, “Widecompression: Tensor ring nets,” in Proc. IEEE Conf. Comput. VisionPattern Recog. (CVPR), 2018, pp. 9329–9338.

[25] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Client-edge-cloudhierarchical federated learning,” in Proc. IEEE Int. Conf. Commun. (ICC),2020, pp. 1–6.

[26] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “HFEL: Joint edgeassociation and resource allocation for cost-efficient hierarchical federatededge learning,” IEEE Trans. Wireless Commun., pp. 1–1, 2020.

[27] M. S. H. Abad, E. Ozfatura, D. GUndUz, and O. Ercetin, “Hierarchicalfederated learning across heterogeneous cellular networks,” in Proc. IEEEICASSP, 2020, pp. 8866–8870.

[28] A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal, “GADMM:Fast and communication efficient framework for distributed machinelearning,” arXiv preprint arXiv:1909.00047, 2019.

[29] V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, and M. Jaggi,“CoCoA: A general framework for communication-efficient distributedoptimization,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 8590–8638, 2017.

[30] P. Richtarik and M. Takac, “Distributed coordinate descent method forlearning with big data,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 2657–2681, 2016.

[31] S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning forwireless communications: Motivation, opportunities and challenges,”arXiv preprint arXiv:1908.06847, 2019.

[32] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward anintelligent edge: Wireless communication meets machine learning,” IEEECommun. Mag., vol. 58, no. 1, pp. 19–25, 2020.

[33] C. L. P. Chen, G. Wen, Y. Liu, and F. Wang, “Adaptive consensus controlfor a class of nonlinear multiagent time-delay systems using neuralnetworks,” IEEE Trans. Neural Netw. Learning Syst., vol. 25, no. 6, pp.1217–1226, 2014.

[34] T. Li and J.-F. Zhang, “Consensus conditions of multi-agent systems withtime-varying topologies and stochastic communication noises,” IEEETrans. Autom. Control, vol. 55, no. 9, pp. 2043–2057, 2010.

[35] S. Kar and J. M. F. Moura, “Distributed consensus algorithms in sensornetworks with imperfect communication: Link failures and channel noise,”IEEE Trans. Signal Process., vol. 57, no. 1, pp. 355–369, 2009.

[36] S. Manfredi, “Design of a multi-hop dynamic consensus algorithm overwireless sensor networks,” Control Eng. Practice, vol. 21, no. 4, pp.381–394, 2013.

[37] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Auto. Control, vol. 54, no. 1, pp. 48–61,2009.

[38] T. Chang, M. Hong, and X. Wang, “Multi-agent distributed optimizationvia inexact consensus ADMM,” IEEE Trans. Signal Process., vol. 63,no. 2, pp. 482–497, 2015.

[39] T. Chang, A. Nedic, and A. Scaglione, “Distributed constrained opti-mization by consensus-based primal-dual perturbation method,” IEEETrans. Autom. Control, vol. 59, no. 6, pp. 1524–1538, 2014.

[40] B. Johansson, T. Keviczky, M. Johansson, and K. H. Johansson,“Subgradient methods and consensus algorithms for solving convexoptimization problems,” in Proc. IEEE Conf. Decis. Control, 2008, pp.4185–4190.

[41] R. N. Clarke, “Expanding mobile wireless capacity: The challengespresented by technology and economics,” Telecommun. Policy, vol. 38,no. 8-9, pp. 693–708, 2014.

[42] U. N. Kar and D. K. Sanyal, “An overview of device-to-devicecommunication in cellular networks,” ICT Express, vol. 4, no. 4, pp.203–208, 2018.

[43] C. Hu, J. Jiang, and Z. Wang, “Decentralized federated learning: Asegmented gossip approach,” arXiv preprint arXiv:1908.07782, 2019.

[44] S. Savazzi, M. Nicoli, and V. Rampa, “Federated learning with cooper-ating devices: A consensus approach for massive IoT networks,” IEEEInternet Things J., vol. 7, no. 5, pp. 4641–4654, 2020.

[45] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis,“Federated learning in the sky: Joint power allocation and schedulingwith UAV swarms,” arXiv preprint arXiv:2002.08196, 2020.

[46] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”Syst. & Control Lett., vol. 53, no. 1, pp. 65–78, 2004.

[47] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani,“Fedpaq: A communication-efficient federated learning method withperiodic averaging and quantization,” arXiv preprint arXiv:1909.13014,2019.

[48] C. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao, A. Zomaya, andV. Gramoli, “Federated learning over wireless networks: Convergenceanalysis and resource allocation,” arXiv preprint arXiv:1910.13067, 2019.

[49] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochasticmethods for data fitting,” SIAM J. Sci. Comput., vol. 34, no. 3, pp.A1380–A1405, 2012.

[50] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus withleast-mean-square deviation,” J. Parallel Distrib. Comput., vol. 67, no. 1,pp. 33–46, 2007.

[51] M. Hmila, M. Fernandez-Veiga, M. Rodrıguez-Perez, and S. Herrerıa-Alonso, “Energy efficient power and channel allocation in underlay deviceto multi device communications,” IEEE Trans. Commun., vol. 67, no. 8,pp. 5817–5832, 2019.

[52] S. Dominic and L. Jacob, “Joint resource block and power allocationthrough distributed learning for energy efficient underlay D2D commu-nication with rate guarantee,” Comput. Commun., 2020.

[53] B. T. Polyak, “Gradient methods for minimizing functionals,” Zh. Vychisl.Mat. Mat. Fiz, vol. 3, no. 4, pp. 643–653, 1963.

[54] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming. AthenaScientific, 1996.

15

APPENDIX APROOF OF THEOREM 1

Proof. We first aim to bound the per-iteration decrease of the gap between the function F (w(k)) and F (w∗). Using Taylorexpansion and the η-smoothness of function F , the following quadratic upper-bound can be obtained:

F (w(k)) ≤F (w(k−1)) + (w(k) −w(k−1))>∇F (w(k−1)) +η

2

∥∥∥w(k) −w(k−1)∥∥∥2

, ∀k. (32)

To find the relationship between w(k−1) and w(k), we follow the procedure described in the main text. For parent node ap, leta′p+1 denote the corresponding sampled node, ∀p, e.g., in the following nested sums a′|L| denotes the sampled node in the lastlayer by parent node a|L|−1 in its above layer. This corresponds to an arbitrary realization of the children sampling at differentparent nodes. Let a′1 denote the sampled node by the main server in L1. The model parameter of this node is given by:

w(k)a′1

=

∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

|Da|L| |w(k−1)a|L|

|L(k)1,1|

−

∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

β|Da|L| |∇fa|L|(w(k−1)a|L|

)

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1

|L(k)1,1|

+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

|L(k)1,1|

+ 1(k){L1,1}c

(k)a′1,

(33)

which the main server uses to obtain the next global parameter as follows (due to the existence of the indicator function in thelast term of the above expression, the following expression holds regardless of the operating mode of the cluster at layer L1):

w(k) =|L(k)

1,1|w(k)a′1

D. (34)

Based on (3), it can be verified that:∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

|Da|L| |w(k−1)a|L|

= Dw(k−1).(35)

Also, using (1), we have:∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

β|Da|L| |∇fa|L|(w(k−1)a|L|

) = βD∇F (w(k−1)).(36)

Replacing the above two equations in (33) and performing the update given by (34), we get:

w(k) = w(k−1) − β∇F (w(k−1)) +1

D

( ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

),

(37)

16

Calculating w(k) − w(k−1) using the above equation and replacing the result in (32) yields:

F (w(k))− F (w(k−1)) ≤(ηβ2

2− β

)∥∥∥∇F (w(k−1))∥∥∥2

+

(1− βηD

)( ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+ · · ·

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

)>∇F (w(k−1))

+η

2D2

∥∥∥ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+ · · ·

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

∥∥∥2

.

(38)

Tuning the learning rate as β = 1η , we obtain:

F (w(k))− F (w(k−1)) ≤ −1

2η

∥∥∥∇F (w(k−1))∥∥∥2

+

η

2D2

∥∥∥ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

∥∥∥2

.

(39)

Considering the right hand side of the above inequality, to bound∥∥∇F (w(k−1))

∥∥2, we use the strong convexity property of F .

Considering the strong convexity criterion in Assumption 1 with x replaced by w(k−1) and minimizing the both hand sides, theminimum of the left hand side occurs at y = w∗ and the minimum of the right hand side occurs at y = w(k−1)− 1

µ∇F (w(k−1)).Replacing these values in the strong convexity criterion in Assumption 1 results in Polyak-Lojasiewicz inequality [53] in thefollowing form: ∥∥∥∇F (w(k−1))

∥∥∥2

≥ (F (w(k−1))− F (w∗))2µ, (40)

which yields:F (w(k))− F (w(k−1)) ≤ −µ

η(F (w(k−1))− F (w∗))+

η

2D2

[∥∥∥ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

∥∥∥2].

(41)

17

Then, we perform the following algebraic steps to bound the second term on the right hand side of the above inequality:∥∥∥ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+ · · ·

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

∥∥∥2

≤

(∥∥∥ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

∥∥∥+∥∥∥ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1

∥∥∥+ · · ·

+∥∥∥ ∑a1∈L(k)

1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

∥∥∥+∥∥∥1(k){L1,1}|L

(k)1,1|c

(k)a′1

∥∥∥)2

≤

( ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|

∥∥∥c(k)a′|L|

∥∥∥+

∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|

∥∥∥c(k)a′|L|−1

∥∥∥+ · · ·

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|2∥∥∥c(k)

a′2

∥∥∥+ 1(k){L1,1}|L

(k)1,1|∥∥∥c(k)

a′1

∥∥∥)2

(a)

≤

Φ

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|2

∥∥∥c(k)a′|L|

∥∥∥2

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|2

∥∥∥c(k)a′|L|−1

∥∥∥2

+ · · ·

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|2∥∥∥c(k)

a′2

∥∥∥2

+ 1(k){L1,1}|L

(k)1,1|2

∥∥∥c(k)a′1

∥∥∥2],

(42)

where the triangle inequality, e.g, for vectors ai, 1 ≤ i ≤ n: ‖∑ni=1 ai‖ ≤

∑ni=1 ‖ai‖, is applied sequentially and

Φ = N|L|−1 +N|L|−2 + · · ·+N1 + 1. (43)

Also, inequality (a) in (42) is obtained by exploiting the Cauchy–Schwarz inequality,< a, a′ > ≤ ‖a‖ . ‖a′‖, which results inthe following bound, where q = [q1, · · · , qb]:(

b∑a=1

qi

)2

= (< 1,q >)2 ≤ b

b∑a=1

q2i . (44)

To further find each of the error terms in the right hand side of (42), we need to bound∥∥∥c(k)

a′p

∥∥∥2

, 1 ≤ p ≤ |L|. For notations

simplicity we consider bounding∥∥∥c(k)

a′|L|

∥∥∥2

for the case where sampling is conducted from the nodes of LUT cluster C located

in the bottom-most layer a′|L| ∈ C(k). Let c1, · · · , c|C(k)| denote the nodes belonging to cluster C, where a′|L| is among them.

18

The evolution of nodes parameters in this cluster can be described by (12) as:[W

(k)C

]m

=(V

(k)C

)θ(k)C[W

(k)C

]m, (45)

where [W

(k)C

]m

=[(

w(k)c1

)m, · · · ,

(w(k)c|C(k)|

)m

]>, (46)

and [W

(k)C

]m

=[(

w(k−1)c1

)m, · · · ,

(w(k)c|C(k)|

)m

]>, (47)

where (.)m denotes the m-th element of the vector. Let matrix W(k)C denote the average of the vector of parameters in cluster

C. Each column of this matrix is given by:

[W

(k)C

]m

=1|C(k)|1

>|C(k)|

[W

(k)C

]m

|C(k)|,∀m. (48)

It is straightforward to verify that:∥∥∥∥∥∥∥(w(k)cu

)m−

(∑|C(k)|b=1 w

(k)cb

)m

|C(k)|

∥∥∥∥∥∥∥ ≤∥∥∥∥[W(k)

C

]m−[W

(k)C

]m

∥∥∥∥ , 1 ≤ u ≤ |C(k)|. (49)

Also, considering (48), the following equality is immediate:[W

(k)C

]m

=[W

(k)C

]m

+ r, (50)

where for vector r, we have: < 1|C(k)|, r >= 0. Based on this and the properties of the consensus matrix given in Assumption 2,we bound the right hand side of (49) as follows:∥∥∥∥[W(k)

C

]m−[W

(k)C

]m

∥∥∥∥ =

∥∥∥∥∥(V(k)C

)θ(k)C[W

(k)C

]m−[W

(k)C

]m

∥∥∥∥∥ =∥∥∥∥∥(V(k)C

)θ(k)C

([W

(k)C

]m

+ r

)−[W

(k)C

]m

∥∥∥∥∥ =

∥∥∥∥∥(V(k)C

)θ(k)C

r

∥∥∥∥∥ =∥∥∥∥∥((

V(k)C

)θ(k)C −

1|C(k)|1>|C(k)|

|C(k)|

)r

∥∥∥∥∥ .(51)

Note that(

1|C(k)|1>|C(k)|

|C(k)|

)is idempotent5. This immediately makes I−

1|C(k)|1>|C(k)|

|C(k)| idempotent, where I is the identity matrix.

Using the properties of the consensus matrix given in Assumption 2, we then get:(V

(k)C

)θ(k)C −

1|C(k)|1>|C(k)|

|C(k)|=(V

(k)C

)θ(k)C

(I−

1|C(k)|1>|C(k)|

|C(k)|

)

=(V

(k)C

)θ(k)C

(I−

1|C(k)|1>|C(k)|

|C(k)|

)θ(k)C

=

(V

(k)C

(I−

1|C(k)|1>|C(k)|

|C(k)|

))θ(k)C

=

(V

(k)C −

1|C(k)|1>|C(k)|

|C(k)|

)θ(k)C

.

(52)

Using the above equality in (51) and using the matrix norm characteristics, we get:∥∥∥∥∥((

V(k)C

)θ(k)C −

1|C(k)|1>|C(k)|

|C(k)|

)r

∥∥∥∥∥ ≤∥∥∥∥∥∥∥(V

(k)C −

(1|C(k)|1

>|C(k)|

|C(k)|

))θ(k)C

∥∥∥∥∥∥∥ ‖r‖ ≤∥∥∥∥∥(V(k)C

)−

(1|C(k)|1

>|C(k)|

|C(k)|

)∥∥∥∥∥θ(k)C

‖r‖ ≤ (λC)θ(k)C ‖r‖ ,

(53)

5A matrix x is called idempotent if xy = x, y ∈ N.

19

where the last inequality is due to the fact that for a symmetric matrix A, we have ‖A‖ = ρ(A), where ρ and λC are definedin Assumption 2. According to Definition 1, we have:

‖w(k)cu − w(k)

cu′‖ ≤ Υ

(k)C , ∀cu, cu′ ∈ C(k), (54)

which results in: ∥∥∥(w(k)cu

)m−(w(k)cu′

)m

∥∥∥ ≤ Υ(k)C , ∀cu, cu′ ∈ C(k). (55)

As a result, considering (50), we have:

‖(r)u‖ =

∥∥∥∥∥∥∥∥∥∥∥∥∥(w(k)cu

)m−

|C(k)|∑b=1

w(k)cb

m

|C(k)|

∥∥∥∥∥∥∥∥∥∥∥∥∥=

1

|C(k)|

∥∥∥∥∥∥|C(k)|(w(k)cu

)m−

|C(k)|∑b=1

w(k)cb

m

∥∥∥∥∥∥

=1

|C(k)|

∥∥∥∥∥∥|C(k)|∑

b=1

[w(k)cb− w(k)

cu

]m

∥∥∥∥∥∥ ≤ |C(k)| − 1

|C(k)|Υ

(k)C , 1 ≤ u ≤ |C(k)|.

(56)

Thus, ‖r‖ ≤(|C(k)| − 1

)Υ

(k)C ≤ |C(k)|Υ(k)

C . It is worth mentioning the following remark here.

Remark 2. In Definition 1, the divergence could have been defined per parameter dimension, i.e., different bounds on the righthand side of (55) for each dimension, which would have resulted in tighter upper bounds. However, in practice it is hard toobtain such quantity in large-scale networks when the size of parameter vector is large.

Replacing the above result in (53) and combining with (49), gives us:∥∥∥∥∥∥∥(w(k)cu

)m−

(∑|C(k)|b=1 w

(k)cb

)m

|C(k)|

∥∥∥∥∥∥∥ ≤ (λC)θ(k)C |C(k)|Υ(k)

C , 1 ≤ u ≤ |C(k)|. (57)

Using the above inequality, we have:∥∥∥∥∥w(k)cu −

∑|C(k)|b=1 w

(k)cb

|C(k)|

∥∥∥∥∥ ≤M∑m=1

∥∥∥∥∥∥∥(w(k)cu

)m−

(∑|C(k)|b=1 w

(k)cb

)m

|C(k)|

∥∥∥∥∥∥∥ ≤ (λC)θ(k)C |C(k)|Υ(k)

C M, 1 ≤ u ≤ |C(k)|. (58)

Considering the fact that the above inequality bounds the error of consensus at every node with respect to the average value ofthe cluster, and the parent node samples one of the nodes of the cluster, for any sampled node a′|L| ∈ C, we get:∥∥∥c(k)

a′|L|

∥∥∥2

≤ (λC)2θ

(k)C |C(k)|2

(Υ

(k)C

)2

M2, a′|L| ∈ C. (59)

The above mentioned prove can be generalized to every cluster with slight modifications, which results in:∥∥∥c(k)a′p

∥∥∥2

≤ (λC)2θ

(k)C |C(k)|2

(Υ

(k)C

)2

M2, a′p ∈ C. (60)

Replacing the above inequality in (42) combined with (41) gives us:

F (w(k))− F (w(k−1)) ≤ −µη

(F (w(k−1))− F (w∗)

)+

ηΦM2

2D2

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|4

(λ

(k)Q(a|L|−1)

)2θ(k)

Q(a|L|−1)(

Υ(k)Q(a|L|−1)

)2

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|4

(λ

(k)Q(a|L|−2)

)2θ(k)

Q(a|L|−2)(

Υ(k)Q(a|L|−2)

)2

+ · · ·+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|4(λ

(k)Q(a1)

)2θ(k)

Q(a1)(

Υ(k)Q(a1)

)2

+ 1(k){L1,1}|L

(k)1,1|4

(λ

(k)L1,1

)2θ(k)L1,1

(Υ

(k)L1,1

)2].

(61)

20

Adding F (w(k−1)) to both hands sides and subtracting F (w∗) from both hand sides, we get:

F (w(k))− F (w∗) ≤ (1− µ

η)(F (w(k−1))− F (w∗)

)︸︷︷︸

(a)

+

ηΦM2

2D2

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|4

(λ

(k)Q(a|L|−1)

)2θ(k)

Q(a|L|−1)(

Υ(k)Q(a|L|−1)

)2

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|4

(λ

(k)Q(a|L|−2)

)2θ(k)

Q(a|L|−2)(

Υ(k)Q(a|L|−2)

)2

+ · · ·+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|4(λ

(k)Q(a1)

)2θ(k)

Q(a1)(

Υ(k)Q(a1)

)2

+ 1(k){L1,1}|L

(k)1,1|4

(λ

(k)L1,1

)2θ(k)L1,1

(Υ

(k)L1,1

)2].

(62)

Expanding term (a) on the right hand side of the inequality in a recursive manner leads to the theorem result. �

APPENDIX BPROOF OF PROPOSITION 1

Consider the bound on the number of D2D that is given in the proposition statement. For Lj,i, if σj ≤ |L(k)j,i |4

(Υ

(k)Lj,i

)2

, ∀i,

the proposed number of D2D guarantees θ(k)Lj,i≥

log(σj)−2 log(|L(k)j,i |

2Υ(k)Lj,i

)2 log

(λ

(k)Lj,i

) , which results in:

θ(k)Lj,i≥

log (σj)− 2 log(|L(k)j,i |2Υ

(k)Lj,i

)2 log

(λ

(k)Lj,i

)

⇒ θ(k)Lj,i≥ 1

2

log

(σj

|L(k)j,i |4

(Υ

(k)Lj,i

)2

)log(λ

(k)Lj,i

)⇒(λ

(k)Lj,i

)2θ(k)Lj,i ≤ σj

|L(k)j,i |4

(Υ

(k)Lj,i

)2 , ∀k,

(63)

where the last inequality is due to the facts that log alog b = logba, alogba = b, and λ

(k)Lj,i

< 1. Also, for cluster Lj,i, if σj ≥

|L(k)j,i |4

(Υ

(k)Lj,i

)2

, any θ(k)Lj,i≥ 0 ensures σj ≥ |L(k)

j,i |4(

Υ(k)Lj,i

)2 (λ

(k)Lj,i

)2θ(k)Lj,i , ∀k. Replacing the above result in (14), we get:

F (w(k−1))− F (w∗) ≤ ηΦM2

2D2

k−1∑t=0

(η − µη

)t [∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k−t){Q(a|L|−1)}σ|L|+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k−t){Q(a|L|−2)}σ|L|−1 + · · ·

+∑

a1∈L(k)1,1

1(k−t){Q(a1)}σ2 + 1

(k−t){L1,1}σ1

]+

(η − µη

)k (F (w(0))− F (w∗)

)

≤ ηΦM2

2D2

k−1∑t=0

(η − µη

)t [N|L|−1σ|L| +N|L|−2σ|L|−1 + · · ·

+N1σ2 +N0σ1

]+

(η − µη

)k (F (w(0))− F (w∗)

).

(64)

21

Taking the limit with respect to k, we get:

limk→∞

F (w(k−1))− F (w∗) ≤ ηΦM2

2D2

|L|−1∑j=0

σj+1Nj

1

1− (η−µη ), (65)

which concludes the proof.

APPENDIX CPROOF OF PROPOSITION 2

Consider the per-iteration decrease of the objective function given by (62). Following a similar procedure as Appendix B, giventhe proposed number of D2D rounds in the proposition statement, we get:

F (w(k))− F (w∗) ≤ (1− µ

η)(F (w(k−1))− F (w∗)

)+ηΦM2

2D2

[ |L|−1∑j=0

σ(k)j+1Nj

]. (66)

Using the fact that ∇F (w∗) = 0 combined with η-smoothness of F , we get:∥∥∥∇F (w(k−1))∥∥∥ =

∥∥∥∇F (w(k−1))−∇F (w∗)∥∥∥ ≤ η‖w(k−1) − w∗‖. (67)

Also, it is straightforward to verify that strong convexity of F , expressed in Assumption 1, implies the following inequality:

µ/2∥∥∥w(k−1) − w∗

∥∥∥2

≤ F (w(k−1))− F (w∗). (68)

Combining the above results with the condition given in the proposition statement, i.e., (19), we get:

|L|−1∑j=0

σ(k)j+1Nj ≤

D2µ(µ− δη)

η4ΦM2

∥∥∥∇F (w(k−1))∥∥∥2

≤ D2µ(µ− δη)

η2ΦM2

∥∥∥w(k−1) − w∗∥∥∥2

≤ 2D2(µ− δη)

η2ΦM2

(F (w(k−1))− F (w∗)

)(69)

By replacing the above inequality in (66) we get:

F (w(k))− F (w∗)] ≤ (1− µ/η)(F (w(k−1))− F (w∗)

)+ (µ/η − δ)

(F (w(k−1))− F (w∗)

), (70)

which readily leads to the proposition result.

APPENDIX DPROOF OF COROLLARY 1

Regarding the first condition, at global iteration κ, using the number of consensus given in the corollary statement, accordingto (64), we have:

F (w(κ))− F (w∗) ≤ ηΦM2

2D2

|L|−1∑j=0

σj+1Nj1−

(1− µ

η

)κµ/η

+

(η − µη

)κ (F (w(0))− F (w∗)

)

=

(1− µ

η

)κF (w(0))− F (w∗)− η2ΦM2

2µD2

|L|−1∑j=0

σj+1Nj

+η2ΦM2

2µD2

|L|−1∑j=0

σj+1Nj .

(71)

Thus to satisfy the accuracy requirement, it is sufficient to have:(1− µ

η

)κF (w(0))− F (w∗)− η2ΦM2

2µD2

|L|−1∑j=0

σj+1Nj

+η2ΦM2

2µD2

|L|−1∑j=0

σj+1Nj ≤ ε. (72)

Performing some algebraic steps leads to (23).Regarding the second condition, given the number of D2D rounds stated in the proposition statement, we first recursively

expand the right hand side of (70) to get:

F (w(κ))− F (w∗) ≤ (1− δ)κ(F (w(0))− F (w∗)

). (73)

22

Thus, to satisfy the desired accuracy, it is sufficient to have:

(1− δ)κ[F (w(0))− F (w∗)] ≤ ε, (74)

which readily leads to (24). Note that the criterion given in the corollary statement for ε guarantees that: 0 < δ ≤ µ/η.

APPENDIX EPROOF OF COROLLARY 2

Regarding the first condition, upon using the number of D2D rounds described in the corollary statement, we get (72), whichcan be written as: (

1− µ

η

)κ≤

ε− η2ΦM2

2µD2

∑|L|−1j=0 σj+1Nj

F (w(0))− F (w∗)− η2ΦM2

2µD2

∑|L|−1j=0 σj+1Nj

. (75)

To obtain κ, we need to take the logarithm with base 1−µ/η, where 0 < 1−µ/η < 1. Using the characteristic of the logarithmupon having a positive base less than one, we get:

κ ≥ log

(ε− η

2ΦM2

2µD2

∑|L|−1j=0 σj+1Nj

)(F (w(0))−F (w∗)− η

2ΦM2

2µD2

∑|L|−1j=0 σj+1Nj

)−1

1−µ/η , (76)

which can be written as (21).Regarding the second condition, upon using the number of D2D rounds described in the corollary statement, we get (74). To

obtain κ, we take the logarithm with base 1− δ from both hand sides of the equation, using the fact that 0 < 1− δ < 1 andthe characteristic of the logarithm upon having a positive base less than one, we get:

κ ≥ log(ε/(F (w(0))−F (w∗)))1−δ , (77)

which can be written as (22).

APPENDIX FPROOF OF PROPOSITION 3

Upon sharing the gradients, the nodes in the bottom layer share their scaled gradients (multiplying their gradients by theirnumber of data points), while the rest of the procedure, i.e., traversing of the gradients over the hierarchy, is the same assharing the parameters. For parent node ap, let a′p+1 denote the corresponding sampled node, ∀p, e.g., in the following nestedsums a′|L| denotes the sampled node in the last layer by parent node a|L|−1 in its above layer. Let g(k)

a′1denote the sampled

value by the main server at global iteration k. It can be verified that we have:

g(k)a′1

=

∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

|Da|L| |∇fa|L|(w(k−1)a|L|

)

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1

|L(k)1,1|

+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

|L(k)1,1|

+ 1(k){L1,1}c

(k)a′1.

(78)

The main server then uses this vector as the estimation of global gradient and builds the parameter vector for the next iterationas follows (note that although the root only receives the gradients, it has the knowledge of the previous parameters that it

23

broadcast, i.e., w(k−1)):

w(k)a′1

= Dw(k−1)

|L(k)1,1|−

βk−1

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)


)

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1

|L(k)1,1|

+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

|L(k)1,1|

+ 1(k){L1,1}c

(k)a′1

],

(79)

which is used to obtain the next global parameter (due to the existence of the indicator function in the last term of the aboveexpression, the following expression holds regardless of the operating mode of the cluster at layer L1):

w(k) =|L(k)

1,1|w(k)a′1

D. (80)

According to (1), it can be verified that:∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)


) = D∇F (w(k−1)).(81)

Replacing the above equation in (79) and performing the update given by (80), we get:

w(k) = w(k−1) − βk−1

[∇F (w(k−1)) +

1

D

( ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

)].

(82)

Using the above equality in (32), we have:

F (w(k)) ≤ F (w(k−1))− βk−1

(∇F (w(k−1)) + c(k)

)>∇F (w(k−1)) +

η

2β2k−1‖∇F (w(k−1)) + c(k)‖2

= F (w(k−1))− βk−1

∥∥∥∇F (w(k−1))∥∥∥2

− βk−1

(c(k)

)>∇F (w(k−1)) +

η

2β2k−1‖∇F (w(k−1)) + c(k)‖2

= F (w(k−1))− βk−1

∥∥∥∇F (w(k−1))∥∥∥2

− βk−1

(c(k)

)>∇F (w(k−1)) +

ηβ2k−1

2‖∇F (w(k−1))‖2

+ β2k−1η

(∇F (w(k−1))>c(k)

)+ηβ2

k−1

2

∥∥∥c(k)∥∥∥2

,

(83)

24

where

c(k) ,1

D

( ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

).

(84)

Taking the expectation from both hand sides (with respect to the consensus errors) and using the fact that upon using theconsensus method, when one node is sampled uniformly at random we have:6 E[c

(k)a′p

] = 0, ∀p. This implies E[c(k)] = 0, ∀k,replacing which in (83) gives us:

E[F (w(k))] ≤ F (w(k−1))− (1− ηβk−1

2)βk−1

∥∥∥∇F (w(k−1))∥∥∥2

+ηβ2

k−1

2E[‖c(k)‖2]. (85)

Using the fact that β0 ≤ 1/η, we get βk ≤ 1/η, and thus 1− ηβk/2 ≥ 1/2, ∀k. Using this in the above inequality gives us:

E[F (w(k))] ≤ F (w(k−1))− βk−1

2

∥∥∥∇F (w(k−1))∥∥∥2

+ηβ2

k−1

2E[‖c(k)‖2]. (86)

Using the strong convexity, we get Polyak-Lojasiewicz inequality [53] in the following form: ‖∇F (w(k−1))‖2 ≥ 2µ[F (w(k−1))−F (w∗)], using which in the above inequality yields:

E[F (w(k))] ≤ F (w(k−1))− βk−1µ[F (w(k−1))− F (w∗)] +ηβ2

k−1

2E[‖c(k)‖2], (87)

or, equivalently:

E[F (w(k))]− F (w∗) ≤ (1− βk−1µ)[F (w(k−1))− F (w∗)] +ηβ2

k−1

2E[‖c(k)‖2]. (88)

Taking total expectation, with respect to all the consensus errors until iteration k, from both hand sides results in:

E[F (w(k))− F (w∗)] ≤ (1− βk−1µ)E[F (w(k−1))− F (w∗)] +ηβ2

k−1

2E[‖c(k)‖2]. (89)

We continue the proof by carrying out an induction. The proposition result trivially holds for iteration 0. Assume that theresult holds for iteration k, i.e., E[F (w(k))− F (w∗)] ≤ Γ

k+λ . We aim to show that the result also holds for iteration k + 1.Using (89), we get:

E[F (w(k+1))− F (w∗)] ≤ (1− βkµ)E[F (w(k))− F (w∗)] +ηβ2

k−1

2E[‖c(k+1)‖2], (90)

which results in:E[F (w(k+1))− F (w∗)] ≤ (1− α

k + λµ)

Γ

k + λ+

ηα2

2(k + λ)2E[‖c(k+1)‖2]

=

(k + λ− αµ

(k + λ)2

)Γ +

ηα2

2(k + λ)2E[‖c(k+1)‖2]

=

(k + λ− 1

(k + λ)2

)Γ− αµ− 1

(k + λ)2Γ +

ηα2

2(k + λ)2E[‖c(k+1)‖2].

(91)

6Assume a set of n numbers denoted by x1, · · · , xn with mean x. Assume that X denotes a random variable with probability mass function p(X = xi) = 1n

,1 ≤ i ≤ n. It is straightforward to verify that E(X − x) = 0.

25

Note that using a similar method as Appendix A, we can get:7

E[∥∥∥c(k+1)

∥∥∥2]≤ ΦM2

D2

[∑

a1∈L(k+1)1,1

∑a2∈Q(k+1)(a1)

· · ·∑

a|L|−1∈Q(k+1)(a|L|−2)

1(k+1)

{Q(a|L|−1)}|Q(k+1)(a|L|−1)|4

(λ

(k+1)Q(a|L|−1)

)2θ(k+1)

Q(a|L|−1)(

Υ(k+1)Q(a|L|−1)

)2

+∑

a1∈L(k+1)1,1

∑a2∈Q(k+1)(a1)

· · ·∑

a|L|−2∈Q(k+1)(a|L|−3)

1(k+1)

{Q(a|L|−2)}|Q(k+1)(a|L|−2)|4

(λ

(k+1)Q(a|L|−2)

)2θ(k+1)

Q(a|L|−2)(

Υ(k+1)Q(a|L|−2)

)2

+ · · ·+∑

a1∈L(k+1)1,1

1(k+1)

{Q(k+1)(a1)}|Q(k+1)(a1)|4

(λ

(k+1)Q(a1)

)2θ(k+1)

Q(a1)(

Υ(k+1)Q(a1)

)2

+ 1(k+1){L1,1}|L

(k+1)1,1 |4

(λ

(k+1)L1,1

)2θ(k+1)L1,1

(Υ

(k+1)L1,1

)2].

(92)

Using the number of D2D rounds given in the proposition, similar to the approach taken in Appendix B it can be verified thatE[‖c(k+1)‖2] ≤ C = ΦM2

D2

∑|L|−1j=0 Njσj+1, ∀k. Using this and the definition of Γ in (26), we get: Γ ≥ ηα2C

2(αµ−1) , ∀k. Usingthis result in the last line of (91), we get:

E[F (w(k+1))− F (w∗)] ≤(k + λ− 1

(k + λ)2

)Γ. (93)

Note that since k + λ > 1, we have (k + λ)2 ≥ (k + λ− 1)(k + λ+ 1). Using this fact in (93), we get:

E[F (w(k+1))− F (w∗)] ≤(

1

k + λ+ 1

)Γ, (94)

which completes the induction and thus the proof.

7Note that if for every realization of random variable X , inequality ‖X‖2 < y holds, then we get: E[‖X‖2] < y.

F (w(k))− F (w∗) ≤[k∏l=1

(1−

µ

η+ 8

c2

D2

(D −D(l)

s

)2)](

F (w(0))− F (w∗))

+

k∑t=1

k∏l=t+1

(1−

µ

η+ 8

c2

D2

(D −D(l)

s

)2)

ηΦM2(D

(t)s

)2

[ ∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(t)

{Q(a|L|−1)}|Q(k)(a|L|−1)|4

(λ

(t)Q(a|L|−1)

)2θ(t)Q(a|L|−1)

(Υ

(t)Q(a|L|−1)

)2

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(t)

{Q(a|L|−2)}|Q(k)(a|L|−2)|4

(λ

(t)Q(a|L|−2)

)2θ(t)Q(a|L|−2)

(Υ

(t)Q(a|L|−2)

)2

+ · · ·

+∑

a1∈L(k)1,1

1(t){Q(a1)}|Q

(k)(a1)|4(λ

(t)Q(a1)

)2θ(t)Q(a1)

(Υ

(t)Q(a1)

)2

+ 1(t)

{L1,1}|L(k)1,1 |

4(λ

(t)L1,1

)2θ(t)L1,1

(Υ

(t)L1,1

)2 ]+

4

η

(D −D(t)

s

D

)2

c1

(95)

26

APPENDIX GCLUSTER SAMPLING

In a system of million/billion users, one technique that a main server can use to reduce the network load is to engage aportion of the devices in each global iteration. We realize this in FogL via cluster sampling using which at each global iteration,a portion of the clusters of the bottom-most layer are engaged in model training, which we call them as active clusters. Weassume that at each global iteration k, the main server engages a set of |S(k)| clusters in the learning, where each element ofthe set S(k) corresponds to one cluster in the bottom-most layer. Consequently, we partition the nodes in different layers intoactive nodes (those that are through the path between an active cluster and the main server) and passive nodes. Similarly, forthe clusters of the middle layers, if the cluster contains at least one active node, it is called an active cluster. To capture thesedynamics, with some abuse of notation, let 1(k)

{C} take the value of 1 if cluster C is both in active mode and operates in LUTmode in global aggregation k, and 0 otherwise. To conduct analysis, in addition to our assumptions made in Assumptions 1and 2, we also consider the following assumption that is common in stochastic optimization literature [54]:

∃c1 ≥ 0, c2 ≥ 1 : ‖∇fi(x)‖2 ≤ c1 + c2‖∇F (x)‖2, ∀i, x. (96)

Proposition 4. For global iteration k of MH-FL with cluster sampling, the upper bound of convergence of the objectivefunction is given by (95), where D(k)

S denotes the total number of data points of the sampled devices at iteration k, i.e.,D

(k)s =

∑n∈N|L| 1

(k){B(n)}|Dn|, with B(n) referring to the cluster that node n belongs to.8

Proof. To find the relationship between w(k) and w(k−1), we follow the procedure described in the main text. Let 1(k){S(C)} take

the value of 1 when cluster C is in active mode in global aggregation k, and 0 otherwise. Also, with some abuse of notation,let 1(k)

{C} take the value of 1 if cluster C is both in active mode and operates in LUT mode in global aggregation k, and 0otherwise. It can be verified that, at global iteration k, the parameter of the node located in the L1 sampled by the main server,referred to as a′1, is given by (97), which is used by the server to obtain the next global parameter as follows:

w(k) =|L(k)

1,1|w(k)a′1

D(k)s

, (98)

where D(k)s =

∑n∈N|L| 1

(k){B(n)}|Dn|, with B(n) referring to the cluster that the node n belongs to, is the total number of data

points available at the sampled devices at global aggregation k, which is assumed to be known to the server (in this casethe server needs the knowledge of the number of data points available at the active clusters). Following a similar proceduredescribed in Appendix A, we obtain (99). Let us define $(k) as follows:

8It is assumed that∏kj=k+1 cj = 1, ∀cj .

w(k)a′1

=

∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |w(k−1)a|L|

|L(k)1,1|

−

∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}β|Da|L| |∇fa|L|(w(k−1)a|L|

)

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

|L(k)1,1|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1

|L(k)1,1|

+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

|L(k)1,1|

+ 1(k){L1,1}c

(k)a′1

(97)

27

w(k) = w(k−1) −∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}β|Da|L| |

D(k)s

∇fa|L|(w(k−1)a|L|

)

+1

D(k)s

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

](99)

$(k) ,1

D(k)s

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

].

(100)

By adding and subtracting a term, we rewrite (99) as follows:

w(k) = w(k−1) −∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}β|Da|L| |

D(k)s


)

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

β|Da|L| |D∇fa|L|(w

(k−1)a|L|

)

−∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)


(k−1)a|L|

) +$(k),

(101)

or equivalentlyw(k) = w(k−1) − β∇F (w(k−1))

−∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}β|Da|L| |

D(k)s


)

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)


(k−1)a|L|

) +$(k).

(102)

Let us define %(k) as follows:

%(k) , β

[−

∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |

D(k)s


)

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

|Da|L| |D∇fa|L|(w

(k−1)a|L|

) +1

β$(k)

].

(103)

For global iteration k, let S(k) denotes the set of passive clusters, which is the complementary set of S(k), i.e., S(k)∪S(k) = L|L|,S(k) ∩ S(k) = ∅, where L|L| denotes the set of all clusters located in the bottom-most layer. Let 1(k)

{S(C)} take the value of 1

when cluster C is in passive mode in global aggregation k, and 0 otherwise. Following the procedure described in the proof ofAppendix A, we first aim to bound E[‖%(k)‖2]. The procedure is described in (107). In that series of simplifications in (107),

28

‖$(k)‖2 ≤ ΦM2(D

(k)s

)2

[∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|4

(λ

(k)Q(a|L|−1)

)2θ(k)

Q(a|L|−1)(

Υ(k)Q(a|L|−1)

)2

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|4

(λ

(k)Q(a|L|−2)

)2θ(k)

Q(a|L|−2)(

Υ(k)Q(a|L|−2)

)2

+ · · ·+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|4(λ

(k)Q(a1)

)2θ(k)

Q(a1)(

Υ(k)Q(a1)

)2

+ 1(k){L1,1}|L

(k)1,1|4

(λ

(k)L1,1

)2θ(k)L1,1

(Υ

(k)L1,1

)2]

(105)

F (w(k))− F (w∗) ≤ (1− µ

η)(F (w(k−1))− F (w∗)

)+η

2

[2ΦM2(D

(k)s

)2

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|4

(λ

(k)Q(a|L|−1)

)2θ(k)

Q(a|L|−1)(

Υ(k)Q(a|L|−1)

)2

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|4

(λ

(k)Q(a|L|−2)

)2θ(k)

Q(a|L|−2)(

Υ(k)Q(a|L|−2)

)2

+ · · ·+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|4(λ

(k)Q(a1)

)2θ(k)

Q(a1)(

Υ(k)Q(a1)

)2

+ 1(k){L1,1}|L

(k)1,1|4

(λ

(k)L1,1

)2θ(k)L1,1

(Υ

(k)L1,1

)2 ]

+8

η2

(D −D(k)

s

D

)2

(c1 + 2c2η(F (w(k−1))− F (w∗))

]

(106)

the triangle inequality is applied repeatedly. In inequality (a), we have used the fact that (‖a‖+ ‖b‖)2 ≤ 2(‖a‖2 + ‖b‖2), ininequality (b) we have used the fact that 1

D(k)s

= 1D −

D(k)s −D

(D)(D(k)s )

, in (c) we have used (96), and in inequality (d) we have usedthe smoothness definition in Assumption 1 that can also be written as:

F (y) ≤ F (x) + (y − x)>∇F (x) +η

2‖y − x‖2 , ∀x,y, (104)

minimizing the both hand sides of which results in: ‖∇F (w)‖2 ≤ 2η(F (w)−F (w∗)), ∀w. Note that ‖$(k)‖2 can be obtainedsimilar to Appendix A as (105). Replacing this with β = 1

η in the bound in (107), and following the procedure of proof inAppendix A, we get (106), which can be recursively expanded to get the bound in the proposition statement. �

Remark 3. The methodology used to derive all the previous results regarding the convergence and the number of D2D can bestudied for this scenario with cluster sampling, which we leave as future work. One key observation from (95) is that uponincreasing the number of active clusters, often resulting in increasing D(k)

s , ∀k, the right hand side of (95) starts to decrease,which implies a higher training accuracy, and the similarity between the bounds (14) and (95) increases. In the limiting caseD

(k)s = D, ∀k, bound (95) can be written similarly to (95), where ηΦM2

2D2 in (14) would be replaced by a larger value ηΦM2

D2 .

29

1

β2‖%(k)‖2 =

∥∥∥∥∥− ∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |

D(k)s

∇fa|L| (w(k−1)a|L| )

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

|Da|L| |D

∇fa|L| (w(k−1)a|L| ) +

1

β$(k)

∥∥∥∥∥2

≤(∥∥∥∥ 1

β$(k)

∥∥∥∥+

∥∥∥∥∥ ∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

|Da|L| |D

∇fa|L| (w(k−1)a|L| )

−∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |

D(k)s

∇fa|L| (w(k−1)a|L| )

∥∥∥∥∥)2

(a)

≤ 2

∥∥∥∥ 1

β$(k)

∥∥∥∥2

+ 2

∥∥∥∥∥ ∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

|Da|L| |D

∇fa|L| (w(k−1)a|L| )

−∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |

D(k)s

∇fa|L| (w(k−1)a|L| )

∥∥∥∥∥2

(b)

≤ 21

β2

∥∥∥$(k)∥∥∥2

+ 2

∥∥∥∥∥ ∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |D

∇fa|L| (w(k−1)a|L| )

−∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}(D −D(k)

s )|Da|L| |

(D)(D(k)s )

∇fa|L| (w(k−1)a|L| )

∥∥∥∥∥2

≤21

β2

∥∥∥$(k)∥∥∥2

+ 2

( ∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |D

∥∥∥∇fa|L| (w(k−1)a|L| )

∥∥∥−

∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}(D −D(k)

s )|Da|L| |

(D)(D(k)s )

∥∥∥∇fa|L| (w(k−1)a|L| )

∥∥∥)2

≤21

β2

∥∥∥$(k)∥∥∥2

+ 2

( ∑a1∈L

(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}|Da|L| |D

maxa|L|∈Q(k)(a|L|−1)

(∥∥∇fa|L| (w(k−1)a|L| )

∥∥)

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|∈Q(k)(a|L|−1)

1(k)

{S(Q(a|L|−1))}(D −D(k)

s )|Da|L| |

(D)(D(k)s )

maxa|L|∈Q(k)(a|L|−1)

(∥∥∇fa|L| (w(k−1)a|L| )

∥∥))2

=21

β2

∥∥∥$(k)∥∥∥2

+ 2

(D −D(k)

s

Dmaxa∈N|L|

(∥∥∇fa(w(k−1)a )‖

)+

(D −D(k)s )D

(k)s

(D)(D(k)s )

maxa∈N|L|

(∥∥∇fa(w(k−1)a )

∥∥))2

≤2

β2

∥∥∥$(k)∥∥∥2

+ 2

[(2D −D(k)

s

Dmaxa∈N|L|

(∥∥∇fa(w(k−1)a )

∥∥))2]≤

2

β2

∥∥∥$(k)∥∥∥2

+ 8

(D −D(k)

s

Dmaxa∈N|L|

(∥∥∇fa(w(k−1)a )

∥∥))2

≤2

β2

∥∥∥$(k)∥∥∥2

+ 8

(D −D(k)

s

D

)2(maxa∈N|L|

(∥∥∇fa(w(k−1)a )

∥∥))2

(c)

≤2

β2

∥∥∥$(k)∥∥∥2

+ 8

(D −D(k)

s

D

)2 (c1 + c2‖F (w(k−1))‖2

)(d)

≤ 21

β2

∥∥∥$(k)∥∥∥2

+ 8

(D −D(k)

s

D

)2 (c1 + 2c2η

(F (w(k−1))− F (w∗)

))

(107)

The following appendix is the last appendix of the paper concerned with theoretical analysis, which is followed by anotherappendix containing extensive numerical simulations.

30

APPENDIX HAGGREGATION ERROR UPON USING ALGORITHM 3

According to (33), the aggregation error at the k-th global aggregation is given by:

e(k) =1

D

( ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|c(k)

a′|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

∑a3∈Q(k)(a2)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|c(k)

a′|L|−1+

...

+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|c(k)a′2

+ 1(k){L1,1}|L

(k)1,1|c

(k)a′1

).

(108)

Following a similar procedure described in Appendix A, we get:

‖e(k)‖2 ≤ ΦM2

D2

[∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}|Q(k)(a|L|−1)|4

(λ

(k)Q(a|L|−1)

)2θ(k)

Q(a|L|−1)(

Υ(k)Q(a|L|−1)

)2

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}|Q(k)(a|L|−2)|4

(λ

(k)Q(a|L|−2)

)2θ(k)

Q(a|L|−2)(

Υ(k)Q(a|L|−2)

)2

+ · · ·+∑

a1∈L(k)1,1

1(k){Q(a1)}|Q

(k)(a1)|4(λ

(k)Q(a1)

)2θ(k)

Q(a1)(

Υ(k)Q(a1)

)2

+ 1(k){L1,1}|L

(k)1,1|4

(λ

(k)L1,1

)2θ(k)L1,1

(Υ

(k)L1,1

)2],

(109)

where Φ = N|L|−1 +N|L|−2 + · · ·+N1 + 1. By tuning the number of D2D according to (31), following a similar procedureas Appendix B, we get:

‖e(k)‖2 ≤ ΦM2

D2

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

1(k)

{Q(a|L|−1)}ψ

ΦM2

D2 N|L|−1|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}ψ

ΦM2

D2 N|L|−2|L|+ · · ·

+∑

a1∈L(k)1,1

1(k){Q(a1)}

ψΦM2

D2 N1|L|+ 1

(k){L1,1}

ψΦM2

D2 N0|L|

]

≤ ΦM2

D2

[ ∑a1∈L(k)

1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−1∈Q(k)(a|L|−2)

ψΦM2

D2 N|L|−1|L|

+∑

a1∈L(k)1,1

∑a2∈Q(k)(a1)

· · ·∑

a|L|−2∈Q(k)(a|L|−3)

1(k)

{Q(a|L|−2)}ψ

ΦM2

D2 N|L|−2|L|+ · · ·

+∑

a1∈L(k)1,1

ψΦM2

D2 N1|L|+

ψΦM2

D2 N0|L|

]

=ΦM2

D2

[ψ

ΦM2

D2 |L|+

ψΦM2

D2 |L|· · ·+ ψ

ΦM2

D2 |L|+

ψΦM2

D2 |L|︸︷︷︸|L| terms

].

(110)

Thus, we have:‖e(k)‖2 ≤ ψ. (111)

31

APPENDIX IDETAILS OF THE SIMULATIONS SETTING AND FURTHER SIMULATIONS

In this section, we first present some details regarding simulations settings and parameter tuning and then present a series ofsimulation results regarding the choice of different datasets and larger network size as compared to the main text. The exactset of hyperparameters used on each network can be found at our implementation repository: https://github.com/shams-sam/Federated2Fog.

A. Simulation Setting

1) Setup: All simulations are performed on a single machine with 64GB RAM and 8GB GPU memory, which emulates thelearning through a distributed learning framework PySyft that helps spin off virtually disjoint nodes with mutually exclusivemodel parameters and datasets, working on top of PyTorch machine learning library.

2) Classifiers: We consider two different classifiers - regularized Support Vector Machine (SVM) and fully-connected NeuralNetwork (NN), initialized with a copy of global model before the learning process begins on each node participating in thelearning process.

The regularized SVM is tuned to satisfy the strong convexity with µ = 0.1. We also use the estimated value of η = 10(similar values are observed in [18]). The NN classifier is a simple fully connected network with a single hidden layer and noconvolutional units. Softmax activation at the output layer gives the class logits and the overall training optimizes negativelog-likelihood loss function with L2 regularization.

Input size for both the models, SVM and NN is 28× 28 = 784, with output size 10. The number of parameters optimizedby the networks M is given by M = (784 + 1)× 10 = 7850.

3) Datasets and Data Distribution among the Nodes: We consider two datasets MNIST and F-MNIST (Fashion MNIST)9,each of which contain 60000 training samples and 10000 testing samples. MNIST consists of handwritten digits 0− 9, whileF-MNIST consists of images associated with 10 classes in clothing domain. Each dataset consists of 28× 28 grayscale images.

The datasets are distributed over nodes such that all nodes have approximately equal number of training samples. Howeverthe training samples, maybe either be i.i.d or non-i.i.d distributed. For i.i.d distribution, each node participating in the learningprocess has samples from each class of the dataset, while under non-i.i.d distribution, each node has access to only one of theclasses. These are the extreme ends of possible split of data among nodes in terms of class distribution, helping us evaluate theoverall robustness as well as differences in characteristics of our technique under different settings.

4) Network Formation: We consider two network configurations: (i) the network consists of 125 edge devices; (ii) thenetwork consists of 625 edge devices. For the former case, we consider a fog network consisting of a main server and threesub-layers, to build our fog network we start with the 125 worker nodes in the bottom-most layer (L3) and dedicated localdatasets sampled as explained above. The worker nodes update the local models with a copy of parameters from latest globalmodel at the start of each iteration. The worker nodes are then clustered in groups of 5 to communicate with one of the25 aggregators in their upper layer (i.e., L2), such that there is a 1-to-1 mapping between the clusters and the aggregators.Similarly the nodes in layer (L2) are clustered and communicate with the 5 aggregators in the layer L1, followed by clusteringand communicating the 5 nodes with the main server.

For the latter case, we consider a fog network consisting of a main server and four sub-layers, to build our fog network westart with the 625 worker nodes in the bottom-most layer (i.e., L4) and dedicated local datasets sampled as explained above.The worker nodes update the local models with a copy of parameters from latest global model at the start of each iteration.The worker nodes are then clustered in groups of 5 to communicate with one of the 125 aggregators in their upper layer (i.e.,L3), such that there is a 1-to-1 mapping between the clusters and aggregators. Similarly nodes in L3 are again clustered andcommunicate to the 25 aggregators in the upper layer (i.e., L2). This is followed by clustering of these nodes in groups of 5and communicating with 5 aggregators in layer L1, which then communicate with the main server.

The connectivity among the nodes within a cluster is simulated using random geometric graphs with increasing connectivityas we traverse from the bottom-most layer to the main server. For the case with 125 edge device, layer L3 has average degreeof the nodes around 2, followed by layer L2 with average degree around 3 and layer L1 with average degree around 4. For thecase with 625 edge device, in layer L3 and L4 we have average degree of the nodes around 2, followed by layer L2 withaverage degree around 3 and layer L1 with average degree around 4. We use NetworkX10 library of Python for generating thegraph. We adjust the radius parameter of the graph generator such that the average degree of the graph is within toleranceregion of 0.2 from the desired degree of the graph.

For the D2D communications, we consider the common choice of the weights [46] that gives z(t+1)n = z(t)

n +

d(k−1)C

∑m∈%(k−1)(n)(z(t)

m − z(t)n ), 0 < d

(k−1)C < 1/D

(k−1)C , for any node n in arbitrary cluster C, where D

(k−1)C is the

maximum degree of the nodes in G(k−1)C . Using this implementation, the nodes inside LUT cluster C only need to have the

knowledge of the parameter d(k−1)C , which is broadcasted by the respective parent node.

9https://github.com/zalandoresearch/fashion-mnist10https://networkx.github.io

https://github.com/shams-sam/Federated2Fog

https://github.com/shams-sam/Federated2Fog

32

0 10 20k

0.2

0.4

0.6

0.8ac

curacy

(a)0 10 20

k

5

10

loss (1

0−2 )

(b)

EUT θ = 1 θ = 2 θ = 5 θ = 15 θ = 30

e0 e1 e2 e3k

e−4

e−3

loss (log

scale)

(c)

Fig. 16: Performance comparison between baseline EUT and MH-MT whena fixed number of D2D rounds θ is used at every cluster of the network, fornon-i.i.d. As the number of D2D rounds increases, MH-MT performs moresimilar to the EUT baseline and the learning is more stable. (MNIST, 625Edge Devices)

0 10 20 30 40 50k

0.2

0.4

0.6

0.8

accuracy

(a)

0 10 20 30 40 50k

0.0

0.2

0.4

loss

(b)


Fig. 17: Performance comparison between baseline EUT, and MH-MT withand without (w/o) decreasing the gradient descent step size. Decreasing thestep size can provide convergence to the optimal solution in cases wherea fixed step size is not capable, but also has a slower convergence speed.(MNIST, 625 Edge Devices)

0 10 20k

0.4

0.6

0.8

accuracy

(a)0 10 20

k

2.55.07.5

loss ( 10

−2)

(b)

EUT σ = 1% σ = 10% σ = 60% σ = 90%

0 10 20k

51015

θ(k)

(c)

0 20k

51015

θ(k) L 4

(d)0 20

k

51015

θ(k) L 3

(e)0 20

k

2.55.07.5

θ(k) L 2

(f)0 20

k

2

4

θ(k) L 1

(g)

Fig. 18: Performance comparison between baseline EUT and MH-MT fori.i.d when a finite optimality gap is tolerable. σj at Lj is fixed as σj =

σ′maxi Υ(1)Lj,i

. Tapering of D2D rounds though time and space (layers) canbe observed. (MNIST, 625 Edge Devices)

0 10 20k

0.250.500.75

accuracy

(a)0 10 20

k

51015

loss (×10

2)

(b)

EUT σ = 1% σ = 10% σ = 60% σ = 90%

0 10 20k

51015

θ(k)

(c)

0 20k

51015

θ(k) L 4

(d)0 20

k

51015

θ(k) L 3

(e)0 20

k

5

10

θ(k) L 2

(f)0 20

k

1

2

3

θ(k) L 1

(g)

Fig. 19: Performance comparison between baseline EUT and MH-MT fornon-i.i.d when a finite optimality gap is tolerable. σi is set as in Fig. 18.Smaller loss and higher accuracy are achieved with smaller σ′, implyingmore rounds of consensus. (MNIST, 625 Edge Devices)

B. Further Simulation Results

This section presents the plots from complimentary experiments from Section IV. In the following, we explain the relationshipbetween the figures presented in this appendix and the simulation results presented in the main text.

Fig. 5 from main text is repeated in Fig. 16 for MNIST dataset distributed over 625 edge devices, Fig. 27 for FMNISTdataset distributed over 125 edge devices and Fig. 38 for FMNIST dataset distributed over 625 edge devices.











33

0 10 20k

0.4

0.6

0.8

accuracy

(a)0 10 20

k

2.55.07.5

loss ( 10

−2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

0

20

40

θ(k)

(c)

0 20k

0

20

40

θ(k) L 4

(d)0 20

k

20

40

θ(k) L 3

(e)0 20

k

20

40

θ(k) L 2

(f)0 20

k0

20

40

θ(k) L 1

(g)

Fig. 20: Performance comparison between baseline EUT and MH-MT for i.i.d.when linear convergence to the optimal is desired. The value of δ is set atδ = δ′ µ

η. Boosting of the D2D rounds though time can be observed. Also,

tapering through space can be observed by comparing the D2D rounds inthe bottom subplots. (MNIST, 625 Edge Devices)

0 10 20k

0.4

0.6

0.8

accuracy

(a)0 10 20

k

2.55.07.5

loss ( 10

−2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

20

40

θ(k)

(c)

0 20k

20

40

θ(k) L 4

(d)0 20

k

20

40

θ(k) L 3

(e)0 20

k

20

40

θ(k) L 2

(f)0 20

k

0

20

40

θ(k) L 1

(g)

Fig. 21: Performance comparison between baseline EUT and MH-MT fornon-i.i.d when linear convergence to the optimal is desired. The value ofδ is set as in Fig. 20. Smaller values of loss and higher accuracy are bothassociated with larger value of δ, which results in lower error tolerance andmore rounds of consensus. (MNIST, 625 Edge Devices)

0 10 20k

0.6

0.8

accu

racy

(a)0 10 20

k

2

4

loss (×

10−2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

20

θ(k)

(c)

0 20k

0

20

θ(k) L 4

(d)0 20

k0

20

θ(k) L 3

(e)0 20

k

51015

θ(k) L 2

(f)0 20

k

2

4

θ(k) L 1

(g)

Fig. 22: Performance comparison between baseline EUT and MH-MT underi.i.d using NNs with different values of ψ. Tapering the D2D rounds thoughtime can be observed. Also, tapering through space can be observed bycomparing the D2D rounds in the bottom subplots. (MNIST, 625 EdgeDevices)

0 10 20k

0.6

0.8

accuracy

(a)0 10 20

k

2.5

5.0

7.5

loss (×10

2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

20θ(k)

(c)

0 20k

0

20

40

θ(k) L 4

(d)0 20

k

102030

θ(k) L 3

(e)0 20

k

5

10

θ(k) L 2

(f)0 20

k

2

4

θ(k) L 1

(g)

Fig. 23: Performance comparison between baseline EUT and MH-MT undernon-i.i.d. using NNs with different values of ψ. Lower loss and higheraccuracy are associated with smaller values of ψ, which result in lower errortolerance and larger values of D2D rounds over time. (MNIST, 625 EdgeDevices)

0.0300.03

50.04

00.04

50.05

00.05

5ε′

0

5

10

15

20

25

κ

theory simulation

Fig. 24: Comparison between the theoretical andsimulation results regarding the number of globaliterations to achieve an accuracy of ε′(F (w(0))−F (w∗)) for different ε′. Convergence in practice isfaster than the derived upper bound. (MNIST, 625Edge Devices)

scenario

1scen

ario 2scen

ario 3scen

ario 40

20

40

60

80

accumulated

ene

rgy

consum

ption (×

104 Jou

les)

EUT MH-MT

Fig. 25: Comparison of accumulated energy con-sumption between EUT and MH-MT over scenario1: σ′ = 0.1 from Fig. 18, scenario 2: σ′ = 0.1from Fig. 19, scenario 3: ψ = 104 from Fig. 22,and scenario 4: ψ = 104 from Fig. 23. (MNIST,625 Edge Devices)

scenario

1scen

ario 2

scenario

3scen

ario 40

2

4

6

8

10

12

accu

mul

ated

num

ber o

fpa

ram

eter

s tra

nsfe

rred

betw

een

netw

ork

laye

rs (×

107 )

EUT MH-MT

Fig. 26: Comparison of parameters transferredamong layers in EUT vs MH-MT over scenario 1:σ′ = 0.1 from Fig. 18, scenario 2: σ′ = 0.1 fromFig. 19, scenario 3: ψ = 104 from Fig. 22, andscenario 4: ψ = 104 from Fig. 23. (MNIST, 625Edge Devices)

34

0 10 20k

0.3

0.4

0.5

0.6

0.7

accu

racy

(a)

0 10 20k

2

4

6

8

10

loss (1

0−2 )

(b)

EUT θ = 1 θ = 2 θ = 5 θ = 15 θ = 30

e0 e1 e2 e3k

e−4

e−3

loss (log

scale)

(c)

Fig. 27: Performance comparison between baseline EUT and MH-MT whena fixed number of D2D rounds θ is used at every cluster of the network, fornon-i.i.d. As the number of D2D rounds increases, MH-MT performs moresimilar to the EUT baseline and the learning is more stable. (FMNIST, 125Edge Devices)

0 10 20 30 40 50k

0.2

0.4

0.6

accuracy

(a)

0 10 20 30 40 50k

0.0

0.2

0.4

0.6

loss

(b)


Fig. 28: Performance comparison between baseline EUT, and MH-MT withand without (w/o) decreasing the gradient descent step size. Decreasing thestep size can provide convergence to the optimal solution in cases wherea fixed step size is not capable, but also has a slower convergence speed.(FMNIST, 125 Edge Devices)

0 10 20k

0.4

0.6

accuracy

(a)0 10 20

k

2

4

6

loss (×

10−2)

(b)

EUT σ′ = 1% σ′ = 10% σ′ = 60% σ′ = 90%

0 10 20k

5

10

15

θ(k)

(c)

0 10 20k

51015

θ(k) L 3

(d)0 10 20

k

2.5

5.0

7.5

θ(k) L 2

(e)0 10 20

k

1

2

3

θ(k) L 1

(f)


σ′maxi Υ(1)Lj,i

. Tapering of D2D rounds though time and space (layers) canbe observed. (FMNIST, 125 Edge Devices)

0 10 20k

0.25

0.50accuracy

(a)0 10 20

k

5

10

15

loss (×

10−2)

(b)

EUT σ′ = 1% σ′ = 10% σ′ = 60% σ′ = 90%

0 10 20k

5

10

15

θ(k)

(c)

0 10 20k

51015

θ(k) L 3

(d)0 10 20

k

2.5

5.0

7.5

θ(k) L 2

(e)0 10 20

k

1

2

3

θ(k) L 1

(f)

Fig. 30: Performance comparison between baseline EUT and MH-MT fornon-i.i.d when a finite optimality gap is tolerable. σi is set as in Fig. 29.Smaller loss and higher accuracy are achieved with smaller σ′, implyingmore rounds of consensus. (FMNIST, 125 Edge Devices)

0 10 20k

0.4

0.6

accuracy

(a)0 10 20

k

2

4

6

loss (×

10−2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

20

40

θ(k)

(c)

0 10 20k

20

40

θ(k) L 3

(d)0 10 20

k

20

40

θ(k) L 2

(e)0 10 20

k

20

40

θ(k) L 1

(f)



tapering through space can be observed by comparing the D2D rounds inthe bottom subplots. (FMNIST, 125 Edge Devices)

0 10 20k

0.4

0.6

accuracy

(a)0 10 20

k

2

4

6

loss (×

10−2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

20

40

θ(k)

(c)

0 10 20k

20

40

θ(k) L 3

(d)0 10 20

k

20

40

θ(k) L 2

(e)0 10 20

k0

20

40

θ(k) L 1

(f)

Fig. 32: Performance comparison between baseline EUT and MH-MT fornon-i.i.d when linear convergence to the optimal is desired. The value ofδ is set as in Fig. 31. Smaller values of loss and higher accuracy are bothassociated with larger value of δ, which results in lower error tolerance andmore rounds of consensus. (FMNIST, 125 Edge Devices)

35

0 10 20k

0.6

0.7

accu

racy

(a)0 10 20

k

2

3

loss (×

10−2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

10

20

θ(k)

(c)

0 10 20k

0

20

θ(k) L 3

(d)0 10 20

k

5

10

15θ(

k) L 2

(e)0 10 20

k

2

4

θ(k) L 1

(f)

Fig. 33: Performance comparison between baseline EUT and MH-MT underi.i.d using NNs with different values of ψ. Tapering the D2D rounds thoughtime can be observed. Also, tapering through space can be observed bycomparing the D2D rounds in the bottom subplots. (FMNIST, 125 EdgeDevices)

0 10 20k

0.4

0.6

accu

racy

(a)0 10 20

k

2.5

5.0

7.5

loss (×

10−2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

20

θ(k)

(c)

0 10 20k

0

20θ(k) L 3

(d)0 10 20

k

5

10

θ(k) L 2

(e)0 10 20

k

2

4

θ(k) L 1

(f)

Fig. 34: Performance comparison between baseline EUT and MH-MT undernon-i.i.d. using NNs with different values of ψ. Lower loss and higheraccuracy are associated with smaller values of ψ, which result in lower errortolerance and larger values of D2D rounds over time. (FMNIST, 125 EdgeDevices)

0.0250.03

00.03

50.04

00.04

50.05

00.05

5ε′

0

5

10

15

κ

theory simulation

Fig. 35: Comparison between the theoretical andsimulation results regarding the number of globaliterations to achieve an accuracy of ε′(F (w(0))−F (w∗)) for different ε′. Convergence in practiceis faster than the derived upper bound. (FMNIST,125 Edge Devices)

scenario

1scen

ario 2scen

ario 3scen

ario 40.0

2.5

5.0

7.5

10.0

12.5

15.0

accumulated energy

consum

ption (×10

4 Joules)

EUT MH-MT

Fig. 36: Comparison of accumulated energy con-sumption between EUT and MH-MT over scenario1: σ′ = 0.1 from Fig. 29, scenario 2: σ′ = 0.1from Fig. 30, scenario 3: ψ = 104 from Fig. 33,and scenario 4: ψ = 104 from Fig. 34. (FMNIST,125 Edge Devices)

scenario

1scen

ario 2

scenario

3scen

ario 40.0

0.5

1.0

1.5

2.0

2.5

accu

mul

ated

num

ber o

fpa

ram

eter

s tra

nsfe

rred

betw

een

netw

ork

laye

rs (×

107 )

EUT MH-MT

Fig. 37: Comparison of parameters transferredamong layers in EUT vs MH-MT over scenario 1:σ′ = 0.1 from Fig. 29, scenario 2: σ′ = 0.1 fromFig. 30, scenario 3: ψ = 104 from Fig. 33, andscenario 4: ψ = 104 from Fig. 34. (FMNIST, 125Edge Devices)

0 10 20k

0.4

0.6

accu

racy

(a)0 10 20

k

2.5

5.0

7.5

10.0

loss (1

0−2 )

(b)

EUT θ = 1 θ = 2 θ = 5 θ = 15 θ = 30

e0 e1 e2 e3k

e−4

e−3

loss (log

scale)

(c)

Fig. 38: Performance comparison between baseline EUT and MH-MT whena fixed number of D2D rounds θ is used at every cluster of the network, fornon-i.i.d. As the number of D2D rounds increases, MH-MT performs moresimilar to the EUT baseline and the learning is more stable. (FMNIST, 625Edge Devices)

0 10 20 30 40 50k

0.2

0.4

0.6

accuracy

(a)

0 10 20 30 40 50k

0.0

0.2

0.4

loss

(b)


Fig. 39: Performance comparison between baseline EUT, and MH-MT withand without (w/o) decreasing the gradient descent step size. Decreasing thestep size can provide convergence to the optimal solution in cases wherea fixed step size is not capable, but also has a slower convergence speed.(FMNIST, 625 Edge Devices)

36

0 10 20k

0.4

0.6

accuracy

(a)0 10 20

k

2

4

6

loss (×10

2)

(b)

EUT σ = 1% σ = 10% σ = 60% σ = 90%

0 10 20k

51015

θ(k)

(c)

0 20k

51015

θ(k) L 4

(d)0 20

k

51015

θ(k) L 3

(e)0 20

k

2.55.07.5

θ(k) L 2

(f)0 20

k

1

2

3

θ(k) L 1

(g)


σ′maxi Υ(1)Lj,i

. Tapering of D2D rounds though time and space (layers) canbe observed. (FMNIST, 625 Edge Devices)

0 10 20k

0.4

0.6

accuracy

(a)0 10 20

k

5

10

loss (×10

2)

(b)

EUT σ = 1% σ = 10% σ = 60% σ = 90%

0 10 20k

51015

θ(k)

(c)

0 20k

51015

θ(k) L 4

(d)0 20

k

51015

θ(k) L 3

(e)0 20

k

2.55.07.5

θ(k) L 2

(f)0 20

k

1

2

3

θ(k) L 1

(g)

Fig. 41: Performance comparison between baseline EUT and MH-MT fornon-i.i.d when a finite optimality gap is tolerable. σi is set as in Fig. 40.Smaller loss and higher accuracy are achieved with smaller σ′, implyingmore rounds of consensus. (FMNIST, 625 Edge Devices)

0 10 20k

0.4

0.6

accuracy

(a)0 10 20

k

2

4

6

loss (×10

2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

0

20

40

θ(k)

(c)

0 20k

0

20

40

θ(k) L 4

(d)0 20

k0

20

40

θ(k) L 3

(e)0 20

k0

20

40

θ(k) L 2

(f)0 20

k

0

20

40

θ(k) L 1

(g)



tapering through space can be observed by comparing the D2D rounds inthe bottom subplots. (FMNIST, 625 Edge Devices)

0 10 20k

0.4

0.6

accuracy

(a)0 10 20

k

2

4

6

loss (×10

2)

(b)

EUT δ ′ = 99% δ ′ = 95% δ ′ = 90% δ ′ = 80%

0 10 20k

20

40

θ(k)

(c)

0 20k

20

40

θ(k) L 4

(d)0 20

k

20

40θ(

k) L 3

(e)0 20

k

20

40

θ(k) L 2

(f)0 20

k

20

40

θ(k) L 1

(g)

Fig. 43: Performance comparison between baseline EUT and MH-MT fornon-i.i.d when linear convergence to the optimal is desired. The value ofδ is set as in Fig. 42. Smaller values of loss and higher accuracy are bothassociated with larger value of δ, which results in lower error tolerance andmore rounds of consensus. (FMNIST, 625 Edge Devices)

0 10 20k

0.6

0.7

accu

racy

(a)0 10 20

k

2

3

loss (×

10−2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

20

θ(k)

(c)

0 20k

0

20

θ(k) L 4

(d)0 20

k0

20

θ(k) L 3

(e)0 20

k

51015

θ(k) L 2

(f)0 20

k

2

4

θ(k) L 1

(g)

Fig. 44: Performance comparison between baseline EUT and MH-MT underi.i.d using NNs with different values of ψ. Tapering the D2D rounds thoughtime can be observed. Also, tapering through space can be observed bycomparing the D2D rounds in the bottom subplots. (FMNIST, 625 EdgeDevices)

0 10 20k

0.4

0.6

accu

racy

(a)0 10 20

k

2.5

5.0

loss (×

10−2)

(b)

EUT ψ = 1e+05 ψ = 1e+04 ψ = 1e+03 ψ = 100 ψ = 10

0 10 20k

0

20

θ(k)

(c)

0 20k

0

20θ(k) L 4

(d)0 20

k0

10

20

θ(k) L 3

(e)0 20

k

2.55.07.5

θ(k) L 2

(f)0 20

k

2

4

θ(k) L 1

(g)

Fig. 45: Performance comparison between baseline EUT and MH-MT undernon-i.i.d. using NNs with different values of ψ. Lower loss and higheraccuracy are associated with smaller values of ψ, which result in lower errortolerance and larger values of D2D rounds over time. (FMNIST, 625 EdgeDevices)

37

0.0250.03

00.03

50.04

00.04

50.05

00.05

5ε′

0

5

10

15

20

25

κ

theory simulation

Fig. 46: Comparison between the theoretical andsimulation results regarding the number of globaliterations to achieve an accuracy of ε′(F (w(0))−F (w∗)) for different ε′. Convergence in practiceis faster than the derived upper bound. (FMNIST,625 Edge Devices)

scenario

1scen

ario 2scen

ario 3scen

ario 40

20

40

60

80

accumulated

ene

rgy

consum

ption (×

104 Jou

les)

EUT MH-MT

Fig. 47: Comparison of accumulated energy con-sumption between EUT and MH-MT over scenario1: σ′ = 0.1 from Fig. 40, scenario 2: σ′ = 0.1from Fig. 41, scenario 3: ψ = 104 from Fig. 44,and scenario 4: ψ = 104 from Fig. 45. (FMNIST,625 Edge Devices)

scenario

1scen

ario 2

scenario

3scen

ario 40

2

4

6

8

10

12

accu

mul

ated

num

ber o

fpa

ram

eter

s tra

nsfe

rred

betw

een

netw

ork

laye

rs (×

107 )

EUT MH-MT

Fig. 48: Comparison of parameters transferredamong layers in EUT vs MH-MT over scenario 1:σ′ = 0.1 from Fig. 40, scenario 2: σ′ = 0.1 fromFig. 41, scenario 3: ψ = 104 from Fig. 44, andscenario 4: ψ = 104 from Fig. 45. (FMNIST, 625Edge Devices)

Documents

Multi-Stage Hybrid Federated Learning over Large-Scale ... · We develop multi-stage hybrid model training (MH-MT), a novel learning methodology for distributed ML in these scenarios