16
Federated Learning with Sparsification-Amplified Privacy and Adaptive Optimization Rui Hu , Yanmin Gong * and Yuanxiong Guo The University of Texas at San Antonio {rui.hu, yanmin.gong, yuanxiong.guo}@utsa.edu Abstract Federated learning (FL) enables distributed agents to collaboratively learn a centralized model with- out sharing their raw data with each other. How- ever, data locality does not provide sufficient pri- vacy protection, and it is desirable to facilitate FL with rigorous differential privacy (DP) guarantee. Existing DP mechanisms would introduce random noise with magnitude proportional to the model size, which can be quite large in deep neural net- works. In this paper, we propose a new FL frame- work with sparsification-amplified privacy. Our ap- proach integrates random sparsification with gra- dient perturbation on each agent to amplify pri- vacy guarantee. Since sparsification would increase the number of communication rounds required to achieve a certain target accuracy, which is unfavor- able for DP guarantee, we further introduce accel- eration techniques to help reduce the privacy cost. We rigorously analyze the convergence of our ap- proach and utilize Renyi DP to tightly account the end-to-end DP guarantee. Extensive experiments on benchmark datasets validate that our approach outperforms previous differentially-private FL ap- proaches in both privacy guarantee and communi- cation efficiency. 1 Introduction Federated learning (FL) is a new distributed learning paradigm that enables multiple agents to collaboratively learn a shared model under the orchestration of the cloud without sharing their local data [McMahan et al., 2017]. By keeping data locally, FL is advantageous in privacy and communica- tion efficiency compared with traditional centralized learning paradigm. However, recent inference attacks [Fredrikson et al., 2015; Shokri et al., 2017] show that the local model up- dates shared between agents could also lead to privacy leak- age, and it is desirable to protect the shared local model up- dates with rigorous privacy guarantee. To address this issue, several privacy-preserving frame- work have been proposed, among which differential privacy * Corresponding author (DP) [Dwork and Roth, 2014] has become the de-facto stan- dard due to its rigorous privacy guarantee and effectiveness in data analysis tasks [Abadi et al., 2016; Hu et al., 2020; Huang et al., 2019; Guo and Gong, 2018; Gong et al., 2016]. General DP mechanisms, such as Gaussian or Laplacian mechanism, rely on the injection of carefully calibrated noise to the output of an algorithm directly. This poses new chal- lenges to achieving DP in FL because the added noise is pro- portional to the model size which can be very large with mod- ern deep learning neural networks (e.g., millions of model parameter), resulting in significantly degraded model accu- racy. Under the local DP setting where the cloud is not fully trusted, the challenges become more prominent as all the lo- cal updates shared with the cloud need to be protected. Existing works on differentially-private machine learning either consider centralized DP [Abadi et al., 2016], or rely on costly techniques such as secure multi-party computation [Truex et al., 2019] and shuffling via anonymous channels [Liu et al., 2020; Erlingsson et al., 2019] to remove the re- quirement of a trusted cloud and improve the model accuracy in local DP. How to better balance the model accuracy and privacy protection in FL efficiently remains largely unknown. In this paper, we propose a novel differentially-private FL scheme, called Fed-SPA, to provide strong privacy guarantee in the local DP setting while maintaining high model accu- racy. In light of the observation that local model updates in FL are largely sparse, we design a sparsification-coded DP mech- anism that integrates gradient perturbation with random spar- sification to amplify the privacy guarantee with little sacrifice on the model accuracy. Random sparsification transforms a large vector into a sparse one by keeping only a random sub- set of coordinates while setting other coordinates to zeros. As we will show in this paper, random sparsification not only introduces randomness to the scheme, but also reduces the sensitivity of the shared model updates with respect to raw data, thus resulting in smaller privacy loss at every commu- nication round. Furthermore, as sparsification can slow down the convergence speed of learning algorithms and increase the total number of communication rounds, we propose to further reduce the end-to-end privacy loss by using convergence ac- celeration techniques to offset the negative impact of sparsi- fication. We provide theoretical analysis to demonstrate the convergence of our scheme and rigorous privacy guarantee. The main contributions of this paper are summarized be- arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

Federated Learning with Sparsification-Amplified Privacy and AdaptiveOptimization

Rui Hu , Yanmin Gong∗ and Yuanxiong GuoThe University of Texas at San Antonio

rui.hu, yanmin.gong, [email protected]

AbstractFederated learning (FL) enables distributed agentsto collaboratively learn a centralized model with-out sharing their raw data with each other. How-ever, data locality does not provide sufficient pri-vacy protection, and it is desirable to facilitate FLwith rigorous differential privacy (DP) guarantee.Existing DP mechanisms would introduce randomnoise with magnitude proportional to the modelsize, which can be quite large in deep neural net-works. In this paper, we propose a new FL frame-work with sparsification-amplified privacy. Our ap-proach integrates random sparsification with gra-dient perturbation on each agent to amplify pri-vacy guarantee. Since sparsification would increasethe number of communication rounds required toachieve a certain target accuracy, which is unfavor-able for DP guarantee, we further introduce accel-eration techniques to help reduce the privacy cost.We rigorously analyze the convergence of our ap-proach and utilize Renyi DP to tightly account theend-to-end DP guarantee. Extensive experimentson benchmark datasets validate that our approachoutperforms previous differentially-private FL ap-proaches in both privacy guarantee and communi-cation efficiency.

1 IntroductionFederated learning (FL) is a new distributed learningparadigm that enables multiple agents to collaboratively learna shared model under the orchestration of the cloud withoutsharing their local data [McMahan et al., 2017]. By keepingdata locally, FL is advantageous in privacy and communica-tion efficiency compared with traditional centralized learningparadigm. However, recent inference attacks [Fredrikson etal., 2015; Shokri et al., 2017] show that the local model up-dates shared between agents could also lead to privacy leak-age, and it is desirable to protect the shared local model up-dates with rigorous privacy guarantee.

To address this issue, several privacy-preserving frame-work have been proposed, among which differential privacy

∗Corresponding author

(DP) [Dwork and Roth, 2014] has become the de-facto stan-dard due to its rigorous privacy guarantee and effectivenessin data analysis tasks [Abadi et al., 2016; Hu et al., 2020;Huang et al., 2019; Guo and Gong, 2018; Gong et al., 2016].General DP mechanisms, such as Gaussian or Laplacianmechanism, rely on the injection of carefully calibrated noiseto the output of an algorithm directly. This poses new chal-lenges to achieving DP in FL because the added noise is pro-portional to the model size which can be very large with mod-ern deep learning neural networks (e.g., millions of modelparameter), resulting in significantly degraded model accu-racy. Under the local DP setting where the cloud is not fullytrusted, the challenges become more prominent as all the lo-cal updates shared with the cloud need to be protected.

Existing works on differentially-private machine learningeither consider centralized DP [Abadi et al., 2016], or relyon costly techniques such as secure multi-party computation[Truex et al., 2019] and shuffling via anonymous channels[Liu et al., 2020; Erlingsson et al., 2019] to remove the re-quirement of a trusted cloud and improve the model accuracyin local DP. How to better balance the model accuracy andprivacy protection in FL efficiently remains largely unknown.

In this paper, we propose a novel differentially-private FLscheme, called Fed-SPA, to provide strong privacy guaranteein the local DP setting while maintaining high model accu-racy. In light of the observation that local model updates in FLare largely sparse, we design a sparsification-coded DP mech-anism that integrates gradient perturbation with random spar-sification to amplify the privacy guarantee with little sacrificeon the model accuracy. Random sparsification transforms alarge vector into a sparse one by keeping only a random sub-set of coordinates while setting other coordinates to zeros.As we will show in this paper, random sparsification not onlyintroduces randomness to the scheme, but also reduces thesensitivity of the shared model updates with respect to rawdata, thus resulting in smaller privacy loss at every commu-nication round. Furthermore, as sparsification can slow downthe convergence speed of learning algorithms and increase thetotal number of communication rounds, we propose to furtherreduce the end-to-end privacy loss by using convergence ac-celeration techniques to offset the negative impact of sparsi-fication. We provide theoretical analysis to demonstrate theconvergence of our scheme and rigorous privacy guarantee.

The main contributions of this paper are summarized be-

arX

iv:2

008.

0155

8v2

[cs

.LG

] 8

Jun

202

1

Page 2: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

low.

• We propose to use sparsification for privacy amplifica-tion in FL while also improving communication effi-ciency. Previous works that consider both communica-tion efficiency and DP treat them as two separate goalsand solve them in an uncoordinated manner. Unlike pre-vious approaches, we focus on the interplay of those twogoals and aim to kill two birds with one stone in thispaper, i.e., use sparsification as a tool to improve DPand achieve communication efficiency at the same time.We theoretically analyze the impacts of sparsification onthe utility-privacy trade-off and design a sparsification-coded DP mechanism for FL that provides stronger pri-vacy guarantee with the same amount of random noise.

• To further improve the utility-privacy trade-off, we inte-grate the sparsification-coded DP mechanisms with con-vergence acceleration techniques, which can reduce thenumber of required communication rounds and ensurefaster convergence. Specifically, we adapt an acceler-ation strategy similar to that of Adam optimizer to theFL setting and design an adaptive aggregation strategyon the cloud to reduce the number of communicationrounds. The resulting scheme called Fed-SPA providesa better utility-privacy trade-off than the state-of-art dif-ferentially private FL methods.

• We empirically evaluate our scheme on several bench-mark datasets. The experiment results show that Fed-SPA significantly boosts the model accuracy, and at thesame time saves more than 80% of the bandwidth costcompared to the state-of-art approaches under the sameDP guarantee.

It is worth noting that our scheme aims to improve theutility-privacy trade-off while achieving communication effi-ciency. This distinguishes our paper from previous studies oncommunication efficient and differentially-private distributedlearning that focus on ensuring privacy protection and achiev-ing communication efficiency at the same time. For example,cpSGD [Agarwal et al., 2018] is a modified distributed SGDscheme which is private and communication-efficient via gra-dient quantization and binomial mechanism. However, sincequantization does not provide any privacy amplification ef-fects as sparsification, the utility-privacy trade-off is not im-proved in that approach.

2 PreliminariesDP is a rigorous notion of privacy that has become the de-facto standard for measuring privacy risk. In the context ofFL, DP ensures that the exchanged model updates are nearlythe same regardless of the usage of a data sample. In this pa-per, we consider a relaxed DP definition called Renyi differ-ential privacy (RDP), which is strictly stronger than (ε, δ)-DPfor δ > 0 and allows tighter composition analysis.

Definition 1 ((α, ρ)-RDP). Given a real number α ∈(1,+∞) and privacy parameter ρ ≥ 0, a randomizedmechanismM satisfies (α, ρ)-RDP if for any two neighbor-ing datasets D,D′ that differs in one record, the Renyi α-

divergence betweenM(D) andM(D′) satisfies

Dα[M(D)‖M(D′)] :=1

α− 1logE

[(M(D)

M(D′)

)α]≤ ρ,

where the expectation is taken over the output ofM(D′).

Lemma 1 (RDP Composition [Mironov, 2017]). IfM1 sat-isfies (α, ρ1)-RDP andM2 satisfies (α, ρ2)-RDP, then theircompositionM1 M2 satisfies (α, ρ1 + ρ2)-RDP.

Lemma 2 (Gaussian Mechanism [Mironov, 2017]). Let h :D → Rd be a vector-valued function over datasets. TheGaussian mechanismM = h(D) + b with b ∼ N (0, σ2Id)satisfies (α, αφ2(h)/2σ2)-RDP, where φ(h) is the L2 sensi-tivity of h defined by φ(h) = supD,D′ ‖h(D)−h(D′)‖2 withD,D′ being two neighboring datasets in D.

3 Fed-SPA: Federated Learning withSparsification-Amplified Privacy andAdaptive Optimization

Notation. We use [n] to denote the set of integers1, 2, . . . , n with any positive integer n, and [·]j to denotethe j-th coordinate of a vector. Let ‖·‖ be the `2 vector norm.

3.1 Problem FormulationA typical FL system consists of n agents and a central server(e.g., the cloud). Each agent i ∈ [n] has a local dataset withm data samples, and all agents collaboratively train a globalmodel θ on the collection of their local datasets under the or-chestration of the central server. The agents in FL aim to findthe optimal global model θ by solving the following empiricalrisk minimization problem while keeping their data locally:

minθ∈Rd

f(θ) :=1

n

n∑i=1

fi(θ), (1)

where fi(θ) = Ez∼Di [li(θ; z)] represents the loss function ofi-th agent (possibly non-convex), Di is the data distributionof i-th agent, and z represents a data sampled from Di. Fori 6= j, the data distributionsDi andDj may be very different.

Threat Model. Before elaborating the proposed solutions,we first define the following threat model considered in thispaper. The adversary considered here can be the “honest-but-curious” aggregation server or agents in the system. Theaggregation server will honestly follow the designed trainingprotocol but are curious about agents’ private data and mayinfer it from the shared messages. Furthermore, some agentscan collude with the aggregation server or each other to inferprivate information about a specific victim agent. Besides, theadversary could also be the passive outside attacker. Theseattackers can eavesdrop all shared messages in the executionof the training protocol but will not actively inject false mes-sages into or interrupt message transmissions.

3.2 Classic FL Algorithm: Federated AveragingAs the most widely-used algorithm in the FL setting, Fed-erated Averaging (FedAvg) [McMahan et al., 2017] solves(1) by selecting and distributing the current global model to a

Page 3: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

subset of agents, running multiple steps of SGD in parallel onthe selected agents, and then aggregating the model updatesfrom those agents to improve the global model iteratively.Specifically, FedAvg involves T communication rounds, andeach round consists of four stages: First, at the beginning ofround t ∈ 0, . . . , T−1, the server selects a subset of agentsW ⊆ [n] to participate and sends them the latest global modelθt. Second, each agent i ∈ W initializes its local model θt,0ito be the global model θt and then performs τ iterations ofSGD on its local dataset as follows:

θt,s+1i = θt,si − ηlg

t,si , s = 0, 1, . . . , τ − 1 (2)

where ηl is the local learning rate. Here, gt,si :=

(1/B)∑z∈ξt,si

∇l(θt,si , z) represents the stochastic gradient

computed on a mini-batch ξt,si of B samples, which a unbi-ased estimate of∇fi(θt,si ). Third, each agent i ∈ W uploadsthe final local model update θt,τi − θt to the server. Fourth,the server aggregates the local model updates from all partic-ipating agents and improves the global model as

θt+1 = θt +1

|W|∑i∈W

(θt,τi − θt). (3)

The same procedure repeats for the next round.

Privacy and Communication Drawbacks. Although Fe-dAvg avoids the direct information leakage by keeping datalocally, the intermediate updates exchanged during the col-laboration process such as θt,τi − θt and θt could still leakprivate information about the local data as demonstratedin recent advanced attacks such as model inversion attacks[Fredrikson et al., 2015] and membership attacks [Shokri etal., 2017]. Furthermore, in FedAvg, agents need to repeatedlyupload local model updates (i.e., θt−θt,τi ) of large size (e.g.,millions of model parameters for modern deep neural net-work models) to the server and download the newly-updatedglobal model (i.e., θt) from the server in order to learn an ac-curate global model (e.g.,∼ 1000 rounds for running CNN onMINIST dataset or∼ 4000 for LSTM on Shakespeare datasetto reach 99% accuracy [McMahan et al., 2017]). Since theprivacy loss is proportional to the model dimension and thenumber of communication rounds, the size of added DP noisecould be very large in order to provide a strong local DPguarantee, which will degrade the model accuracy heavily.Besides, as the bandwidth between the server and agentscould be rather limited in practice (e.g., wireless connectionbetween the cloud and smartphones), especially for uplinktransmissions, the overall communication cost could be veryhigh. The above drawbacks motivate us to develop a newprivacy-preserving and communication-efficient FL scheme.

3.3 Proposed Fed-SPA AlgorithmIn this subsection, we present Fed-SPA, our proposed FLscheme with the goal of improving user privacy protectionand also communication efficiency while maintaining highmodel accuracy. To ensure easy integration into existingpackages/systems, Fed-SPA follows the same overall struc-ture of FedAvg but differs in the following two key aspects:

1) local models are updated at each agent using a variantof SGD, where the gradients are perturbed by our proposedsparsification-coded DP mechanism; and 2) the global modelis updated adaptively instead of simple averaging to acceler-ate convergence at the server. The entire process of Fed-SPAis summarized in Algorithm 1.

User-Side Sparsification-Coded DP Mechanism. To ad-dress privacy and communication aspects simultaneously, wedesign a sparsification-coded DP mechanism, which inte-grates Gaussian mechanism and sparsification to reduce theprivacy loss of each local iteration and the size of transmittedlocal model updates at each communication round. Specifi-cally, we use the randk sparsifier to reduce the message sizeby a factor of d/k, which is defined as follows:

Definition 2 (randk Sparsifier). For parameter k ∈ [d], theoperator randk : Rd × Ωk → Rd is defined for a vectorx ∈ Rd as

[randk(x, ω)]j :=

[x]j , if j ∈ ω,0, otherwise,

(4)

where Ωk =(

[d]k

)denotes the set of all k-element subsets

of [d]. We omit the second argument whenever it is chosenuniformly at random, i.e., ω ∼u.a.r Ωk.

With the randk sparsifier, our sparsification-coded DPmechanism works as follows: at communication round t,each selected agent i ∈ W first generates its own sparsifierrandk(·) by randomly sampling a set of k active coordinatesωti before performing τ SGD updates, and then, at s-th localiteration, perturbs and sparsifies the stochastic gradient usingGaussian noise and the generated sparsifier randk(·). We usethe same sparsifier (with active set ωti ) across all local itera-tions of communication round t on agent i so that the trans-mitted model update (θt,τi −θt) is still a sparse vector (whichhas active coordinates ωti ), preserving the communication ef-ficiency benefit of sparsification. Let p = k/d represent thecompression ratio, the update rule at each agent is:

θt,s+1i = θt,si − ηlS

ti (g

t,si + bt,si ), (5)

where Sti (·) := (1/p) randk(·) is a scaled variant ofrandk(·), and bt,si is the noise sampled from the GaussiandistributionN (0, σ2Id). We use the scaled sparsifier Sti (·) sothat the sparsified noisy gradient is an unbiased estimate ofthe true gradient.

Server-Side Adaptive Update. The sparsification-codedDP mechanism will inevitably slow down the convergencespeed due to the increased variance of the stochastic gradi-ent used in each iteration, and the privacy loss of each agentincreases proportionally with the number of iterations accord-ing to the composition property of DP in Lemma 1. To im-prove the privacy, we speed up the convergence by updatingthe model in an adaptive manner similar as Adam [Kingmaand Ba, 2014] on the server. The adaptive optimizer likeAdam can be modified by adding noise to their gradients toprovide better DP guarantee, in terms of reducing the itera-tions needed to achieve a target model accuracy [Yu et al.,2018]. However, there are two main constraints unique to

Page 4: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

Algorithm 1 The Fed-SPA Algorithm

Require: Initial model θ0, initial momentums [v−1]j ≥κ2,∀j ∈ [d], u−1 = 0d, noise magnitude σ, number ofrounds T , number of local iterations τ , compression ra-tio p, momentum parameters β1, β2, learning rates ηl, ηg ,and batch size B.

1: for t = 0 to T − 1 do2: Randomly selects a set of agentsW3: Broadcasts θt to all agents inW4: for each agent i ∈ W in parallel do5: Generate a new sparsifier Sti (·)6: θt,0i ← θt7: for s = 0 to τ − 1 do8: Compute a stochastic gradient gt,si over a mini-

batch ξt,si of B samples9: θt,s+1

i ← θt,si − ηlSti (gt,si + bt,si ) where bt,si ∼

N (0, σ2Id)10: end for11: ∆t

i ← θt,τi − θt and upload ∆ti to the server

12: end for13: ut ← β1ut−1 + (1− β1)

∑i∈W ∆t

i/|W|14: vt ← β2vt−1 + (1− β2)u2

t15: θt+1 ← θt + ηgut/(

√vt + κ)

16: end for

deploying the adaptive optimizer on each agent in the FL set-ting. First, it is often the case that each agent participate onlyonce or several times intermittently during the entire train-ing process, and hence, replacing SGD with an adaptive op-timizer at each agent during the local update stage will per-form poorly due to the stale historical information such as themomentum in Adam. Second, maintaining the historical in-formation on resource-constrained agents (e.g. smartphones)is costly for computation and storage resource.

To address the above issues, in Fed-SPA, the server, ratherthan the agents, will carry out the adaptive update withoutany additional communication. The server maintains two mo-mentum vectors u,v ∈ Rd which get updated at each round.Specifically, at round t, after τ local iterations, each agenti ∈ W uploads its model update ∆t

i := θt,τi −θt to the serverto improve the global model as follows:

ut = β1ut−1 + (1− β1)∑i∈W ∆t

i/|W|,vt = β2vt−1 + (1− β2)u2

t ,

θt+1 = θt + ηgut/(√vt + κ),

(6)

where β1, β2 ∈ [0, 1) are momentum parameters, ηg is theglobal learning rate, and κ controls the degree of adaptivity.Note that the math operations in (6) are element-wise.

4 Main Theoretical ResultsIn this section, we give the formal privacy guarantee and rig-orous convergence analysis of Fed-SPA. Before stating ourresults, we make the following assumptions:Assumption 1 (Smoothness). The local objective function fiis L-smooth, i.e., for any i ∈ [n] and x,y ∈ Rd, we havefi(y) ≤ fi(x) + 〈∇fi(x),y − x〉+ (L/2)‖y − x‖2.

Assumption 2 (Bounded Variance). Let gi be the stochas-tic gradient over the mini-batch sampled from the distribu-tion Di. The function fi has a bounded local variance, i.e.,E‖[gi]j − [∇fi(x)]j‖2 ≤ ζ2

l,j for all x ∈ Rd, j ∈ [d]

and i ∈ [n]. Moreover, the global variance is bounded,i.e., (1/n)

∑ni=1 E‖[∇fi(x)]j − [∇f(x)]j‖2 ≤ ζ2

g,j for allx ∈ Rd, j ∈ [d] and i ∈ [n]. We also denote ζ2

l :=∑dj=1 ζ

2l,j

and ζ2g :=

∑dj=1 ζ

2g,j for convenience.

Assumption 3 (Bounded Gradient). The loss functionli(x, z) hasG/

√d-bounded gradients, i.e., for any data sam-

ple z from Di, we have |[∇li(x, z)]j | ≤ G/√d for all

x ∈ Rd, j ∈ [d] and i ∈ [n].

Assumption 1 is standard and implies that the global lossfunction f is also L-smooth. Assumption 2 and Assump-tion 3 are fairly standard in non-convex optimization litera-ture [Reddi et al., 2020; Kingma and Ba, 2014]. Assump-tion 3 characterizes the sensitivity of each coordinate of gra-dient ∇l(x, z) and implies E‖∇l(x, z)‖2 ≤ G2, which canbe enforced by the gradient clipping technique.

4.1 Privacy AnalysisIn this subsection, we discuss how the sparsification in oursparsification-coded DP mechanism amplifies the privacy andprovide the end-to-end privacy guarantee of Fed-SPA.

In our sparsification-coded DP mechanism, the randk spar-sifier does not provide any DP guarantee by itself, but itamplifies the privacy guarantee provided by additive Gaus-sian noises. To analyze the end-to-end privacy, we firstneed to analyze the sensitivity of the sparsified gradient inLine 9, Algorithm 1. With the randk sparsifier, the activecoordinate set ω is chosen independent of data and doesnot leak privacy, and only the values of active coordinates[x]j , j ∈ ω contain private data information and needto be protected. Therefore, in our sparsification-coded DPmechanism, only values of active coordinates of the gra-dient are actually perturbed by the Gaussian noise. Moreprecisely, let ωti denote the active coordinate set for partic-ipating agent i at round t, the sparsified noisy gradient ateach local iteration can be represented as Sti (g

t,si + bt,si ) =

[gt,si + bt,si ]ωti/p = [gt,si ]ωt

i/p+ [bt,si ]ωt

i/p, where we can

observe that the amount of added noise is proportional to k,the number of active coordinates, and is reduced by a factorof d/k compared with the standard Gaussian mechanism. Inthe following, we analyze the sensitivity of [gt,si ]ωt

iand then

compute the privacy guarantee after adding noise [bt,si ]ωti.

For agent i, given any two neighboring datasets ξt,si and ξt,sithat have the same size B but differ in one data sample (e.g.,z ∈ ξt,si and z ∈ ξt,si ). The L2 sensitivity of [gt,si ]ωt

ican be

denoted as φ2ωt

i= max ‖[gt,si ]ωt

i− [∇fi(θt,si , ξt,si )]ωt

i‖2 =

max ‖(1/B)[∇l(θt,si , z) − ∇l(θt,si , z)]ωti‖2 and is upper-

bounded by 2pG2/B2 following Assumption 3. We observethat the sensitivity of [gt,si ]ωt

iis proportional to the compres-

sion ratio p, reducing the privacy loss according to Lemma 2.Then, we give the end-to-end DP guarantee of Fed-SPA

Page 5: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

in Theorem 1 and provide the proof in Appendix E. Given afixed value of δ, ε is computed numerically by searching anoptimal α that minimizes ε. We observe that the noise magni-tude σ is proportional to p, which implies that sparsification,i.e., when p < 1, can reduce the magnitude of Gaussian noiseand hence improve the model accuracy.

Theorem 1 (Privacy Guarantee). Suppose the mini-batchξt,si is sampled without replacement at each iteration. Letq := B/m be the data sampling rate. Let Ii represent thenumber of rounds agent i participates during the training.Under Assumption 3, if σ′2 = σ2B2/2pG2 ≥ 0.7, Fed-SPAachieves (ε, δ)-DP for agent i, where

ε =7q2IiταpG

2

B2σ2+

log(1/δ)

α− 1,

for any α ≤ (2/3)σ2 log (1/qα(1 + σ′2))+1 and δ ∈ (0, 1).

4.2 Convergence AnalysisIn this subsection, we present the convergence results of Fed-SPA for general loss functions satisfying Assumptions 1-3.For ease of illustration, we assume full participation, i.e.,|W| = n, and β1 = 0, though our analysis can be easilyextended to β1 > 0 and partial participation (i.e., |W| < n,see Appendix H for details). As f(·) may be non-convex, westudy the gradient of the global model θt as t increases. Wegive the result in Theorem 2 and the proof in the full version.

Theorem 2 (Convergence Result). Let Assump-tions 1-3 hold, and L,G, ζl, ζg be as defined therein.Suppose the local learning rate satisfies ηl ≤min1/8Lτ, (1/8τ) minκ

√d/G, (κ2

√d/GηgL)1/2.

Let ζ2dp = (G2 + ζ2

l )/p + pdσ2, then the iterates ofAlgorithm 1 satisfy:

1

T

T−1∑t=0

E‖∇f(θt)‖2 = O(

(√β2ηlτG/

√d+ κ)(Ξ + Ξ′)

)with

Ξ =f(θ0)− f∗

ηlηgτT+

5η2l τL

2

(ζ2dp + 6τζ2

g

),

Ξ′ =

(ηgL

2+

G√d

)[4ηlnκ2

ζ2dp +

20η3l τ

2L2

κ2

(ζ2dp + 6τζ2

g

)],

where f∗ is the optimal objective value.

We restate the above result for a specific choice of ηl, ηgand κ in Lemma 3 to highlight the dependence on τ andT . Note that when T is sufficiently large compared to τ ,O(1/

√nτT ) is the dominant term in Lemma 3. Therefore,

Fed-SPA converges at a rate of O(1/√nτT ), which matches

the best known rate for the general non-convex setting of ourinterest [Reddi et al., 2020]. We also note that both sparsi-fication and Gaussian noise will slow down the convergence.However, a small compression ratio p can reduce the amountof added noise (i.e., the term pdσ2) by a factor of 1/p, whichimplies the privacy amplification effect of sparsification.

Lemma 3. Choose the local learning rate ηl = Θ(1/Lτ√T )

that satisfies the condition in Theorem 2. Suppose ηg =

Θ(√nτ) and κ = G/

√dL, then the iterates of Algorithm 1

satisfy

1

T

T−1∑t=0

E‖∇f(θt)‖2 = O

(f(θ0)− f∗√

nτT+

2ζ2dpL

G2√nτT

+(ζ2dp + 6τζ2

g )

GτT+

(ζ2dp + 6τζ2

g )L√n

G2√τT 3/2

),

when T is sufficiently large.

5 Experimental EvaluationThe goal of this section is to evaluate the performance of Fed-SPA on several benchmark datasets. We aim at evaluating theperformance of Fed-SPA with different levels of compressionand comparing them with the performance of the followingthree schemes: 1) FedAvg: the classic FL algorithm; 2) DP-Fed: this baseline follows the algorithm of FedAvg exceptthat the stochastic gradient is perturbed by adding Gaussiannoise N (0, σ2Id); 3) cpSGD [Agarwal et al., 2018]: thisbaseline follows the algorithm of classic distributed SGD ex-cept that the stochastic gradient is quantized into some dis-crete domain and then perturbed using Binomial mechanism.

5.1 Experimental SetupWe explore two widely-used benchmark datasets in FL:MNIST [LeCun et al., 1998] and CIFAR-10 [Krizhevsky etal., 2009]. The MNIST dataset consists of 10 classes of28 × 28 handwritten digit images. There are 60K trainingexamples and 10K testing examples, which are partitionedamong 100 agents, each containing 600 training and 100 test-ing examples. The CIFAR-10 dataset consists of 10 classesof 32 × 32 images. There are 50K training examples and10K testing examples in the dataset, which are partitionedinto 100 agents, each containing 500 training and 100 test-ing examples. We use a CNN model for the MNIST dataset,which has two 5 × 5 convolutional layers (the first with 10filters, the second with 20 filters, each followed with 2 × 2max pooling and ReLu activation), a fully connected layerwith 50 units and ReLu activation, and a final softmax outputlayer. For the CIFAR-10 dataset, we use a CNN model thatconsists of three 3 × 3 convolution layers (the first with 64filters, the second with 128 filters, the third with 256 filters,each followed with 2× 2 max pooling and ReLu activation),two fully connected layers (the first with 128 units, the sec-ond with 256 units, each followed with ReLu activation), anda final softmax output layer.

We set the privacy failure probability δ = 10−3 and thesampling ratio of agents r = |W|/n = 0.1 for all experi-ments by default. Since cpSGD follows the classic distributedSGD scheme, we have τ = 1 and r = 1.0 for cpSGD.We tune the hyperparameters using grid-search. We set thenumber of local iterations τ = 300 for MNIST and τ = 50for CIFAR-10. The details of other hyperparameter settingsare given in the full version. The per-coordinate sensitiv-ity G/

√d is selected during an initialization round for each

scheme by taking the median value over N absolute values.

Page 6: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

Compression ratio AlgorithmPerformance

Accuracy (%) Cost (MB) ε

p = 0.05Fed-SPA 92.65 0.0197 1.0cpSGD diverge - -

p = 0.1Fed-SPA 92.16 0.0393 1.0cpSGD diverge - -

p = 0.4Fed-SPA 94.84 0.1572 1.0cpSGD 87.32 1.5720 3.63

p = 1.0

Fed-SPA 94.70 0.3931 1.0cpSGD 88.48 3.9310 2.26DP-Fed 91.41 0.3931 1.0FedAvg 96.87 0.3931 -

Table 1: Summary of results on MNIST dataset.

5.2 Experimental ResultsTable 1 shows the best accuracy over 45 communicationrounds for each scheme on the MNIST dataset, and Table 2represents the best accuracy over 200 rounds for each schemeon the CIFAR-10 dataset. Assume each value of the modelparameter is represented by a 32-bit floating number. TheCompression ratio for Fed-SPA is calculated as p = k/d.For cpSGD, p = log2(m + b)/32, where b implies the b-bitquantization and m is the parameter for the Binomial dis-tribution Bin(m, 0.5). For FedAvg and DP-Fed, we havep = 1.0. Cost is the average bandwidth consumption cal-culated as p× d× 32× T × r, where r is the sampling prob-ability of devices and T is the communication rounds. Notethat since how to analyze Binomial mechanism in cpSGD us-ing RDP is still unknown, we account the privacy loss ε ofcpSGD using the standard subsampling amplification [Balleet al., 2018] and advanced composition [Dwork et al., 2010].

Without compression, i.e., when p = 1.0, Fed-SPA out-performs cpSGD and DP-Fed on both datasets as the serverin Fed-SPA updates the global model adaptively to speed upthe convergence, and its best accuracy to achieve (1.0, 10−3)-DP is close to the best accuracy of FedAvg under the samecommunication cost. Note that since the Binomial noise usedin cpSGD is not as concentrated as the Gaussian noise usedin Fed-SPA and DP-Fed, the model accuracy of cpSGD de-grades heavily, and hence to achieve a target accuracy, theprivacy loss of cpSGD is very large. As the compression ra-tio p decreases, the performance of Fed-SPA on both datasetsdoes not degrade much, compared with cpSGD which cannot

Compression ratio AlgorithmPerformance

Accuracy (%) Cost (MB) ε

p = 0.05Fed-SPA 63.0 - 1.0cpSGD diverge - -

p = 0.1Fed-SPA 63.28 2.15 1.0cpSGD diverge - -

p = 0.4Fed-SPA 64.36 8.60 1.0cpSGD 35.74 86.02 258

p = 1.0

Fed-SPA 62.06 21.50 1.0cpSGD 37.17 215.04 39.77DP-Fed 61.13 21.50 1.0FedAvg 67.16 21.50 -

Table 2: Summary of results on CIFAR-10 dataset.

0.2 0.4 0.6 0.8 1.0ε

50

60

70

80

90

Test

ing

accu

racy

p=0.05p=0.1p=0.2p=0.8p=1.0

(a) MNIST.

0.2 0.4 0.6 0.8 1.0ε

30

40

50

60

Test

ing

accu

racy

p=0.01p=0.05p=0.2p=0.8p=1.0

(b) CIFAR-10.

Figure 1: Privacy-accuracy trade-off of Fed-SPA.

achieve a reasonable privacy guarantee when p ≤ 0.1. Moreimportantly, Fed-SPA for MNIST achieves a higher accuracythan DP-Fed under the same privacy cost, while saving 86%communication cost when p = 0.05. The reason is that de-creasing p also lowers the sensitivity of the sparsified gradientwhich has a direct impact on the noise magnitude to achievea target privacy guarantee, as explain in Section 4.1. We ob-serve a similar trend for CIFAR-10 from Table 2: Fed-SPAperforms better when the compression ratio is small, e.g.,when p ≤ 0.4, as the sensitivity in this case is small.

Then, we show the privacy-accuracy trade-off of Fed-SPAfor both datasets with different levels of compression in Fig-ure 1. As we mentioned above, a small compression ratioresults in a smaller sensitivity, and hence less Gaussian noisewill be added to the model. On the other hand, the compres-sion ratio cannot be arbitrarily small as it will increase thevariance of gradient as explained in Section 4.2. Therefore,we should find an optimal p that is small enough to reducethe size of additive Gaussian noise but large enough to avoidlarge compression error. For both datasets, we can observethat when the privacy budget ε is limited (e.g., ε < 1.0), asmall enough p (e.g., p = 0.8 for MNIST and p = 0.2 forCIFAR-10) achieves higher accuracy as it reduces the size ofGaussian noise. As ε increases, the size of Gaussian noisedecreases, and hence Fed-SPA with larger p performs betterdue to the smaller compression error.

6 ConclusionThis paper has proposed Fed-SPA, a new FL scheme based onsparsification-coded DP mechanism and server-side adaptiveupdate, to improve privacy-accuracy trade-off and achievecommunication efficiency at the same time. We have pro-vided rigorous convergence and privacy analysis of Fed-SPA.Extensive experiments based on benchmark datasets havebeen conducted to verify the effectiveness of the proposedscheme and numerically show the trade-off between privacyguarantee and model accuracy. For future work, we planto investigate the interplay of privacy protection with othercommunication-efficient techniques in FL.

AcknowledgementsThe work of R. Hu and Y. Gong was supported by the U.S.National Science Foundation under grants US CNS-2029685and CNS-1850523. The work of Y. Guo was supported bythe U.S. National Science Foundation under grant US CNS-2029685.

Page 7: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

References[Abadi et al., 2016] Martin Abadi, Andy Chu, Ian Goodfel-

low, H Brendan McMahan, Ilya Mironov, Kunal Talwar,and Li Zhang. Deep learning with differential privacy.In Proceedings of the 2016 ACM SIGSAC Conference onComputer and Communications Security, pages 308–318.ACM, 2016.

[Agarwal et al., 2018] Naman Agarwal, Ananda TheerthaSuresh, Felix Xinnan X Yu, Sanjiv Kumar, and Bren-dan McMahan. cpsgd: Communication-efficient anddifferentially-private distributed sgd. In Advances in Neu-ral Information Processing Systems, pages 7564–7575,2018.

[Balle et al., 2018] Borja Balle, Gilles Barthe, and MarcoGaboardi. Privacy amplification by subsampling: Tightanalyses via couplings and divergences. In Advancesin Neural Information Processing Systems, pages 6277–6287, 2018.

[Dwork and Roth, 2014] Cynthia Dwork and Aaron Roth.The algorithmic foundations of differential privacy. Foun-dations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.

[Dwork et al., 2010] Cynthia Dwork, Guy N Rothblum, andSalil Vadhan. Boosting and differential privacy. In 2010IEEE 51st Annual Symposium on Foundations of Com-puter Science, pages 51–60. IEEE, 2010.

[Erlingsson et al., 2019] Ulfar Erlingsson, Vitaly Feldman,Ilya Mironov, Ananth Raghunathan, Kunal Talwar, andAbhradeep Thakurta. Amplification by shuffling: Fromlocal to central differential privacy via anonymity. In Pro-ceedings of the Thirtieth Annual ACM-SIAM Symposiumon Discrete Algorithms, pages 2468–2479. SIAM, 2019.

[Fredrikson et al., 2015] Matt Fredrikson, Somesh Jha, andThomas Ristenpart. Model inversion attacks that exploitconfidence information and basic countermeasures. InProceedings of the 22nd ACM SIGSAC Conference onComputer and Communications Security, pages 1322–1333, 2015.

[Gong et al., 2016] Yanmin Gong, Yuguang Fang, andYuanxiong Guo. Private data analytics on biomedicalsensing data via distributed computation. IEEE/ACMtransactions on computational biology and bioinformat-ics, 13(3):431–444, 2016.

[Guo and Gong, 2018] Yuanxiong Guo and Yanmin Gong.Practical collaborative learning for crowdsensing in the in-ternet of things with differential privacy. In 2018 IEEEConference on Communications and Network Security(CNS), pages 1–9. IEEE, 2018.

[Hu et al., 2020] Rui Hu, Yuanxiong Guo, Hongning Li,Qingqi Pei, and Yanmin Gong. Personalized federatedlearning with differential privacy. IEEE Internet of ThingsJournal, 2020.

[Huang et al., 2019] Zonghao Huang, Rui Hu, YuanxiongGuo, Eric Chan-Tin, and Yanmin Gong. DP-ADMM:

ADMM-based distributed learning with differential pri-vacy. IEEE Transactions on Information Forensics andSecurity, 15:1002–1012, 2019.

[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba.Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

[Krizhevsky et al., 2009] Alex Krizhevsky, Geoffrey Hinton,et al. Learning multiple layers of features from tiny im-ages. 2009.

[LeCun et al., 1998] Yann LeCun, Leon Bottou, YoshuaBengio, Patrick Haffner, et al. Gradient-based learning ap-plied to document recognition. Proceedings of the IEEE,86(11):2278–2324, 1998.

[Liu et al., 2020] Ruixuan Liu, Yang Cao, Hong Chen,Ruoyang Guo, and Masatoshi Yoshikawa. FLAME: Dif-ferentially private federated learning in the shuffle model.arXiv preprint arXiv:2009.08063, 2020.

[McMahan et al., 2017] Brendan McMahan, Eider Moore,Daniel Ramage, Seth Hampson, and Blaise Aguera y Ar-cas. Communication-efficient learning of deep networksfrom decentralized data. In Artificial Intelligence andStatistics, pages 1273–1282, 2017.

[Mironov, 2017] Ilya Mironov. Renyi differential privacy. InComputer Security Foundations Symposium (CSF), 2017IEEE 30th, pages 263–275. IEEE, 2017.

[Reddi et al., 2020] Sashank Reddi, Zachary Charles,Manzil Zaheer, Zachary Garrett, Keith Rush, JakubKonecny, Sanjiv Kumar, and H Brendan McMa-han. Adaptive federated optimization. arXiv preprintarXiv:2003.00295, 2020.

[Shokri et al., 2017] Reza Shokri, Marco Stronati, Con-gzheng Song, and Vitaly Shmatikov. Membership infer-ence attacks against machine learning models. In 2017IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017.

[Truex et al., 2019] Stacey Truex, Nathalie Baracaldo, AliAnwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, andYi Zhou. A hybrid approach to privacy-preserving feder-ated learning. In Proceedings of the 12th ACM Workshopon Artificial Intelligence and Security, pages 1–11, 2019.

[Wang et al., 2019a] Lingxiao Wang, Bargav Jayaraman,David Evans, and Quanquan Gu. Efficient privacy-preserving nonconvex optimization. arXiv preprintarXiv:1910.13659, 2019.

[Wang et al., 2019b] Yu-Xiang Wang, Borja Balle, andShiva Prasad Kasiviswanathan. Subsampled renyi differ-ential privacy and analytical moments accountant. In The22nd International Conference on Artificial Intelligenceand Statistics, pages 1226–1235. PMLR, 2019.

[Yu et al., 2018] Da Yu, Huishuai Zhang, and Wei Chen. Im-prove the gradient perturbation approach for differentiallyprivate optimization. NeurIPS 2018 Workshop, 2018.

Page 8: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

A ExperimentsA.1 Hyperparameter settingsWe tune the hyperparameters using grid-search and choose the following optimal values: for MNIST, we use batch sizeB = 10,number of local iterations τ = 300 and local learning rate ηl = 0.01; for CIFAR-10, we use batch size B = 50, number oflocal iterations τ = 50 and local learning rate ηl = 0.1. for Fed-SPA, we set β1 = 0.9, β2 = 0.99 and κ = 10−3 for bothdatasets, and the global learning rate for MNIST and CIFAR-10 are ηg = 0.01 and ηg = 0.005, respectively. Note that we letboth local and global learning rates decay at a rate of 1/

√t.

B Additional Lemmas of RDPLemma 4 (RDP to DP conversion [Wang et al., 2019b]). IfM satisfies (α, ρ)-RDP, then it also satisfies (ρ+ log(1/δ)

α−1 , δ)-DP.

Lemma 5 (RDP for Subsampling Mechanism [Wang et al., 2019b; Wang et al., 2019a]). For a Gaussian mechanismM andany m-datapoints dataset D, define M SUBSAMPLE as 1) subsample without replacement B datapoints from the dataset(denote q = B/m as the sampling ratio); and 2) applyM on the subsampled dataset as input. Then ifM satisfies (α, ρ(α))-RDP with respect to the subsampled dataset for all integers α ≥ 2, then the new randomized mechanismM SUBSAMPLEsatisfies (α, ρ′(α))-RDP with respect to D, where

ρ′(α) ≤ 1

α− 1log

(1 + q2

2

)min4(eρ(2) − 1), 2eρ(2) +

α∑j=3

qj(α

j

)2e(j−1)ρ(j)

).

If σ′2 = σ2/φ2(h) ≥ 0.7 and α ≤ (2/3)σ2 log(

1/qα(1 + σ′2))

+ 1,M SUBSAMPLE satisfies (α, 3.5q2φ2(h)α/σ2)-RDP.

C Useful inequalitiesLemma 6. For arbitrary set of n vectors aini=1,ai ∈ Rd,∥∥∥∥∥

n∑i=1

ai

∥∥∥∥∥2

≤ nn∑i=1

‖ai‖2. (7)

Lemma 7. For given two vectors a, b ∈ Rd,

‖a+ b‖2 ≤ (1 + α)‖a‖2 + (1 + α−1)‖b‖2,∀α > 0. (8)

This inequality also holds for the sum of two matrices, A,B ∈ Rn×d in Frobenius norm.

D Proof of Lemma 8Lemma 8 (Unbiased Sparsification). Given a vector x ∈ Rd, parameter k ∈ [d], and sparsifier Sti (x) := (d/k) randk(x). Letp := k/d ∈ (0, 1] be the compression ratio, it holds that

Eω[Sti (x)] = x, Eω‖Sti (x)− x‖2 =

(1

p− 1

)‖x‖2.

Proof. By applying the expectation over the active set ω, we have

Eω[Sti (x)] =d

k

[k

d[x]1, . . . ,

k

d[x]d

]= x,

and

Eω‖Sti (x)− x‖2 =

d∑j=1

(k

d

([x]j −

d

k[x]j

)2

+

(1− k

d

)[x]2j

)

=

(1

p− 1

)‖x‖2.

Page 9: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

E Proof of Theorem 1Proof. Let ωti denote the active coordinate set for participating agent i at round t. The sparsified noisy gradient at each localiteration can be represented as

Sti (gt,si + bt,si ) =

[gt,si + bt,si ]ωti

p=

[gt,si ]ωti

+ [bt,si ]ωti

p,

where we can observe that due to the use of sparsification, Gaussian noises are only added to the values of active coordinatesin ωti in Fed-SPA. Therefore, we need to analyze the privacy loss of [gt,si ]ωt

iafter adding noise [bt,si ]ωt

i. For agent i, given any

two neighboring mini-batch ξt,si and ξt,si that have the same size B but differ only in one data sample (e.g., z and z). Let φωti

be the L2 sensitivity of [gt,si ]ωti, then we have

φ2ωt

i= max

∥∥∥∥[gt,si ]ωti

−[∇fi(θt,si , ξt,si )

]ωt

i

∥∥∥∥2

= max

∥∥∥∥ 1

B

[∇l(θt,si , z)−∇l(θt,si , z)

]ωt

i

∥∥∥∥2

≤ 2pG2

B2,

where the last inequality comes from Assumption 3. Given that the noise added to each coordinate of [gt,si ]ωti

is drawn fromthe Gaussian distribution N (0, σ2), each local iteration of Algorithm 1 achieves (α, αpG2/B2σ2)-RDP for the sub-sampledmini-batch ξt,si by Lemma 2.

By the sub-sampling amplification property of RDP in Lemma 5, each local iteration of Algorithm 1 achieves (α, 7q2ρi(α))-RDP for agent i’s local dataset Di if σ′2 = σ2B2/2pG2 ≥ 0.7 and α ≤ (2/3)σ2 log (1/qα(1 + σ′

2)) + 1. After T commu-

nication rounds, agent i will perform Iiτ iterations of SGD. By Lemma 1, Fed-SPA satisfies (α, 7q2Iiτρi(α))-RDP for agent iacross T communication rounds. Therefore, Theorem 1 follows by Lemma 4.

For clarity of presentation, Theorem 1 relies on the amplification by sub-sampling result of [Wang et al., 2019a] which has asimple closed-form. A tighter and more general result (without restriction on the values of σ, q, m, p) can be readily obtainedby using the results of [Wang et al., 2019b].

F Proof of Theorem 2Proof. We note that both the moment and model update on the server are element-wise as following: ∀j ∈ [d],

[ut]j = β1[ut−1]j + (1− β1)[∆t]j ,

[vt]j = β2[vt−1]j + (1− β2)[ut]2j ,

[θt+1]j = [θt]j + ηg[ut]j√

[vt]j + κ.

By the L-smoothness of function f , we have

Et[f(θt+1)− f(θt)] ≤ Et 〈∇f(θt),θt+1 − θt〉+L

2Et ‖θt+1 − θt‖22

= ηgEt⟨∇f(θt),

ut√vt + κ

⟩+Lη2

g

2

d∑j=1

Et

∥∥∥∥∥ [ut]j√[vt]j + κ

∥∥∥∥∥2

2

= ηg Et

⟨∇f(θt),

ut√β2vt−1 + κ

⟩︸ ︷︷ ︸

T1

+ηg Et

⟨∇f(θt),

ut√vt + κ

− ut√β2vt−1 + κ

⟩︸ ︷︷ ︸

T2

+Lη2

g

2

d∑j=1

Et

[[ut]

2j

(√

[vt]j + κ)2

](9)

where the expectation is taken over randomness at round t.

Page 10: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

Bounding T2. We observe the following about T2:

T2 = Et

d∑j=1

[∇f(θt)]j ×

[[ut]j√

[vt]j + κ− [ut]j√

β2[vt−1]j + κ

]= Et

d∑j=1

[∇f(θt)]j × [ut]j ×

[ √β2[vt−1]j −

√[vt]j

(√

[vt]j + κ)(√β2[vt−1]j + κ)

]= Et

d∑j=1

[∇f(θt)]j × [ut]j ×

[−(1− β2)[ut]

2j

(√

[vt]j + κ)(√β2[vt−1]j + κ)(

√β2[vt−1]j +

√[vt]j)

]≤ (1− β2)Et

d∑j=1

|[∇f(θt)]j | × |[ut]j | ×

[[ut]

2j

(√

[vt]j + κ)(√β2[vt−1]j + κ)(

√β2[vt−1]j +

√[vt]j)

]given that −(1− β2)[ut]

2j = (

√β2[vt−1]j −

√[vt]j)(

√β2[vt−1]j +

√[vt]j). Moreover, we have that

T2 ≤ (1− β2)Et

d∑j=1

|[∇f(θt)]j | ×

√(√

[vt]j −√β2[vt−1]j)(

√β2[vt−1]j +

√[vt]j)

(1− β2)

×

[[ut]

2j

(√

[vt]j + κ)(√β2[vt−1]j + κ)(

√β2[vt−1]j +

√[vt]j)

]]

≤√

1− β2Et

d∑j=1

|[∇f(θt)]j | ×

√(√

[vt]j −√β2[vt−1]j)

(√β2[vt−1]j +

√[vt]j)

×

[[ut]

2j

(√

[vt]j + κ)(√β2[vt−1]j + κ)

]≤√

1− β2Et

d∑j=1

|[∇f(θt)]j | ×

[[ut]

2j

(√

[vt]j + κ)(√β2[vt−1]j + κ)

]using the fact that [vt−1]j is increasing in t and (

√[vt]j −

√β2[vt−1]j)/(

√β2[vt−1]j +

√[vt]j) ≤ 1. Next, using the fact

that |[∇f(θt)]j | ≤ G/√d and [vt−1]j ≥ τ2 since [v−1]j ≥ τ2 and [vt−1]j is increasing in t, one yields

T2 ≤√

1− β2Et

d∑j=1

G

κ√d

[[ut]

2j

(√

[vt]j + κ)(√β2 + 1)

]≤√

1− β2Et

d∑j=1

G

κ√d

[[ut]

2j

(√

[vt]j + κ)

] .For expository purpose, we assume β1 = 0, so ut = ∆t and hence

T2 ≤√

1− β2Et

d∑j=1

G

κ√d

[[∆t]

2j

(√

[vt]j + κ)

] . (10)

Bounding T1. Note that

T1 =

d∑j=1

⟨[∇f(θt)]j ,Et

[[ut]j√

β2[vt−1]j + κ

]⟩

=

d∑j=1

⟨[∇f(θt)]j√β2[vt−1]j + κ

,Et [[ut]j − ηlτ [∇f(θt)]j + ηlτ [∇f(θt)]j ]

= −ηlτd∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+

⟨∇f(θt)√β2vt−1 + κ

,Et [ut + ηlτ∇f(θt)]

⟩︸ ︷︷ ︸

T3

. (11)

Page 11: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

The last term T3 in (11) can be bounded as follows: as β1 = 0, we have

T3 =

⟨∇f(θt)√β2vt−1 + κ

,Et [∆t + ηlτ∇f(θt)]

=

⟨∇f(θt)√β2vt−1 + κ

,Et

[− 1

n

n∑i=1

τ−1∑s=0

ηlSti (g

t,si + bt,si ) + ηlτ∇f(θt)

]⟩

= ηlτ

⟨∇f(θt)√β2vt−1 + κ

,Et

[− 1

n

n∑i=1

1

τ

τ−1∑s=0

∇fi(θt,si ) +∇f(θt)

]⟩

=ηlτ

2

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+ηlτ

2Et

∥∥∥∥∥∥ 1

n

n∑i=1

1

τ

τ−1∑s=0

∇fi(θt,si )√√β2vt−1 + κ

− ∇f(θt)√√β2vt−1 + κ

∥∥∥∥∥∥2

,

where we use the fact that ab ≤ (a2 + b2)/2. Given that ∇f(θt) = (1/n)∑ni=1∇fi(θt), one yields

T3 ≤ηlτ

2

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+

ηl2τn2

Et

∥∥∥∥∥∥n∑i=1

τ−1∑s=0

∇fi(θt,si )−∇fi(θt)√√β2vt−1 + κ

∥∥∥∥∥∥2

≤ ηlτ

2

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+

ηl2nκ

Et

[n∑i=1

τ−1∑s=0

∥∥∇fi(θt,si )−∇fi(θt)∥∥2

]

≤ ηlτ

2

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+ηlL

2

2nκEt

[n∑i=1

τ−1∑s=0

∥∥θt,si − θt∥∥2

], (12)

where the second inequality results from (7), and the third inequality uses the L-smoothness of fi and [vt−1]j > 0, j ∈ [d].Combining (11) and (12), one get

T1 ≤ −ηlτ

2

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+ηlL

2

2nκEt

[n∑i=1

τ−1∑s=0

∥∥θt,si − θt∥∥2

].

Using Lemma 10, we get

T1 ≤ −ηlτ

4

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+

5η3l τ

2L2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

). (13)

Combining (9), (10) and (13), we obtain that the loss gap at round t is bounded as follows:

Et[f(θt+1)− f(θt)] ≤ −ηlηgτ

4

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+

5ηgη3l τ

2L2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)

+Gηg√

1− β2

κ√d

d∑j=1

Et

[[∆t]

2j

(√

[vt]j + κ)

]+Lη2

g

2

d∑j=1

Et

[[∆t]

2j

(√

[vt]j + κ)2

].

Summing over t = 0 to T − 1 and using telescoping sum, one can get

E[f(θT )− f(θ0)] ≤ −ηlηgτ4

T−1∑t=0

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+

5ηgη3l τ

2L2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)

+Gηg√

1− β2

κ√d

T−1∑t=0

d∑j=1

E

[[∆t]

2j

(√

[vt]j + κ)

]+Lη2

g

2

T−1∑t=0

d∑j=1

E

[[∆t]

2j

(√

[vt]j + κ)2

].

Page 12: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

Using Lemma 11 and the same idea in Lemma 11 to bound∑T−1t=0

∑dj=1 E

[[∆t]

2j

(√

[vt]j+κ)

], we finally obtain that

E[f(θT )− f(θ0)] ≤ −ηlηgτ4

T−1∑t=0

d∑j=1

[∇f(θt)]2j√

β2[vt−1]j + κ+

5ηgη3l τ

2L2T

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)

+

(Gηg√

1− β2√d

+η2gL

2

)[4η2l τT

nκ2

(1

p(G2 + ζ2

l ) + pdσ2

)

+20η4

l τ3L2T

κ2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)]+

(Gηg√

1− β2√d

+η2gL

2

)4η2l τ

2

κ2

T−1∑t=0

E ‖∇f(θt)‖2 .

(14)

Based on the condition of ηl in Theorem 2, we have(G√

1− β2√d

+ηgL

2

)4ηlτ

κ2≤ 1

8(√β2ηlτG/

√d+ κ)

We also observe that √β2[vt−1]j + κ ≤

√β2ηlτG/

√d+ κ,

so rearranging (14) and dividing both sides by T yields

1

T

T−1∑t=0

E ‖∇f(θt)‖2 ≤8(√β2ηlτG/

√d+ κ)

Tηlηgτ(f(θ0)− f∗) +

8(√β2ηlτG/

√d+ κ)

ηlτ

5η3l τ

2L2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)

+8(√β2ηlτG/

√d+ κ)

ηlτ

(G√d

+ηgL

2

)[4η2l τ

nκ2

(1

p(G2 + ζ2

l ) + pdσ2

)+

20η4l τ

3L2

κ2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)],

where we use the fact that f(θT ) ≥ f∗.Note that in the above convergence analysis, we use β1 = 0 just for expository purposes though our analysis can be directly

extended to β1 > 0. Specifically, if β1 > 0, ut = β1ut−1 + (1 − β1)∆t = (1 − β1)∑th=1 β

t−h1 ∆h which only makes the

expectation and variance of ut different. Assume E[∆h] = E[∆t], then we have E[ut] = (1 − βt1)∆t and∑Tt=1 ‖ut‖2 ≤∑T

t=1 ‖∆t‖2/(1− β1)2. Then all the proof steps in Appendix F follow directly by substituting the above new expectation andvariance bounds into T1 and T2.

G Intermediate Results

Lemma 9 (Bounded Sparsified Noisy Gradient). The sparsified noisy gradient in Fed-SPA is unbiased and has a boundedvariance, i.e., E[Sti (g

t,si + bt,si )] = ∇fi(θt,si ) and E‖Sti (g

t,si + bt,si )‖2 ≤ (G2 + ζ2

l )/p+ pdσ2.

Proof. By the unbiasedness property of the sparsification in Lemma 8, we have

E[Sti (gt,si + bt,si )] = Eξt,si

[gt,si ] + Ebt,si[bt,si ] = ∇fi(θt,si ),

where the expectation E[·] is taken over the random variables ωti , ξt,si and bt,si . Here, we use the fact that ωti , ξ

t,si and bt,si are

Page 13: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

independent with each other. Then, applying the expectation E[·] on the norm of the sparsified noisy gradient, one yields

E‖Sti (gt,si + bt,si )‖2 = E‖Sti (g

t,si )‖2 + E‖[bt,si ]ωt

i‖2

= E‖Sti (gt,si )−∇fi(θt,si )‖2 + ‖∇fi(θt,si )‖2 + Eωt

i

∑j∈ωt

i

E[bt,si ]j‖[bt,si ]j‖2

= E‖Sti (gt,si )− gt,si + gt,si −∇fi(θ

t,si )‖2 + ‖∇fi(θt,si )‖2 +

k

d

∑j∈[d]

σ2

= E‖Sti (gt,si )− gt,si ‖

2 + E‖gt,si −∇fi(θt,si )‖2 + ‖∇fi(θt,si )‖2 + pdσ2

≤(

1

p− 1

)Eξt,si

‖gt,si ‖2 + ζ2

l + +‖∇fi(θt,si )‖2 + pdσ2

=

(1

p− 1

)Eξt,si

‖gt,si −∇fi(θt,si )‖2 +

(1

p− 1

)‖∇fi(θt,si )‖2 + ζ2

l + ‖∇fi(θt,si )‖2 + pdσ2

≤ 1

pζ2l +

1

p‖∇fi(θt,si )‖2 + pdσ2

≤ 1

p(G2 + ζ2

l ) + pdσ2,

using the fact that E[(X − E[X])2] = E[X2]− E[X]2, Lemma 8 and Assumption 3.

Lemma 10. The distance between the global model and the local model is bounded as follows: when s ∈ [0, τ),

1

n

n∑i=1

E∥∥θt,si − θt∥∥2 ≤ 5τη2

l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)+ 30τ2η2

l E ‖∇f(θt)‖2 , (15)

if the local learning rate satisfies ηl ≤ 1/8Lτ .

Proof. The result trivially holds for s = 0 since θt,0i = θt for all i ∈ [n]. We observe that for any s ∈ (0, τ ],

E∥∥θt,si − θt∥∥2

= E∥∥∥θt,s−1

i − ηlSti (gt,s−1i + bt,si )− θt

∥∥∥2

= E∥∥∥θt,s−1

i − θt − ηlSti (gt,s−1i + bt,si ) + ηl∇fi(θt,s−1

i )− ηl∇fi(θt,s−1i ) + ηl∇fi(θt)− ηl∇fi(θt)

+ηl∇f(θt)− ηl∇f(θt)‖2

= E∥∥∥θt,s−1

i − θt − ηl∇fi(θt,s−1i ) + ηl∇fi(θt)− ηl∇fi(θt) + ηl∇f(θt)− ηl∇f(θt)

∥∥∥2

+ η2l E∥∥∥Sti (gt,s−1

i + bt,si )−∇fi(θt,s−1i )

∥∥∥2

≤ (1 + α)E∥∥∥θt,s−1

i − θt∥∥∥2

+ (1 + α−1)η2l E∥∥∥−∇fi(θt,s−1

i ) +∇fi(θt)−∇fi(θt) +∇f(θt)−∇f(θt)∥∥∥2

+ η2l

(E∥∥∥Sti (gt,s−1

i + bt,si )∥∥∥2

− E∥∥∥∇fi(θt,s−1

i )∥∥∥2)

≤(

1 +1

2τ − 1

)E∥∥∥θt,s−1

i − θt∥∥∥2

+ 6τη2l E∥∥∥∇fi(θt,s−1

i )−∇fi(θt)∥∥∥2

+ 6τη2l E ‖∇fi(θt)−∇f(θt)‖2

+ 6τη2l E ‖∇f(θt)‖2 + η2

l

(1

p(G2 + ζ2

l ) + pdσ2

)≤(

1 +1

2τ − 1

)E∥∥∥θt,s−1

i − θt∥∥∥2

+ 6τL2η2l E∥∥∥θt,s−1

i − θt∥∥∥2

+ 6τη2l ζ

2g + 6τη2

l E ‖∇f(θt)‖2

+ η2l

(1

p(G2 + ζ2

l ) + pdσ2

)=

(1 +

1

2τ − 1+ 6τL2η2

l

)E∥∥∥θt,s−1

i − θt∥∥∥2

+ 6τη2l E ‖∇f(θt)‖2 + η2

l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

),

Page 14: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

where the third equality uses the fact that E[Sti (gt,s−1i + bt,si )] = ∇fi(θt,s−1

i ), the first inequality results from (8), the secondinequality comes from choosing α = 1/(2τ − 1), (7) and Lemma 9, and last inequality uses the L-smoothness of function fiand Assumption 2.

Taking the average over the agent i and choosing the learning rate ηl ≤ 1/8Lτ , we obtain that

1

n

n∑i=1

E∥∥θt,si − θt∥∥2 ≤

(1 +

1

2τ − 1+ 6τL2η2

l

)1

n

n∑i=1

E∥∥∥θt,s−1

i − θt∥∥∥2

+ 6τη2l E ‖∇f(θt)‖2

+ η2l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)≤(

1 +1

τ − 1

)1

n

n∑i=1

E∥∥∥θt,s−1

i − θt∥∥∥2

+ 6τη2l E ‖∇f(θt)‖2 + η2

l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

),

where we use the fact that 1/(2τ − 1) ≤ 1/2(τ − 1) and 6τL2η2l ≤ 1/2(τ − 1). Unrolling the recursion, one yields

1

n

n∑i=1

E∥∥θt,si − θt∥∥2 ≤

s−1∑h=0

(1 +

1

τ − 1

)h(6τη2

l E ‖∇f(θt)‖2 + η2l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

))≤ (τ − 1)

[(1 +

1

τ − 1

)τ− 1

](6τη2

l E ‖∇f(θt)‖2 + η2l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

))≤ 5τη2

l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)+ 30τ2η2

l E ‖∇f(θt)‖2 ,

where we use the fact that (1 + 1/(τ − 1))τ ≤ 5 for τ > 1.

Lemma 11. The following upper bound holds for Algorithm 1:

T−1∑t=0

d∑j=1

E

[[∆t]

2j

(√

[vt]j + κ)2

]≤ 4η2

l τT

nκ2

(1

p(G2 + ζ2

l ) + pdσ2

)+

20η4l τ

3L2T

κ2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)

+4η2l τ

2

κ2

T−1∑t=0

E ‖∇f(θt)‖2 . (16)

Proof. We bound the desired quantity in the following manner:

T−1∑t=0

d∑j=1

E

[[∆t]

2j

(√

[vt]j + κ)2

]≤T−1∑t=0

d∑j=1

E

[[∆t]

2j

[vt]j + κ2

]

≤T−1∑t=0

d∑j=1

E

[[∆t]

2j

κ2

]

≤ 1

κ2

T−1∑t=0

E‖∆t + ηlτ∇f(θt)− ηlτ∇f(θt)‖2

≤ 2

κ2

T−1∑t=0

E‖∆t + ηlτ∇f(θt)‖2︸ ︷︷ ︸T4

+2η2l τ

2

κ2

T−1∑t=0

E‖∇f(θt)‖2. (17)

Page 15: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

Then, we bound T4 as follows

T4 = 2

T−1∑t=0

E

∥∥∥∥∥ 1

n

n∑i=1

1

κ

τ−1∑s=0

ηl(−Sti (g

t,si + bt,si ) +∇fi(θt,si )−∇fi(θt,si ) +∇fi(θt)−∇fi(θt) +∇f(θt)

)∥∥∥∥∥2

= 2η2l

T−1∑t=0

E

∥∥∥∥∥ 1

n

n∑i=1

1

κ

τ−1∑s=0

(Sti (gt,si + bt,si )−∇fi(θt,si )−∇fi(θt,si ) +∇fi(θt))

∥∥∥∥∥2

≤ 4η2l

n2

T−1∑t=0

E

∥∥∥∥∥n∑i=1

1

κ

τ−1∑s=0

Sti (gt,si + bt,si )−∇fi(θt,si )

∥∥∥∥∥2

+4η2l

n2

T−1∑t=0

E

∥∥∥∥∥n∑i=1

1

κ

τ−1∑s=0

∇fi(θt,si )−∇fi(θt)

∥∥∥∥∥2

≤ 4η2l

n2κ2

T−1∑t=0

n∑i=1

τ−1∑s=0

E∥∥Sti (gt,si + bt,si )−∇fi(θt,si )

∥∥2+

4η2l τ

nκ2

T−1∑t=0

n∑i=1

τ−1∑s=0

E∥∥∇fi(θt,si )−∇fi(θt)

∥∥2

≤ 4η2l τT

nκ2

(1

p(G2 + ζ2

l ) + pdσ2

)+

4η2l τL

2

nκ2

T−1∑t=0

n∑i=1

τ−1∑s=0

E∥∥θt,si − θt∥∥2

≤ 4η2l τT

nκ2

(1

p(G2 + ζ2

l ) + pdσ2

)+

4η2l τ

2L2

κ2

T−1∑t=0

5τη2l

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)+ 30τ2η2

l E ‖∇f(θt)‖2

=4η2l τT

nκ2

(1

p(G2 + ζ2

l ) + pdσ2

)+

20η4l τ

3L2T

κ2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)+

2η2l τ

2

κ2

T−1∑t=0

E ‖∇f(θt)‖2 ,

where the first inequality results from (7), the second inequality uses the independence of unbiased stochastic gradientsgt,si , i ∈ [n], s ∈ [0, τ) and (7), and the last inequality comes from Lemma 10. The result follows by plugging T1 into(17).

H Partial ParticipationFor the partial participation when |W| < n, the main changed in the proof is the upper bound of T4. The rest of the proof issimilar to that of Theorem 2, so we focus on bounding T4 with partial participation here. AssumeW is sampled uniformly forall subsets of [n] with size r at each round. In this case, we have

T4 =2

κ2

T−1∑t=0

E

∥∥∥∥∥ 1

|W|∑i∈W

τ−1∑s=0

ηl(−Sti (g

t,si + bt,si ) +∇fi(θt,si )−∇fi(θt,si ) +∇fi(θt)−∇fi(θt) +∇f(θt)

)∥∥∥∥∥2

≤ 6η2l

κ2

T−1∑t=0

E

∥∥∥∥∥1

r

∑i∈W

τ−1∑s=0

(Sti (gt,si + bt,si )−∇fi(θt,si ))

∥∥∥∥∥2

+6η2l

κ2

T−1∑t=0

E

∥∥∥∥∥1

r

∑i∈W

τ−1∑s=0

(∇fi(θt,si )−∇fi(θt)

)∥∥∥∥∥2

+6η2l

κ2

T−1∑t=0

E

∥∥∥∥∥1

r

∑i∈W

τ−1∑s=0

(∇fi(θt)−∇f(θt))

∥∥∥∥∥2

≤ 6η2l

κ2

T−1∑t=0

E

r2

∑i∈W

τ−1∑s=0

∥∥Sti (gt,si + bt,si )−∇fi(θt,si )∥∥2

]+

6η2l

κ2

T−1∑t=0

E

r

∑i∈W

τ−1∑s=0

∥∥∇fi(θt,si )−∇fi(θt)∥∥2

]

+6τ2η2

l

r2κ2

T−1∑t=0

E

∥∥∥∥∥∑i∈W∇fi(θt)− r∇f(θt)

∥∥∥∥∥2

≤ 6Tτη2l

rκ2

(1

p(G2 + ζ2

l ) + pdσ2

)+

6τη2l L

2

nκ2

n∑i=1

T−1∑t=0

τ−1∑s=0

E∥∥θt,si − θt∥∥2

+6τ2Tη2

l ζ2g

κ2

(1− r

n

)≤ 6Tτη2

l

rκ2

(1

p(G2 + ζ2

l ) + pdσ2

)+

30Tτ3η4l L

2

κ2

(1

p(G2 + ζ2

l ) + pdσ2 + 6τζ2g

)+

60η4l τ

2L2

κ2

T−1∑t=0

E ‖∇f(θt)‖2

+6τ2Tη2

l ζ2g

κ2

(1− r

n

),

Page 16: arXiv:2008.01558v2 [cs.LG] 8 Jun 2021

where the first inequality results from (7), the second inequality uses the independence of unbiased stochastic gradientsgt,si , i ∈ W, s ∈ [0, τ) and (7), and the last inequality comes from Lemma 10.