Push-Pull Gradient Methods for Distributed …1 Push-Pull Gradient Methods for Distributed Optimization in Networks Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedic´ Abstract—In this

1

Push-Pull Gradient Methods for DistributedOptimization in Networks

Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedic

Abstract—In this paper, we focus on solving a distributedconvex optimization problem in a network, where each agenthas its own convex cost function and the goal is to minimizethe sum of the agents’ cost functions while obeying the net-work connectivity structure. In order to minimize the sum ofthe cost functions, we consider new distributed gradient-basedmethods where each node maintains two estimates, namely, anestimate of the optimal decision variable and an estimate ofthe gradient for the average of the agents’ objective functions.From the viewpoint of an agent, the information about thegradients is pushed to the neighbors, while the informationabout the decision variable is pulled from the neighbors hencegiving the name “push-pull gradient methods”. The methodsutilize two different graphs for the information exchange amongagents, and as such, unify the algorithms with different types ofdistributed architecture, including decentralized (peer-to-peer),centralized (master-slave), and semi-centralized (leader-follower)architecture. We show that the proposed algorithms and theirmany variants converge linearly for strongly convex and smoothobjective functions over a network (possibly with unidirectionaldata links) in both synchronous and asynchronous random-gossipsettings. In particular, under the random-gossip setting, “push-pull” is the first class of algorithms for distributed optimizationover directed graphs. Moreover, we numerically evaluate ourproposed algorithms in both scenarios, and show that theyoutperform other existing linearly convergent schemes, especiallyfor ill-conditioned problems and networks that are not wellbalanced.

Index Terms—distributed optimization, convex optimization,directed graph, network structure, linear convergence, random-gossip algorithm, spanning tree.

I. INTRODUCTION

IN this paper, we consider a system involving n agentswhose goal is to collaboratively solve the following prob-

lem:minx∈Rp

f(x) :=n∑i=1

fi(x), (1)

where x is the global decision variable and each functionfi : Rp → R is convex and known by agent i only. The agents

The first author and the second author contribute equally. Parts of the resultsappear in Proceedings of the 57th IEEE Conference on Decision and Control[1]. (Corresponding authors: Shi Pu; Jinming Xu.)

This work was supported in parts by the NSF grant CCF-1717391, the ONRgrant no. N000141612245, and the SRIBD Startup Fund jcyj-sp2019090001.

S. Pu is with Institute for Data and Decision Analytics, The ChineseUniversity of Hong Kong, Shenzhen, China and Shenzhen Research Instituteof Big Data. (e-mail: [email protected]).

W. Shi was with the Department of Electrical Engineering, PrincetonUniversity, Princeton, NJ 08854 USA (e-mail: [email protected]).

J. Xu is with the College of Control Science and Engineering, ZhejiangUniversity, China (e-mail: [email protected]).

A. Nedic is with the School of Electrical, Computer, and Energy En-gineering, Arizona State University, Tempe, AZ 85281 USA (e-mail: [email protected]).

are embedded in a communication network, and their goal isto obtain an optimal and consensual solution through localneighbor communications and information exchange. Thislocal exchange is desirable in situations where the exchangeof a large amount of data is prohibitively expensive due tolimited communication resources.

To solve problem (1) in a networked system of n agents,many algorithms have been proposed under various assump-tions on the objective functions and on the underlying net-works/graphs. Static undirected graphs were extensively con-sidered in the literature [2], [3], [4], [5], [6]. References[7], [8], [9] studied time-varying and/or stochastic undirectednetworks. Directed graphs were discussed in [10], [11], [12],[9], [13], [14]. Centralized (master-slave) algorithms werediscussed in [15], where extensive applications in learning canbe found. Parallel, coordinated, and asynchronous algorithmswere discussed in [16] and the references therein. The readeris also referred to the recent paper [17] and the referencestherein for a comprehensive survey on distributed optimizationalgorithms.

In the first part of this paper, we introduce a novel gradient-based algorithm (Push-Pull) for distributed (consensus-based)optimization in directed graphs. Unlike the push-sum typeprotocol used in the previous literature [9], [14], our algorithmuses a row stochastic matrix for the mixing of the decisionvariables, while it employs a column stochastic matrix fortracking the average gradients. Although motivated by a fullydecentralized scheme, we show that Push-Pull can work bothin fully decentralized networks and in two-tier networks.

Gossip-based communication protocols are popular choicesfor distributed computation due to their low communicationcosts [18], [19], [20], [21]. In the second part of this paper,we consider a random-gossip push-pull algorithm (G-Push-Pull) where at each iteration, an agent wakes up uniformlyrandomly and communicates with one or two of its neighbors.Both Push-Pull and G-Push-Pull have different variants. Weshow that they all converge linearly to the optimal solutionfor strongly convex and smooth objective functions.

A. Related Work

Our emphasis in the literature review is on the decentralizedoptimization, since our approach builds on a new under-standing of the decentralized consensus-based methods fordirected communication networks. Most references, including[2], [3], [22], [23], [4], [6], [24], [25], [26], [27], [28], [29],often restrict the underlying network connectivity structure, ormore commonly require doubly stochastic mixing matrices.

arX

iv:1

810.

0665

3v4

[m

ath.

OC

] 6

Feb

202

0

http://arxiv.org/abs/jcyj-sp/2019090

2

The work in [2] has been the first to demonstrate the linearconvergence of an ADMM-based decentralized optimizationscheme. Reference [3] uses a gradient difference structurein the algorithm to provide the first-order decentralized op-timization algorithm which is capable of achieving the typicalconvergence rates of a centralized gradient method, whilereferences [22], [23] deal with the second-order decentralizedmethods. By using Nesterov’s acceleration, reference [4] hasobtained a method whose convergence time scales linearlyin the number of agents n, which is the best scaling withn currently known. More recently, for a class of so-termeddual friendly functions, papers [6], [24] have obtained anoptimal decentralized consensus optimization algorithm whosedependency on the condition number1 of the system’s objectivefunction achieves the best known scaling in the order ofO(√κ). Work in [28], [29] investigates proximal-gradient

methods which can tackle (1) with proximal friendly compo-nent functions. Paper [30] extends the work in [2] to handleasynchrony and delays. References [31], [32] considered astochastic variant of problem (1) in asynchronous networks.A tracking technique has been recently employed to developdecentralized algorithms for tracking the average of the Hes-sian/gradient in second-order methods [23], allowing uncoor-dinated stepsize [25], [26], handling non-convexity [27], andachieving linear convergence over time-varying graphs [9].

For directed graphs, to eliminate the need of constructinga doubly stochastic matrix in reaching consensus2, reference[33] proposes the push-sum protocol. Reference [34] has beenthe first to propose a push-sum based distributed optimizationalgorithm for directed graphs. Then, based on the push-sum technique again, a decentralized subgradient method fortime-varying directed graphs has been proposed and analyzedin [10]. Aiming to improve convergence for a smooth objectivefunction and a fixed directed graph, the work in [13], [12]modifies the algorithm from [3] with the push-sum technique,thus providing a new algorithm which converges linearly for astrongly convex objective function on a static graph. However,the algorithm requires a careful selection of the stepsize whichmay be even non-existent in some cases [13]. This stabilityissue has been resolved in [9] in a more general setting oftime-varying directed graphs. The work in [14], [35] considersan algorithm that uses only row-stochastic mixing matrices andstill achieves linear convergence over fixed directed graphs.

Simultaneously and independently, a paper [36] has pro-posed an algorithm that is similar to the synchronous variantproposed in this paper. By contrast, the work in [36] does notshow that the algorithm unifies different architectures. More-over, asynchronous or time-varying cases were not discussedeither therein.

1The condition number of a smooth and strongly convex function is theratio of its gradient Lipschitz constant and its strong convexity constant.

2Constructing a doubly stochastic matrix over a directed graph needsweight balancing which requires an independent iterative procedure acrossthe network; consensus is a basic coordination technique in decentralizedoptimization.

B. Main Contribution

The main contribution of this paper is threefold. First, wedesign new distributed optimization methods (Push-Pull andG-Push-Pull) and their many variants for directed graphs.These methods utilize two different graphs for the informationexchange among agents, and as such, unify different computa-tion and communication architectures, including decentralized(peer-to-peer), centralized (master-slave), and semi-centralized(leader-follower) architecture. To the best of our knowledge,these are the first algorithms in the literature that enjoy suchproperty.

Second, we establish the linear convergence of the proposedmethods in both synchronous (Push-Pull) and asynchronousrandom-gossip (G-Push-Pull) settings. In particular, G-Push-Pull is the first class of gossip-type algorithms for distributedoptimization over directed graphs.

Finally, in our proposed methods each agent in the networkis allowed to use a different nonnegative stepsize, and onlyone of such stepsizes needs to be positive. This is a uniquefeature compared to the existing literature (e.g., [9], [14]).

Some of the results related to a variant of Push-Pull willappear in Proceedings of the 57th IEEE Conference on Deci-sion and Control [1]. In contrast, the current work analyzes adifferent, more communication-efficient variant of Push-Pull,adopts an uncoordinated stepsize policy which generalizes thescheme in [1], and introduces G-Push-Pull in extra. It alsocontains detailed proofs omitted from the conference version.

C. Organization of the Paper

The structure of this paper is as follows. We first providenotation and state basic assumptions in Subsection I-D. Thenwe introduce the push-pull gradient method in Section IIalong with the intuition of its design and some examplesexplaining how it relates to (semi-)centralized and decentral-ized optimization. We establish the linear convergence of thepush-pull algorithm in Section III. In Section IV we intro-duce the random-gossip push-pull method (G-Push-Pull) anddemonstrate its linear convergence in Section V. In Section VIwe conduct numerical experiments to verify our theoreticalclaims. Concluding remarks are given in Section VII.

D. Notation and Assumption

Throughout the paper, vectors default to columns if nototherwise specified. Let N = 1, 2, . . . , n be the set ofagents. Each agent i ∈ N holds a local copy xi ∈ Rp of thedecision variable and an auxiliary variable yi ∈ Rp trackingthe average gradients, where their values at iteration k aredenoted by xi,k and yi,k, respectively. Let

x := [x1, x2, . . . , xn]ᵀ ∈ Rn×p,y := [y1, y2, . . . , yn]ᵀ ∈ Rn×p.

Define F (x) to be an aggregate objective function of the localvariables, i.e., F (x) :=

∑ni=1 fi(xi), and write

∇F (x) := [∇f1(x1)ᵀ,∇f2(x2)ᵀ, . . . ,∇fn(xn)ᵀ] ∈ Rn×p.

We use the symbol tr· to denote the trace of a square matrix.

3

Definition 1. Given an arbitrary vector norm ‖ · ‖ on Rn, forany x ∈ Rn×p, we define

‖x‖ :=∥∥∥[‖x(1)‖, ‖x(2)‖, . . . , ‖x(p)‖

]∥∥∥2,

where x(1),x(2), . . . ,x(p) ∈ Rn are columns of x, and ‖ · ‖2represents the 2-norm.

We make the following assumption on the functions fiin (1).

Assumption 1. Each fi is µ-strongly convex and its gradientis L-Lipschitz continuous, i.e., for any x, x′ ∈ Rp,

〈∇fi(x)−∇fi(x′), x− x′〉 ≥ µ‖x− x′‖22,‖∇fi(x)−∇fi(x′)‖2 ≤ L‖x− x′‖2.

Under Assumption 1, there exists a unique optimal solutionx∗ ∈ Rp to problem (1).

We use directed graphs to model the interaction topologyamong agents. A directed graph (digraph) is a pair G =(N , E), where N is the set of vertices (nodes) and the edgeset E ⊆ N ×N consists of ordered pairs of vertices. If thereis a directed edge from node i to node j in G, or (i, j) ∈ E ,then i is defined as the parent node and j is defined as thechild node. Information can be transmitted from the parentnode to the child node directly. A directed path in graph G isa sequence of edges (i, j), (j, k), (k, l) . . .. Graph G is calledstrongly connected if there is a directed path between any pairof distinct vertices. A directed tree is a digraph where everyvertex, except for the root, has only one parent. A spanningtree of a digraph is a directed tree that connects the root toall other vertices in the graph. A subgraph S of graph G is agraph whose set of vertices and set of edges are all subsets ofG (see [37]).

Given a nonnegative matrix3 M = [mij ] ∈ Rn×n, the di-graph induced by the matrix M is denoted by GM = (N , EM),where N = 1, 2, . . . , n and (j, i) ∈ EM iff (if and onlyif) mij > 0. We let RM be the set of roots of all possiblespanning trees in the graph GM. For an arbitrary agent i ∈ N ,we define its in-neighbor set N in

M,i as the collection of allindividual agents that i can actively and reliably pull data from;we also define its out-neighbor setN out

M,i as the collection of allindividual agents that can passively and reliably receive datafrom i. In the situation when the set is time-varying, we furtheradd a subscript to indicate it generates a sequence of sets. Forexample, N in

M,i,k is the in-neighbor set of i at time/iterationk.

II. A PUSH-PULL GRADIENT METHOD

To proceed, we first illustrate and highlight the proposed al-gorithm, which we call Push-Pull in the following (Algorithm1).

Algorithm 1 (Push-Pull) can be rewritten in the followingaggregated form:

xk+1 = R(xk −αyk), (2a)yk+1 = Cyk +∇F (xk+1)−∇F (xk), (2b)

3A matrix is nonnegative if all its elements are nonnegative.

Algorithm 1: Push-PullEach agent i chooses its local step size αi ≥ 0,

in-bound mixing/pulling weights Rij ≥ 0 for all j ∈ N inR,i,

and out-bound pushing weights Cli ≥ 0 for all l ∈ N outC,i ;

Each agent i initializes with any arbitrary xi,0 ∈ Rp andyi,0 = ∇fi(xi,0);for k = 0, 1, · · · , do

for each i ∈ N ,agent i pulls (xj,k − αjyj,k) from each j ∈ N in

R,i;agent i pushes Cliyi,k to each l ∈ N out

C,i ;for each i ∈ N ,xi,k+1 =

∑nj=1Rij(xj,k − αjyj,k);

yi,k+1 =∑nj=1 Cijyj,k +∇fi(xi,k+1)−∇fi(xi,k);

end for

where α = diagα1, α2, . . . , αn is a nonnegative diagonalmatrix and R = [Rij ], C = [Cij ] ∈ Rn×n. We make thefollowing assumption on the matrices R and C.

Assumption 2. The matrix R ∈ Rn×n is nonnegative row-stochastic and C ∈ Rn×n is nonnegative column-stochastic,i.e., R1 = 1 and 1ᵀC = 1ᵀ. In addition, the diagonal entriesof R and C are positive, i.e., Rii > 0 and Cii > 0 for alli ∈ N .

As a result of C being column-stochastic, we have byinduction that

1

n1ᵀyk =

1

n1ᵀ∇F (xk), ∀k. (3)

Relation (3) is critical for (a subset of) the agents to track theaverage gradient 1ᵀ∇F (xk)/n through the y-update.

Remark 1. At each iteration, each agent will “push” infor-mation about gradients to its out-neighbors and “pull” thedecision variables from its in-neighbors, respectively. Eachcomponent of yk plays the role of tracking the averagegradient using the column-stochastic matrix C while eachcomponent of xk performs optimization seeking by averageconsensus using a row-stochastic matrix R. The structure ofAlgorithm 1 resembles that of the gradient tracking methodsas in [9], [25] with the doubly stochastic matrix being splitinto a row-stochastic matrix and a column-stochastic matrix.Such an asymmetric R-C structure has already been used inthe literature for achieving average consensus [38]. However,the proposed optimization algorithm can not be interpreted asa linear system since it introduces nonlinear dynamics due tothe gradient terms.

We now give the condition on the structures of graphsGR and GCᵀ induced by matrices R and Cᵀ, respectively.Note that GCᵀ is identical to the graph GC with all its edgesreversed.

Assumption 3. The graphs GR and GCᵀ each contain at leastone spanning tree. Moreover, there exists at least one nodethat is a root of spanning trees for both GR and GCᵀ , i.e.,RR ∩RCᵀ 6= ∅, where RR (resp., RCᵀ ) is the set of roots ofall possible spanning trees in the graph GR (resp., GCᵀ ).

4

Assumption 3 is weaker than requiring that both GR and GCare strongly connected, which was assumed in most previousworks (e.g., [9], [14], [36]). This relaxation offers us moreflexibility in designing graphs GR and GC. For instance,suppose that we have a strongly connected communicationgraph G. Then there are multiple ways to construct GR andGC satisfying Assumption 3. One trivial approach is to setGR = GC = G. Another way is to pick at random ir ∈ Nand let GR (resp., GC) be a spanning tree (resp., reversedspanning tree) contained in G with ir as its root. Once graphsGR and GC are established, matrices R and C can be designedaccordingly.

Remark 2. There are different ways to design the weights inR and C by each agent locally. For example, each agent imay choose Rij = 1

|N inR,i|+cR

for some constant cR > 0 for

all j ∈ N inR,i and let Rii = 1 −

∑j∈N in

R,i. Similarly, agent i

can choose Cli = 1|N out

C,i |+cCfor some constant cC > 0 for

all l ∈ N outC,i and let Cii = 1 −

∑l∈N out

C,i. Such a choice of

mixing weights will render R row-stochastic and C column-stochastic, thus satisfying Assumption 2.

We have the following result from Assumption 2 andAssumption 3.

Lemma 1. Under Assumption 2 and Assumption 3, the ma-trix R has a unique nonnegative left eigenvector uᵀ (w.r.t.eigenvalue 1) with uᵀ1 = n, and the matrix C has a uniquenonnegative right eigenvector v (w.r.t. eigenvalue 1) with1ᵀv = n (see [39]). Moreover, eigenvector uᵀ (resp., v) isnonzero only on the entries associated with agents i ∈ RR

(resp., j ∈ RCᵀ ), and uᵀv > 0.

Proof. See Appendix A-A.

Finally, we assume the following condition regarding thestep sizes αi.

Assumption 4. There is at least one agent i ∈ RR ∩ RCᵀ

whose step size αi is positive.

Assumption 3 and Assumption 4 hint on the crucial role ofthe set RR∩RCᵀ . In what follows, we provide some intuitionfor the development of Push-Pull and an interpretation of thealgorithm from another perspective. The discussions will shedlight on the rationale behind the assumptions.

To motivate the development of Push-Pull, let us considerthe optimality condition for (1) in the following form:

x∗ ∈ nullI−R, (4a)1ᵀ∇F (x∗) = 0, (4b)

where x∗ := 1x∗ᵀ and R satisfies Assumption 2. Considerthe algorithm in (2). Suppose that the algorithm produces twosequences xk and yk converging to some points x∞ andy∞, respectively. Then from (2a) and (2b) we would have

(I−R)(x∞ −αy∞) + αy∞ = 0, (5a)(I−C)y∞ = 0. (5b)

If spanI − R and α · nullI − C are disjoint4, from(5) we would have x∞ ∈ nullI − R and αy∞ = 0.Hence x∞ satisfies the optimality condition in (4a). In lightof (5b), Assumption 4, and Lemma 1, we have y∞ ∈nullα ∩ nullI −C = 0. Then from (3) we know that1ᵀ∇F (x∞) = 1ᵀy∞ = 0, which is exactly the optimalitycondition in (4b).

For another interpretation of Push-Pull, notice that underAssumptions 2 and 3, with linear rates of convergence,

limk→∞

Rk =1uᵀ

n, lim

k→∞Ck =

v1ᵀ

n. (6)

Thus with comparatively small step sizes, relation (6) togetherwith (3) implies that xk ' 1uᵀxk−K/n (for some fixedK > 0) and yk ' v1ᵀ∇F (xk)/n. From the proof of Lemma1, eigenvector u (resp., v) is nonzero only on the entriesassociated with agents i ∈ RR (resp., j ∈ RCᵀ ). Hencexk ' 1uᵀxk−K/n indicates that only the state informationof agents i ∈ RR are pulled by the entire network, andyk ' v1ᵀ∇F (xk)/n implies that only agents j ∈ RCᵀ arepushed and tracking the average gradients. This “push” and“pull” information structure gives the name of the algorithm.The assumption RR ∩ RCᵀ 6= ∅ essentially says at least oneagent needs to be both “pulled” and “pushed”.

The above discussion has mathematically interpreted whythe use of row stochastic matrices and column stochastic matri-ces is reasonable. Now let us explain from the implementationaspect why Algorithm 1 is called “Push-Pull” and why it ismore feasible to be implemented with “Push” and “Pull” atthe same time. When the information across agents need tobe diffused/fused, either an agent needs to know what scalingweights it needs to put on the quantities sending out to otheragents, or it needs to know how to combine the quantitiescoming in with correct weights. In particular, we have thefollowing specific weight assignment strategies:A) For the networked system to maintain

∑l Cli = 1, an

apparently convenient way is to let agent i scale its databy Cli,∀l, before sending/pushing out messages. In thisway, it becomes agent i’s responsibility to synchronizeout-neighbors’ receptions of messages and it is naturalto employ a reliable push-communication-protocol toimplement such operations.

B) Unlike what happens in A), to maintain∑j Rij = 1,

the only seemingly feasible way is to let the receiveri perform the tasks of scaling and combination/additionsince it would be difficult for the sender to know theweights or adjust the weights accordingly, especiallywhen the network changes. Thus, it is natural to employa pull-communication-protocol for the above operations.

A. Unifying Different Distributed Computational Architecture

We now demonstrate how the proposed algorithm (2) unifiesdifferent types of distributed architecture, including decentral-ized, centralized, and semi-centralized architecture. For thefully decentralized case, suppose we have a graph G that is

4This is a consequence of Assumption 4 and the relation uᵀv > 0 fromLemma 1.

5

undirected and connected. Then we can set GR = GC = Gand let R = C be symmetric matrices, in which case theproposed algorithm degrades to the one considered in [9], [25];if the graph is directed and strongly connected, we can alsolet GR = GC = G and design the weights for R and Ccorrespondingly.

To illustrate the less straightforward situation of (semi)-centralized networks, let us give a simple example. Considera four-node star network composed by 1, 2, 3, 4 wherenode 1 is situated at the center and nodes 2, 3, and 4 are(bidirectionally) connected with node 1 but not connected toeach other. In this case, the matrix R in our algorithm can bechosen as

R =

1 0 0 00.5 0.5 0 00.5 0 0.5 00.5 0 0 0.5

, C =

1 0.5 0.5 0.50 0.5 0 00 0 0.5 00 0 0 0.5

.For a graphical illustration, the corresponding network topolo-gies of GR and GC are shown in Fig. 1. The central node

1

2

3

4

1

2

3

4

Fig. 1. On the left is the graph GR and on the right is the graph GC.

1’s information regarding x1,k is pulled by the neighbors(the entire network in this case) through GR; the others onlypassively infuse the information from node 1. At the sametime, node 1 has been pushed information regarding yi,k(i = 2, 3, 4) from the neighbors through GC; the other nodesonly actively comply with the request from node 1. Thismotivates the algorithm’s name push-pull gradient method.Although nodes 2, 3, and 4 are updating their yi’s accordingly,these quantities do not have to contribute to the optimizationprocedure and will die out geometrically fast due to theweights in the last three rows of C. Consequently, in thisspecial case, the local stepsize α for agents 2, 3, and 4 can beset to 0. Without loss of generality, suppose f1(x) = 0,∀x.Then the algorithm becomes a typical centralized algorithmfor minimizing

∑4i=2 fi(x) where the master node 1 utilizes

the slave nodes 2, 3, and 4 to compute the gradient informationin a distributed way.

Taking the above as an example for explaining the semi-centralized case, it is worth noting that node 1 can be replacedby a strongly connected subnet in GR and GC, respectively.Correspondingly, nodes 2, 3, and 4 can all be replaced bysubnets as long as the information from the master layer inthese subnets can be diffused to all the slave layer agents inGR, while the information from all the slave layer agents canbe diffused to the master layer in GC. Specific requirementson connectivities of slave subnets can be understood by usingthe concept of rooted trees. We refer to the nodes as leadersif their roles in the network are similar to the role of node1; and the other nodes are termed as followers. Note that

after the replacement of the individual nodes by subnets, thenetwork structure in all subnets are decentralized, while therelationship between leader subnet and follower subnets ismaster-slave. This is why we refer to such an architecture assemi-centralized.

Remark 3 (A class of Push-Pull algorithms). There can bemultiple variants of the proposed algorithm depending onwhether the Adapt-then-Combine (ATC) strategy [40] is usedin the x-update and/or the y-update (see Remark 3 in [9] formore details). For readability, we only illustrate one algorithmin Algorithm 1 and call it Push-Pull in the above. We alsogenerally use “Push-Pull” to refer to a class of algorithmsregardless whether the ATC structure is used, if not causingconfusion. Our forthcoming analysis can be adapted to thesevariants. Our numerical tests in Section VI only involve somevariants.

III. CONVERGENCE ANALYSIS FOR PUSH-PULL

In this section, we study the convergence properties of theproposed algorithm. We first define the following variables:

xk :=1

nuᵀxk, yk :=

1

n1ᵀyk.

Our strategy is to bound ‖xk+1 − x∗‖2, ‖xk+1 − 1xk+1‖Rand ‖yk+1−vyk+1‖C in terms of linear combinations of theirprevious values, where ‖ · ‖R and ‖ · ‖C are specific norms tobe defined later. In this way we establish a linear system ofinequalities which allows us to derive the convergence results.The proof technique was inspired by [5], [14].

Before diving into the detailed analysis, we present the mainconvergence result for the Push-Pull algorithm in (2) in thefollowing theorem.

Theorem 1. Suppose Assumptions 1-3 hold, α′ ≥ Mα forsome M > 0 and

α ≤ min

2c3

c2 +√c22 + 4c1c3

,(1− σC)

2σCδC,2‖R‖2L

, (7)

where c1, c2, c3 are given in (18)-(20), and σC and δC,2 aredefined in Lemma 4 and Lemma 6, respectively. Then, thequantities ‖xk − x∗‖2, ‖xk − 1xk‖R and ‖yk − vyk‖C allconverge to 0 at the linear rate O(ρ(A)k) with ρ(A) < 1,where ρ(A) denotes the spectral radius of the matrix Adefined in (15).

Remark 4. Note that

α′ =1

nuᵀαv =

∑i∈RR∩RCᵀ

1

nuiviαi.

The condition α′ ≥ Mα is automatically satisfied for afixed M in various situations. For example, if maxi∈N αi =maxi∈RR∩RCᵀ αi (which is always true when both GR andGCᵀ are strongly connected), we can take M = ujvj/n forj = arg maxi∈RR∩RCᵀ αi. For another example, if all αi areequal, then M = uᵀv/n.

In general, the constant M roughly measures the ratio ofstepsizes used by agents in RR ∩RCᵀ and by all the agents.According to condition (7) and definitions (18)-(20), smaller

6

M leads to a tighter upper bound on the maximum stepsizeα.

Remark 5. The upperbound on the stepsizes can be exactlycalculated by (7) assuming the weight matrices R and Care known. Note that since our proof technique is similar tothose using the small-gain theorem (see e.g., [5], [9]), theupperbound obtained here may be conservative. Thus, it isstill open how to develop new analytical tools to derive atighter upperbound. However, it is worth mentioning that, inour numerical simulations, Push-Pull always allows for a verylarge region of stepsize choice compared to other distributedoptimization algorithms applicable to directed graphs.

Remark 6. When α is sufficiently small, it can be shownthat ρ(A) ' 1− α′µ, in which case the Push-Pull algorithmis comparable to the centralized gradient descent methodwith stepsize α′.5 Since α′ =

∑i∈RR∩RCᵀ

1nuiviαi, only

the stepsizes of agents i ∈ RR ∩ RCᵀ contribute to theconvergence speed of Push-Pull. This corresponds to ourdiscussion in Section II-A.

A. Preliminary Analysis

From the algorithm (2) and Lemma 1, we have

xk+1 =1

nuᵀR(xk −αyk) = xk −

1

nuᵀαyk, (8)

and

yk+1 =1

n1ᵀ (Cyk +∇F (xk+1)−∇F (xk))

= yk +1

n1ᵀ (∇F (xk+1)−∇F (xk)) . (9)

Let us further define gk := 1n1

ᵀ∇F (1xk). Then, we obtainfrom relation (8) that

xk+1 = xk −1

nuᵀα (yk − vyk + vyk)

= xk −1

nuᵀαvyk −

1

nuᵀα (yk − vyk)

= xk − α′gk − α′(yk − gk)− 1

nuᵀα (yk − vyk) ,

(10)

where

α′ :=1

nuᵀαv. (11)

We will show later that Assumptions 3 and 4 ensures α′ > 0.In view of (2) and Lemma 1, using (8) we have

xk+1 − 1xk+1 = R(xk −αyk)− 1xk +1

n1uᵀαyk

= R(xk − 1xk)−(R− 1uᵀ

n

)αyk

=

(R− 1uᵀ

n

)(xk − 1xk)−

(R− 1uᵀ

n

)αyk, (12)

5The proof of this statement is similar to that of Corollary 1 in [32] andwas omitted for conciseness of the paper.

and from (9) we obtain

yk+1 − vyk+1 = Cyk − vyk

+

(I− v1ᵀ

n

)(∇F (xk+1)−∇F (xk))

=

(C− v1ᵀ

n

)(yk − vyk)

+

(I− v1ᵀ

n

)(∇F (xk+1)−∇F (xk)) . (13)

B. Supporting Lemmas

Before proceeding to prove the main result in Theorem 1,we state a few useful lemmas.

Lemma 2. Under Assumption 1, there holds

‖yk − gk‖2 ≤L√n‖xk − 1xk‖2, ‖gk‖2 ≤ L‖xk − x∗‖2.

In addition, when α′ ≤ 2/(µ+ L), we have

‖xk − α′gk − x∗‖2 ≤ (1− α′µ)‖xk − x∗‖2, ∀k.

Proof. See Appendix A-B.

Lemma 3. Suppose Assumptions 2-3 hold. Let ρR and ρCbe the spectral radii of (R − 1uᵀ/n) and (C − v1ᵀ/n),respectively. Then, we have ρR < 1 and ρC < 1.

Proof. See Appendix A-C.

Lemma 4. There exist matrix norms ‖ · ‖R and ‖ · ‖C suchthat σR := ‖R − 1uᵀ

n ‖R < 1, σC := ‖C − v1ᵀ

n ‖C < 1, andσR and σC are arbitrarily close to ρR and ρC, respectively.In addition, given any diagonal matrix W ∈ Rn×n, we have‖W‖R = ‖W‖C = ‖W‖2.

Proof. See [39, Lemma 5.6.10] and the discussions thereafter.

In the rest of this paper, with a slight abuse of notation, wedo not distinguish between the vector norms on Rn and theirinduced matrix norms.

Lemma 5. Given an arbitrary norm ‖ ·‖, for any W ∈ Rn×nand x ∈ Rn×p, we have ‖Wx‖ ≤ ‖W‖‖x‖. For any w ∈Rn×1 and x ∈ R1×p, we have ‖wx‖ = ‖w‖‖x‖2.

Proof. See Appendix A-D.

Lemma 6. There exist constants δC,R, δC,2, δR,C, δR,2 > 0such that for all x ∈ Rn×p, we have ‖x‖C ≤ δC,R‖x‖R,‖x‖C ≤ δC,2‖x‖2, ‖x‖R ≤ δR,C‖x‖C, and ‖x‖R ≤δR,2‖x‖2. In addition, with a proper rescaling of the norms‖ · ‖R and ‖ · ‖C, we have ‖x‖2 ≤ ‖x‖R and ‖x‖2 ≤ ‖x‖Cfor all x.

Proof. The above result follows from the equivalence relationof all norms on Rn and Definition 1.

The following critical lemma establishes a linear system ofinequalities that bound ‖xk+1 − x∗‖2, ‖xk+1 − 1xk‖R and‖yk+1 − vyk‖C.

7

Lemma 7. Under Assumptions 1-3, when α′ ≤ 2/(µ + L),we have the following linear system of inequalities: ‖xk+1 − x∗‖2

‖xk+1 − 1xk+1‖R‖yk+1 − vyk+1‖C

≤ A

‖xk − x∗‖2‖xk − 1xk‖R‖yk − vyk‖C

, (14)

where the inequality is taken component-wise, and elementsof the transition matrix A = [aij ] are given by:a11

a21

a31

=

1− α′µασR‖v‖RL

αc0δC,2‖R‖2‖v‖2L2

,a12

a22

a32

=

α′L√n

σR

(1 + α‖v‖R L√

n

)c0δC,2L

(‖R− I‖2 + α‖R‖2‖v‖2 L√

n

) ,

a13

a23

a33

=

α‖u‖2n

ασRδR,CσC + αc0δC,2‖R‖2L

,(15)

where α := maxi αi and c0 := ‖I− v1ᵀ/n‖C.

Proof. See Appendix A-E.

In light of Lemma 7, ‖xk − x∗‖2, ‖xk − 1xk‖R and‖yk − vyk‖C all converge to 0 linearly at rate O(ρ(A)k) ifthe spectral radius of A satisfies ρ(A) < 1. The next lemmaprovides some sufficient conditions for the relation ρ(A) < 1to hold.

Lemma 8. [32, Lemma 5] Given a nonnegative, irreduciblematrix M = [mij ] ∈ R3×3 with mii < λ∗ for some λ∗ > 0for all i = 1, 2, 3. A necessary and sufficient condition forρ(M) < λ∗ is det(λ∗I−M) > 0.

C. Proof of Theorem 1

In light of Lemma 8, it suffices to ensure a11, a22, a33 < 1and det(I−A) > 0, or

det(I−A) = (1− a11)(1− a22)(1− a33)− a12a23a31

−a13a21a32 − (1− a22)a13a31 − (1− a11)a23a32

−(1− a33)a12a21

= (1− a11)(1− a22)(1− a33)

−α′α2σRc0δR,CδC,2‖R‖2‖v‖2 L3√n

−α2σRc0δC,2‖u‖2‖v‖R(‖R− I‖2 + α‖R‖2‖v‖2 L√

n

)L2

n

−α2c0δC,2‖R‖2‖v‖2‖u‖2 L2

n (1− a22)

−ασRc0δR,CδC,2L(‖R− I‖2 + α‖R‖2‖v‖2 L√

n

)(1− a11)

−α′ασR‖v‖R L2√n

(1− a33) > 0.

(16)

We now provide some sufficient conditions under whicha11, a22, a33 < 1 and (16) holds true. First, a11 < 1 is ensuredby choosing α′ ≤ 2/(µ+ L). Suppose 1− a22 ≥ (1− σR)/2and 1 − a33 ≥ (1 − σC)/2 under properly chosen stepsizes.These relations holds when

α ≤ min

(1− σR)

√n

2σR‖v‖RL,

(1− σC)

2c0δC,2‖R‖2L

. (17)

Second, a sufficient condition for det(I − A) > 0 is tosubstitute (1 − a22) (resp., (1 − a33)) in (16) by (1 − σR)/2(resp., (1− σC)/2) and take α′ = Mα. We then have c1α2 +c2α− c3 < 0, where

c1 = MσRc0δR,CδC,2‖R‖2‖v‖2 L3√n

+σRc0δC,2‖u‖2‖v‖R‖R‖2‖v‖2 L3

n√n

+MµσRc0δR,CδC,2‖R‖2‖v‖2 L2√n

= σRc0δC,2‖R‖2‖v‖2 L2

n√n

[MδR,Cn(L+ µ) + ‖u‖2‖v‖RL] ,

(18)

c2 = σRc0δC,2‖u‖2‖v‖R‖R− I‖2 L2

n

+c0δC,2‖R‖2‖v‖2‖u‖2(1− σR)L2

2n+MσRc0δR,CδC,2‖R− I‖2µL+M

2 σR‖v‖R(1− σC) L2√n,

(19)

and

c3 = M4 (1− σC)(1− σR)µ. (20)

Hence

α ≤ 2c3

c2 +√c22 + 4c1c3

. (21)

Relations (17) and (21), together with the fact that A isirreducible (a 3× 3 full matrix), yield the final bound on α.

IV. A GOSSIP-LIKE PUSH-PULL METHOD (G-PUSH-PULL)

In this section, we introduce a generalized random-gossippush-pull algorithm. We call it G-Push-Pull and outline it inthe following (Algorithm 2)6.

Algorithm 2 illustrates the G-Push-Pull algorithm. At each“time slot” k, it is possible in practice that multiple agents(entities that are equivalent to the agent “ik” employed inthe algorithm) are activated/selected. This random selectionprocess is done by placing a Poisson clock on each agent.Anytime when a node is awakened by itself or push-notified(or pull-alerted), it will be temporarily locked for the currentpaired update. We note that in this gossip version (Algorithm2), only the push-communication-protocol is employed. Otherpossible variants that involve only the pull-communication-protocol or both protocols exist. For instance, to give a visualimpression, for a 4-agent network connected as 1 → 2 →3→ 4→ 1 and 1→ 3 (each arrow represents a unidirectionaldata link and this digraph is not balanced/regular)7, if we areto design a pull-only gossip algorithm and suppose agent 3is updating (pulling information from 1 and 2) at time k, themixing matrices can be designed/implemented as

Rk =

1 0 0 00 1 0 0

1/3 1/3 1/3 00 0 0 1

, Ck =

1/2 0 0 00 1/2 0 0

1/2 1/2 1 00 0 0 1

.From the third rows of Rk and Ck, we can see that agent 3is aggregating the pulled information (x1,k, x2,k, 1/2× y1,k,and 1/2 × y2,k); from the first and second column of Ck,we can observe that agents 1 and 2 are “sharing part of y”

6In the algorithm description, the multiplication sign “×” is added simplyfor avoiding visual confusion. It still represents the commonly recognizedscalar-scalar or scalar-vector multiplication.

7For simplicity, we assume GR = GC in this example.

8

Algorithm 2: G-Push-PullEach agent i chooses its local step size αi ≥ 0;Each agent i initializes with any arbitrary xi,0 ∈ Rp andyi,0 = ∇fi(xi,0);for time slot k = 0, 1, · · · do

agent ik is uniformly randomly selected from N ;agent ik uniformly randomly chooses the set N out

R,ik,k, a

subset of its out-neighbors in GR at “time” k;agent ik sends xik,k to all members in N out

R,ik,k;

every agent jk from N outR,ik,k

generates γR,jk,k ∈ (0, 1);agent ik uniformly randomly chooses the set N out

C,ik,k, a

subset of its out-neighbors in GC at “time” k;agent ik sends γC,ik,k × yik,k to all members in

N outC,ik,k

, where γC,ik,k is generated at agent ik such thatγC,ik,k|N out

C,ik,k| < 1;

xik,k+1 = xik,k − αikyik,k;for all jk ∈ N out

R,ik,kand lk ∈ N out

C,ik,kdo

if jk == lkxjk,k+1 = (1−γR,jk,k)xjk,k+γR,jk,k xik,k−2αjkyjk,k;

elsexjk,k+1 = (1−γR,jk,k)xjk,k+γR,jk,k xik,k−αjkyjk,k;xlk,k+1 = xlk,k − αlkylk,k;

end ifend foryik,k+1 = (1−γC,ik,k|N out

C,ik,k|)yik,k+∇fik(xik,k+1)−

∇fik(xik,k);for all lk ∈ N out

C,ik,kdo

ylk,k+1 = ylk,k + γC,ik,k × yik,k + ∇flk(xlk,k+1) −∇flk(xlk,k);

end forfor all jk ∈ N out

R,ik,kbut jk /∈ N out

C,ik,kdo

yjk,k+1 = yjk,k +∇fjk(xjk,k+1)−∇fjk(xjk,k);end for

end for

and rescaling their own y. The gossip mechanism allows andis in favor of a push-only or pull-only network, which isdifferent from what we require for the general static networkcarrying Algorithm 1 (see Section II, the discussion rightbefore Section II-A). Such difference is due to the fact that ingossip algorithms, at each “iteration” k, only one or multipleisolated trees with depth 1 are activated, thus trivial weightsassignment mechanisms exist in the Ck graph. For instancein the above example with a 4-agent network, the chosenCk could be generated by letting the agents being pulledsimply “halve the y variable before using it or sending it out”.This trick for gossip algorithms is difficult, if not impossible,to implement in a synchronized network with other generaltopologies.

In the following, to make the convergence analysis concise,we further assume/restrict to the situation where αi = α > 0for all i ∈ N , |N out

C,ik,k| ≤ 1, |N out

R,ik,k| ≤ 1, and γC,ik,k =

γR,ik,k = γ for all ik ∈ N and all k = 0, 1, . . .. With thesimplification, we can represent the recursion of G-Push-Pull

in a compact matrix form:

xk+1 = Rkxk − αQkyk, (22a)yk+1 = Ckyk +∇F (xk+1)−∇F (xk), (22b)

where the matrices Rk and Ck are given by

Rk = I + γ(ejke

ᵀik− ejke

ᵀjk

), (23a)

Ck = I + γ(elke

ᵀik− eike

ᵀik

), (23b)

respectively. Here, ei = [0, · · · , 0, 1, 0, · · · ]ᵀ ∈ Rn×1 is a unitvector with the ith component equal to 1. Notice that eachRk is row-stochastic, and each Ck is column-stochastic. Therandom matrix variable Qk = diag eik + ejk + elk.

Remark 7. In practice, after receiving information from agentik at step k, agents jk and lk can choose to perform theirupdates when they wake up in a future step.

V. CONVERGENCE ANALYSIS FOR G-PUSH-PULL

Define R := E[Rk] and C := E[Ck]. Denote by uᵀ

(uᵀ1 = n) the left eigenvector of R w.r.t. eigenvalue 1,and let v (1ᵀv = n) be the right eigenvector of C w.r.t.eigenvalue 1. Let xk = 1

n uᵀxk and yk = 1

n1ᵀyk as before.

Our strategy is to bound E[‖xk+1−x∗‖22], E[‖xk+1−1xk‖2S]and E[‖yk+1 − vyk‖2D] in terms of linear combinations oftheir previous values, where ‖ · ‖S and ‖ · ‖D are norms to bespecified later. Then based on the established linear system ofinequalities, we prove the convergence of G-Push-Pull.

We first state the main convergence result for G-Push-Pullin the following theorem.

Theorem 2. Suppose Assumptions 1-3 hold and

γ < min

γR, γC,

ηµ√

(1− σC)

8L√d4d7

, (24a)

α ≤ min

2c6

c5 +√c25 + 4c4c6

,

, (24b)

where the constant γR > 0 is defined in Lemma 12, γC, σC >0 are defined in Lemma 13, η > 0 is given in Lemma 14,c4-c6 are given in (34)-(36), and d4, d7 > 0 are defined in(31). Then, the quantities E[‖xk−x∗‖22], E[‖xk−1xk‖2S] andE[‖yk−vyk‖2D] all converge to 0 at the linear rate O(ρ(B)k),where ρ(B) < 1 denotes the spectral radius of the matrix Bdefined in (30).

A. Preliminaries

From (22), we have the following recursive relations.

xk+1 =uᵀ

n(Rkxk − αQkyk)

= xk − αuᵀ

nQkyk +

uᵀ

n

(Rk − R

)(xk −

1uᵀ

nxk

),

xk+1−1uᵀ

nxk+1 =

(I− 1uᵀ

n

)Rkxk−

(I− 1uᵀ

n

)αQkyk,

(25)

9

and

yk+1 −v1ᵀ

nyk+1 =

(I− v1ᵀ

n

)Ckyk

+

(I− v1ᵀ

n

)[∇F (xk+1)−∇F (xk)] . (26)

To derive a linear system of inequalities from the aboveequations, we first provide some useful facts about Rk andCk as well as their expectations R and C. Let

Tk := ejkeᵀik− ejke

ᵀjk, Ek := elke

ᵀik− eike

ᵀik.

Then from (23) we have Rk = I + γTk and Ck = I + γEk.Define T = E[Tk] and E = E[Ek]. We obtain

R = I + γT, C = I + γE.

Matrices T and E have the following algebraic property.

Lemma 9. The matrix T (resp., E) has a unique eigenvalue0; all the other eigenvalues lie in the unit circle centered at(−1, 0) ∈ C2.

Proof. Note that I+ T is a nonnegative row-stochastic matrixcorresponding to the graph GR. It has spectral radius 1, whichis also the unique eigenvalue of modulus 1 due to the existenceof a spanning tree in the graph GR [41, Lemma 3.4]. Therefore,0 is a unique eigenvalue of T, and all the other eigenvalueslie in the unit circle centered at (−1, 0) ∈ C2. The argumentfor E is similar.

Note that uᵀ is the left eigenvector of T w.r.t. eigenvalue0, and 1ᵀ is the left eigenvector of E w.r.t. eigenvalue 0. Wehave the following result.

Lemma 10. The matrix T can be decomposed as T =S−1JTS, where JT ∈ Cn×n has 0 on its top-left, and it differsfrom the Jordan form of T only on the superdiagonal entries8.Square matrix S has uᵀ as its first row. The rows of S areeither left eigenvectors of T, or generalized left eigenvectors ofT (up to rescaling). In particular, the superdiagonal elementsof JT can be made arbitrarily close to 0 by proper choice ofS.

The matrix E can be decomposed as E = D−1JED, whereJE ∈ Cn×n has 0 on its top-left, and it differs from theJordan form of E only on the superdiagonal entries. Squarematrix D has 1ᵀ as its first row. The rows of D are either lefteigenvectors of E, or generalized left eigenvectors of E (upto rescaling). In particular, the superdiagonal elements of JE

can be made arbitrarily close to 0 by proper choice of D.

Proof. Since uᵀ is the left eigenvector of T w.r.t. eigenvalue 0,the Jordan form of T can be written as T = S−1JTS, whereJT has 0 on its top-left, and uᵀ is the first row in S (see[39]). If T is diagonalizable, then the matrix JT is diagonal,in which case we can take JT = JT and S = S. If not, thematrix JT has superdiagonal elements equal to 1. Then welet S to be different from S only in the rows correspondingto the generalized left eigenvectors of T. By scaling down

8If T is diagonalizable, then JT is exactly the Jordan form of T. The samerelation applies to JE and E in the next paragraph.

these rows, the superdiagonal elements of JT can be madearbitrarily close to 0. The proof for E is similar.

The following lemma is the final cornerstone we need tobuild our proof for the main results.

Lemma 11. Suppose T = S−1JTS and E = D−1JED(as described in Lemma 10). Then the matrix R − 1uᵀ

n hasdecomposition

R− 1uᵀ

n= S−1JRS,

where JR = I − diag e1 + γJT, and the matrix C − v1ᵀ

nhas decomposition

C− v1ᵀ

n= D−1JCD,

where JC = I− diag e1+ γJE.

Proof. Recall that the rows of S are left (generalized) eigen-vectors of T (up to rescaling). Since 1 is the right eigenvectorof T w.r.t eigenvalue 0, it is orthogonal to the left (generalized)eigenvectors of T w.r.t eigenvalues other than 0. Thus we haveS1uᵀ

n = e1uᵀ = diag e1S. Therefore,

R− 1uᵀ

n= S−1(I + γJT)S− S−1diag e1S

= S−1(I− diag e1+ γJT)S.

Similarly, we can prove the second relation.

In the rest of this section, we assume that T = S−1JTSand E = D−1JED for some fixed matrices S, D, JT andJE as described in Lemma 10. In particular, JT and JE arediagonal or close to diagonal.

B. Supporting Lemmas

Define norms ‖ · ‖S and ‖ · ‖D such that for all x ∈ Rn,‖x‖S := ‖Sx‖2 and ‖x‖D := ‖Dx‖2. Correspondingly,for any matrix W ∈ Rn×n, its matrix norms are givenby ‖W‖S := ‖S−1WS‖2 and ‖W‖D := ‖D−1WD‖2,respectively. Denote Tk := Tk − T, Ek := Ek − E, andRk := Rk − R = γTk, Ck := Ck − C = γEk. We have afew supporting lemmas.

Lemma 12. Under Assumption 3, we have

E

[∥∥∥∥(I− 1uᵀ

n

)Rkxk

∥∥∥∥2

S

| xk

]≤ σR ‖xk − 1xk‖2S ,

where σR := ‖JR‖22 + γ2‖VT‖2 with

VT := E[(S−1)ᵀTᵀ

k

(I− u1ᵀ

n

)SᵀS

(I− 1uᵀ

n

)TkS

−1

],

and

E

[∥∥∥∥(I− 1uᵀ

n

)Qkyk

∥∥∥∥2

S

| Hk

]≤ ‖VQ‖2‖yk‖2S,

where

VQ := E[(S−1)ᵀQᵀ

k

(I− 1uᵀ

n

)ᵀ

SᵀS

(I− 1uᵀ

n

)QkS

−1

].

10

In particular, there exist γR > 0 such that for all γ ∈ (0, γR),we have σR < 1.

Proof. By definition,∥∥(I− 1uᵀ

n

)Rkxk

∥∥2

S

= tr[

S(I− 1uᵀ

n

)Rkxk

]ᵀS(I− 1uᵀ

n

)Rkxk

= tr

xᵀkR

ᵀk

(I− u1ᵀ

n

)SᵀS

(I− 1uᵀ

n

)Rkxk

.

(27)

In what follows, we omit the symbol tr· to simplifynotation; all matrices W ∈ Rp×p refer to trW. The readersmay conveniently assume p = 1 in the following.

Let Hk denote the history x1,x2, . . . ,xk,yk. Takingconditional expectation on both sides of (27), we have

E[∥∥∥(I− 1uᵀ

n

)Rkxk

∥∥∥2

S| Hk

]= xᵀ

kRᵀ(I− u1ᵀ

n

)SᵀS

(I− 1uᵀ

n

)Rxk

+E[xᵀkR

ᵀk

(I− u1ᵀ

n

)SᵀS

(I− 1uᵀ

n

)Rkxk | Hk

]= (xk − 1xk)ᵀ

(R− 1uᵀ

n

)ᵀSᵀS

(R− 1uᵀ

n

)(xk − 1xk)

+γ2 (xk − 1xk)ᵀ SᵀE[(S−1)ᵀTᵀ

k

(I− u1ᵀ

n

)Sᵀ

·S(I− 1uᵀ

n

)TkS

−1]S (xk − 1xk) .

(28)

In light of Lemma 11 (see [39]),∥∥R− 1uᵀ

n

∥∥S

=∥∥S (R− 1uᵀ

n

)S−1

∥∥2

= ‖JR‖2. Then from (28) and thedefinition of T, we have

E[∥∥(I− 1uᵀ

n

)Rkxk

∥∥2

S| Hk

]≤∥∥R− 1uᵀ

n

∥∥2

S‖xk − 1xk‖2S + γ2‖VT‖2 ‖xk − 1xk‖2S

=(‖JR‖22 + γ2‖VT‖2

)‖xk − 1xk‖2S .

The second relation follows from

E[∥∥(I− 1uᵀ

n

)Qkyk

∥∥2

S| Hk

]= E

[∥∥∥yᵀkS

ᵀ(S−1)ᵀQᵀk

(I− 1uᵀ

n

)ᵀSᵀS

(I− 1uᵀ

n

)·QkS

−1Syk∥∥

2| Hk

]≤ ‖VQ‖2‖yk‖2S.

Given the definition of JR and Lemma 9, we know ‖JR‖ =1−O(γ). Hence when γ is sufficiently small, we have σR =‖JR‖22 + γ2‖VT‖2 < 1.


E

[∥∥∥∥(I− v1ᵀ

n

)Ckyk

∥∥∥∥2

D

| Hk

]≤ σC ‖yk − vyk‖

2D

+ 2γ2‖VE‖2 ‖vyk‖2D ,

where σC := ‖JC‖22 + 2γ2‖VE‖2 with VE :=

E[(D−1)ᵀEᵀ

kDᵀDEkD

−1]. In particular, there exist γC > 0

such that for all γ ∈ (0, γC), we have σC < 1.

Proof. Note that∥∥(I− v1ᵀ

n

)Ckyk

∥∥2

D

=[D(I− v1ᵀ

n

)Ckyk

]ᵀD(I− v1ᵀ

n

)Ckyk

= yᵀkC

ᵀk

(I− 1vᵀ

n

)DᵀD

(I− v1ᵀ

n

)Ckyk.

Taking conditional expectation on both sides,

E[∥∥(I− v1ᵀ

n

)Ckyk

∥∥2

D| Hk

]= yᵀ

kCᵀ(I− 1vᵀ

n

)DᵀD

(I− v1ᵀ

n

)Cyk

+E[yᵀkC

ᵀk

(I− 1vᵀ

n

)DᵀD

(I− v1ᵀ

n

)Ckyk | Hk

]= (yk − vyk)

ᵀ (Cᵀ − 1vᵀ

n

)DᵀD

(C− v1ᵀ

n

)(yk − vyk)

+yᵀkE[CᵀkD

ᵀDCk

]yk

≤∥∥C− v1ᵀ

n

∥∥2

D‖yk − vyk‖2D

+γ2yᵀkD

ᵀE[(D−1)ᵀEᵀ

kDᵀDEkD

−1]Dyk

≤ ‖JC‖22 ‖yk − vyk‖2D + γ2‖VE‖2 ‖yk‖2D

≤(‖JC‖22 + 2γ2‖VE‖2

)‖yk − vyk‖2D + 2γ2‖VE‖2 ‖vyk‖2D ,

where we invoked Lemma 11 for the second to last relation.


E[‖xk+1 − x∗‖22 | Hk

]≤[1− αηµ+ 4α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22L2

]‖xk − x∗‖22

+(

2αηL2

µn + 4α2L2

n3 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22

+ 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2

)‖xk − 1xk‖22

+(

2αηµn2 ‖uᵀE[Qk]‖22 + 2α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2

)·‖yk − vyk‖22,

where η := uᵀE[Qk]v/n > 0.

Proof. See Appendix B-A.

Similar to Lemma 6, we have the following relation betweennorms ‖ · ‖2, ‖ · ‖S and ‖ · ‖D.

Lemma 15. There exist constants δD,S, δD,2, δS,D, δS,2 > 0such that for all x ∈ Rn×p, we have ‖x‖D ≤ δD,S‖x‖S,‖x‖D ≤ δD,2‖x‖2, ‖x‖S ≤ δS,D‖x‖D, and ‖x‖S ≤ δS,2‖x‖2.In addition, with a proper rescaling of the norms ‖ · ‖S and‖ · ‖D, we have ‖x‖2 ≤ ‖x‖S and ‖x‖2 ≤ ‖x‖D for all x.

In the following lemma, we establish a linear system ofinequalities that bound E[‖xk+1 − x∗‖22], E[‖xk+1 − 1xk‖2S]and E[‖yk+1 − vyk‖2D].

Lemma 16. Under Assumptions 1-3, we have the followinglinear system of inequalities:

E[‖xk+1 − x∗‖22]E[‖xk+1 − 1xk+1‖2S]E[‖yk+1 − vyk+1‖2D]

≤ B

E[‖xk − x∗‖22]E[‖xk − 1xk‖2S]E[‖yk − vyk‖2D]

, (29)

where the inequality is to be taken component-wise, and

11

elements of the transition matrix B = [bij ] are given by:

b11 = 1− αηµ+ 4α2L2

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22,

b21 = 3α2L2(

1+σR

1−σR

)‖VQ‖2‖v‖2S,

b31 = 2L2[

(1+σC)σC

γ2‖VE‖2‖v‖2D+4α2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

DE[‖Qk‖22]‖v‖22

],

b12 = 2αηL2

µn + 4α2L2

n3 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22

+ 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2,

b22 = (1+σR)2 + 3α2L2

n

(1+σR

1−σR

)‖VQ‖2‖v‖2S,

b32 = 2δ2D,2L

2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D

·[E[‖Rk − I‖22

]+ 4α2L2

n E[‖Qk‖22]‖v‖22]

+ 2γ2L2

n(1+σC)σC‖VE‖2‖v‖2D,

b13 = 2αηµn2 ‖uᵀE[Qk]‖22 + 2α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2,

b23 = 3α2(

1+σR

1−σR

)‖VQ‖2δS,D,

b33 = (1+σC)2 + 4α2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

DE[‖Qk‖22].

(30)

Proof. See Appendix B-B.

Now we are ready to prove the main convergence result forG-Push-Pull.

C. Proof of Theorem 2

Let

d1 :=2‖E[Qᵀ

k uuᵀQk]‖

2

n2 ,

d2 := 3(

1+σR

1−σR

)‖VQ‖2

d3 := 4δ2D,2

(1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

DE[‖Qk‖22],

d4 := 1n

(1+σC)σC‖VE‖2‖v‖2D

d5 := 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2,

d6 := δ2D,2

(1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

DE[‖Rk − I‖22]

d7 :=‖uᵀE[Qk]‖22

n2 .

(31)

We can rewrite the elements of B as

b11 = 1− αηµ+ 2α2L2‖v‖22d1,b21 = α2L2d2‖v‖2S,b31 = 2L2

(γ2d4 + α2L2d3‖v‖22

),

b12 = 2αηL2

µn +2α2L2‖v‖22d1

n + d5,

b22 = (1+σR)2 +

α2L2d2‖v‖2Sn ,

b32 = 2L2(d6 + α2L2d3‖v‖22 + γ2d4),b13 = 2α

ηµd7 + α2d1,

b23 = α2d2δS,D,

b33 = (1+σC)2 + α2L2d3.

According to Lemma 3, a sufficient condition for ρ(B) < 1is b11, b22, b33 < 1 and det(I−B) > 0, or

det(I−B) = (1− b11)(1− b22)(1− b33)− b12b23b31

−b13b21b32 − (1− b22)b13b31 − (1− b11)b23b32

−(1− b33)b12b21 > 0.(32)

Let α satisfy the following inequalities.

b11 ≤ 1− 12αηµ, b22 ≤ (3+σR)

4 ,

b33 ≤ (3+σC)4 ,

2α2L2‖v‖22d1

n ≤ αηL2

µn ,

α2d1 ≤ αηµd7, α2L2d3‖v‖22 ≤ d6.

(33)

Then it is sufficient that12αηµ

(1−σR)4

(1−σC)4

−(

3αηL2

µn + d5

)α2d2δS,D2L2

(γ2d4 + α2L2d3‖v‖22

)− 3αηµd7α

2L2d2‖v‖2S2L2(d6 + α2L2d3‖v‖22 + γ2d4

)− (1−σR)

43αηµd72L2

(γ2d4 + α2L2d3‖v‖22

)− 1

2αηµα2d2δS,D2L2

(d6 + α2L2d3‖v‖22 + γ2d4

)− (1−σC)

4

(3αηL2

µn + d5

)α2L2d2‖v‖2S > 0.

Since α2L2d3‖v‖22 ≤ d6 from (33), we only need132ηµ(1− σR)(1− σC)

−(

3αηL2

µn + d5

)2αL2d2δS,D

(γ2d4 + d6

)− 6α2L4

ηµ d2d7‖v‖2S(2d6 + γ2d4

)− 3(1−σR)

2L2d7

ηµ

(γ2d4 + α2L2d3‖v‖22

)−ηµα2L2d2δS,D

(2d6 + γ2d4

)− (1−σC)

4

(3αηL2

µn + d5

)αL2d2‖v‖2S > 0.

We can rewrite the above inequality as c4α2 + c5α− c6 < 0,where

c4 = 6ηL4

µn d2δS,D(d6 + γ2d4

)+ 6L4

ηµ d2d7‖v‖2S(2d6 + γ2d4

)+ 3(1−σR)

2L4d3d7

ηµ ‖v‖22 + ηµL2d2δS,D(2d6 + γ2d4

)+ 3(1−σC)

4ηL4d2

µn ‖v‖2S,

(34)

c5 = 2L2d2d5δS,D(d6 + γ2d4

)+ (1−σC)

4 L2d2d5‖v‖2S,(35)

and

c6 = 132ηµ(1− σR)(1− σC)− 3(1−σR)

2L2d4d7

ηµ γ2. (36)

In light of (24a), we have c6 > 0. Then

α ≤ 2c6

c5 +√c25 + 4c4c6

is sufficient.

VI. SIMULATIONS

In this section, we provide numerical comparisons of a fewdifferent algorithms under both synchronous and asynchronousrandom-gossip settings. The problem we consider is sensorfusion over a network, which is similar to the one consideredin [8]. The estimation problem can be described as

minx∈Rp

n∑i=1

(‖zi −Hix‖2 + λi ‖x‖2

),

where x is the unknown parameter to be estimated, Hi ∈ Rs×pand zi ∈ Rs represent the measurement matrix and thenoisy observation of sensor i, respectively, and λi is theregularization parameter for the local cost function of sensori.

12

We consider a sensor network that is randomly generatedin a unit square, and two sensors are connected within acertain sensing range. The sensors are assumed to have asym-metric sensing ranges so as to construct a directed network,i.e., 1.5

√log n/n for outgoing links and 0.75

√log n/n for

incoming links. We set n = 20, p = 20, s = 1 andλi = 0.01,∀i ∈ N so that each local cost function isill-conditioned, necessitating coordination among agents toachieve fast convergence. The measurement matrix is ran-domly generated from a standard normal distribution whichis then normalized such that its Lipschitz constant is equal to1. We design the weight matrices R and C based on the sameunderlying graph (i.e., GR = GC = G) and the constant weightrule, i.e., R = I− 1

2dinmax

LR where dinmax is the maximum in-

degree, and LR defines the Laplacian matrix corresponding toGR (using in-degree). Similarly, C = I− 1

2doutmax

LC.

We compare our proposed Push-Pull algorithm againstPush-DIGing [9] and Xi-Row [14] that are also applicable todirected networks. Push-DIGing is an algorithm building onpush-sum protocols which works on merely column-stochasticmatrices and thus only need push operations for the informa-tion dissemination in the network. Xi-Row is an algorithm thatonly uses row-stochastic matrices and thus only requires pulloperations to fetch information in the network. By contrast,the proposed Push-Pull algorithm uses both row-stochasticand column-stochastic matrices for the information diffusion.As we will show shortly in our simulations, Push-Pull worksmuch better especially for ill-conditioned problems and whengraphs are not well balanced. This is due to the fact that nodeswith very few in-degree (resp., out-degree) neighbors (e.g.,in significantly unbalanced directed graphs) will become abottleneck to the information flow for row-stochastic (resp.,column-stochastic) weight matrices. In contrast, Push-Pullrelies on both out-degree and in-degree neighbors of thenetwork and can thus diffuse information much fast. It shouldbe also noted that the per-node storage complexity of Push-Pull(or Push-DIGing) is O(p) while that of Xi-Row is O(n+ p).Since, at each iteration, the amount of data transmitted overeach link also scales at such orders for these algorithms,respectively. For large-scale networks (n p), Xi-Row maysuffer from high needs in storage/bandwidth compared to theother methods.

Fig. 2a illustrates the performance of the above algorithmsunder a randomly generated fixed network in terms of the(normalized) residual ‖xk−x∗‖22

‖x0−x∗‖22. Since the upperbound on the

stepsize for all the algorithms are derived from the small gaintheorem (or similar techniques) and can be very conservative,we hand-optimize the stepsize for each method to make afair comparison. It can be seen from Fig. 2a that Push-Pullallows for much larger value of the stepsize compared to Push-DIGing and Xi-Row. In addition, it also enjoys much fasterconvergence speed.

Under the asynchronous random-gossip setting, we com-pare G-Push-Pull (see section IV) against a variant of Push-DIGing which is shown to be applicable to the time-varyingscenario [9]. We use a random link activation model, thatis, at each iteration, a subset of links will be randomly

activated following a certain Bernoulli process B(0.1), andthe associated nodes will be awakened via “push-notification”or “pull-notification”. These awakened nodes will then com-municate with each other according to the gossip protocolthat is designed based on the aforementioned constant weightrule. The simulation results are averaged over 20 runs. Fig. 2billustrates the performance of the proposed G-Push-Pull algo-rithm against Push-DIGing. It can be seen that Push-DIGIngallows for very small values of the stepsize and suffers fromsome “spikes” due to the use of division operations in thealgorithm (note that the divisors can scale badly at the orderof Ω(n−n) [9]). In contrast, G-Push-Pull allows for very largevalue of stepsize and enjoys much faster convergence speed.More importantly, it converges linearly and steadily to theoptimal solution.

Iterations200 400 600 800 1000 1200 1400 1600 1800 2000

Re

sid

ua

l

10-6

10-5

10-4

10-3

10-2

10-1

100

Push-Pull: α=1Push-DIGing: α=0.22Xi-Row: α=0.008Push-Pull: α=1.6Push-DIGing: α=0.25Xi-Row: γ=0.009

(a) Fixed directed network

Iterations200 400 600 800 1000 1200 1400 1600 1800 2000

Re

sid

ua

l

10-6

10-4

10-2

100

102 Push-DIGing: α=0.18

Push-DIGing: α=0.2Push-DIGing: α=0.24G-Push-Pull: α=2

(b) Asynchronous directed network

Fig. 2. Plots of (normalized) residuals versus the number of iterations over(a) a fixed directed network and (b) an asynchronous directed network. Thesimulation results for asynchronous stetting are averaged over 20 runs.

VII. CONCLUSIONS

In this paper, we have studied the problem of distributedoptimization over a network. In particular, we proposed newdistributed gradient-based methods (Push-Pull and G-Push-Pull) where each node maintains estimates of the optimaldecision variable and the average gradient of the agents’

13

objective functions. From the viewpoint of an agent, the infor-mation about the gradients is pushed to its neighbors, whilethe information about the decision variable is pulled from itsneighbors. The methods utilize two different graphs for theinformation exchange among agents and work for differenttypes of distributed architecture, including decentralized, cen-tralized, and semi-centralized architecture. We have showedthat the algorithms converge linearly for strongly convex andsmooth objective functions over a directed network both forsynchronous and asynchronous random-gossip updates. In thesimulations, we have also demonstrated the effectiveness ofthe proposed algorithm as compared to the state-of-the-arts.

APPENDIX APROOFS FOR PUSH-PULL

A. Proof of Lemma 1Denote by ui the ith element of u. We first prove that ui > 0

iff i ∈ RR. Note that there exists an order of vertices such that

R can be rewritten as R :=

[R1 0R2 R3

], where R1 is a square

matrix corresponding to vertices in RR. Since the inducedsubgraph GR1

is strongly connected, R1 is row stochasticand irreducible. In light of the Perron-Frobenius theorem, R1

has a strictly positive left eigenvector uᵀ1 (with uᵀ11 = n)corresponding to eigenvalue 1. It follows that [u1,0]ᵀ is arow eigenvector of R, which is also unique from the Perron-Frobenius theorem. Since reordering of vertices does notchange the corresponding eigenvector (up to permutation in thesame oder of vertices), we have ui > 0 iff i ∈ RR. Similarly,we can show that vj > 0 iff j ∈ RCᵀ . Since RR ∩RCᵀ 6= ∅from Assumption 3, we have uᵀv > 0.

B. Proof of Lemma 2In light of Assumption 1 and (3), ‖yk − gk‖2 =

1n‖1

ᵀ∇F (xk) − 1ᵀ∇F (1xk)‖2 ≤ Ln

∑ni=1 ‖xi,k − xk‖2 ≤

L√n‖xk − 1xk‖2, and ‖gk‖2 = 1

n‖1ᵀ∇F (1xk) −

1ᵀ∇F (1x∗)‖2 ≤ Ln

∑ni=1 ‖xk − x∗‖2 = L‖xk − x∗‖2. Proof

of the last relation can be found in [5, Lemma 10].

C. Proof of Lemma 3In light of [41, Lemma 3.4], under Assumptions 2-3, spec-

tral radii of R and C are both equal to 1 (the correspondingeigenvalues have multiplicity 1). Suppose for some λ, u 6= 0,uᵀ(R− 1uᵀ

n

)= λuᵀ. Since 1 is a right eigenvector of

(R − 1uᵀ/n) corresponding to eigenvalue 0, uᵀ1 = 0 (see[39, Theorem 1.4.7]). We have uᵀR = λu. Hence λ is alsoan eigenvalue of R. Noticing that uᵀ1 = n, we have uᵀ 6= uᵀ

so that λ < 1. We conclude that σR < 1. Similarly we canobtain σC < 1.

D. Proof of Lemma 5By Definition 1,

‖Wx‖ = ‖[‖Wx1‖, ‖Wx2‖, . . . , ‖Wxp‖]‖2≤ ‖[‖W‖‖x1‖, ‖W‖‖x2‖, . . . , ‖W‖‖xp‖]‖2= ‖W‖‖[‖x1‖, ‖x2‖, . . . , ‖xp‖]‖2 = ‖W‖‖x‖,

and ‖wx‖ = ‖[‖wx1‖, ‖wx2‖, . . . , ‖wxp‖]‖2 =‖w‖‖[|x1|, |x2|, . . . , |xp|]‖2 = ‖w‖‖x‖2.

E. Proof of Lemma 7

The three inequalities embedded in (14) come from (10),(12), and (13), respectively. First, by Lemma 2 and Lemma 6,we obtain from (10) that

‖xk+1 − x∗‖2 ≤ ‖xk − α′gk − x∗‖2 + α′‖yk − gk‖2+ 1n‖u

ᵀα(yk − vyk)‖2≤ (1− α′µ)‖xk − x∗‖2 + α′L√

n‖xk − 1xk‖2

+‖u‖2‖α‖2n ‖yk − vyk‖2≤ (1− α′µ)‖xk − x∗‖2 + α′L√

n‖xk − 1xk‖R

+ α‖u‖2n ‖yk − vyk‖C.

Second, by relation (12), Lemma 5 and Lemma 6, we see that

‖xk+1 − 1xk+1‖R ≤ σR‖xk − 1xk‖R + σR‖α‖R‖yk‖R≤ σR‖xk − 1xk‖R + ασR‖yk − vyk‖R + ασR‖v‖R‖yk‖2≤ σR‖xk − 1xk‖R + ασR‖yk − vyk‖R

+ασR‖v‖R(L√n‖xk − 1xk‖2 + L‖xk − x∗‖2

)≤ σR

(1 + α‖v‖R L√

n

)‖xk − 1xk‖R + ασRδR,C‖yk − vyk‖C

+ασR‖v‖RL‖xk − x∗‖2.

Lastly, it follows from (13), Lemma 5 and Lemma 6 that

‖yk+1 − vyk+1‖C ≤ σC‖yk − vyk‖C+c0δC,2L‖xk+1 − xk‖2

= σC‖yk − vyk‖C + c0δC,2L‖(R− I)(xk − 1xk)−Rαyk‖2≤ σC‖yk − vyk‖C + c0δC,2L (‖R− I‖2‖xk − 1xk‖2

+ ‖Rα(yk − vyk) + Rαvyk‖2)≤ σC‖yk − vyk‖C + c0δC,2L(‖R− I‖2‖xk − 1xk‖2

+‖R‖2‖α‖2‖yk − vyk‖2 + ‖R‖2‖α‖2‖v‖2‖yk‖2).

In light of Lemma 2,

‖yk‖2 ≤(L√n‖xk − 1xk‖2 + L‖xk − x∗‖2

).

Hence

‖yk+1 − vyk+1‖C≤ (σC + αc0δC,2‖R‖2L) ‖yk − vyk‖C

+c0δC,2L‖R− I‖2‖xk − 1xk‖2+αc0δC,2‖R‖2‖v‖2L

(L√n‖xk − 1xk‖2 + L‖xk − x∗‖2

)≤ (σC + αc0δC,2‖R‖2L) ‖yk − vyk‖C

+c0δC,2L(‖R− I‖2 + α‖R‖2‖v‖2 L√

n

)‖xk − 1xk‖R

+αc0δC,2‖R‖2‖v‖2L2‖xk − x∗‖2.

APPENDIX BPROOFS FOR G-PUSH-PULL

A. Proof of Lemma 14

By (22),

xk+1 = uᵀ

n (Rkxk − αQkyk)

= xk + uᵀ

n

(Rkxk − αQkyk

)= xk − α u

ᵀ

n Qkyk + uᵀ

n Rk (xk − 1xk) .

14

It follows that‖xk+1 − x∗‖22

=∥∥∥xk − α uᵀ

n Qkyk + uᵀ

n Rk (xk − 1xk)− x∗∥∥∥2

2= ‖xk − x∗‖22−2⟨xk − x∗, α u

ᵀ

n Qkyk − uᵀ

n Rk (xk − 1xk)⟩

+∥∥∥α uᵀ

n Qkyk − uᵀ

n Rk (xk − 1xk)∥∥∥2

2.

Taking conditional expectation on both sides,

E[‖xk+1 − x∗‖22 | Hk

]= ‖xk − x∗‖22

−2α⟨xk − x∗, u

ᵀ

n E[Qk]yk⟩

+ α2E[∥∥ uᵀ

n Qkyk∥∥2

2| Hk

]+E

[∥∥∥ uᵀ

n Rk (xk − 1xk)∥∥∥2

2| Hk

].

We now bound the last two terms on the right-hand side. First,

E[∥∥ uᵀ

n Qkyk∥∥2

2| Hk

]= yᵀ

kE[Qᵀkunuᵀ

n Qk

]yk

= yᵀkE[Qᵀkuuᵀ

n2 Qk

]yk ≤ 1

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖yk‖22

≤ 2n2 ‖E [Qᵀ

kuuᵀQk]‖

2

[2‖yk − vyk‖22 + 2‖v‖22‖yk‖22

].

Second,

E[∥∥∥ uᵀ

n Rk (xk − 1xk)∥∥∥2

2| Hk

]= (xk − 1xk)

ᵀ E[Rᵀkuuᵀ

n2 Rk

](xk − 1xk)

≤ 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2‖xk − 1xk‖22.

The term with inner product can be rewritten in the followingway: ⟨

xk − x∗, uᵀ

n E[Qk]yk⟩

=⟨xk − x∗, u

ᵀ

n E[Qk]vyk⟩

+⟨xk − x∗, u

ᵀ

n E[Qk](yk − vyk)⟩

= η 〈xk − x∗, yk〉+⟨xk − x∗, u

ᵀ

n E[Qk](yk − vyk)⟩.

We now haveE[‖xk+1 − x∗‖22 | Hk

]≤ ‖xk − x∗‖22 − 2αη 〈xk − x∗, yk〉−2α

⟨xk − x∗, u

ᵀ

n E[Qk](yk − vyk)⟩

+ 2α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖yk − vyk‖22

+ 2α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22‖yk‖22

+ 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2‖xk − 1xk‖22.

Notice that‖xk − x∗‖22 − 2αη 〈xk − x∗, yk〉

= ‖xk − x∗‖22 − 2αη 〈xk − x∗, gk〉 − 2αη 〈xk − x∗, yk − gk〉≤ (1− 2αηµ)‖xk − x∗‖22 − 2αη 〈xk − x∗, yk − gk〉≤ (1− 2αηµ)‖xk − x∗‖22 + 2αη‖xk − x∗‖‖yk − gk‖,

and

‖yk‖22 = ‖yk − gk + gk‖22 ≤ 2‖yk − gk‖22 + 2‖gk‖22.

We haveE[‖xk+1 − x∗‖22 | Hk

]≤ (1− 2αηµ)‖xk − x∗‖22

+2αη‖xk − x∗‖2‖yk − gk‖2+ 2α

n ‖xk − x∗‖2 ‖uᵀE[Qk]‖2 ‖yk − vyk‖2

+ 2α2

n2 ‖E [Qᵀkuu


+ 4α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22(‖yk − gk‖22 + ‖gk‖22)

+ 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2‖xk − 1xk‖22.

Note that

2αη‖xk − x∗‖2‖yk − gk‖2≤ αη

(µ2 ‖xk − x

∗‖22 + 2µ‖yk − gk‖

22

),

and2αn ‖xk − x

∗‖2 ‖uᵀE[Qk]‖2‖yk − vyk‖2≤ αη

(µ2 ‖xk − x

∗‖22 + 2η2µn2 ‖uᵀE[Qk]‖22 ‖yk − vyk‖22

).

From Lemma 2 we obtain

E[‖xk+1 − x∗‖22 | Hk

]≤ (1− αηµ)‖xk − x∗‖22

+ 2αηµ ‖yk − gk‖

22 + 2α

ηµn2 ‖uᵀE[Qk]‖22 ‖yk − vyk‖22+ 2α2

n2 ‖E [Qᵀkuu


+ 4α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22(‖yk − gk‖22 + ‖gk‖22)

+ 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2‖xk − 1xk‖22

≤[1− αηµ+ 4α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22L2

]‖xk − x∗‖22

+(

2αηL2

µn + 4α2L2

n3 ‖E [Qᵀkuu

ᵀQk]‖2‖v‖22

+ 1n2

∥∥∥E [Rᵀkuu

ᵀRk

]∥∥∥2

)‖xk − 1xk‖22

+(

2αηµn2 ‖uᵀE[Qk]‖22 + 2α2

n2 ‖E [Qᵀkuu

ᵀQk]‖2

)‖yk − vyk‖22.

B. Proof of Lemma 16

The first inequality follows directly from Lemma 14 andLemma 15.

For the second inequality, note that from (25) we have

‖xk+1 − 1xk+1‖2S≤[∥∥(I− 1uᵀ

n

)Rkxk

∥∥S

+∥∥(I− 1uᵀ

n

)αQkyk

∥∥S

]2=∥∥(I− 1uᵀ

n

)Rkxk

∥∥2

S+ α2

∥∥(I− 1uᵀ

n

)Qkyk

∥∥2

S

+2α∥∥(I− 1uᵀ

n

)Rkxk

∥∥S

∥∥(I− 1uᵀ

n

)Qkyk

∥∥S.

In light of Lemma 12,

E[‖xk+1 − 1xk+1‖2S | Hk

]≤(σR + 1−σR

2

)‖xk − 1xk‖2S + α2

(1 + 2σR

1−σR

)‖VQ‖2‖yk‖2S

= (1+σR)2 ‖xk − 1xk‖2S + α2

(1+σR

1−σR

)‖VQ‖2‖yk‖2S.

Since

‖yk‖2S = ‖yk − vyk + v(yk − gk) + vgk‖2S≤ 3

[‖yk − vyk‖2S + ‖v‖2S‖yk − gk‖22 + ‖v‖2S‖gk‖22

]≤ 3

[‖yk − vyk‖2S + L2

n ‖v‖2S‖xk − 1xk‖22 + L2‖v‖2S‖xk − x∗‖22

].

We have

E[‖xk+1 − 1xk+1‖2S | Hk

]≤ (1+σR)

2 ‖xk − 1xk‖2S+3α2

(1+σR

1−σR

)‖VQ‖2

[‖yk − vyk‖2S + L2

n ‖v‖2S‖xk − 1xk‖22

+L2‖v‖2S‖xk − x∗‖22]

≤[

(1+σR)2 + 3α2L2

n

(1+σR

1−σR

)‖VQ‖2‖vS‖2

]‖xk − 1xk‖2S

+3α2L2(

1+σR

1−σR

)‖VQ‖2‖v‖2S‖xk − x∗‖22

+3α2(

1+σR

1−σR

)‖VQ‖2δS,D‖yk − vyk‖2D.

15

For the last inequality, from (26) and Lemma 13 we have∥∥yk+1 − v1ᵀ

n yk+1

∥∥2

D

≤(∥∥(I− v1ᵀ

n

)Ckyk

∥∥D

+∥∥(I− v1ᵀ

n

)[∇F (xk+1)−∇F (xk)]

∥∥D

)2≤(

1 + 1−σC

2σC

)(σC ‖yk − vyk‖

2D + 2γ2‖VE‖2 ‖vyk‖2D

)+(

1 + 2σC

1−σC

)∥∥(I− v1ᵀ

n

)[∇F (xk+1)−∇F (xk)]

∥∥2

D

= (1+σC)2 ‖yk − vyk‖2D + (1+σC)

σCγ2‖VE‖2 ‖vyk‖2D

+(

1+σC

1−σC

)∥∥(I− v1ᵀ

n

)[∇F (xk+1)−∇F (xk)]

∥∥2

D.

Notice that by (22) and Assumption 1,

‖∇F (xk+1)−∇F (xk)‖2D ≤ δ2D,2L

2‖xk+1 − xk‖22= δ2

D,2L2‖Rkxk − αQkyk − xk‖22

≤ 2δ2D,2L

2‖(Rk − I)(xk − 1xk)‖22 + 2α2δ2D,2L

2‖Qk‖22‖yk‖22≤ 2δ2

D,2L2‖Rk − I‖22‖xk − 1xk‖22

+4α2δ2D,2L

2‖Qk‖22(‖yk − vyk‖22 + ‖vyk‖22).

We have

‖yk+1 − vyk+1‖2D ≤(1+σC)

2 ‖yk − vyk‖2D+ (1+σC)

σCγ2‖VE‖2 ‖vyk‖2D

+δ2D,2L

2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D

[2‖Rk − I‖22‖xk − 1xk‖22

+4α2‖Qk‖22(‖yk − vyk‖22 + ‖vyk‖22)]

≤[

(1+σC)2 + 4α2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D‖Qk‖22

]· ‖yk − vyk‖22+2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D‖Rk − I‖22‖xk − 1xk‖22

+[

(1+σC)σC

γ2‖VE‖2‖v‖2D+4α2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D‖Qk‖22‖v‖22

]‖yk‖22

≤[

(1+σC)2 + 4α2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D‖Qk‖22

]· ‖yk − vyk‖2D+2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D‖Rk − I‖22‖xk − 1xk‖2S

+2L2[

(1+σC)σC

γ2‖VE‖2‖v‖2D+4α2δ2

D,2L2(

1+σC

1−σC

)∥∥I− v1ᵀ

n

∥∥2

D‖Qk‖22‖v‖22

]·(

1n‖xk − 1xk‖22 + ‖xk − x∗‖22

).

REFERENCES

[1] S. Pu, W. Shi, J. Xu, and A. Nedic, “A push-pull gradient method fordistributed optimization in networks,” in Proceedings of the 54th IEEEConference on Decision and Control (CDC), 2018.

[2] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the LinearConvergence of the ADMM in Decentralized Consensus Optimization,”IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761,2014.

[3] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-OrderAlgorithm for Decentralized Consensus Optimization,” SIAM Journalon Optimization, vol. 25, no. 2, pp. 944–966, 2015.

[4] A. Olshevsky, “Linear time average consensus and distributed opti-mization on fixed graphs,” SIAM Journal on Control and Optimization,vol. 55, no. 6, pp. 3990–4014, 2017.

[5] G. Qu and N. Li, “Harnessing smoothness to accelerate distributedoptimization,” IEEE Transactions on Control of Network Systems, 2017.

[6] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulie, “Optimalalgorithms for smooth and strongly convex distributed optimization innetworks,” in International Conference on Machine Learning, 2017, pp.3027–3036.

[7] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54,no. 1, pp. 48–61, 2009.

[8] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Convergence of asynchronous dis-tributed gradient methods over stochastic networks,” IEEE Transactionson Automatic Control, 2017.

[9] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergencefor distributed optimization over time-varying graphs,” SIAM Journal onOptimization, vol. 27, no. 4, pp. 2597–2633, 2017.

[10] A. Nedic and A. Olshevsky, “Distributed optimization over time-varyingdirected graphs,” IEEE Transactions on Automatic Control, vol. 60,no. 3, pp. 601–615, 2015.

[11] ——, “Stochastic gradient-push for strongly convex functions on time-varying directed graphs,” IEEE Transactions on Automatic Control,vol. 61, no. 12, pp. 3936–3947, 2016.

[12] J. Zeng and W. Yin, “Extrapush for convex smooth decentralizedoptimization over directed networks,” Journal of Computational Math-ematics, vol. 35, no. 4, 2017.

[13] C. Xi and U. A. Khan, “Dextra: A fast algorithm for optimization overdirected graphs,” IEEE Transactions on Automatic Control, vol. 62,no. 10, pp. 4980–4993, 2017.

[14] C. Xi, V. S. Mai, R. Xin, E. H. Abed, and U. A. Khan, “Linearconvergence in optimization over directed graphs with row-stochasticmatrices,” IEEE Transactions on Automatic Control, 2018.

[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Foundations and Trends R© in Machine Learning, vol. 3,no. 1, pp. 1–122, 2011.

[16] Z. Peng, Y. Xu, M. Yan, and W. Yin, “Arock: an algorithmic frame-work for asynchronous parallel coordinate updates,” SIAM Journal onScientific Computing, vol. 38, no. 5, pp. A2851–A2879, 2016.

[17] A. Nedic, A. Olshevsky, and M. G. Rabbat, “Network topology andcommunication-computation tradeoffs in decentralized optimization,”Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.

[18] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossipalgorithms,” IEEE transactions on information theory, vol. 52, no. 6,pp. 2508–2530, 2006.

[19] J. Lu, C. Y. Tang, P. R. Regier, and T. D. Bow, “Gossip algorithms forconvex consensus optimization over networks,” IEEE Transactions onAutomatic Control, vol. 56, no. 12, pp. 2917–2923, 2011.

[20] S. Lee and A. Nedic, “Asynchronous gossip-based random projectionalgorithms over networks,” IEEE Transactions on Automatic Control,vol. 61, no. 4, pp. 953–968, 2016.

[21] A. S. Mathkar and V. S. Borkar, “Nonlinear gossip,” SIAM Journal onControl and Optimization, vol. 54, no. 3, pp. 1535–1557, 2016.

[22] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralizedsecond-order method with exact linear convergence rate for consensusoptimization,” IEEE Transactions on Signal and Information Processingover Networks, vol. 2, no. 4, pp. 507–522, 2016.

[23] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato,“Newton-raphson consensus for distributed convex optimization,” IEEETransactions on Automatic Control, vol. 61, no. 4, pp. 994–1009, 2016.

[24] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedic, “Optimal algorithmsfor distributed optimization,” arXiv preprint arXiv:1712.00232, 2017.

[25] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradientmethods for multi-agent optimization under uncoordinated constantstepsizes,” in Decision and Control (CDC), 2015 IEEE 54th AnnualConference on. IEEE, 2015, pp. 2055–2060.

[26] A. Nedic, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometricallyconvergent distributed optimization with uncoordinated step-sizes,” inAmerican Control Conference (ACC), 2017. IEEE, 2017, pp. 3950–3955.

[27] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimiza-tion,” IEEE Transactions on Signal and Information Processing overNetworks, vol. 2, no. 2, pp. 120–136, 2016.

[28] W. Shi, Q. Ling, G. Wu, and W. Yin, “A Proximal Gradient Algorithmfor Decentralized Composite Optimization,” IEEE Transactions on Sig-nal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.

[29] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient methodwith network independent step-sizes and separated convergence rates,”IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506,2019.

[30] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed, “Decentralizedconsensus optimization with asynchrony and delays,” in Signals, Systemsand Computers, 2016 50th Asilomar Conference on. IEEE, 2016, pp.992–996.

[31] S. Pu and A. Garcia, “Swarming for faster convergence in stochasticoptimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4,pp. 2997–3020, 2018.

16

[32] S. Pu and A. Nedic, “Distributed stochastic gradient tracking methods,”arXiv preprint arXiv:1805.11454, 2018.

[33] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-Based Computationof Aggregate Information,” in Proceedings of the 44th Annual IEEESymposium on Foundations of Computer Science, 2003, pp. 482–491.

[34] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dualaveraging for convex optimization,” in Decision and Control (CDC),2012 IEEE 51st Annual Conference on. IEEE, 2012, pp. 5453–5458.

[35] R. Xin, C. Xi, and U. A. Khan, “Frostfast row-stochastic optimizationwith uncoordinated step-sizes,” EURASIP Journal on Advances in SignalProcessing, vol. 2019, no. 1, pp. 1–14, 2019.

[36] R. Xin and U. A. Khan, “A linear algorithm for optimization overdirected graphs with geometric convergence,” IEEE Control SystemsLetters, vol. 2, no. 3, pp. 315–320, 2018.

[37] C. Godsil and G. F. Royle, Algebraic graph theory. Springer Science& Business Media, 2013, vol. 207.

[38] K. Cai and H. Ishii, “Average consensus on general strongly connecteddigraphs,” Automatica, vol. 48, no. 11, pp. 2750–2761, 2012.

[39] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge universitypress, 1990.

[40] A. Sayed, “Diffusion Adaptation over Networks,” Academic PressLibrary in Signal Processing, vol. 3, pp. 323–454, 2013.

[41] W. Ren and R. W. Beard, “Consensus seeking in multiagent systemsunder dynamically changing interaction topologies,” IEEE Transactionson automatic control, vol. 50, no. 5, pp. 655–661, 2005.

Shi Pu is currently an assistant professor in theInstitute for Data and Decision Analytics, The Chi-nese University of Hong Kong, Shenzhen, China.He received a B.S. Degree in Engineering Mechan-ics from Peking University, in 2012, and a Ph.D.Degree in Systems Engineering from the Universityof Virginia, in 2016. He was a postdoctoral associateat the University of Florida, from 2016 to 2017,a postdoctoral scholar at Arizona State University,from 2017 to 2018, and a postdoctoral associate atBoston University, from 2018 to 2019. His research

interests include distributed optimization, network science, machine learning,and game theory.

Wei (Wilbur) Shi was a postdoc in the Electricaland Computer Engineering Department of PrincetonUniversity, Princeton, NJ, USA. He obtained hisPh.D. in Control Science and Engineering from theUniversity of Science and Technology of China,Hefei, Anhui, China. He was a postdoc in the Coor-dinated Science Laboratory, University of Illinois atUrbana-Champaign, Urbana, IL, USA. His currentresearch interest distributes in optimization, learn-ing, and control, and applications in cyber physicalsystems and internet of things. He was a recipient

of the Young Author Best Paper Award in IEEE Signal Processing Society.

Jinming Xu received the B.S. degree in mechani-cal engineering from Shandong University, China,in 2009 and the Ph.D. degree in Electrical andElectronic Engineering from Nanyang TechnologicalUniversity (NTU), Singapore, in 2016. He was aresearch fellow of the EXQUITUS center at NTUfrom 2016 to 2017; he also received postdoctoraltraining in the Ira A. Fulton Schools of Engineering,Arizona State University, from 2017 to 2018, andSchool of Industrial Engineering, Purdue University,from 2018 to 2019, respectively. Currently, he is

an assistant professor with the College of Control Science and Engineeringat Zhejiang University, China. His research interests include distributedoptimization and control, machine learning and network science.

Angelia Nedic has a Ph.D. from Moscow State Uni-versity, Moscow, Russia, in Computational Math-ematics and Mathematical Physics (1994), and aPh.D. from Massachusetts Institute of Technology,Cambridge, USA in Electrical and Computer Sci-ence Engineering (2002). She has worked as asenior engineer in BAE Systems North America, Ad-vanced Information Technology Division at Burling-ton, MA. Currently, she is a faculty member of theschool of Electrical, Computer and Energy Engi-neering at Arizona State University at Tempe. Prior

to joining Arizona State University, she has been a Willard Scholar facultymember at the University of Illinois at Urbana-Champaign. She is a recipient(jointly with her co-authors) of the Best Paper Awards at the Winter SimulationConference 2013 and at the International Symposium on Modeling andOptimization in Mobile, Ad Hoc and Wireless Networks (WiOpt) 2015.Her general research interest is in optimization, large scale complex systemsdynamics, variational inequalities and games.

Documents

Push-Pull Gradient Methods for Distributed …1 Push-Pull Gradient Methods for Distributed Optimization in Networks Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedic´ Abstract—In this