Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Distributed Subgradient Methods for
Multi-agent Optimization
Asu Ozdaglar
February 2009
Department of Electrical Engineering & Computer Science
Massachusetts Institute of Technology, USA
MIT Laboratory for Information and Decision Systems 1
Motivation
• Increasing interest in distributed control and coordination of networks
consisting of multiple autonomous agents
• Motivated by many emerging networking applications, such as ad hoc wireless
communication networks and sensor networks, characterized by:
– Lack of centralized control and access to information
– Time-varying connectivity
• Control and optimization algorithms for such networks should be:
– Completely distributed relying on local information
– Robust against changes in the network topology
MIT Laboratory for Information and Decision Systems 2
Multi-Agent Optimization Problem
Goal: Develop a general computational model
for cooperatively optimizing a global system
objective through local interactions and com-
putations in a multi-agent system
• Global objective is a combination of
individual agent performance measures
Examples:
• Consensus problems: Alignment of estimates maintained by different agents
– Control of moving vehicles (UAVs), computing averages of initial values
• Parameter estimation in distributed sensor networks:
– Regression-based estimates using local sensor measurements
• Congestion control in data networks with heterogeneous users
MIT Laboratory for Information and Decision Systems 3
Related Literature
• Parallel and Distributed Optimization Algorithms:
– General computational model for distributed asynchronous optimization
∗ Tsitsiklis 84, Bertsekas and Tsitsiklis 95
• Consensus and Cooperative Control:
– Analysis of group behavior (flocking) in dynamical-biological systems
∗ Vicsek 95, Reynolds 87, Toner and Tu 98
– Mathematical models of consensus and averaging
∗ Jadbabaie et al. 03, Olfati-Saber Murray 04, Boyd et al. 05
• Game Theory/Mechanism Design for Distributed Cooperative Control:
– Assign each agent a local utility function such that:
∗ The equilibrium of the resulting game is the same as (or close to) the
global optimum
∗ Derive learning algorithms that reach equilibrium
– Marden, Arslan, Shamma 07 used this approach for the consensus problem
to deal with constraints
MIT Laboratory for Information and Decision Systems 4
This Talk
• Development of a distributed subgradient method for multi-agent optimization
[Nedic, Ozdaglar 08]
– Convergence analysis and performance bounds for time-varying topologies
under general connectivity assumptions
• Effects of local constraints [Nedic, Ozdaglar, Parrilo 08]
• Effects of networked-system constraints: quantization, delay, asynchronism
MIT Laboratory for Information and Decision Systems 5
Model
• We consider a network of m agents with node set V = {1, . . . , m}– Agents want to cooperatively solve
minx∈Rn
m∑i=1
fi(x)
– Function fi(x) : Rn → R is a convex objec-
tive function known only by node i
f2(x1, . . . , xn)
fm(x1, . . . , xn)
f1(x1, . . . , xn)
• Agents update and send their information at discrete times t0, t1, t2 . . .
• We use xi(k) ∈ Rn to denote the estimate of agent i at time tk
Agent Update Rule:
• Agent i locally minimizes fi according to:
xi(k + 1) =
m∑j=1
aij(k)xj(k)− αi(k)di(k)
aij(k) are weights, αi(k) is stepsize, di(k) is subgradient of fi at xi(k)
• ai(k) = (ai1(k), . . . , ai
m(k)) represents i’s time-varying neighbors at slot k
• The model includes consensus as a special case (fi(x) = 0 for all i)MIT Laboratory for Information and Decision Systems 6
Linear Dynamics and Transition Matrices
• We let A(s) denote the matrix whose ith column is the vector ai(s) and
introduce the transition matrices
Φ(k, s) = A(s)A(s + 1) · · ·A(k − 1)A(k) for all k ≥ s
• We use these matrices to relate xi(k + 1) to xj(s) at time s ≤ k:
xi(k+1) =
m∑j=1
[Φ(k, s)]ijxj(s)−
k−1∑r=s
(m∑
j=1
[Φ(k, r + 1)]ijαj(r)dj(r)
)−αi(k)di(k).
• We analyze convergence properties of the distributed method by establishing:
– Convergence of transition matrices Φ(k, s) (consensus part)
– Convergence of an approximate subgradient method (effect of optimization)
MIT Laboratory for Information and Decision Systems 7
Assumptions
Assumption (Weights) At all times k, we have
(a) The weights aij(k) are nonnegative for all agents i, j.
(b) There exists a scalar η ∈ (0, 1) such that for all agents i,
aii(k) ≥ η and ai
j(k) ≥ η
for agents j communicating with agent i at time k.
(c) The agent weight vectors ai(k) = [ai1(k), . . . , ai
m(k)]T are stochastic, i.e.,∑mj=1 ai
j(k) = 1 for all i and k.
Example: Equal neighbor weights
aij(k) =
1
ni(k) + 1
where ni(k) is the number of agents communicating with i at time k
MIT Laboratory for Information and Decision Systems 8
Information Exchange
• Agent i influences any other agent infinitely
often - connectivity
• Agent j send his information to neighbor-
ing agent i within a bounded time interval
- bounded intercommunication interval
At slot k, information exchange may be represented by a directed graph (V, Ek) with
Ek = {(j, i) | aij(k) > 0}
Assumption (Connectivity) The graph (V, E∞) is connected, where
E∞ = {(j, i) | (j, i) ∈ Ek for infinitely many indices k}.
Assumption (Bounded Intercommunication Interval) There is some B ≥ 1 s.t.
(j, i) ∈ Ek ∪ Ek+1 ∪ · · · ∪ Ek+B−1 for all (j, i) ∈ E∞ and k ≥ 0.
MIT Laboratory for Information and Decision Systems 9
Properties of Transition Matrices
Lemma: Let Weights and Information Exchange assumptions hold. We then have
[Φ(s + (m− 1)B − 1, s)]ij ≥ η(m−1)B for all s, i, and j,
where η is the lower bound on weights and B is the intercommunication interval
bound.
• We introduce the matrices Dk(s) as follows: for a fixed s ≥ 0,
Dk(s) = Φ′ (s + kB0 − 1, s + (k − 1)B0) for k = 1, 2, . . . ,
where B0 = (m− 1)B.
• By the previous lemma, all entries of Dk(s) are positive.
MIT Laboratory for Information and Decision Systems 10
Convergence of Transition Matrices
Lemma: Let Weights and Information Exchange assumptions hold. For each s ≥ 0,
we have:
(a) The limit D(s) = limk→∞Dk(s) · · ·D1(s) exists.
(b) The limit D(s) has identical rows and the rows are stochastic.
(c) For every j, the entries [Dk(s) · · ·D1(s)]ji , i = 1, . . . , m, converge to the same
limit φj(s) as k →∞ with a geometric rate:
∣∣∣[Dk(s) · · ·D1(s)]ji − φj(s)
∣∣∣ ≤ 2(1 + η−B0
) (1− ηB0
)k
where η is the lower bound on weights, B is the intercommunication interval
bound, and B0 = (m− 1)B.
MIT Laboratory for Information and Decision Systems 11
Proof Outline
• We show that the sequence {(Dk · · ·D1)x} converges for every x ∈ Rm
• Consider the sequence {xk} with xk = Dk · · ·D1x and write xk as
xk = zk + cke, where ck = min1≤i≤m
[xk]i
• Using the property that each entry of the matrix Dk is positive, we show
‖zk‖∞ ≤(1− ηB0
)k
‖z0‖∞ for all k.
Hence zk → 0 with a geometric rate.
• We then show that the sequence {ck} converges to some c ∈ R and use the
contraction constant to establish the rate estimate
• The final relation follows by picking x = ej , the jth unit vector
MIT Laboratory for Information and Decision Systems 12
Convergence of Transition Matrices
Proposition: Let Weights and Information Exchange assumptions hold. For each
s ≥ 0, we have:
(a) The limit Φ(s) = limk→∞ Φ(k, s) exists.
(b) The limit matrix Φ(s) has identical columns and the columns are stochastic, i.e.,
Φ(s) = φ(s)e′,
where φ(s) ∈ Rm is a stochastic vector.
(c) For every i, [Φ(k, s)]ji , j = 1, ..., m, converge to the same limit φi(s) as k →∞with a geometric rate, i.e., for all i, j and all k ≥ s,
∣∣∣[Φ(k, s)]ji − φi(s)∣∣∣ ≤ 2
1 + η−B0
1− ηB0
(1− ηB0
) k−sB0
where η is the lower bound on weights, B is the intercommunication interval
bound, and B0 = (m− 1)B.
• The rate estimate in part (c) recently improved in [Nedic, Olshevsky, Ozdaglar,
Tsitsiklis 08]
MIT Laboratory for Information and Decision Systems 13
Convergence Analysis
• Recall the evolution of the estimates (with αi(s) = α):
xi(k + 1) =
m∑j=1
[Φ(k, s)]ijxj(s)− α
k−1∑r=s
(m∑
j=1
[Φ(k, r + 1)]ijdj(r)
)− αdi(k).
• Proof method: Consider a “stopped process”, where after time k, agents stop
computing subgradients but keep exchanging their estimates: di(k) = 0 for all
k ≥ k and all i.
• It can be seen that the stopped process takes the form
xi(k + 1) =
m∑j=1
[Φ(k, 0)]ijxj(0)− α
k∑r=1
(m∑
j=1
[Φ(k, r)]ijdj(r − 1)
)
• Using limk→∞[Φ(k, s)]ij = φj(s) for all i, we see that the limit vector
limk→∞ xi(k) exists, independent of i, but depends on k:
limk→∞
xi(k) = y(k)
MIT Laboratory for Information and Decision Systems 14
Behavior of “Stopped Process”
• Stopped process is described by:
y(k + 1) = y(k)− α
m∑j=1
φj(k)dj(k)
– Subgradients dj(k) of fj are computed at xj(k) instead of y(k)
• This would correspond to an approximate subgradient method for minimizing∑j fj(x) provided that the values φj(k) are the same for all j
• Assumption (Doubly Stochastic Weights) The matrices A(k) are doubly
stochastic, i.e.,∑m
i=1 aij(k) = 1 for all j and k.
– Can be ensured when the agents exchange their information simultaneously
and coordinate the selection of the weights aij(k)
• In this case the stopped process reduces to
y(k + 1) = y(k)− α
m
m∑j=1
dj(k)
MIT Laboratory for Information and Decision Systems 15
Main Convergence Result
Proposition: Let
• Weights, Doubly Stochastic Weights, and Information Exchange assumptions hold
• The subgradients of fi be uniformly bounded by a constant L, and
max1≤j≤m ‖xj(0)‖ ≤ αL.
Then, for the averages xi(k) of estimates xi(0), . . . , xi(k − 1) we have
f(xi(k)) ≤ f∗ +m dist2(y(0), X∗)
2αk+ αL2
(C
2+ 2mC1
)
where f∗ is the optimal value of f =∑
i fi, X∗ is the optimal set,
y(0) =1
m
∑i
xi(0), C = 1 + 8mC1
C1 = 1 +m
1− (1− ηB0)1
B0
1 + η−B0
1− ηB0, B0 = (m− 1)B
• Estimates are per iteration
• Captures tradeoff between accuracy and computational complexity
MIT Laboratory for Information and Decision Systems 16
Proof Outline
• We analyze the stopped process:
y(k + 1) = y(k)− α
m
m∑j=1
dj(k)
• Establish approximate convergence relation for the running averages
y(k) =1
k
k−1∑
h=0
y(h)
• Using the convergence rate estimate for the transition matrices Φ(k, s), we show
that y(k) are close to the agents’ averages xi(k) = 1k
∑k−1h=0 xi(h)
‖y(k)− xi(k)‖ ≤ 2αLC1 for all i and k
• Infer the result for the running averages x(k)
MIT Laboratory for Information and Decision Systems 17
Constrained Consensus Problem
• Estimates of agent i restricted to lie in a closed convex constraint set Xi
• We assume that the intersection set X = ∩mi=1Xi is nonempty
• Examples where constraints important:
– Motion planning and alignment problems, where each agent’s position is
limited to a certain region or range
– Distributed constrained multi-agent optimization
• This talk: Pure consensus problem in the presence of constraints
MIT Laboratory for Information and Decision Systems 18
Projected Consensus Algorithm
• For the constrained consensus problem, we develop a consensus algorithm based
on projections.
• We use xi(k) ∈ Xi to denote the estimate of agent i at time tk
• Given a closed convex set X ⊂ Rn, and a vector y ∈ Rn, we define:
dist(y, X) = minx∈X
‖y − x‖, PX(y) = argminx∈X‖y − x‖
Agent Update Rule:
• Agent i updates his estimates subject to his constraint set:
xi(k + 1) = PXi
[ m∑j=1
aij(k)xj(k)
]
where ai(k) = (ai1(k), . . . , ai
m(k))′ is the weight vector
MIT Laboratory for Information and Decision Systems 19
Connection to Alternating Projections
• Update rule similar to classical alternating projection method
Alternating/Cyclic Projection Methods:
Given closed convex sets X1, X2 ⊂ Rn, find a
point in X = X1 ∩X2:
x(k + 1) = PX1(x(k))
x(k + 2) = PX2(x(k + 1))
Convergence analysis:
Xi affine [Von Neumann; Aronszajn 50],
Xi convex [Gubin, Polyak, Raik 67]
Constrained Consensus Algorithm:
Given closed convex sets X1, X2 ⊂ Rn:
w1(k) =∑
j a1j (k)xj(k)
w2(k) =∑
j a2j (k)xj(k)
x1(k + 1) = PX1(w1(k))
x2(k + 1) = PX2(w2(k))
MIT Laboratory for Information and Decision Systems 20
Convergence Analysis
• Analysis of the impact of constraints
• Recall the update rule
xi(k + 1) =
m∑j=1
aij(k)xj(k) + ei(k),
where the projection error ei(k) is given by
ei(k) = PXi
[ m∑j=1
aij(k)xj(k)
]−
m∑j=1
aij(k)xj(k) = xi(k + 1)− wi(k).
Proposition: Assume that the intersection X = ∩mi=1Xi is nonempty. Let Doubly
Stochastic Weights assumption hold. Then:
limk→∞
ei(k) = 0 for all i.
• Implication: The analysis translates into unconstrained case
• The proof relies on two lemmas
MIT Laboratory for Information and Decision Systems 21
Convergence of Projection Error
Lemma: Let w ∈ Rn and X ⊂ Rn closed and
convex. For all x ∈ X,
‖w − PX(w)‖2 ≤ ‖w − x‖2 − ‖PX(w)− x‖2.
Lemma: Let Doubly Stochastic Weights assumption hold. Then:
• ‖xi(k + 1)− x‖ ≤ ‖wi(k)− x‖ ∀ x ∈ Xi (nonexpansiveness of projection)
• ∑mi=1 ‖wi(k)− x‖2 ≤ ∑m
i=1 ‖xi(k)− x‖2 ∀ x (by doubly stochasticity)
– Of independent interest in convergence analysis of doubly stochastic matrices
Implication:
limk→∞∑m
i=1 ‖wi(k)− x‖2 − ‖xi(k + 1)− x‖2 = 0, for all x ∈ X
MIT Laboratory for Information and Decision Systems 22
Convergence of the Estimates
• Recall that the transition matrices are defined as follows:
Φ(k, s) = A(s)A(s + 1) · · ·A(k − 1)A(k) for all s and k with k ≥ s,
• Using the transition matrices and the decomposition of the estimate evolution,
xi(k + 1) =
m∑j=1
[Φ(k, s)]ij xj(s) +
k∑r=s+1
(m∑
j=1
[Φ(k, r)]ij ej(r − 1)
)+ ei(k).
• Use a two-time scale analysis, where we define a similar “stopped process”
y(k) =1
m
m∑j=1
xj(s) +1
m
k∑r=s+1
(m∑
j=1
ej(r − 1)
)
• It can be seen that the agent estimates reach a consensus:
limk→∞
‖xi(k)− y(k)‖ = 0 for all i
Proposition: (Convergence) Let Weights, Doubly Stochastic Weights, and
Information Exchange assumptions hold. We have for some x ∈ X,
limk→∞
‖xi(k)− x‖ = 0 for all i.
MIT Laboratory for Information and Decision Systems 23
Rate Analysis
Assumption (Interior Point): There exists
a vector x such that
x ∈ int(X) = int(∩m
i=1 Xi
),
i.e., there exists some scalar δ > 0 such
that {z | ‖z − x‖ ≤ δ} ⊂ X.
Proposition: Let Interior Point Assumption hold. Let the weight vectors ai(k) be
given by ai(k) = (1/m, . . . , 1/m)′ for all i and k. Then,
m∑i=1
‖xi(k)− x‖2 ≤(
1− 1
4R2
)k m∑i=1
‖xi(0)− x‖2 for all k ≥ 0,
where x ∈ X is the limit of the sequence {xi(k)}, and R = 1δ
∑mi=1 ‖xi(0)− x‖,
i.e., the convergence rate is linear.
• Linear convergence rate extends to time-varying weights with Xi = X
• Convergence rate for time-varying weights and different local constraints open
MIT Laboratory for Information and Decision Systems 24
Summary
• We presented a distributed subgradient method for multi-agent optimization
• The method can operate over networks with time-varying connectivity
• We proposed a constrained consensus policy for the case when agents have local
constraints on their estimates
• This policy has connections to the classical alternating projection method
• We analyzed the convergence and rate of convergence of the algorithms
MIT Laboratory for Information and Decision Systems 25
Putting Things Together: Constrained Distributed
Optimization
• Each agent has a convex closed local constraint set Xi [Nedic, Ozdaglar,
Parrilo 08]
• Agent i updates his estimate by
vi(k) =
m∑j=1
aij(k)xj(k)
xi(k + 1) = PXi
[vi(k)− α(k)di(k)
],
α(k) > 0 is a diminishing stepsize sequence and di(k) is a subgradient of fi(x)
at x = vi(k).
Results:
• Agent estimates generated by this algorithm converge to the same optimal
solution for the cases when the weights are constant and equal, and when the
weights are time-varying but Xi = X for all i.
• Convergence analysis in the general case open!
MIT Laboratory for Information and Decision Systems 26
Optimization over Random Networks
• Existing work focuses on deterministic models of network connectivity (i.e.,
worst-case assumptions about intercommunication intervals)
• Time-varying connectivity modeled probabilistically [Lobel, Ozdaglar 08]
• Each component l of agent estimates evolve according to
xl(k + 1) = A(k)xl(k)− α(k)dl(k)
• We assume that the matrix A(k) is a random matrix drawn independently over
time from a probability space of stochastic matrices.
• This allows edges at any time k to be correlated.
Results:
• We establish properties of random transition matrices by constructing positive
probability events in which information propagates from every node to every
other node in the network
• We provide convergence analysis of a stochastic subgradient method under
different stepsize rules
MIT Laboratory for Information and Decision Systems 27
Limits on Communication
Quantization Effects:
• Agents have access to quantized estimates due to storage and communication
bandwidth constraints [Nedic, Olshevsky, Ozdaglar, Tsitsiklis 07]
• Agent i updates his estimate by
xi(k + 1) =
m∑j=1
aij(k)xj
Q(k)− αdi(k)
xiQ(k + 1) = bxi(k + 1)c
b·c represents rounding down to the nearest integer multiple of 1/Q
Delay Effects:
• Agents have access to outdated estimates due to communication delays
[Bliman, Nedic, Ozdaglar 08]
• In the presence of delays, agent i updates his estimate by
xi(k + 1) =
m∑j=1
aij(k)xj(k − ti
j(k))− αdi(k)
tij(k) is the delay in passing information from j to i
MIT Laboratory for Information and Decision Systems 28
Applications to Social Networks
• Growing interest in dynamics in a social network of communicating agents
• A specific example is learning and information aggregation over networks
• Consensus policies can be used to develop and analyze myopic/quasi-myopic
learning models [Golub, Jackson 07], [Acemoglu, Ozdaglar, ParandehGheibi
08]
• In the context of social networks, these rules may be too myopic:
– Alternative approach: Bayesian learning over social networks [Acemoglu,
Dahleh, Lobel, Ozdaglar 08]
MIT Laboratory for Information and Decision Systems 29
References
• A. Nedic and A. Ozdaglar, “Distributed Subgradient Methods for Multi-agent
Optimization,” IEEE Transactions on Automatic Control, 2008.
• A. Ozdaglar, “Constrained Consensus and Alternating Projections,” Proc. of
Allerton Conference, 2007.
• A. Nedic, A. Olshevsky, A. Ozdaglar, and J.N. Tsitsiklis, “On Distributed
Averaging Algorithms and Quantization Effects,” IEEE Transactions on
Automatic Control, 2008.
• A. Nedic, A. Ozdaglar, and P.A. Parrilo, “Constrained Consensus and
Optimization in Multi-agent Networks,” submitted for publication, 2008.
• I. Lobel and A. Ozdaglar, “Distributed Subgradient Methods over Random
Networks,” Proc. of Allerton Conference, 2008.
• D. Acemoglu, M. Dahleh, I. Lobel, and A. Ozdaglar, “Bayesian Learning in
Social Networks,” submitted for publication, 2008.
• D. Acemoglu, A. Ozdaglar, and A. ParandehGheibi “Spread of (Mis)Information
over Social Networks,” working paper, 2008.
All papers can be downloaded from web.mit.edu/asuman/wwwMIT Laboratory for Information and Decision Systems 30