Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Distributed Subgradient Methodsfor Saddle-Point Problems
David Mateos-Nunez Jorge Cortes
University of California, San Diego
{dmateosn,cortes}@ucsd.edu
Conference on Decision and ControlOsaka, Japan, December 17, 2015
1 / 21
Context: decentralized, peer-to-peer systems
Sensor networks Medical diagnosis
Formation control Recommender systems
2 / 21
General agenda for today
Review of (consensus based) distributed convex optimization
Part 1:
Distributed optimization with separable constraintsvia agreement on the Lagrange multipliers
? General saddle-point problems with explicit agreement
Part 2:
? Convex-concave problems not arising from Lagrangianse.g., strict concave part
Distributed low-rank matrix completionthrough a sadlle-point characterization of the nuclear norm
3 / 21
Review: consensus based distributed convex optimization
x∗ ∈ arg minx∈Rd
N∑i=1
f i (x) (basic unconstrained problem)
Agent i has access to f i
Agent i can share its estimate of x∗ with “neighboring” agents
f 1 f 2
f 3 f 4
f 5
A =
a13 a14
a21 a25a31 a32a41
a52
︸ ︷︷ ︸
Adjacency matrix
Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95
Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05
Distributed multi-agent optimization: A. Nedic and A. Ozdaglar 074 / 21
Review: the Laplacian matrix
L = diag(A1)− A =
a13 + a14 −a13 −a14−a21 a21 + a25 −a25−a31 −a32 a31 + a32−a41 a41
−a52 a52
Nullspace is agreement ⇔ graph has spanning tree
Consensus via feedback on disagreement −[Lx ]i =∑N
j=1 aij(xj − xi )
5 / 21
Part 1: Distributed constrained convex optimization
minw i∈Wi , ∀i
D∈D
N∑i=1
f i (w i ,D)
s.t. g1(w1,D) + · · ·+ gN(wN ,D) ≤ 0︸ ︷︷ ︸Constraints might couple decisions of agents that cannot communicate directly
e.g.,∑
‖w i‖22 − 10 ≤ 0
Agent i only knows how wi enters the constraint through gi
D is the usual decision vector the agents need to agree upon
Constraints are useful models for
Traffic and routing (Flow conservation)
Resource allocation (Budgets)
Optimal control (System evolution)
Network formation (Relative positions/angles)
6 / 21
Agenda for distributed constrained optimization
Previous work and limitations
Distributing the constraints through the Lagrangian decomposition
I Idea: Agreement on the multiplier
General saddle-point problems with agreement constraints
Our distributed saddle-point dynamics with Laplacian averaging
I Theorem of convergence: ∼ 1√# iter.
saddle-point evaluation error
7 / 21
Previous work by type of constraint & info. structure
N∑i=1
f i (x)
s.t. g(x) ≤ 0
All agents know g
2011 D. Yuan, S. Xu, and H. Zhao
2012 M. Zhu and S. Martınez
Increasing literature
minw i∈Wi
N∑i=1
f i (w i )
s.t.
N∑i=1
g i (w i ) ≤ 0
Agent i knows only giwhen Aw ≤ 0, knows only column i
versus column i & row i of A
Less studied information structure:
’10 D. Mosk-Aoyama,T. Roughgarden and D. Shah
I (only linear constraints)
’13 M. Burger, G. Notarstefano,and F. Allgwer
I Dual cutting-plane consensus methods
’13 T.-H. Chang, A. Nedic,and A. Scaglione
I Primal-dual perturbation methods8 / 21
Distributing the constraint via agreement on multipliers
minw i∈WiD∈D
N∑i=1
f i (w i ,D)
s.t. g1(w1,D) + · · ·+ gN(wN ,D) ≤ 0
same as
minw i∈WiD∈D
maxz∈Rm
≥0
N∑i=1
f i (w i ,D) + z>N∑i=1
g i (w i ,D)
= minw i∈WiD∈D
maxz i∈Rm
≥0
z i=z j ∀i ,j
N∑i=1
(f i (w i ,D) + z i
>g i (w i ,D)
)
= minw i∈Wi
D i∈DD i=Dj ∀i ,j
maxz i∈Rm
≥0
z i=z j ∀i ,j
N∑i=1
(f i (w i ,D i ) + z i>g i (w i ,D i )
)︸ ︷︷ ︸Local coupled through agreement
(Existence of saddle-points ⇒ Max-min property = Strong duality )9 / 21
Saddle-point problems with explicit agreement
A more general framework
minw∈W
(D1,...,DN)∈DN
D i=Dj , ∀i ,j
maxµ∈M
(z1,...,zN)∈ZN
z i=z j , ∀i ,j
φ(w , (D1, . . . ,DN)︸ ︷︷ ︸
convex
,µ, (z1, . . . , zN)︸ ︷︷ ︸concave
)
Distributed setting unstudied in the literatureInspiration from A. Nedic and A. Ozdaglar, 09 and K. Arrow, et al. 1958
Particularizes to...
Convex-concave functions arising from Lagrangians
I The concave part is linear
Min-max formulation of nuclear norm regularization (later in talk)
I The concave part is quadratic
10 / 21
Our general algorithm
Projected saddle-point subgradient algorithm with Laplacian averaging(provided a saddle-point exists under agreement)
wt+1 = wt − ηtgwt
Dt+1 = Dt −σLtDt︸ ︷︷ ︸Lap. aver.
−ηtgDt
µt+1 = µt + ηtgµt
zt+1 = zt −σLtzt︸ ︷︷ ︸Lap. aver.
+ηt gzt
(wt+1,Dt+1,µt+1, zt+1) = (PW(wt+1
),PDN
(Dt+1
),PM
(µt+1
),PZN
(zt+1
))︸ ︷︷ ︸
Orthogonal projections onto compact convex sets
gwt , gDt , gµt , gzt are subgradients and supergradients of φ(w ,D,µ, z)
Any initial conditions; not “anytime constraint satisfaction”
11 / 21
Theorem (Distributed saddle-point approximation)
Assume that
φ(w ,D,µ, z) is convex in (w ,D) ∈ W ×DN and concave in(µ, z) ∈ M×ZN
The dynamics is bounded (maybe achieved through projections)
The sequence of weight-balanced communication digraphs isI δ-nondegenerate (aij > δ whenever aij > 0)I B-jointly-connected (unions of length B are strongly connected)
For a suitable choice of consensus stepsize σ and (decreasing) subgradientstepsizes {ηt}, then, for any saddle point (w∗,D∗,µ∗, z∗)︸ ︷︷ ︸
D∗=D∗⊗1,z∗=z∗⊗1
of φ ,
− α√t − 1
≤φ(w avt ,Dav
t , zavt ,µav
t )− φ(w∗,D∗, z∗,µ∗) ≤ α√t − 1
w avt+1 :=
1t+1
t+1∑s=1
ws =t
t+1wavt + 1
t+1wt+1︸ ︷︷ ︸can be computed recursively
12 / 21
Part 2: Beyond Lagrangians
Lagrangians are particular cases of convex-concave functions– the concave part (Lagrange multipliers) is always linear
Other min-max problems can benefit from distributed formulations– e.g., min-max formulations of the nuclear norm
Agenda for distributed optimization with nuclear normregularization
Definition of nuclear norm
Application to low-rank matrix completion
Our dynamics for distributed optimization with nuclear normI Theorem of convergence (corollary from previous result)
13 / 21
Review: definition of nuclear norm
Given a matrix W =
| |w1 · · · wN
| |
∈ Rd×N
‖W ‖2∗ := sum of singular values of W
= trace√WW > = trace
√√√√ N∑i=1
wiw>i
Optimization with nuclear norm regularization
minwi∈Wi
N∑i=1
f i (wi ) + γ‖W ‖2∗
favors vectors {wi}Ni=1 belonging to a low-dimensional subspace14 / 21
Distributed low-rank matrix completion
Tara Philip Mauricio Miroslav
Toy Story ? ? ?
Jurasic Park ??
· · · · · · · · · · · · · · ·W = [ W:,1 W:,2 W:,3 W:,4 ]
Estimate W from the revealed entries {Zij}
minW∈Rd×N
∑(i ,j)∈revealed
‖Wij − Zij‖22 + γ ‖W ‖∗︸ ︷︷ ︸Nuclear norm
γ depends on application, dimensions... (Regularization, not penalty)
Netflix: users N ∼ 107 � movies d ∼ 105
Why making it distributed? Because users may not want to sharetheir ratings
15 / 21
Formulation of nuclear norm as saddle-point problem
Drawing from another paper of the authors (ignore details)
minwi∈Wi
N∑i=1
f i (wi ) + γ‖[W |√ε Id ]‖∗
= minwi∈Wi ,Di∈{D�cId}
Di = Dj ∀i , j
supxi∈Rd
Yi∈Rd×d
N∑i=1
Fi (wi ,Di︸ ︷︷ ︸convex
, xi ,Yi︸ ︷︷ ︸concave
)
with convex-concave local functions
Fi (w ,D, x ,Y )︸ ︷︷ ︸Rd×{D�cId}×Rd×Rd×d
:= fi (w)︸ ︷︷ ︸convex
+ γ(
trace(D(−xx> − ε
NYY>︸ ︷︷ ︸
quadratic concave part because D � 0
))
−2γw>x − 2γε
Ntrace(Y ) +
1
Ntrace(D)︸ ︷︷ ︸
linear in each variable
)
See Distributed optimization for multi-task learning via nuclear-norm
approximation, NecSys15, D. Mateos-Nunez, J. Cortes 16 / 21
Distributed saddle-point dynamics for nuclear optimization
wi (k + 1) =PW(wi (k)− ηk
(gi (k)− 2γxi (k)
))Di (k + 1) =P{D�cId}
(Di (k)− ηkγ
(− xix
>i − ε
NYiY>i + 1
N Id)
+ σ
N∑j=1
aij ,t(Dj(k)− Di (k))︸ ︷︷ ︸“Only” communication, size d × d
)
xi (k + 1) = xi (k) + ηkγ(− 2Di (k)xi (k)− 2wi (k)
)Yi (k + 1) =Yi (k) + ηkγ
(− 2ε
NDi (k)Yi (k)−
2ε
NId)
Convergence is a corollary from previous theorem
User i does not need to share wi with its neighbors!!
Di →√∑N
i=1 wiw>i + εId conveys only mixed information
Complexity per iteration: orthogonal projection onto {D � cId}17 / 21
Simulation of matrix completion
20 users × 8 movies. Each user rates 5 movies. Ratings are private
100
101
102
103
104
0
0.5
1
1.5
2
Distr ibuted saddle-point dynamics
Centralized subgradient descent
100
101
102
103
104
103
104
0 5000 10000 150000
1
2
3
4
5
ite rat ions, k
‖W (k)−Z‖F‖Z‖F
∑Ni=1
∑j∈Υi
(Wij(k)− Zij)2 + γ‖W (k)|
√εId‖∗
(∑N
i=1 ‖Di (k)− 1N
∑Ni=1Di (k)‖2F )1/2
Matrixfitting error
Networkcost function
Disagreementlocal matrices
18 / 21
Conclusions
More details in
arXiv “Distributed saddle-point subgradient algorithms withLaplacian averaging,” submitted to Transactions on Automatic Control
Our algorithms particularize to deal with
I Saddle-points of Lagrangians for distributed constrained optimization
F Less studied type of constraints/information structure in the literature
F Constraints couple decisions of agents that can’t communicate directly
I Min-max distributed formulations of nuclear norm
“Distributed optimization for multi-task learning vianuclear-norm approximation”, D. Mateos-Nunez, J. Cortes
F First multi-agent treatment of nuclear norm regularization
19 / 21
Future directions
Bounds on Lagrange multipliers in a distributed wayI Necessary to guarantee boundedness of the dynamics’ trajectoriesI One such procedure in arXiv version
Application to semidefinite constraints with chordal sparsityI agents update the entries corresponding to maximal cliques subject to
agreement on the intersections
Other applications that you can find...
IEEE Spectrum. Japan project of Orbital Solar Farm
20 / 21
Thank you for listening!
21 / 21
(Back slide) Outline of the proof
Inequality techniques from A. Nedic and A. Ozdaglar, 2009
I Saddle-point evaluation error
tφ(w avt+1,D
avt+1,µ
avt+1, z
avt+1)− tφ(w∗,D∗,µ∗, z∗) (1)
at running-time averages, w avt+1 :=
1t
∑ts=1 ws , etc.
I Bound for (1) in terms ofF initial conditionsF bound on subgradients and states of the dynamicsF disagreementF sum of learning rates
Input-to-state stability with respect to agreement
‖LKDt‖2 ≤ CI‖D1‖2(1− δ
4N2
)d t−1B e
+ CU max1≤s≤t−1
‖ds‖2︸ ︷︷ ︸subgradients as disturbances
Doubling Trick scheme: for m = 0, 1, 2, . . . , dlog2 te, takeηs =
1√2m
in each period of 2m rounds s = 2m, . . . , 2m+1 − 11 / 1