Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Distributed online optimization over jointly connecteddigraphs
David Mateos-Nunez Jorge Cortes
University of California, San Diego
{dmateosn,cortes}@ucsd.edu
Mathematical Theory of Networks and SystemsUniversity of Groningen, July 7, 2014
1 / 30
Overview of this talk
Distributed optimization Online optimization
Case study: medical diagnosis
2 / 30
Motivation: a data-driven optimization problem
Medical findings, symptoms:age factoramnesia before impactdeterioration in GCS scoreopen skull fractureloss of consciousnessvomiting
Any acute brain finding revealed on Computerized Tomography?
(-1 = not present, 1 = present)
“The Canadian CT Head Rule for patients with minor head injury”
3 / 30
Binary classification
feature vector of patient s:ws = ((ws)1, ..., (ws)d−1)
true diagnosis: ys ∈ {−1, 1}wanted weights: x = (x1, ..., xd)
predictor: h(x ,ws) = x>(ws , 1)
margin: ms(x) = ys h(x ,ws)
Given the data set {ws}Ps=1, estimate x ∈ Rd by solving
minx∈Rd
P∑s=1
l(ms(x))
where the loss function l : R → R is decreasing and convex
4 / 30
Binary classification
feature vector of patient s:ws = ((ws)1, ..., (ws)d−1)
true diagnosis: ys ∈ {−1, 1}wanted weights: x = (x1, ..., xd)
predictor: h(x ,ws) = x>(ws , 1)
margin: ms(x) = ys h(x ,ws)
Given the data set {ws}Ps=1, estimate x ∈ Rd by solving
minx∈Rd
P∑s=1
l(ms(x))
where the loss function l : R → R is decreasing and convex
4 / 30
Why distributed online optimization?
Why distributed?
information is distributed across group of agentsneed to interact to optimize performance
Why online?
information becomes incrementally availableneed adaptive solution
5 / 30
Review ofdistributed convex optimization
6 / 30
A distributed scenario
In the diagnosis example health center i ∈ {1, . . . ,N}manages a set of patients P i
f (x) =N∑i=1
∑s∈P i
l(ysh(x ,ws)) =N∑i=1
f i (x)
Goal: best predicting model w 7→ h(x∗,w)
minx∈Rd
N∑i=1
f i (x)
using “local information”
7 / 30
A distributed scenario
In the diagnosis example health center i ∈ {1, . . . ,N}manages a set of patients P i
f (x) =N∑i=1
∑s∈P i
l(ysh(x ,ws)) =N∑i=1
f i (x)
Goal: best predicting model w 7→ h(x∗,w)
minx∈Rd
N∑i=1
f i (x)
using “local information”
7 / 30
What do we mean by “using local information”?
Agent i maintains an estimate x it of
x∗ ∈ arg minx∈Rd
N∑i=1
f i (x)
Agent i has access to f i
Agent i can share its estimate x it with “neighboring” agents
f 1 f 2
f 3 f 4
f 5
A =
a13 a14
a21 a25a31 a32a41
a52
Distributed algorithms: Tsitsiklis 84, Bertsekas and Tsitsiklis 95
Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05
8 / 30
What do we mean by “using local information”?
Agent i maintains an estimate x it of
x∗ ∈ arg minx∈Rd
N∑i=1
f i (x)
Agent i has access to f i
Agent i can share its estimate x it with “neighboring” agents
f 1 f 2
f 3 f 4
f 5
A =
a13 a14
a21 a25a31 a32a41
a52
Distributed algorithms: Tsitsiklis 84, Bertsekas and Tsitsiklis 95
Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 058 / 30
How do agents agree on the optimizer?
local minimization vs local consensus
A. Nedic and A. Ozdaglar, TAC, 09
x ik+1 = −ηkg ik +
N∑j=1
aij ,k xjk ,
where g ik ∈ ∂f i (x ik) and Ak = (aij ,k) is doubly stochastic
J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12
9 / 30
How do agents agree on the optimizer?
local minimization vs local consensus
A. Nedic and A. Ozdaglar, TAC, 09
x ik+1 = −ηkg ik +
N∑j=1
aij ,k xjk ,
where g ik ∈ ∂f i (x ik) and Ak = (aij ,k) is doubly stochastic
J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12
9 / 30
Saddle-point dynamics
The minimization problem can be regarded as
minx∈Rd
N∑i=1
f i (x) = minx1,...,xN∈Rd
x1=...=xN
N∑i=1
f i (x i ) = minx∈(Rd )N
Lx=0
f (x),
(Lx)i =N∑j=1
aij(xi − x j) f (x) =
N∑i=1
f i (x i )
Convex-concave augmented Lagrangian when L is symmetric
F (x , z) := f (x) + γ2 x>L x + z>L x ,
Saddle-point dynamics
x = − ∂F (x , z)∂x
= −∇f (x)− γ Lx − Lz
z =∂F (x , z)∂z
= Lx
10 / 30
Saddle-point dynamics
The minimization problem can be regarded as
minx∈Rd
N∑i=1
f i (x) = minx1,...,xN∈Rd
x1=...=xN
N∑i=1
f i (x i ) = minx∈(Rd )N
Lx=0
f (x),
(Lx)i =N∑j=1
aij(xi − x j) f (x) =
N∑i=1
f i (x i )
Convex-concave augmented Lagrangian when L is symmetric
F (x , z) := f (x) + γ2 x>L x + z>L x ,
Saddle-point dynamics
x = − ∂F (x , z)∂x
= −∇f (x)− γ Lx − Lz
z =∂F (x , z)∂z
= Lx10 / 30
In the case of directed graphs
Weight-balanced ⇔ 1>L = 0 ⇔ L + L> � 0,
F (x , z) := f (x) + γ2 x>L x + z>L x
still convex-concave
x = − ∂F (x , z)∂x
= −∇f (x)− γ 12
(L+ L>)x − L>z (Non distributed!)
changed to −∇f (x)− γ Lx − Lz
z =∂F (x , z)∂z
= Lx
11 / 30
In the case of directed graphs
Weight-balanced ⇔ 1>L = 0 ⇔ L + L> � 0,
F (x , z) := f (x) + γ2 x>L x + z>L x
still convex-concave
x = − ∂F (x , z)∂x
= −∇f (x)− γ 12
(L+ L>)x − L>z (Non distributed!)
changed to −∇f (x)− γ Lx − Lz
z =∂F (x , z)∂z
= Lx
J. Wang and N. Elia (with L> = L), Allerton, 10
11 / 30
In the case of directed graphs
Weight-balanced ⇔ 1>L = 0 ⇔ L + L> � 0,
F (x , z) := f (x) + γ2 x>L x + z>L x
still convex-concave
x = − ∂F (x , z)∂x
= −∇f (x)− γ 12
(L+ L>)x − L>z (Non distributed!)
changed to −∇f (x)− γ Lx − Lz
z =∂F (x , z)∂z
= Lx
B. Gharesifard and J. Cortes, CDC, 12
11 / 30
Review ofonline convex optimization
12 / 30
Different kind of optimization: sequential decision making
In the diagnosis example:
Each round t ∈ {1, . . . ,T}question (features, medical findings): wt
decision (about using CT): h(xt ,wt)outcome (by CT findings/follow up of patient): ytloss: l(yt h(xt ,wt))
Choose xt & Incur loss ft(xt) := l(yt h(xt ,wt))
Goal: sublinear regret
R(u,T ) :=T∑t=1
ft(xt)−T∑t=1
ft(u) ∈ o(T )
using “historical observations”
13 / 30
Different kind of optimization: sequential decision making
In the diagnosis example:
Each round t ∈ {1, . . . ,T}question (features, medical findings): wt
decision (about using CT): h(xt ,wt)outcome (by CT findings/follow up of patient): ytloss: l(yt h(xt ,wt))
Choose xt & Incur loss ft(xt) := l(yt h(xt ,wt))
Goal: sublinear regret
R(u,T ) :=T∑t=1
ft(xt)−T∑t=1
ft(u) ∈ o(T )
using “historical observations”
13 / 30
Why regret?
If the regret is sublinear,
T∑t=1
ft(xt) ≤T∑t=1
ft(u) + o(T ),
then,
limT→∞
1
T
T∑t=1
ft(xt) ≤ limT→∞
1
T
T∑t=1
ft(u)
In temporal average, online decisions {xt}Tt=1 perform as well as bestfixed decision in hindsight
“No regrets, my friend”
14 / 30
Why regret?
What about generalization error?
Sublinear regret does not imply xt+1 will do well with ft+1
No assumptions about sequence {ft}; it canI follow an unknown stochastic or deterministic model,
I or be chosen adversarially
In our example, ft := l(yt h(xt ,wt)).
I If some model w 7→ h(x∗,w) explains reasonably the data inhindsight,
I then the online models w 7→ h(xt ,w) perform just as well in average
15 / 30
Why regret?
What about generalization error?
Sublinear regret does not imply xt+1 will do well with ft+1
No assumptions about sequence {ft}; it canI follow an unknown stochastic or deterministic model,
I or be chosen adversarially
In our example, ft := l(yt h(xt ,wt)).
I If some model w 7→ h(x∗,w) explains reasonably the data inhindsight,
I then the online models w 7→ h(xt ,w) perform just as well in average
15 / 30
Some classical results
Projected gradient descent:
xt+1 = ΠS(xt − ηt∇ft(xt)), (1)
where ΠS is a projection onto a compact set S ⊆ Rd , & ‖∇f ‖2 ≤ H
Follow-the-Regularized-Leader:
xt+1 = argminy∈S
( t∑s=1
fs(y) + ψ(y))
Martin Zinkevich, 03I (1) achieves O(
√T ) regret under convexity with ηt =
1√t
Elad Hazan, Amit Agarwal, and Satyen Kale, 07I (1) achieves O(logT ) regret under p-strong convexity with ηt =
1pt
I Others: Online Newton Step, Follow the Regularized Leader, etc.
16 / 30
Some classical results
Projected gradient descent:
xt+1 = ΠS(xt − ηt∇ft(xt)), (1)
where ΠS is a projection onto a compact set S ⊆ Rd , & ‖∇f ‖2 ≤ H
Follow-the-Regularized-Leader:
xt+1 = argminy∈S
( t∑s=1
fs(y) + ψ(y))
Martin Zinkevich, 03I (1) achieves O(
√T ) regret under convexity with ηt =
1√t
Elad Hazan, Amit Agarwal, and Satyen Kale, 07I (1) achieves O(logT ) regret under p-strong convexity with ηt =
1pt
I Others: Online Newton Step, Follow the Regularized Leader, etc.
16 / 30
Done with the reviewNow... we combine the two frameworks
Previous work and limitations
our coordination algorithm in the online case
theorems of O(√T ) & O(logT ) agent regrets
simulations
outline of the proof
17 / 30
Combining both frameworks
18 / 30
Resuming the diagnosis example:
Health center i ∈ {1, . . . ,N} takes care of a set of patients P it at time t
f i (x) =T∑t=1
∑s∈P i
t
l(ysh(x ,ws)) =T∑t=1
f it (x)
Goal: sublinear agent regret
Rj(u,T ) :=T∑t=1
N∑i=1
f it (xjt )−
T∑t=1
N∑i=1
f it (u) ≤ o(T )
using “local information” & “historical observations”
19 / 30
Resuming the diagnosis example:
Health center i ∈ {1, . . . ,N} takes care of a set of patients P it at time t
f i (x) =T∑t=1
∑s∈P i
t
l(ysh(x ,ws)) =T∑t=1
f it (x)
Goal: sublinear agent regret
Rj(u,T ) :=T∑t=1
N∑i=1
f it (xjt )−
T∑t=1
N∑i=1
f it (u) ≤ o(T )
using “local information” & “historical observations”
19 / 30
Challenge: Coordinate hospitals
f 1t f 2t
f 3t f 4t
f 5t
Need to design distributed online algorithms
20 / 30
Previous work on consensus-based online algorithms
F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TACProjected Subgradient Descent
I log(T ) regret (local strong convexity & bounded subgradients)I
√T regret (convexity & bounded subgradients)
I Both analysis require a projection onto a compact set
S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13Dual Averaging
I√T regret (convexity & bounded subgradients)
I General regularized projection onto a convex closed set.
K. I. Tsianos and M. G. Rabbat, arXiv, 12Projected Subgradient Descent
I Empirical risk as opposed to regret analysis
Limitations: Communication digraph in all casesis fixed and strongly connected
21 / 30
Our contributions (informally)
time-varying communication digraphs under B-joint connectivity &weight-balanced
unconstrained optimization (no projection step onto a bounded set)
logT regret (local strong convexity & bounded subgradients)√T regret (convexity & β-centrality with β ∈ (0, 1] & bounded
subgradients)
x∗x
−ξx
f it is β-central in Z ⊆ Rd \ X ∗,where X ∗ =
{x∗ : 0 ∈ ∂f it (x
∗)},
if ∀x ∈ Z, ∃x∗ ∈ X ∗ such that
−ξ>x (x∗ − x)
‖ξx‖2‖x∗ − x‖2≥ β,
for each ξx ∈ ∂f it (x).
22 / 30
Our contributions (informally)
time-varying communication digraphs under B-joint connectivity &weight-balanced
unconstrained optimization (no projection step onto a bounded set)
logT regret (local strong convexity & bounded subgradients)√T regret (convexity & β-centrality with β ∈ (0, 1] & bounded
subgradients)
x∗x
−ξx
f it is β-central in Z ⊆ Rd \ X ∗,where X ∗ =
{x∗ : 0 ∈ ∂f it (x
∗)},
if ∀x ∈ Z, ∃x∗ ∈ X ∗ such that
−ξ>x (x∗ − x)
‖ξx‖2‖x∗ − x‖2≥ β,
for each ξx ∈ ∂f it (x).
22 / 30
Coordination algorithm
x it+1 = x it − ηt gx it
+ σ(γ
N∑j=1,j 6=i
aij ,t(x jt − x it
)+
N∑j=1,j 6=i
aij ,t(z jt − z it
))
z it+1 = z it − σN∑
j=1,j 6=i
aij ,t(x jt − x it
)
Subgradient descent on previous local objectives, gx it ∈ ∂f it
23 / 30
Coordination algorithm
x it+1 = x it − ηt gx it
+ σ(γ
N∑j=1,j 6=i
aij ,t(x jt − x it
)
+N∑
j=1,j 6=i
aij ,t(z jt − z it
))
z it+1 = z it − σN∑
j=1,j 6=i
aij ,t(x jt − x it
)
Proportional (linear) feedback on disagreement with neighbors
23 / 30
Coordination algorithm
x it+1 = x it − ηt gx it
+ σ(γ
N∑j=1,j 6=i
aij ,t(x jt − x it
)+
N∑j=1,j 6=i
aij ,t(z jt − z it
))
z it+1 = z it − σ
N∑j=1,j 6=i
aij ,t(x jt − x it
)
Integral (linear) feedback on disagreement with neighbors
23 / 30
Coordination algorithm
x it+1 = x it − ηt gx it
+ σ(γ
N∑j=1,j 6=i
aij ,t(x jt − x it
)+
N∑j=1,j 6=i
aij ,t(z jt − z it
))
z it+1 = z it − σ
N∑j=1,j 6=i
aij ,t(x jt − x it
)
Union of graphs over intervals of length B is strongly connected.
f 1t f 2t
f 3t f 4t
f 5t
time t time t + 1
a14,t+1a41,t+1
time t + 2
23 / 30
Coordination algorithm
x it+1 = x it − ηt gx it
+ σ(γ
N∑j=1,j 6=i
aij ,t(x jt − x it
)+
N∑j=1,j 6=i
aij ,t(z jt − z it
))
z it+1 = z it − σ
N∑j=1,j 6=i
aij ,t(x jt − x it
)
Compact representation[xt+1
zt+1
]=
[xtzt
]− σ
[γLt Lt−Lt 0
] [xtzt
]− ηt
[gxt0
]
vt+1 = (I− σE ⊗ Lt)vt − ηtgt ,
23 / 30
Coordination algorithm
x it+1 = x it − ηt gx it
+ σ(γ
N∑j=1,j 6=i
aij ,t(x jt − x it
)+
N∑j=1,j 6=i
aij ,t(z jt − z it
))
z it+1 = z it − σ
N∑j=1,j 6=i
aij ,t(x jt − x it
)
Compact representation & generalization[xt+1
zt+1
]=
[xtzt
]− σ
[γLt Lt−Lt 0
] [xtzt
]− ηt
[gxt0
]
vt+1 = (I− σE ⊗ Lt)vt − ηtgt ,
23 / 30
Theorem (DMN-JC)
Assume that
{f 1t , ..., f Nt }Tt=1 are convex functions in Rd
I with H-bounded subgradient sets,I nonempty and uniformly bounded sets of minimizers, andI p-strongly convex in a suff. large neighborhood of their minimizers
The sequence of weight-balanced communication digraphs isI nondegenerate, andI B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
Then, taking σ ∈[
δλmin(E)δ ,
1−δλmax(E)dmax
]and ηt =
1p t ,
Rj(u,T ) ≤ C (‖u‖22 + 1 + logT ),
for any j ∈ {1, . . . ,N} and u ∈ Rd
24 / 30
Theorem (DMN-JC)
Assume that
{f 1t , ..., f Nt }Tt=1 are convex functions in Rd
I with H-bounded subgradient sets,I nonempty and uniformly bounded sets of minimizers, andI p-strongly convex in a suff. large neighborhood of their minimizers
The sequence of weight-balanced communication digraphs isI nondegenerate, andI B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
Substituting strong convexity by β-centrality, for β ∈ (0, 1],
Rj(u,T ) ≤ C‖u‖22√T ,
for any j ∈ {1, . . . ,N} and u ∈ Rd
24 / 30
Simulations:acute brain finding revealed on Computerized Tomography
25 / 30
Simulations
Agents’ estimates
0 20 40 60 80 100 120 140 160 180 200−0.5
0
0.5
1
1.5
2
{ xit ,7}
Ni=1
0 20 40 60 80 100 120 140 160 180 200
0
1
2
time, t
Centralized
Average regret
0 20 40 60 80 100 120 140 160 180 200
10−3
10−2
10−1
100
101
time horizon, T
maxj 1/T Rj(
x∗
T ,{
∑N
i f it
}T
t=1
)
Proport ional-Integral
disagreement feedback
Proport ional
disagreement feedback
Centralized
f it (x) =∑s∈P i
t
l(ysh(x ,ws)) +110‖x‖
22
where
l(m) = log(1 + e−2m
)26 / 30
Outline of the proof
Analysis of network regret
RN (u,T ) :=T∑t=1
N∑i=1
f it (xit)−
T∑t=1
N∑i=1
f it (u)
ISS with respect to agreement
‖LKvt‖2 ≤ CI‖v1‖2(1− δ
4N2
)d t−1B e
+ CU max1≤s≤t−1
‖us‖2,
Agent regret bound in terms of {ηt}t≥1 and initial conditions
Boundedness of online estimates under β-centrality (β ∈ (0, 1] )
Doubling trick to obtain O(√T ) agent regret
Local strong convexity ⇒ β-centrality
ηt =1p t to obtain O(logT ) agent regret
27 / 30
Conclusions and future work
Conclusions
Distributed online convex optimization over jointly connecteddigraphs,submitted to IEEE Transactions on Network Science and Engineering(can be found in the webpage of Jorge Cortes)
Relevant for regression & classification that play a crucial role inmachine learning, computer vision, etc.
Future work
Refine guarantees under
model for evolution
of objective functions
28 / 30
Future directions in data-driven optimization problems
Distributed strategies for...
feature selection
semi-supervised learning
29 / 30
Thank you for listening!
Contact: David Mateos-Nunez and Jorge Cortes, at UC San Diego
30 / 30