48
Distributed online optimization over jointly connected digraphs David Mateos-N´ nez Jorge Cort´ es University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems University of Groningen, July 7, 2014 1 / 30

slides_online_optimization_david_mateos

Embed Size (px)

Citation preview

Page 1: slides_online_optimization_david_mateos

Distributed online optimization over jointly connecteddigraphs

David Mateos-Nunez Jorge Cortes

University of California, San Diego

{dmateosn,cortes}@ucsd.edu

Mathematical Theory of Networks and SystemsUniversity of Groningen, July 7, 2014

1 / 30

Page 2: slides_online_optimization_david_mateos

Overview of this talk

Distributed optimization Online optimization

Case study: medical diagnosis

2 / 30

Page 3: slides_online_optimization_david_mateos

Motivation: a data-driven optimization problem

Medical findings, symptoms:age factoramnesia before impactdeterioration in GCS scoreopen skull fractureloss of consciousnessvomiting

Any acute brain finding revealed on Computerized Tomography?

(-1 = not present, 1 = present)

“The Canadian CT Head Rule for patients with minor head injury”

3 / 30

Page 4: slides_online_optimization_david_mateos

Binary classification

feature vector of patient s:ws = ((ws)1, ..., (ws)d−1)

true diagnosis: ys ∈ {−1, 1}wanted weights: x = (x1, ..., xd)

predictor: h(x ,ws) = x>(ws , 1)

margin: ms(x) = ys h(x ,ws)

Given the data set {ws}Ps=1, estimate x ∈ Rd by solving

minx∈Rd

P∑s=1

l(ms(x))

where the loss function l : R → R is decreasing and convex

4 / 30

Page 5: slides_online_optimization_david_mateos

Binary classification

feature vector of patient s:ws = ((ws)1, ..., (ws)d−1)

true diagnosis: ys ∈ {−1, 1}wanted weights: x = (x1, ..., xd)

predictor: h(x ,ws) = x>(ws , 1)

margin: ms(x) = ys h(x ,ws)

Given the data set {ws}Ps=1, estimate x ∈ Rd by solving

minx∈Rd

P∑s=1

l(ms(x))

where the loss function l : R → R is decreasing and convex

4 / 30

Page 6: slides_online_optimization_david_mateos

Why distributed online optimization?

Why distributed?

information is distributed across group of agentsneed to interact to optimize performance

Why online?

information becomes incrementally availableneed adaptive solution

5 / 30

Page 7: slides_online_optimization_david_mateos

Review ofdistributed convex optimization

6 / 30

Page 8: slides_online_optimization_david_mateos

A distributed scenario

In the diagnosis example health center i ∈ {1, . . . ,N}manages a set of patients P i

f (x) =N∑i=1

∑s∈P i

l(ysh(x ,ws)) =N∑i=1

f i (x)

Goal: best predicting model w 7→ h(x∗,w)

minx∈Rd

N∑i=1

f i (x)

using “local information”

7 / 30

Page 9: slides_online_optimization_david_mateos

A distributed scenario

In the diagnosis example health center i ∈ {1, . . . ,N}manages a set of patients P i

f (x) =N∑i=1

∑s∈P i

l(ysh(x ,ws)) =N∑i=1

f i (x)

Goal: best predicting model w 7→ h(x∗,w)

minx∈Rd

N∑i=1

f i (x)

using “local information”

7 / 30

Page 10: slides_online_optimization_david_mateos

What do we mean by “using local information”?

Agent i maintains an estimate x it of

x∗ ∈ arg minx∈Rd

N∑i=1

f i (x)

Agent i has access to f i

Agent i can share its estimate x it with “neighboring” agents

f 1 f 2

f 3 f 4

f 5

A =

a13 a14

a21 a25a31 a32a41

a52

Distributed algorithms: Tsitsiklis 84, Bertsekas and Tsitsiklis 95

Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05

8 / 30

Page 11: slides_online_optimization_david_mateos

What do we mean by “using local information”?

Agent i maintains an estimate x it of

x∗ ∈ arg minx∈Rd

N∑i=1

f i (x)

Agent i has access to f i

Agent i can share its estimate x it with “neighboring” agents

f 1 f 2

f 3 f 4

f 5

A =

a13 a14

a21 a25a31 a32a41

a52

Distributed algorithms: Tsitsiklis 84, Bertsekas and Tsitsiklis 95

Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 058 / 30

Page 12: slides_online_optimization_david_mateos

How do agents agree on the optimizer?

local minimization vs local consensus

A. Nedic and A. Ozdaglar, TAC, 09

x ik+1 = −ηkg ik +

N∑j=1

aij ,k xjk ,

where g ik ∈ ∂f i (x ik) and Ak = (aij ,k) is doubly stochastic

J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12

9 / 30

Page 13: slides_online_optimization_david_mateos

How do agents agree on the optimizer?

local minimization vs local consensus

A. Nedic and A. Ozdaglar, TAC, 09

x ik+1 = −ηkg ik +

N∑j=1

aij ,k xjk ,

where g ik ∈ ∂f i (x ik) and Ak = (aij ,k) is doubly stochastic

J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12

9 / 30

Page 14: slides_online_optimization_david_mateos

Saddle-point dynamics

The minimization problem can be regarded as

minx∈Rd

N∑i=1

f i (x) = minx1,...,xN∈Rd

x1=...=xN

N∑i=1

f i (x i ) = minx∈(Rd )N

Lx=0

f (x),

(Lx)i =N∑j=1

aij(xi − x j) f (x) =

N∑i=1

f i (x i )

Convex-concave augmented Lagrangian when L is symmetric

F (x , z) := f (x) + γ2 x>L x + z>L x ,

Saddle-point dynamics

x = − ∂F (x , z)∂x

= −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx

10 / 30

Page 15: slides_online_optimization_david_mateos

Saddle-point dynamics

The minimization problem can be regarded as

minx∈Rd

N∑i=1

f i (x) = minx1,...,xN∈Rd

x1=...=xN

N∑i=1

f i (x i ) = minx∈(Rd )N

Lx=0

f (x),

(Lx)i =N∑j=1

aij(xi − x j) f (x) =

N∑i=1

f i (x i )

Convex-concave augmented Lagrangian when L is symmetric

F (x , z) := f (x) + γ2 x>L x + z>L x ,

Saddle-point dynamics

x = − ∂F (x , z)∂x

= −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx10 / 30

Page 16: slides_online_optimization_david_mateos

In the case of directed graphs

Weight-balanced ⇔ 1>L = 0 ⇔ L + L> � 0,

F (x , z) := f (x) + γ2 x>L x + z>L x

still convex-concave

x = − ∂F (x , z)∂x

= −∇f (x)− γ 12

(L+ L>)x − L>z (Non distributed!)

changed to −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx

11 / 30

Page 17: slides_online_optimization_david_mateos

In the case of directed graphs

Weight-balanced ⇔ 1>L = 0 ⇔ L + L> � 0,

F (x , z) := f (x) + γ2 x>L x + z>L x

still convex-concave

x = − ∂F (x , z)∂x

= −∇f (x)− γ 12

(L+ L>)x − L>z (Non distributed!)

changed to −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx

J. Wang and N. Elia (with L> = L), Allerton, 10

11 / 30

Page 18: slides_online_optimization_david_mateos

In the case of directed graphs

Weight-balanced ⇔ 1>L = 0 ⇔ L + L> � 0,

F (x , z) := f (x) + γ2 x>L x + z>L x

still convex-concave

x = − ∂F (x , z)∂x

= −∇f (x)− γ 12

(L+ L>)x − L>z (Non distributed!)

changed to −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx

B. Gharesifard and J. Cortes, CDC, 12

11 / 30

Page 19: slides_online_optimization_david_mateos

Review ofonline convex optimization

12 / 30

Page 20: slides_online_optimization_david_mateos

Different kind of optimization: sequential decision making

In the diagnosis example:

Each round t ∈ {1, . . . ,T}question (features, medical findings): wt

decision (about using CT): h(xt ,wt)outcome (by CT findings/follow up of patient): ytloss: l(yt h(xt ,wt))

Choose xt & Incur loss ft(xt) := l(yt h(xt ,wt))

Goal: sublinear regret

R(u,T ) :=T∑t=1

ft(xt)−T∑t=1

ft(u) ∈ o(T )

using “historical observations”

13 / 30

Page 21: slides_online_optimization_david_mateos

Different kind of optimization: sequential decision making

In the diagnosis example:

Each round t ∈ {1, . . . ,T}question (features, medical findings): wt

decision (about using CT): h(xt ,wt)outcome (by CT findings/follow up of patient): ytloss: l(yt h(xt ,wt))

Choose xt & Incur loss ft(xt) := l(yt h(xt ,wt))

Goal: sublinear regret

R(u,T ) :=T∑t=1

ft(xt)−T∑t=1

ft(u) ∈ o(T )

using “historical observations”

13 / 30

Page 22: slides_online_optimization_david_mateos

Why regret?

If the regret is sublinear,

T∑t=1

ft(xt) ≤T∑t=1

ft(u) + o(T ),

then,

limT→∞

1

T

T∑t=1

ft(xt) ≤ limT→∞

1

T

T∑t=1

ft(u)

In temporal average, online decisions {xt}Tt=1 perform as well as bestfixed decision in hindsight

“No regrets, my friend”

14 / 30

Page 23: slides_online_optimization_david_mateos

Why regret?

What about generalization error?

Sublinear regret does not imply xt+1 will do well with ft+1

No assumptions about sequence {ft}; it canI follow an unknown stochastic or deterministic model,

I or be chosen adversarially

In our example, ft := l(yt h(xt ,wt)).

I If some model w 7→ h(x∗,w) explains reasonably the data inhindsight,

I then the online models w 7→ h(xt ,w) perform just as well in average

15 / 30

Page 24: slides_online_optimization_david_mateos

Why regret?

What about generalization error?

Sublinear regret does not imply xt+1 will do well with ft+1

No assumptions about sequence {ft}; it canI follow an unknown stochastic or deterministic model,

I or be chosen adversarially

In our example, ft := l(yt h(xt ,wt)).

I If some model w 7→ h(x∗,w) explains reasonably the data inhindsight,

I then the online models w 7→ h(xt ,w) perform just as well in average

15 / 30

Page 25: slides_online_optimization_david_mateos

Some classical results

Projected gradient descent:

xt+1 = ΠS(xt − ηt∇ft(xt)), (1)

where ΠS is a projection onto a compact set S ⊆ Rd , & ‖∇f ‖2 ≤ H

Follow-the-Regularized-Leader:

xt+1 = argminy∈S

( t∑s=1

fs(y) + ψ(y))

Martin Zinkevich, 03I (1) achieves O(

√T ) regret under convexity with ηt =

1√t

Elad Hazan, Amit Agarwal, and Satyen Kale, 07I (1) achieves O(logT ) regret under p-strong convexity with ηt =

1pt

I Others: Online Newton Step, Follow the Regularized Leader, etc.

16 / 30

Page 26: slides_online_optimization_david_mateos

Some classical results

Projected gradient descent:

xt+1 = ΠS(xt − ηt∇ft(xt)), (1)

where ΠS is a projection onto a compact set S ⊆ Rd , & ‖∇f ‖2 ≤ H

Follow-the-Regularized-Leader:

xt+1 = argminy∈S

( t∑s=1

fs(y) + ψ(y))

Martin Zinkevich, 03I (1) achieves O(

√T ) regret under convexity with ηt =

1√t

Elad Hazan, Amit Agarwal, and Satyen Kale, 07I (1) achieves O(logT ) regret under p-strong convexity with ηt =

1pt

I Others: Online Newton Step, Follow the Regularized Leader, etc.

16 / 30

Page 27: slides_online_optimization_david_mateos

Done with the reviewNow... we combine the two frameworks

Previous work and limitations

our coordination algorithm in the online case

theorems of O(√T ) & O(logT ) agent regrets

simulations

outline of the proof

17 / 30

Page 28: slides_online_optimization_david_mateos

Combining both frameworks

18 / 30

Page 29: slides_online_optimization_david_mateos

Resuming the diagnosis example:

Health center i ∈ {1, . . . ,N} takes care of a set of patients P it at time t

f i (x) =T∑t=1

∑s∈P i

t

l(ysh(x ,ws)) =T∑t=1

f it (x)

Goal: sublinear agent regret

Rj(u,T ) :=T∑t=1

N∑i=1

f it (xjt )−

T∑t=1

N∑i=1

f it (u) ≤ o(T )

using “local information” & “historical observations”

19 / 30

Page 30: slides_online_optimization_david_mateos

Resuming the diagnosis example:

Health center i ∈ {1, . . . ,N} takes care of a set of patients P it at time t

f i (x) =T∑t=1

∑s∈P i

t

l(ysh(x ,ws)) =T∑t=1

f it (x)

Goal: sublinear agent regret

Rj(u,T ) :=T∑t=1

N∑i=1

f it (xjt )−

T∑t=1

N∑i=1

f it (u) ≤ o(T )

using “local information” & “historical observations”

19 / 30

Page 31: slides_online_optimization_david_mateos

Challenge: Coordinate hospitals

f 1t f 2t

f 3t f 4t

f 5t

Need to design distributed online algorithms

20 / 30

Page 32: slides_online_optimization_david_mateos

Previous work on consensus-based online algorithms

F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TACProjected Subgradient Descent

I log(T ) regret (local strong convexity & bounded subgradients)I

√T regret (convexity & bounded subgradients)

I Both analysis require a projection onto a compact set

S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13Dual Averaging

I√T regret (convexity & bounded subgradients)

I General regularized projection onto a convex closed set.

K. I. Tsianos and M. G. Rabbat, arXiv, 12Projected Subgradient Descent

I Empirical risk as opposed to regret analysis

Limitations: Communication digraph in all casesis fixed and strongly connected

21 / 30

Page 33: slides_online_optimization_david_mateos

Our contributions (informally)

time-varying communication digraphs under B-joint connectivity &weight-balanced

unconstrained optimization (no projection step onto a bounded set)

logT regret (local strong convexity & bounded subgradients)√T regret (convexity & β-centrality with β ∈ (0, 1] & bounded

subgradients)

x∗x

−ξx

f it is β-central in Z ⊆ Rd \ X ∗,where X ∗ =

{x∗ : 0 ∈ ∂f it (x

∗)},

if ∀x ∈ Z, ∃x∗ ∈ X ∗ such that

−ξ>x (x∗ − x)

‖ξx‖2‖x∗ − x‖2≥ β,

for each ξx ∈ ∂f it (x).

22 / 30

Page 34: slides_online_optimization_david_mateos

Our contributions (informally)

time-varying communication digraphs under B-joint connectivity &weight-balanced

unconstrained optimization (no projection step onto a bounded set)

logT regret (local strong convexity & bounded subgradients)√T regret (convexity & β-centrality with β ∈ (0, 1] & bounded

subgradients)

x∗x

−ξx

f it is β-central in Z ⊆ Rd \ X ∗,where X ∗ =

{x∗ : 0 ∈ ∂f it (x

∗)},

if ∀x ∈ Z, ∃x∗ ∈ X ∗ such that

−ξ>x (x∗ − x)

‖ξx‖2‖x∗ − x‖2≥ β,

for each ξx ∈ ∂f it (x).

22 / 30

Page 35: slides_online_optimization_david_mateos

Coordination algorithm

x it+1 = x it − ηt gx it

+ σ(γ

N∑j=1,j 6=i

aij ,t(x jt − x it

)+

N∑j=1,j 6=i

aij ,t(z jt − z it

))

z it+1 = z it − σN∑

j=1,j 6=i

aij ,t(x jt − x it

)

Subgradient descent on previous local objectives, gx it ∈ ∂f it

23 / 30

Page 36: slides_online_optimization_david_mateos

Coordination algorithm

x it+1 = x it − ηt gx it

+ σ(γ

N∑j=1,j 6=i

aij ,t(x jt − x it

)

+N∑

j=1,j 6=i

aij ,t(z jt − z it

))

z it+1 = z it − σN∑

j=1,j 6=i

aij ,t(x jt − x it

)

Proportional (linear) feedback on disagreement with neighbors

23 / 30

Page 37: slides_online_optimization_david_mateos

Coordination algorithm

x it+1 = x it − ηt gx it

+ σ(γ

N∑j=1,j 6=i

aij ,t(x jt − x it

)+

N∑j=1,j 6=i

aij ,t(z jt − z it

))

z it+1 = z it − σ

N∑j=1,j 6=i

aij ,t(x jt − x it

)

Integral (linear) feedback on disagreement with neighbors

23 / 30

Page 38: slides_online_optimization_david_mateos

Coordination algorithm

x it+1 = x it − ηt gx it

+ σ(γ

N∑j=1,j 6=i

aij ,t(x jt − x it

)+

N∑j=1,j 6=i

aij ,t(z jt − z it

))

z it+1 = z it − σ

N∑j=1,j 6=i

aij ,t(x jt − x it

)

Union of graphs over intervals of length B is strongly connected.

f 1t f 2t

f 3t f 4t

f 5t

time t time t + 1

a14,t+1a41,t+1

time t + 2

23 / 30

Page 39: slides_online_optimization_david_mateos

Coordination algorithm

x it+1 = x it − ηt gx it

+ σ(γ

N∑j=1,j 6=i

aij ,t(x jt − x it

)+

N∑j=1,j 6=i

aij ,t(z jt − z it

))

z it+1 = z it − σ

N∑j=1,j 6=i

aij ,t(x jt − x it

)

Compact representation[xt+1

zt+1

]=

[xtzt

]− σ

[γLt Lt−Lt 0

] [xtzt

]− ηt

[gxt0

]

vt+1 = (I− σE ⊗ Lt)vt − ηtgt ,

23 / 30

Page 40: slides_online_optimization_david_mateos

Coordination algorithm

x it+1 = x it − ηt gx it

+ σ(γ

N∑j=1,j 6=i

aij ,t(x jt − x it

)+

N∑j=1,j 6=i

aij ,t(z jt − z it

))

z it+1 = z it − σ

N∑j=1,j 6=i

aij ,t(x jt − x it

)

Compact representation & generalization[xt+1

zt+1

]=

[xtzt

]− σ

[γLt Lt−Lt 0

] [xtzt

]− ηt

[gxt0

]

vt+1 = (I− σE ⊗ Lt)vt − ηtgt ,

23 / 30

Page 41: slides_online_optimization_david_mateos

Theorem (DMN-JC)

Assume that

{f 1t , ..., f Nt }Tt=1 are convex functions in Rd

I with H-bounded subgradient sets,I nonempty and uniformly bounded sets of minimizers, andI p-strongly convex in a suff. large neighborhood of their minimizers

The sequence of weight-balanced communication digraphs isI nondegenerate, andI B-jointly-connected

E ∈ RK×K is diagonalizable with positive real eigenvalues

Then, taking σ ∈[

δλmin(E)δ ,

1−δλmax(E)dmax

]and ηt =

1p t ,

Rj(u,T ) ≤ C (‖u‖22 + 1 + logT ),

for any j ∈ {1, . . . ,N} and u ∈ Rd

24 / 30

Page 42: slides_online_optimization_david_mateos

Theorem (DMN-JC)

Assume that

{f 1t , ..., f Nt }Tt=1 are convex functions in Rd

I with H-bounded subgradient sets,I nonempty and uniformly bounded sets of minimizers, andI p-strongly convex in a suff. large neighborhood of their minimizers

The sequence of weight-balanced communication digraphs isI nondegenerate, andI B-jointly-connected

E ∈ RK×K is diagonalizable with positive real eigenvalues

Substituting strong convexity by β-centrality, for β ∈ (0, 1],

Rj(u,T ) ≤ C‖u‖22√T ,

for any j ∈ {1, . . . ,N} and u ∈ Rd

24 / 30

Page 43: slides_online_optimization_david_mateos

Simulations:acute brain finding revealed on Computerized Tomography

25 / 30

Page 44: slides_online_optimization_david_mateos

Simulations

Agents’ estimates

0 20 40 60 80 100 120 140 160 180 200−0.5

0

0.5

1

1.5

2

{ xit ,7}

Ni=1

0 20 40 60 80 100 120 140 160 180 200

0

1

2

time, t

Centralized

Average regret

0 20 40 60 80 100 120 140 160 180 200

10−3

10−2

10−1

100

101

time horizon, T

maxj 1/T Rj(

x∗

T ,{

∑N

i f it

}T

t=1

)

Proport ional-Integral

disagreement feedback

Proport ional

disagreement feedback

Centralized

f it (x) =∑s∈P i

t

l(ysh(x ,ws)) +110‖x‖

22

where

l(m) = log(1 + e−2m

)26 / 30

Page 45: slides_online_optimization_david_mateos

Outline of the proof

Analysis of network regret

RN (u,T ) :=T∑t=1

N∑i=1

f it (xit)−

T∑t=1

N∑i=1

f it (u)

ISS with respect to agreement

‖LKvt‖2 ≤ CI‖v1‖2(1− δ

4N2

)d t−1B e

+ CU max1≤s≤t−1

‖us‖2,

Agent regret bound in terms of {ηt}t≥1 and initial conditions

Boundedness of online estimates under β-centrality (β ∈ (0, 1] )

Doubling trick to obtain O(√T ) agent regret

Local strong convexity ⇒ β-centrality

ηt =1p t to obtain O(logT ) agent regret

27 / 30

Page 46: slides_online_optimization_david_mateos

Conclusions and future work

Conclusions

Distributed online convex optimization over jointly connecteddigraphs,submitted to IEEE Transactions on Network Science and Engineering(can be found in the webpage of Jorge Cortes)

Relevant for regression & classification that play a crucial role inmachine learning, computer vision, etc.

Future work

Refine guarantees under

model for evolution

of objective functions

28 / 30

Page 47: slides_online_optimization_david_mateos

Future directions in data-driven optimization problems

Distributed strategies for...

feature selection

semi-supervised learning

29 / 30

Page 48: slides_online_optimization_david_mateos

Thank you for listening!

Contact: David Mateos-Nunez and Jorge Cortes, at UC San Diego

30 / 30