slides_online_optimization_david_mateos

Distributed online optimization over jointly connecteddigraphs

David Mateos-Nunez Jorge Cortes

University of California, San Diego

{dmateosn,cortes}@ucsd.edu

Mathematical Theory of Networks and SystemsUniversity of Groningen, July 7, 2014

1 / 30

Overview of this talk

Distributed optimization Online optimization

Case study: medical diagnosis

2 / 30

Motivation: a data-driven optimization problem

Medical findings, symptoms:age factoramnesia before impactdeterioration in GCS scoreopen skull fractureloss of consciousnessvomiting

Any acute brain finding revealed on Computerized Tomography?

(-1 = not present, 1 = present)

“The Canadian CT Head Rule for patients with minor head injury”

3 / 30

Binary classification

feature vector of patient s:ws = ((ws)1, ..., (ws)d−1)

true diagnosis: ys ∈ {−1, 1}wanted weights: x = (x1, ..., xd)

predictor: h(x ,ws) = x>(ws , 1)

margin: ms(x) = ys h(x ,ws)

Given the data set {ws}Ps=1, estimate x ∈ Rd by solving

minx∈Rd

P∑s=1

l(ms(x))

where the loss function l : R → R is decreasing and convex

4 / 30

Binary classification

feature vector of patient s:ws = ((ws)1, ..., (ws)d−1)

true diagnosis: ys ∈ {−1, 1}wanted weights: x = (x1, ..., xd)

predictor: h(x ,ws) = x>(ws , 1)

margin: ms(x) = ys h(x ,ws)

Given the data set {ws}Ps=1, estimate x ∈ Rd by solving

minx∈Rd

P∑s=1

l(ms(x))

where the loss function l : R → R is decreasing and convex

4 / 30

Why distributed online optimization?

Why distributed?

information is distributed across group of agentsneed to interact to optimize performance

Why online?

information becomes incrementally availableneed adaptive solution

5 / 30

Review ofdistributed convex optimization

6 / 30

A distributed scenario

In the diagnosis example health center i ∈ {1, . . . ,N}manages a set of patients P i

f (x) =N∑i=1

∑s∈P i

l(ysh(x ,ws)) =N∑i=1

f i (x)

Goal: best predicting model w 7→ h(x∗,w)

minx∈Rd

N∑i=1

f i (x)

using “local information”

7 / 30

A distributed scenario

In the diagnosis example health center i ∈ {1, . . . ,N}manages a set of patients P i

f (x) =N∑i=1

∑s∈P i

l(ysh(x ,ws)) =N∑i=1

f i (x)

Goal: best predicting model w 7→ h(x∗,w)

minx∈Rd

N∑i=1

f i (x)

using “local information”

7 / 30

What do we mean by “using local information”?

Agent i maintains an estimate x it of

x∗ ∈ arg minx∈Rd

N∑i=1

f i (x)

Agent i has access to f i

Agent i can share its estimate x it with “neighboring” agents

f 1 f 2

f 3 f 4

f 5

A =

a13 a14

a21 a25a31 a32a41

a52

Distributed algorithms: Tsitsiklis 84, Bertsekas and Tsitsiklis 95

Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05

8 / 30

What do we mean by “using local information”?

Agent i maintains an estimate x it of

x∗ ∈ arg minx∈Rd

N∑i=1

f i (x)

Agent i has access to f i

Agent i can share its estimate x it with “neighboring” agents

f 1 f 2

f 3 f 4

f 5

A =

a13 a14

a21 a25a31 a32a41

a52

Distributed algorithms: Tsitsiklis 84, Bertsekas and Tsitsiklis 95

Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 058 / 30

How do agents agree on the optimizer?

local minimization vs local consensus

A. Nedic and A. Ozdaglar, TAC, 09

x ik+1 = −ηkg ik +

N∑j=1

aij ,k xjk ,

where g ik ∈ ∂f i (x ik) and Ak = (aij ,k) is doubly stochastic

J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12

9 / 30

How do agents agree on the optimizer?

local minimization vs local consensus

A. Nedic and A. Ozdaglar, TAC, 09

x ik+1 = −ηkg ik +

N∑j=1

aij ,k xjk ,

where g ik ∈ ∂f i (x ik) and Ak = (aij ,k) is doubly stochastic

J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12

9 / 30

Saddle-point dynamics

The minimization problem can be regarded as

minx∈Rd

N∑i=1

f i (x) = minx1,...,xN∈Rd

x1=...=xN

N∑i=1

f i (x i ) = minx∈(Rd )N

Lx=0

f (x),

(Lx)i =N∑j=1

aij(xi − x j) f (x) =

N∑i=1

f i (x i )

Convex-concave augmented Lagrangian when L is symmetric

F (x , z) := f (x) + γ2 x>L x + z>L x ,


x = − ∂F (x , z)∂x

= −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx

10 / 30


The minimization problem can be regarded as

minx∈Rd

N∑i=1

f i (x) = minx1,...,xN∈Rd

x1=...=xN

N∑i=1

f i (x i ) = minx∈(Rd )N

Lx=0

f (x),

(Lx)i =N∑j=1

aij(xi − x j) f (x) =

N∑i=1

f i (x i )

Convex-concave augmented Lagrangian when L is symmetric

F (x , z) := f (x) + γ2 x>L x + z>L x ,


x = − ∂F (x , z)∂x

= −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx10 / 30

In the case of directed graphs

Weight-balanced ⇔ 1>L = 0 ⇔ L + L> � 0,

F (x , z) := f (x) + γ2 x>L x + z>L x

still convex-concave

x = − ∂F (x , z)∂x

= −∇f (x)− γ 12

(L+ L>)x − L>z (Non distributed!)

changed to −∇f (x)− γ Lx − Lz

z =∂F (x , z)∂z

= Lx

11 / 30



F (x , z) := f (x) + γ2 x>L x + z>L x


x = − ∂F (x , z)∂x

= −∇f (x)− γ 12



z =∂F (x , z)∂z

= Lx

J. Wang and N. Elia (with L> = L), Allerton, 10

11 / 30



F (x , z) := f (x) + γ2 x>L x + z>L x


x = − ∂F (x , z)∂x

= −∇f (x)− γ 12



z =∂F (x , z)∂z

= Lx

B. Gharesifard and J. Cortes, CDC, 12

11 / 30

Review ofonline convex optimization

12 / 30

Different kind of optimization: sequential decision making

In the diagnosis example:

Each round t ∈ {1, . . . ,T}question (features, medical findings): wt

decision (about using CT): h(xt ,wt)outcome (by CT findings/follow up of patient): ytloss: l(yt h(xt ,wt))

Choose xt & Incur loss ft(xt) := l(yt h(xt ,wt))

Goal: sublinear regret

R(u,T ) :=T∑t=1

ft(xt)−T∑t=1

ft(u) ∈ o(T )

using “historical observations”

13 / 30

Different kind of optimization: sequential decision making

In the diagnosis example:

Each round t ∈ {1, . . . ,T}question (features, medical findings): wt

decision (about using CT): h(xt ,wt)outcome (by CT findings/follow up of patient): ytloss: l(yt h(xt ,wt))

Choose xt & Incur loss ft(xt) := l(yt h(xt ,wt))

Goal: sublinear regret

R(u,T ) :=T∑t=1

ft(xt)−T∑t=1

ft(u) ∈ o(T )

using “historical observations”

13 / 30

Why regret?

If the regret is sublinear,

T∑t=1

ft(xt) ≤T∑t=1

ft(u) + o(T ),

then,

limT→∞

1

T

T∑t=1

ft(xt) ≤ limT→∞

1

T

T∑t=1

ft(u)

In temporal average, online decisions {xt}Tt=1 perform as well as bestfixed decision in hindsight

“No regrets, my friend”

14 / 30

Why regret?

What about generalization error?

Sublinear regret does not imply xt+1 will do well with ft+1

No assumptions about sequence {ft}; it canI follow an unknown stochastic or deterministic model,

I or be chosen adversarially

In our example, ft := l(yt h(xt ,wt)).

I If some model w 7→ h(x∗,w) explains reasonably the data inhindsight,

I then the online models w 7→ h(xt ,w) perform just as well in average

15 / 30

Why regret?

What about generalization error?

Sublinear regret does not imply xt+1 will do well with ft+1

No assumptions about sequence {ft}; it canI follow an unknown stochastic or deterministic model,

I or be chosen adversarially

In our example, ft := l(yt h(xt ,wt)).

I If some model w 7→ h(x∗,w) explains reasonably the data inhindsight,

I then the online models w 7→ h(xt ,w) perform just as well in average

15 / 30

Some classical results

Projected gradient descent:

xt+1 = ΠS(xt − ηt∇ft(xt)), (1)

where ΠS is a projection onto a compact set S ⊆ Rd , & ‖∇f ‖2 ≤ H

Follow-the-Regularized-Leader:

xt+1 = argminy∈S

( t∑s=1

fs(y) + ψ(y))

Martin Zinkevich, 03I (1) achieves O(

√T ) regret under convexity with ηt =

1√t

Elad Hazan, Amit Agarwal, and Satyen Kale, 07I (1) achieves O(logT ) regret under p-strong convexity with ηt =

1pt

I Others: Online Newton Step, Follow the Regularized Leader, etc.

16 / 30

Some classical results

Projected gradient descent:

xt+1 = ΠS(xt − ηt∇ft(xt)), (1)

where ΠS is a projection onto a compact set S ⊆ Rd , & ‖∇f ‖2 ≤ H

Follow-the-Regularized-Leader:

xt+1 = argminy∈S

( t∑s=1

fs(y) + ψ(y))

Martin Zinkevich, 03I (1) achieves O(

√T ) regret under convexity with ηt =

1√t

Elad Hazan, Amit Agarwal, and Satyen Kale, 07I (1) achieves O(logT ) regret under p-strong convexity with ηt =

1pt

I Others: Online Newton Step, Follow the Regularized Leader, etc.

16 / 30

Done with the reviewNow... we combine the two frameworks

Previous work and limitations

our coordination algorithm in the online case

theorems of O(√T ) & O(logT ) agent regrets

simulations

outline of the proof

17 / 30

Combining both frameworks

18 / 30

Resuming the diagnosis example:

Health center i ∈ {1, . . . ,N} takes care of a set of patients P it at time t

f i (x) =T∑t=1

∑s∈P i

t

l(ysh(x ,ws)) =T∑t=1

f it (x)

Goal: sublinear agent regret

Rj(u,T ) :=T∑t=1

N∑i=1

f it (xjt )−

T∑t=1

N∑i=1

f it (u) ≤ o(T )

using “local information” & “historical observations”

19 / 30

Resuming the diagnosis example:

Health center i ∈ {1, . . . ,N} takes care of a set of patients P it at time t

f i (x) =T∑t=1

∑s∈P i

t

l(ysh(x ,ws)) =T∑t=1

f it (x)

Goal: sublinear agent regret

Rj(u,T ) :=T∑t=1

N∑i=1

f it (xjt )−

T∑t=1

N∑i=1

f it (u) ≤ o(T )

using “local information” & “historical observations”

19 / 30

Challenge: Coordinate hospitals

f 1t f 2t

f 3t f 4t

f 5t

Need to design distributed online algorithms

20 / 30

Previous work on consensus-based online algorithms

F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TACProjected Subgradient Descent

I log(T ) regret (local strong convexity & bounded subgradients)I

√T regret (convexity & bounded subgradients)

I Both analysis require a projection onto a compact set

S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13Dual Averaging

I√T regret (convexity & bounded subgradients)

I General regularized projection onto a convex closed set.

K. I. Tsianos and M. G. Rabbat, arXiv, 12Projected Subgradient Descent

I Empirical risk as opposed to regret analysis

Limitations: Communication digraph in all casesis fixed and strongly connected

21 / 30

Our contributions (informally)

time-varying communication digraphs under B-joint connectivity &weight-balanced

unconstrained optimization (no projection step onto a bounded set)

logT regret (local strong convexity & bounded subgradients)√T regret (convexity & β-centrality with β ∈ (0, 1] & bounded

subgradients)

x∗x

−ξx

f it is β-central in Z ⊆ Rd \ X ∗,where X ∗ =

{x∗ : 0 ∈ ∂f it (x

∗)},

if ∀x ∈ Z, ∃x∗ ∈ X ∗ such that

−ξ>x (x∗ − x)

‖ξx‖2‖x∗ − x‖2≥ β,

for each ξx ∈ ∂f it (x).

22 / 30

Our contributions (informally)

time-varying communication digraphs under B-joint connectivity &weight-balanced

unconstrained optimization (no projection step onto a bounded set)

logT regret (local strong convexity & bounded subgradients)√T regret (convexity & β-centrality with β ∈ (0, 1] & bounded

subgradients)

x∗x

−ξx

f it is β-central in Z ⊆ Rd \ X ∗,where X ∗ =

{x∗ : 0 ∈ ∂f it (x

∗)},

if ∀x ∈ Z, ∃x∗ ∈ X ∗ such that

−ξ>x (x∗ − x)

‖ξx‖2‖x∗ − x‖2≥ β,

for each ξx ∈ ∂f it (x).

22 / 30

Coordination algorithm

x it+1 = x it − ηt gx it

+ σ(γ

N∑j=1,j 6=i

aij ,t(x jt − x it

)+

N∑j=1,j 6=i

aij ,t(z jt − z it

))

z it+1 = z it − σN∑

j=1,j 6=i


)

Subgradient descent on previous local objectives, gx it ∈ ∂f it

23 / 30



+ σ(γ

N∑j=1,j 6=i


)

+N∑

j=1,j 6=i


))

z it+1 = z it − σN∑

j=1,j 6=i


)

Proportional (linear) feedback on disagreement with neighbors

23 / 30



+ σ(γ

N∑j=1,j 6=i


)+

N∑j=1,j 6=i


))

z it+1 = z it − σ

N∑j=1,j 6=i


)

Integral (linear) feedback on disagreement with neighbors

23 / 30



+ σ(γ

N∑j=1,j 6=i


)+

N∑j=1,j 6=i


))


N∑j=1,j 6=i


)

Union of graphs over intervals of length B is strongly connected.

f 1t f 2t

f 3t f 4t

f 5t

time t time t + 1

a14,t+1a41,t+1

time t + 2

23 / 30



+ σ(γ

N∑j=1,j 6=i


)+

N∑j=1,j 6=i


))


N∑j=1,j 6=i


)

Compact representation[xt+1

zt+1

]=

[xtzt

]− σ

[γLt Lt−Lt 0

] [xtzt

]− ηt

[gxt0

]

vt+1 = (I− σE ⊗ Lt)vt − ηtgt ,

23 / 30



+ σ(γ

N∑j=1,j 6=i


)+

N∑j=1,j 6=i


))


N∑j=1,j 6=i


)

Compact representation & generalization[xt+1

zt+1

]=

[xtzt

]− σ

[γLt Lt−Lt 0

] [xtzt

]− ηt

[gxt0

]

vt+1 = (I− σE ⊗ Lt)vt − ηtgt ,

23 / 30

Theorem (DMN-JC)

Assume that

{f 1t , ..., f Nt }Tt=1 are convex functions in Rd

I with H-bounded subgradient sets,I nonempty and uniformly bounded sets of minimizers, andI p-strongly convex in a suff. large neighborhood of their minimizers

The sequence of weight-balanced communication digraphs isI nondegenerate, andI B-jointly-connected

E ∈ RK×K is diagonalizable with positive real eigenvalues

Then, taking σ ∈[

δλmin(E)δ ,

1−δλmax(E)dmax

]and ηt =

1p t ,

Rj(u,T ) ≤ C (‖u‖22 + 1 + logT ),

for any j ∈ {1, . . . ,N} and u ∈ Rd

24 / 30

Theorem (DMN-JC)

Assume that

{f 1t , ..., f Nt }Tt=1 are convex functions in Rd

I with H-bounded subgradient sets,I nonempty and uniformly bounded sets of minimizers, andI p-strongly convex in a suff. large neighborhood of their minimizers

The sequence of weight-balanced communication digraphs isI nondegenerate, andI B-jointly-connected

E ∈ RK×K is diagonalizable with positive real eigenvalues

Substituting strong convexity by β-centrality, for β ∈ (0, 1],

Rj(u,T ) ≤ C‖u‖22√T ,

for any j ∈ {1, . . . ,N} and u ∈ Rd

24 / 30

Simulations:acute brain finding revealed on Computerized Tomography

25 / 30

Simulations

Agents’ estimates

0 20 40 60 80 100 120 140 160 180 200−0.5

0

0.5

1

1.5

2

{ xit ,7}

Ni=1

0 20 40 60 80 100 120 140 160 180 200

0

1

2

time, t

Centralized

Average regret

0 20 40 60 80 100 120 140 160 180 200

10−3

10−2

10−1

100

101

time horizon, T

maxj 1/T Rj(

x∗

T ,{

∑N

i f it

}T

t=1

)

Proport ional-Integral

disagreement feedback

Proport ional

disagreement feedback

Centralized

f it (x) =∑s∈P i

t

l(ysh(x ,ws)) +110‖x‖

22

where

l(m) = log(1 + e−2m

)26 / 30

Outline of the proof

Analysis of network regret

RN (u,T ) :=T∑t=1

N∑i=1

f it (xit)−

T∑t=1

N∑i=1

f it (u)

ISS with respect to agreement

‖LKvt‖2 ≤ CI‖v1‖2(1− δ

4N2

)d t−1B e

+ CU max1≤s≤t−1

‖us‖2,

Agent regret bound in terms of {ηt}t≥1 and initial conditions

Boundedness of online estimates under β-centrality (β ∈ (0, 1] )

Doubling trick to obtain O(√T ) agent regret

Local strong convexity ⇒ β-centrality

ηt =1p t to obtain O(logT ) agent regret

27 / 30

Conclusions and future work

Conclusions

Distributed online convex optimization over jointly connecteddigraphs,submitted to IEEE Transactions on Network Science and Engineering(can be found in the webpage of Jorge Cortes)

Relevant for regression & classification that play a crucial role inmachine learning, computer vision, etc.

Future work

Refine guarantees under

model for evolution

of objective functions

28 / 30

Future directions in data-driven optimization problems

Distributed strategies for...

feature selection

semi-supervised learning

29 / 30

Thank you for listening!

Contact: David Mateos-Nunez and Jorge Cortes, at UC San Diego

30 / 30

Documents

slides_online_optimization_david_mateos