Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Distributed Solvers for Online Data-Driven Network Optimization
UNIVERSITY OF CALIFORNIAUNOFFICIAL SEAL
Attachment B - “Unofficial” SealFor Use on Letterhead
Jorge Cortes
Mechanical and Aerospace Engineering University of California, San Diego http://carmenere.ucsd.edu/jorge
NREL Workshop: Innovative Optimization and Control Methods for Highly Distributed Autonomous Systems
Table Mountain Inn, Golden, CO April 11-12, 2019
´
Acknowledgments
Ashish Priyank Simon Erfan Nozari David Mateos Dean Richert Michael McCreesh Cherukuri Srivastava Niederlander
Pavan Tallapragada
Bahman Gharesifard
Solmaz Kia Chin-Yao Chang Miguel Vaquero
Sonia Martinez Enrique Steven Low Marcello Emiliano Dall’Anese Mallada Colombino
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 2 / 38
Uncertainty pervasive too: interconnected world with
unstructured environments
complex, changing network dynamics
humans in the loop
contested scenarios, adversaries
´
Network Optimization is Pervasive
Optimizing agent operation with limited network resources
Grid of the future Intelligent transportation Disaster response
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 3 / 38
´
Network Optimization is Pervasive
Optimizing agent operation with limited network resources
Grid of the future Intelligent transportation Disaster response
Uncertainty pervasive too: interconnected world with
unstructured environments complex, changing network dynamics humans in the loop contested scenarios, adversaries
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 3 / 38
´
Basic Taxonomy for Network Optimization
Simple, yet remarkably ubiquitous formulation in multi-agent applications
minimize ‘aggregate objective function’(x)
subject to ‘ineq constraints’(x)
‘eq constraints’(x)
Large-scale systems
coupling might come from objective, constraints, or both
individual agents may seek to find global solution or only own component
coupling topology versus network topology: varying degree of sparsity all the way to non-sparse at all
How do we solve optimization in a distributed way?
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 4 / 38
Optimizers of convex problems ⇔ saddle points of convex-concave Lagrangian
L(x , y , z) = f (x) + y>g(x) + z>(Ax − b)
Dynamical systems approach to algorithms
dynamical systems that solve problemsin linear algebra, systems, optimization
´
Distributed Solvers for Network Optimization
Network optimization w/ distributed structure (widespread in multi-agent scenarios)
nX minimize f (x) = fi (xi ) (separable objective)
i=1
subject to g(x) ≤ 0
Ax = b (locally expressible)
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 5 / 38
Dynamical systems approach to algorithms
dynamical systems that solve problemsin linear algebra, systems, optimization
´
Distributed Solvers for Network Optimization
Network optimization w/ distributed structure (widespread in multi-agent scenarios)
nX minimize f (x) = fi (xi ) (separable objective)
i=1
subject to g(x) ≤ 0
Ax = b (locally expressible)
Optimizers of convex problems ⇔ saddle points of convex-concave Lagrangian
L(x , y , z) = f (x) + y > g(x) + z >(Ax − b)
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 5 / 38
´
Distributed Solvers for Network Optimization
Network optimization w/ distributed structure (widespread in multi-agent scenarios)
nX minimize f (x) = fi (xi ) (separable objective)
i=1
subject to g(x) ≤ 0
Ax = b (locally expressible)
Optimizers of convex problems ⇔ saddle points of convex-concave Lagrangian
L(x , y , z) = f (x) + y > g(x) + z >(Ax − b)
Dynamical systems approach to algorithms
dynamical systems that solve problems in linear algebra, systems, optimization
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 5 / 38
Rich dynamical behavior
characterization of stability and convergence properties
appealing for large-scale systems: easily implementableby individual agents
higher-order methods difficult to “distribute”, errors incomm&sensing lead to errors in higher-order terms
´
Distributed Algorithm Design via Primal-Dual Dynamics
Saddle points of L can be found via saddle-point dynamics
x = −rx L(x , y , z)
y = [ry L(x , y , z)]+ y
z = rz L(x , y , z)
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 6 / 38
Rich dynamical behavior
characterization of stability and convergence properties
appealing for large-scale systems: easily implementableby individual agents
higher-order methods difficult to “distribute”, errors incomm&sensing lead to errors in higher-order terms
´
Distributed Algorithm Design via Primal-Dual Dynamics
Saddle points of L can be found via saddle-point dynamics
x = −rf (x) − y T rg(x) − AT z
y = [g(x)]+ y
z = Ax − b
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 6 / 38
Rich dynamical behavior
characterization of stability and convergence properties
appealing for large-scale systems: easily implementableby individual agents
higher-order methods difficult to “distribute”, errors incomm&sensing lead to errors in higher-order terms
´
Distributed Algorithm Design via Primal-Dual Dynamics
Saddle points of L can be found via saddle-point dynamics X X∂fi ∂gα xi = − (xi ) − yα (xi , xNi ) − zβ Aβi
∂xi ∂xiα β
yα = [gα(x)]+ yα X
zβ = Aβj xj − bβ
j
From agent viewpoint, problem structure gives rise to distributed dynamics
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 6 / 38
´
Distributed Algorithm Design via Primal-Dual Dynamics
Saddle points of L can be found via saddle-point dynamics X X∂fi ∂gα xi = − (xi ) − yα (xi , xNi ) − zβ Aβi
∂xi ∂xiα β
yα = [gα(x)]+ yα X
zβ = Aβj xj − bβ
j
From agent viewpoint, problem structure gives rise to distributed dynamics
Rich dynamical behavior
characterization of stability and convergence properties
appealing for large-scale systems: easily implementable by individual agents
higher-order methods difficult to “distribute”, errors in comm&sensing lead to errors in higher-order terms
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 6 / 38
A. Cherukuri, B. Gharesifard, and J. Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points.
SIAM Journal on Control and Optimization, 55(1):486–511, 2017N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. The proximal augmented Lagrangian method for nonsmooth composite optimization.
IEEE Transactions on Automatic Control, 2018.
To appear
Systems & Control Letters, 87:10–15, 2016´
Stability of Primal-Dual Dynamics Saddle points not necessarily asymptotically stable
Primal-dual dynamics for convex-concave F (x , z) = xz [Samuelson, 58]
x = −rx F (x , z) = −z
z = rz F (x , z) = x
Saddle point (0, 0) is stable, not asymptotically stable -1.0 -0.5 0.0 0.5 1.0-1.0
-0.5
0.0
0.5
1.0
x
z
Stream of results to understand asymptotic convergence & properties
K. Arrow, L Hurwitz, and H. Uzawa. Studies in Linear and Non-Linear Programming.
Stanford University Press, Stanford, California, 1958 D. Feijer and F. Paganini. Stability of primal-dual gradient dynamics and applications to network optimization.
Automatica, 46:1974–1981, 2010 J. Wang and N. Elia. Control approach to distributed optimization.
In Allerton Conf. on Communications, Control and Computing, pages 557–561, Monticello, IL, October 2010 J. Wang and N. Elia. A control perspective for centralized and distributed convex optimization.
In IEEE Conf. on Decision and Control, pages 3800–3805, Orlando, Florida, 2011 J. Chen and V. K. N. Lau. Convergence analysis of saddle point problems in time varying wireless systems - control theoretical approach.
IEEE Transactions on Signal Processing, 60(1):443–452, 2012 E. Mallada, C. Zhao, and S. H. Low. Optimal load-side control for frequency regulation in smart grids.
IEEE Transactions on Automatic Control, 62(12):6294–6309, 2017 T. Holding and I. Lestas. Stability and instability in saddle point dynamics - Part I.
2017.
https://arxiv.org/abs/1707.07349 T. Holding and I. Lestas. Stability and instability in saddle point dynamics - Part II: The subgradient method.
2017.
https://arxiv.org/abs/1707.07351 A. Cherukuri, E. Mallada, and J. Cortes. Asymptotic convergence of constrained primal-dual dynamics.
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 7 / 38
´
LaSalle Functions for Asymptotic Convergence
Positive-definite functions with negative semi-definite derivative
distance to saddle point (x∗, y∗, z∗)
Vd (x , y , z) = 12
� kx − x∗k2 + ky − y∗k2 + kz − z∗k2
�
magnitude of vector field
Vm(x , y , z) = 12 (krx F (x , y , z)k2 + krz F (x , y , z)k2 X
+ ((ry F (x , y , z))j )2)
j active
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 8 / 38
´
LaSalle Functions for Asymptotic Convergence
Positive-definite functions with negative semi-definite derivative
distance to saddle point (x∗, y∗, z∗)
Vd (x , y , z) = 12
� kx − x∗k2 + ky − y∗k2 + kz − z∗k2
�
magnitude of vector field
Vm(x , y , z) = 12 (krx F (x , y , z)k2 + krz F (x , y , z)k2 X
+ ((ry F (x , y , z))j )2)
j active
Global convergence under
convexity-concavity plus local strong convexity-concavity
convexity-linearity plus property of sets of saddle-points
quasiconvexity-quasiconcavity plus property of sets of saddle-points
local second-order information about saddle function
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 8 / 38
´
LaSalle Functions for Asymptotic Convergence
Positive-definite functions with negative semi-definite derivative
distance to saddle point (x∗, y∗, z∗) � � Vd (x , y , z) = 1
2 kx − x∗k2 + ky − y∗k2 + kz − z∗k2
magnitude of vector field
Vm(x , y , z) = 12 (krx F (x , y , z)k2 + krz F (x , y , z)k2 X
+ ((ry F (x , y , z))j )2)
j active
LaSalle arguments show convergence, but not enough for
characterization of convergence rate
dealing with errors in computation/comm/sensing
characterization of robustness against disturbances
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 8 / 38
Lyapunov Function for Constrained Optimization
For f : Rn → R strongly convex, twice continuously differentiable,g : Rn → Rp convex, twice continuously differentiable,
V (x , y , z) = 12k(x , y , z)k2Saddle(F ) + Vm(x , y , z)
1 V positive definite with respect to Saddle(F )
2 V negative definite: t 7→ V (x(t), y(t), z(t)) right-continuous, a.e. differentiable,
ddt
V (x(t), y(t), z(t)) < 0 for t where derivative exists & (x(t), y(t), z(t)) 6∈ Saddle(F )
V (x(t0), y(t0), z(t0)) ≤ limt↑t0 V (x(t), y(t), z(t)) for all t0 ≥ 0
A. Cherukuri, E. Mallada, S. H. Low, and J. Cortes. The role of convexity in saddle-point dynamics: Lyapunov function and robustness.IEEE Transactions on Automatic Control, 63(8):2449–2464, 2018
´
Beyond Asymptotic Stability
Identification of Lyapunov functions of primal-dual dynamics —leading to systematic characterization of convergence properties
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 9 / 38
1 V positive definite with respect to Saddle(F )
2 V negative definite: t 7→ V (x(t), y(t), z(t)) right-continuous, a.e. differentiable,
ddt
V (x(t), y(t), z(t)) < 0 for t where derivative exists & (x(t), y(t), z(t)) 6∈ Saddle(F )
V (x(t0), y(t0), z(t0)) ≤ limt↑t0 V (x(t), y(t), z(t)) for all t0 ≥ 0
A. Cherukuri, E. Mallada, S. H. Low, and J. Cortes. The role of convexity in saddle-point dynamics: Lyapunov function and robustness.IEEE Transactions on Automatic Control, 63(8):2449–2464, 2018
´
Beyond Asymptotic Stability
Identification of Lyapunov functions of primal-dual dynamics —leading to systematic characterization of convergence properties
Lyapunov Function for Constrained Optimization
For f : Rn → R strongly convex, twice continuously differentiable, g : Rn → Rp convex, twice continuously differentiable,
V (x , y , z) = 1 2 k(x , y , z)k
2 Saddle(F )
+ Vm(x , y , z)
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 9 / 38
´
Beyond Asymptotic Stability
Identification of Lyapunov functions of primal-dual dynamics —leading to systematic characterization of convergence properties
Lyapunov Function for Constrained Optimization
For f : Rn → R strongly convex, twice continuously differentiable, g : Rn → Rp convex, twice continuously differentiable,
V (x , y , z) = 1 2 k(x , y , z)k
2 Saddle(F )
+ Vm(x , y , z)
1
2
V positive definite with respect to Saddle(F )
V negative definite: t 7→ V (x(t), y (t), z(t)) right-continuous, a.e. differentiable,
d V (x(t), y(t), z(t)) < 0 for t where derivative exists & (x(t), y(t), z(t)) 6∈ Saddle(F )dt
V (x(t0), y(t0), z(t0)) ≤ lim t↑t0 V (x(t), y(t), z(t)) for all t0 ≥ 0
A. Cherukuri, E. Mallada, S. H. Low, and J. Cortes. The role of convexity in saddle-point dynamics: Lyapunov function and robustness. IEEE Transactions on Automatic Control, 63(8):2449–2464, 2018
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 9 / 38
Real-time implementation of the dynamics
adjust stepsize opportunistically based on system stateaperiodic discrete-time implementation w/same convergence guarantees
Tracking in time-varying optimization problems, data-driven implementations
approx.dynamics = true dynamics + error
estimates/data in lieu of exact elements to close the loop/avoid expensivecentralized computations/circumvent complex dynamicsonline formulations with tracking guarantees & handling of streaming data
´
Implications
Algorithm robustness against disturbances
true dynamics + disturbances
characterization of input-to-state stability properties of primal-dual dynamics graceful performance degradation as a function of size of disturbance
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 10 / 38
Tracking in time-varying optimization problems, data-driven implementations
approx.dynamics = true dynamics + error
estimates/data in lieu of exact elements to close the loop/avoid expensivecentralized computations/circumvent complex dynamicsonline formulations with tracking guarantees & handling of streaming data
´
Implications
Algorithm robustness against disturbances
true dynamics + disturbances
characterization of input-to-state stability properties of primal-dual dynamics graceful performance degradation as a function of size of disturbance
Real-time implementation of the dynamics
adjust stepsize opportunistically based on system state aperiodic discrete-time implementation w/same convergence guarantees
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 10 / 38
´
Implications
Algorithm robustness against disturbances
true dynamics + disturbances
characterization of input-to-state stability properties of primal-dual dynamics graceful performance degradation as a function of size of disturbance
Real-time implementation of the dynamics
adjust stepsize opportunistically based on system state aperiodic discrete-time implementation w/same convergence guarantees
Tracking in time-varying optimization problems, data-driven implementations
approx.dynamics = true dynamics + error
estimates/data in lieu of exact elements to close the loop/avoid expensive centralized computations/circumvent complex dynamics online formulations with tracking guarantees & handling of streaming data
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 10 / 38
Journal of Machine Learning Research, 17:1–43, 2016
A. C. Wilson, B. Recht, and M. I. Jordan. A Lyapunov analysis of momentum methods in optimization.2018.Available at https://arxiv.org/pdf/1611.02635.pdf
W. Su, S. Boyd, and E. J. Candes. A differential equation for modeling Nesterov’s accelerated gradient´
Primal-Dual Dynamics as Versatile Tool
Characterization of rates, speed, and acceleration N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. The proximal augmented Lagrangian method for nonsmooth composite optimization. IEEE Transactions on Automatic Control, 2018. To appear
J. Cortes and S. K. Niederlander. Distributed coordination for nonsmooth convex optimization via saddle-point dynamics. Journal of Nonlinear Science, 2019. To appear
G. Qu and N. Li. On the exponential stability of primal-dual gradient dynamics. 2018. Available online at https://arxiv.org/abs/1803.01825
W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015
M. McCreesh, J. Cortes, and B. Gharesifard. Accelerated convergence of saddle-point dynamics for convex-concave quadratic functions. In IEEE Conf. on Decision and Control, Nice, France, December 2019. Submitted
Graceful degradation as a function of size of disturbance A. Cherukuri, E. Mallada, S. H. Low, and J. Cortes. The role of convexity in saddle-point dynamics: Lyapunov function and robustness. IEEE Transactions on Automatic Control, 63(8):2449–2464, 2018
H. D. Nguyen, T. L. Vu, K. Turitsyn, and J. Slotine. Contraction and robustness of continuous time primal-dual dynamics. IEEE Control Systems Letters, 2(4):755–760, 2018
Feedback-based, data-driven, online formulation, tracking guarantees A. Bernstein and E. Dall’Anese. Real-time feedback-based optimization of distribution grids: a unified approach. 2017. https://arxiv.org/abs/1711.01627
M. Colombino, E. Dall’Anese, and A. Bernstein. Online optimization as a feedback controller: Stability and tracking. IEEE Transactions on Control of Network Systems, 2018. Submitted
E. Dall’Anese, S. Guggilam, A. Simonetto, Y. C. Chen, and S. V. Dhople. Optimal regulation of virtual power plants. IEEE Transactions on Power Systems, 33(2):1868–1881, 2018
Continuous-time vs discrete-time dynamics, machine learning ´ method: theory and insights. J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 11 / 38
´
Beyond Convergence #1: ISS with Respect to Saddle(F )
Robustness to errors in the gradient computation, noise in state measurements, errors in the controller implementation?
Theorem (Equality-Constrained Optimization)
For f strongly convex and C 2 , g = 0, with mI � r2f (x) � MI ∀x ∈ Rn , � x z
�
=
�−rx F (x , z) rz F (x , z)
�
+
�ux
uz
�
=
�−rf (x) − A>z
Ax − b
�
+
�ux
uz
�
is ISS with respect to Saddle(F )
Proof: Vβ (x , z) = β 1 1 k(x , z)k2 + β2Vm(x , z) is ISS-Lyapunov function 2 Saddle(F )
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 12 / 38
´
Beyond Convergence #1: ISS with Respect to Saddle(F )
Robustness to errors in the gradient computation, noise in state measurements, errors in the controller implementation?
Conjecture (Constrained Optimization)
For f strongly convex and C 2 , g convex and C 2, with mI � r2f (x) � MI ∀x ∈ Rn , ⎡
⎣ x yz
⎤
⎦=
⎡
⎣ −rx F (x , y , z) [ry F (x , y , z)]+
y rz F (x , y , z)
⎤
⎦+
⎡
⎣ ux
uy
uz
⎤
⎦=
⎡
⎣ −rf (x) − yT rg(x) − A>z
[g(x)]+ y
Ax − b
⎤
⎦+
⎡
⎣ ux
uy
uz
⎤
⎦
is ISS with respect to Saddle(F )
Proof: ISS-Lyapunov function theory for switched systems?
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 12 / 38
´
Example: Input-to-State Stability
⎡1 1
⎤ ⎡2⎤
f (x) = kxk2 , A = 1 0 and b = 1⎣ ⎦ , ⎣ ⎦
0 1 1
� Saddle(F ) = (x , z) ∈ R2 × R3 | x = (1, 1), z = −(1, 1, 1) + λ(1, −1, −1), λ ∈ R
f is C2, strongly convex, r2f (x) = 2I
0 5 10 15 20 25
-3
-1
1
3
x1 x2 z1 z2 z3
0 5 10 15 20 25
0
1
2
3
saddle-point dynamics distance to saddle points
Vanishing disturbance
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 13 / 38
´
Example: Input-to-State Stability
⎡1 1
⎤ ⎡2⎤
f (x) = kxk2 , A = 1 0 and b = 1⎣ ⎦ , ⎣ ⎦
0 1 1
� Saddle(F ) = (x , z) ∈ R2 × R3 | x = (1, 1), z = −(1, 1, 1) + λ(1, −1, −1), λ ∈ R
f is C2, strongly convex, r2f (x) = 2I
0 5 10 15 20 25
-3
-1
1
3
x1 x2 z1 z2 z3
0 5 10 15 20 25
0
1
2
3
saddle-point dynamics distance to saddle points
“Constant + Sinusoid” disturbance
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 13 / 38
Given sequence of triggering time instants {tk}∞k=0,
x(t) = −rxF (x(tk), z(tk))
z(t) = rzF (x(tk), z(tk))
for t ∈ [tk , tk+1) and k ∈ Z≥0
Objective: Design criterium to opportunistically select {tk}∞k=0 such thatfeasible executions: inter-trigger times lower bounded by positive quantity
global asymptotic convergence is retained
-
´
Beyond Convergence #2: Real Time Implementation
Opportunistic state-triggered implementation
avoid continuous evaluation of the vector field
adjust stepsize opportunistically based on state of the system
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 14 / 38
Objective: Design criterium to opportunistically select {tk}∞k=0 such thatfeasible executions: inter-trigger times lower bounded by positive quantity
global asymptotic convergence is retained
-
´
Beyond Convergence #2: Real Time Implementation
Opportunistic state-triggered implementation
avoid continuous evaluation of the vector field
adjust stepsize opportunistically based on state of the system
Given sequence of triggering time instants {tk }∞ k=0,
x(t) = −rx F (x(tk ), z(tk ))
z(t) = rz F (x(tk ), z(tk ))
for t ∈ [tk , tk+1) and k ∈ Z≥0
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 14 / 38
-
´
Beyond Convergence #2: Real Time Implementation
Opportunistic state-triggered implementation
avoid continuous evaluation of the vector field
adjust stepsize opportunistically based on state of the system
Given sequence of triggering time instants {tk }∞ k=0,
x(t) = −rx F (x(tk ), z(tk ))
z(t) = rz F (x(tk ), z(tk ))
for t ∈ [tk , tk+1) and k ∈ Z≥0
Objective: Design criterium to opportunistically select {tk }∞ such that k=0
feasible executions: inter-trigger times lower bounded by positive quantity
global asymptotic convergence is retained
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 14 / 38
Opportunistic state-triggered paradigm
trade-offs: comp, comm, sensing, control
identify criteria to autonomously trigger actionsbased on task – ‘active’ asynchronism
efficient implementations, incorporates uncertainty
´
Resource-Aware Control and Coordination
Continuous or periodic implementation paradigm
costly-to-implement synchronization for information sharing, processing, decision making
‘passive’ asynchronism, fixed agent time schedules
inefficient implementations for processor usage, communication bandwidth, energy
Time
Agents
Time
Certificate
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 15 / 38
´
Resource-Aware Control and Coordination
Continuous or periodic implementation paradigm
costly-to-implement synchronization for information sharing, processing, decision making
‘passive’ asynchronism, fixed agent time schedules
inefficient implementations for processor usage, communication bandwidth, energy
Opportunistic state-triggered paradigm
trade-offs: comp, comm, sensing, control
identify criteria to autonomously trigger actions based on task – ‘active’ asynchronism
efficient implementations, incorporates uncertainty
Time
Agents
Time
Certificate
Time
Agents
Time
Certificate
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 15 / 38
´
A Picture (A Movie) is Worth a Thousand Words
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 16 / 38
Insights
feasibility: guaranteed monotonic decrease of function, but aperiodic executionsmight not be feasible (accumulation of triggered times, Zeno)
certificate: LaSalle function clearly not good enough
trigger: specific challenges for network systems, both in design (local triggers) andanalysis (asynchronism, Zeno)
trigger: stabilization versus optimization
´
How to Decide When to Update?
Simplified setup: system x = f (x , u) on Rn with stabilization via
controller: k : Rn → Rm
certificate: Lyapunov function V : Rn → R
Synthesis for x = f (x , k(x)), w/ x sampled version of x?
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 17 / 38
Insights
feasibility: guaranteed monotonic decrease of function, but aperiodic executionsmight not be feasible (accumulation of triggered times, Zeno)
certificate: LaSalle function clearly not good enough
trigger: specific challenges for network systems, both in design (local triggers) andanalysis (asynchronism, Zeno)
trigger: stabilization versus optimization
´
How to Decide When to Update?
Simplified setup: system x = f (x , u) on Rn with stabilization via
controller: k : Rn → Rm
certificate: Lyapunov function V : Rn → R
Synthesis for x = f (x , k(x)), w/ x sampled version of x?
V = rV (x) · f (x , k(x)) = rV (x) · f (x , k(x)) + rV (x) · (f (x , k(x)) − f (x , k(x)))
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 17 / 38
Insights
feasibility: guaranteed monotonic decrease of function, but aperiodic executionsmight not be feasible (accumulation of triggered times, Zeno)
certificate: LaSalle function clearly not good enough
trigger: specific challenges for network systems, both in design (local triggers) andanalysis (asynchronism, Zeno)
trigger: stabilization versus optimization
´
How to Decide When to Update?
Simplified setup: system x = f (x , u) on Rn with stabilization via
controller: k : Rn → Rm
certificate: Lyapunov function V : Rn → R
Synthesis for x = f (x , k(x)), w/ x sampled version of x?
V = rV (x) · f (x , k(x)) ≤ rV (x) · f (x , k(x)) + h(x) kx − xk
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 17 / 38
Insights
feasibility: guaranteed monotonic decrease of function, but aperiodic executionsmight not be feasible (accumulation of triggered times, Zeno)
certificate: LaSalle function clearly not good enough
trigger: specific challenges for network systems, both in design (local triggers) andanalysis (asynchronism, Zeno)
trigger: stabilization versus optimization
´
How to Decide When to Update?
Simplified setup: system x = f (x , u) on Rn with stabilization via
controller: k : Rn → Rm
certificate: Lyapunov function V : Rn → R
Synthesis for x = f (x , k(x)), w/ x sampled version of x?
−Lf (x,k(x))V (x)Trigger criterium: kx − xk ≤
h(x)
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 17 / 38
´
How to Decide When to Update?
Simplified setup: system x = f (x , u) on Rn with stabilization via
controller: k : Rn → Rm
certificate: Lyapunov function V : Rn → R
Synthesis for x = f (x , k(x)), w/ x sampled version of x?
−Lf (x,k(x))V (x)Trigger criterium: kx − xk ≤
h(x)
Insights
feasibility: guaranteed monotonic decrease of function, but aperiodic executions might not be feasible (accumulation of triggered times, Zeno)
certificate: LaSalle function clearly not good enough
trigger: specific challenges for network systems, both in design (local triggers) and analysis (asynchronism, Zeno)
trigger: stabilization versus optimization
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 17 / 38
´
State-Triggered Implementation for Primal-Dual Dynamics
Given sequence of triggering time instants {tk }∞ k=0,
x(t) = −rx F (x(tk ), z(tk ))
z(t) = rz F (x(tk ), z(tk ))
for t ∈ [tk , tk+1) and k ∈ Z≥0
Approach: Use Vβ to design criterium to opportunistically select {tk }∞ k=0
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 18 / 38
Approach: Use Vβ to design criterium to opportunistically select {tk}∞k=0
´
State-Triggered Implementation for Primal-Dual Dynamics
Given sequence of triggering time instants {tk }∞ k=0,
x(t) = −rx F (x(tk ), z(tk ))
z(t) = rz F (x(tk ), z(tk ))
for t ∈ [tk , tk+1) and k ∈ Z≥0
State-triggered criterium
Vβ (x(tk ), z(tk ))LXsptk+1 = tk − ξ(x(tk ), z(tk ))kXsp(x(tk ), z(tk ))k2
Next triggering time computable with information available at current one
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 18 / 38
Approach: Use Vβ to design criterium to opportunistically select {tk}∞k=0
´
State-Triggered Implementation for Primal-Dual Dynamics
Given sequence of triggering time instants {tk }∞ k=0,
x(t) = −rx F (x(tk ), z(tk ))
z(t) = rz F (x(tk ), z(tk ))
for t ∈ [tk , tk+1) and k ∈ Z≥0
Theorem For f strongly convex and C 2, with mI � r2f (x) � MI and x 7→ r2f (x) Lipschitz, A full row rank,
trajectories of self-triggered dynamics converge to (x∗, z∗)
inter-event times are lower bounded by positive quantity
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 18 / 38
´
Example: Self-Triggered Implementation
F (x , z) = kxk2 + z(x1 + x2 + x3 − 1).
f (x) = kxk2 satisfies hypotheses, A = [1, 1, 1] full row-rank 1 1Saddle(F ) = {(( 13 , 3 , ), − 2 )}3 3
0 5 10 15
-3
-2
-1
0
1
2
3
x1 x2 x3 z
0 5 10 15
0
10
20
30
40
10 15 20
0
0.5
1×10
-4
primal-dual dynamics Vβ
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 19 / 38
´
Beyond Convergence #3: Tracking in Time-Varying Opt
Time-varying optimization problems
minimize ft (x)
subject to gt (x) ≤ 0
ht (x) = 0
ISS characterization leads to tracking guarantees
lim sup τ →∞
k(x(τ ), y(τ), z(τ)) − (x ∗ (t), y ∗ (t), z ∗ (t))k ≤ γ(cδ)
where γ is class K function and
δ is upper bound on rate of change of saddle points
k d dt (x ∗ (t), y ∗ (t), z ∗ (t))k ≤ δ
c time scale of primal-dual dynamics
d dτ
= c d dt
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 20 / 38
Distributed primal-dual dynamics to solveMPC formulation
Two groups of 5 cars enter network attime 1 (red) and 3 (blue)
Capacity in cell 4 drops to zero at time5 due to an accident
M. Vaquero and J. Cortes. Distributed augmentation-regularization for robust online convex optimization.In IFAC Workshop on Distributed Estimation and Control in Networked Systems, pages 230–235, Groningen, The Netherlands, 2018
´
Dynamic Traffic Assignment
Dynamic traffic assignment problem tunes traffic flows to optimize total travel distance, total travel time given time-varying inflows to network
Cell transmission model
x(t) – traffic volume in cells at time t f (t) – flows between cells at time t λ(t) – inflow to the network +
µ(t) – outflow from the network
supply&demand functions of infrastructure
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 21 / 38
M. Vaquero and J. Cortes. Distributed augmentation-regularization for robust online convex optimization.In IFAC Workshop on Distributed Estimation and Control in Networked Systems, pages 230–235, Groningen, The Netherlands, 2018
´
Dynamic Traffic Assignment
Dynamic traffic assignment problem tunes traffic flows to optimize total travel distance, total travel time given time-varying inflows to network
Cell transmission model
x(t) – traffic volume in cells at time t f (t) – flows between cells at time t λ(t) – inflow to the network +
µ(t) – outflow from the network
supply&demand functions of infrastructure
Distributed primal-dual dynamics to solve MPC formulation
Two groups of 5 cars enter network at time 1 (red) and 3 (blue)
Capacity in cell 4 drops to zero at time 5 due to an accident 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Primal-Dual Iterations 104
0
5
10
15
20
25
30
35
Dis
tan
ce
to
op
tim
um
MP
C ite
ration =
1
MP
C ite
ration =
2
MP
C ite
ration =
3
MP
C ite
ration =
4
MP
C ite
ration =
5
MP
C ite
ration =
6
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 21 / 38
M. Vaquero and J. Cortes. Distributed augmentation-regularization for robust online convex optimization.In IFAC Workshop on Distributed Estimation and Control in Networked Systems, pages 230–235, Groningen, The Netherlands, 2018
´
Dynamic Traffic Assignment
Dynamic traffic assignment problem tunes traffic flows to optimize total travel distance, total travel time given time-varying inflows to network
Cell transmission model
x(t) – traffic volume in cells at time t f (t) – flows between cells at time t λ(t) – inflow to the network +
µ(t) – outflow from the network
supply&demand functions of infrastructure
Distributed primal-dual dynamics to solve MPC formulation
Two groups of 5 cars enter network at time 1 (red) and 3 (blue)
Capacity in cell 4 drops to zero at time 5 due to an accident
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 21 / 38
´
Dynamic Traffic Assignment
Dynamic traffic assignment problem tunes traffic flows to optimize total travel distance, total travel time given time-varying inflows to network
Cell transmission model
x(t) – traffic volume in cells at time t f (t) – flows between cells at time t λ(t) – inflow to the network +
µ(t) – outflow from the network
supply&demand functions of infrastructure
Distributed primal-dual dynamics to solve MPC formulation
Two groups of 5 cars enter network at time 1 (red) and 3 (blue)
Capacity in cell 4 drops to zero at time 5 due to an accident
M. Vaquero and J. Cortes. Distributed augmentation-regularization for robust online convex optimization. In IFAC Workshop on Distributed Estimation and Control in Networked Systems, pages 230–235, Groningen, The Netherlands, 2018
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 21 / 38
´
Beyond Convergence #4: Coupled w/Network Dynamics
Select optimized setpoints x (power injection of distributed energy resources, traffic flows among contiguous arterial roads)
min f (x , y(x))
subject to g(x , y(x)) ≤ 0
h(x , y(x)) = 0
that drive physical y(x) (bus frequencies/voltages, traffic density) dynamics
ξ = Φ(ξ, x)
y = Ψ(ξ, d)
Implementation of primal-dual dynamics requires evaluation of y(x), ∂∂xy (x)
substitute by data/high-fidelity sim/estimates ⇒ approx. primal-dual dynamics
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 22 / 38
´
Traffic Intersection Control with Simulation of Urban MObility (SUMO) simulator
Stages
Controlled traffic light (green-red stage time)
Total cycle time = 160 sec, is = is time of stage i
Flows = 10 cars/cycle, but and = 100 cars/cycle
F(s) is number of cars out given s and flows
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 23 / 38
´
Data-driven Optimization of Stage Times
SUMO is complex agent-based simulator that incorporates
network characteristics: road topology, traffic lights, sidewalks
number of vehicles entering network at any time and any lane
destination of vehicles: route, turning ratios, origin/destination
type of vehicle: car, truck, bus, van, bicycle, pedestrian (w/ length, width, acceleration, max speed)
driver’s behavior: intersection model, lane changing model, car following model
minimize − x
subject to x = Ft (s) 8X
si = 160 i=1
si ≥ 5
data-driven optimized strategy constant, equal stage times
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 24 / 38
´
Data-driven Optimization of Stage Times
SUMO is complex agent-based simulator that incorporates
network characteristics: road topology, traffic lights, sidewalks
number of vehicles entering network at any time and any lane
destination of vehicles: route, turning ratios, origin/destination
type of vehicle: car, truck, bus, van, bicycle, pedestrian (w/ length, width, acceleration, max speed)
driver’s behavior: intersection model, lane changing model, car following model
minimize − x
subject to x = Ft (s) 8X
si = 160 i=1
si ≥ 5
data-driven optimized strategy adaptive signal control – Webster
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 24 / 38
´
Taking the Basic Taxonomy Further
minimize ‘aggregate objective function’(x)
subject to ‘ineq constraints’(x)
‘eq constraints’(x)
Hedging against uncertainty via data-driven distributionally-robust optimization A. Cherukuri and J. Cortes. Cooperative data-driven distributionally robust optimization. IEEE Transactions on Automatic Control, 2018. Submitted
Protecting privacy of individual information via differential privacy E. Nozari, P. Tallapragada, and J. Cortes. Differentially private distributed convex optimization via functional perturbation. IEEE Transactions on Control of Network Systems, 5(1):395–408, 2018
Multi-layer optimization: competition+coordination in power systems A. Cherukuri and J. Cortes. Iterative bidding in electricity markets: rationality and robustness. IEEE Transactions on Network Science and Engineering, 2018. Submitted
P. Srivastava, C.-Y. Chang, and J. Cortes. Participation of microgrids in frequency regulation markets. In American Control Conference, pages 3834–3839, Milwaukee, WI, May 2018
Dealing with non-sparse constraints through data C.-Y. Chang, M. Colombino, J. Cortes, and E. Dall’Anese. Saddle-flow dynamics for distributed feedback-based optimization. IEEE Control Systems Letters, 2019. Submitted
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 25 / 38
´
Hedging Against Uncertainty
Stochastic optimization works well when distribution is known and datasets are large
Risky with “uncertainty about uncertainty” Published as a conference paper at ICLR 2015
+ .007⇥ =
x sign(rxJ(✓, x, y))x +
✏sign(rxJ(✓, x, y))“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3 % confidence
Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedyet al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal tothe sign of the elements of the gradient of the cost function with respect to the input, we can changeGoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to the magnitude of thesmallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real numbers.
Let ✓ be the parameters of a model, x the input to the model, y the targets associated with x (formachine learning tasks that have targets) and J(✓, x, y) be the cost used to train the neural network.We can linearize the cost function around the current value of ✓, obtaining an optimal max-normconstrained pertubation of
⌘ = ✏sign (rxJ(✓, x, y)) .
We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that therequired gradient can be computed efficiently using backpropagation.
We find that this method reliably causes a wide variety of models to misclassify their input. SeeFig. 1 for a demonstration on ImageNet. We find that using ✏ = .25, we cause a shallow softmaxclassifier to have an error rate of 99.9% with an average confidence of 79.3% on the MNIST (?) testset1. In the same setting, a maxout network misclassifies 89.4% of our adversarial examples withan average confidence of 97.6%. Similarly, using ✏ = .1, we obtain an error rate of 87.15% andan average probability of 96.6% assigned to the incorrect labels when using a convolutional maxoutnetwork on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 2009) test set2. Othersimple methods of generating adversarial examples are possible. For example, we also found thatrotating x by a small angle in the direction of the gradient reliably produces adversarial examples.
The fact that these simple, cheap algorithms are able to generate misclassified examples serves asevidence in favor of our interpretation of adversarial examples as a result of linearity. The algorithmsare also useful as a way of speeding up adversarial training or even just analysis of trained networks.
5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY
Perhaps the simplest possible model we can consider is logistic regression. In this case, the fastgradient sign method is exact. We can use this case to gain some intuition for how adversarialexamples are generated in a simple setting. See Fig. 2 for instructive images.
If we train a single model to recognize labels y 2 {�1, 1} with P (y = 1) = ��w>x + b
�where
�(z) is the logistic sigmoid function, then training consists of gradient descent on
Ex,y⇠pdata⇣(�y(w>x + b))
where ⇣(z) = log (1 + exp(z)) is the softplus function. We can derive a simple analytical form fortraining on the worst-case adversarial perturbation of x rather than x itself, based on gradient sign
1This is using MNIST pixel values in the interval [0, 1]. MNIST data does contain values other than 0 or1, but the images are essentially binary. Each pixel roughly encodes “ink” or “no ink”. This justifies expectingthe classifier to be able to handle perturbations within a range of width 0.5, and indeed human observers canread such images without difficulty.
2 See https://github.com/lisa-lab/pylearn2/tree/master/pylearn2/scripts/papers/maxout. for the preprocessing code, which yields a standard deviation of roughly 0.5.
3
[Goodfellow, Shlens, Szegedy ’15]
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 26 / 38
´
Hedging Against Uncertainty
Stochastic optimization works well when distribution is known and datasets are large
Risky with “uncertainty about uncertainty” Published as a conference paper at ICLR 2015
+ .007⇥ =
x sign(rxJ(✓, x, y))x +
✏sign(rxJ(✓, x, y))“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3 % confidence
Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedyet al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal tothe sign of the elements of the gradient of the cost function with respect to the input, we can changeGoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to the magnitude of thesmallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real numbers.
Let ✓ be the parameters of a model, x the input to the model, y the targets associated with x (formachine learning tasks that have targets) and J(✓, x, y) be the cost used to train the neural network.We can linearize the cost function around the current value of ✓, obtaining an optimal max-normconstrained pertubation of
⌘ = ✏sign (rxJ(✓, x, y)) .
We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that therequired gradient can be computed efficiently using backpropagation.
We find that this method reliably causes a wide variety of models to misclassify their input. SeeFig. 1 for a demonstration on ImageNet. We find that using ✏ = .25, we cause a shallow softmaxclassifier to have an error rate of 99.9% with an average confidence of 79.3% on the MNIST (?) testset1. In the same setting, a maxout network misclassifies 89.4% of our adversarial examples withan average confidence of 97.6%. Similarly, using ✏ = .1, we obtain an error rate of 87.15% andan average probability of 96.6% assigned to the incorrect labels when using a convolutional maxoutnetwork on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 2009) test set2. Othersimple methods of generating adversarial examples are possible. For example, we also found thatrotating x by a small angle in the direction of the gradient reliably produces adversarial examples.
The fact that these simple, cheap algorithms are able to generate misclassified examples serves asevidence in favor of our interpretation of adversarial examples as a result of linearity. The algorithmsare also useful as a way of speeding up adversarial training or even just analysis of trained networks.
5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY
Perhaps the simplest possible model we can consider is logistic regression. In this case, the fastgradient sign method is exact. We can use this case to gain some intuition for how adversarialexamples are generated in a simple setting. See Fig. 2 for instructive images.
If we train a single model to recognize labels y 2 {�1, 1} with P (y = 1) = ��w>x + b
�where
�(z) is the logistic sigmoid function, then training consists of gradient descent on
Ex,y⇠pdata⇣(�y(w>x + b))
where ⇣(z) = log (1 + exp(z)) is the softplus function. We can derive a simple analytical form fortraining on the worst-case adversarial perturbation of x rather than x itself, based on gradient sign
1This is using MNIST pixel values in the interval [0, 1]. MNIST data does contain values other than 0 or1, but the images are essentially binary. Each pixel roughly encodes “ink” or “no ink”. This justifies expectingthe classifier to be able to handle perturbations within a range of width 0.5, and indeed human observers canread such images without difficulty.
2 See https://github.com/lisa-lab/pylearn2/tree/master/pylearn2/scripts/papers/maxout. for the preprocessing code, which yields a standard deviation of roughly 0.5.
3
[Goodfellow, Shlens, Szegedy ’15]
Many scenarios w/decisions need to be made before large datasets can be collected
deadlines imposed by performance or safety considerations
timescale of system’s evolution faster than speed at which data can be collected
acquiring samples is expensive: adversary purposely hides in environment
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 26 / 38
´
Hedging Against Uncertainty
Stochastic optimization works well when distribution is known and datasets are large
Risky with “uncertainty about uncertainty” Published as a conference paper at ICLR 2015
+ .007⇥ =
x sign(rxJ(✓, x, y))x +
✏sign(rxJ(✓, x, y))“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3 % confidence
Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedyet al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal tothe sign of the elements of the gradient of the cost function with respect to the input, we can changeGoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to the magnitude of thesmallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real numbers.
Let ✓ be the parameters of a model, x the input to the model, y the targets associated with x (formachine learning tasks that have targets) and J(✓, x, y) be the cost used to train the neural network.We can linearize the cost function around the current value of ✓, obtaining an optimal max-normconstrained pertubation of
⌘ = ✏sign (rxJ(✓, x, y)) .
We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that therequired gradient can be computed efficiently using backpropagation.
We find that this method reliably causes a wide variety of models to misclassify their input. SeeFig. 1 for a demonstration on ImageNet. We find that using ✏ = .25, we cause a shallow softmaxclassifier to have an error rate of 99.9% with an average confidence of 79.3% on the MNIST (?) testset1. In the same setting, a maxout network misclassifies 89.4% of our adversarial examples withan average confidence of 97.6%. Similarly, using ✏ = .1, we obtain an error rate of 87.15% andan average probability of 96.6% assigned to the incorrect labels when using a convolutional maxoutnetwork on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 2009) test set2. Othersimple methods of generating adversarial examples are possible. For example, we also found thatrotating x by a small angle in the direction of the gradient reliably produces adversarial examples.
The fact that these simple, cheap algorithms are able to generate misclassified examples serves asevidence in favor of our interpretation of adversarial examples as a result of linearity. The algorithmsare also useful as a way of speeding up adversarial training or even just analysis of trained networks.
5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY
Perhaps the simplest possible model we can consider is logistic regression. In this case, the fastgradient sign method is exact. We can use this case to gain some intuition for how adversarialexamples are generated in a simple setting. See Fig. 2 for instructive images.
If we train a single model to recognize labels y 2 {�1, 1} with P (y = 1) = ��w>x + b
�where
�(z) is the logistic sigmoid function, then training consists of gradient descent on
Ex,y⇠pdata⇣(�y(w>x + b))
where ⇣(z) = log (1 + exp(z)) is the softplus function. We can derive a simple analytical form fortraining on the worst-case adversarial perturbation of x rather than x itself, based on gradient sign
1This is using MNIST pixel values in the interval [0, 1]. MNIST data does contain values other than 0 or1, but the images are essentially binary. Each pixel roughly encodes “ink” or “no ink”. This justifies expectingthe classifier to be able to handle perturbations within a range of width 0.5, and indeed human observers canread such images without difficulty.
2 See https://github.com/lisa-lab/pylearn2/tree/master/pylearn2/scripts/papers/maxout. for the preprocessing code, which yields a standard deviation of roughly 0.5.
3
[Goodfellow, Shlens, Szegedy ’15]
Traditional robust optimization might be overly conservative
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 26 / 38
´
Hedging Against Uncertainty
Stochastic optimization works well when distribution is known and datasets are large
Risky with “uncertainty about uncertainty” Published as a conference paper at ICLR 2015
+ .007⇥ =
x sign(rxJ(✓, x, y))x +
✏sign(rxJ(✓, x, y))“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3 % confidence
Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedyet al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal tothe sign of the elements of the gradient of the cost function with respect to the input, we can changeGoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to the magnitude of thesmallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real numbers.
Let ✓ be the parameters of a model, x the input to the model, y the targets associated with x (formachine learning tasks that have targets) and J(✓, x, y) be the cost used to train the neural network.We can linearize the cost function around the current value of ✓, obtaining an optimal max-normconstrained pertubation of
⌘ = ✏sign (rxJ(✓, x, y)) .
We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that therequired gradient can be computed efficiently using backpropagation.
We find that this method reliably causes a wide variety of models to misclassify their input. SeeFig. 1 for a demonstration on ImageNet. We find that using ✏ = .25, we cause a shallow softmaxclassifier to have an error rate of 99.9% with an average confidence of 79.3% on the MNIST (?) testset1. In the same setting, a maxout network misclassifies 89.4% of our adversarial examples withan average confidence of 97.6%. Similarly, using ✏ = .1, we obtain an error rate of 87.15% andan average probability of 96.6% assigned to the incorrect labels when using a convolutional maxoutnetwork on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 2009) test set2. Othersimple methods of generating adversarial examples are possible. For example, we also found thatrotating x by a small angle in the direction of the gradient reliably produces adversarial examples.
The fact that these simple, cheap algorithms are able to generate misclassified examples serves asevidence in favor of our interpretation of adversarial examples as a result of linearity. The algorithmsare also useful as a way of speeding up adversarial training or even just analysis of trained networks.
5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY
Perhaps the simplest possible model we can consider is logistic regression. In this case, the fastgradient sign method is exact. We can use this case to gain some intuition for how adversarialexamples are generated in a simple setting. See Fig. 2 for instructive images.
If we train a single model to recognize labels y 2 {�1, 1} with P (y = 1) = ��w>x + b
�where
�(z) is the logistic sigmoid function, then training consists of gradient descent on
Ex,y⇠pdata⇣(�y(w>x + b))
where ⇣(z) = log (1 + exp(z)) is the softplus function. We can derive a simple analytical form fortraining on the worst-case adversarial perturbation of x rather than x itself, based on gradient sign
1This is using MNIST pixel values in the interval [0, 1]. MNIST data does contain values other than 0 or1, but the images are essentially binary. Each pixel roughly encodes “ink” or “no ink”. This justifies expectingthe classifier to be able to handle perturbations within a range of width 0.5, and indeed human observers canread such images without difficulty.
2 See https://github.com/lisa-lab/pylearn2/tree/master/pylearn2/scripts/papers/maxout. for the preprocessing code, which yields a standard deviation of roughly 0.5.
3
[Goodfellow, Shlens, Szegedy ’15]
The best of both worlds: distributionally robust optimization
empirical samples + ‘what-if’ scenarios
Attractive b/c of formal guarantees valid for small datasets
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 26 / 38
Given N i.i.d samples {bξ k}Nk=1, discrete empirical probability distribution
PN =1
N
NXk=1
δbξ k
EPN [f (x , ξ)] =1
N
NXi=1
f (x , ξk)
Sample-average approximation has
almost sure convergence guarantee as N →∞poor out-of-sample performance for small N
´
Network Optimization under Uncertainty
Stochastic optimization: inf EP[f (x , ξ)]x∈X ⊂Rd
objective f : Rd × Rm → R encodes network goal —total travel time, total distance covered, -network outflow
decision variable x at central/aggregate/local level —max velocity, inflows at control nodes, routing at intersections
random variable ξ distributed w/ unknown probability P —inflows/outflows at non-control nodes, road densities, vehicle locations
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 27 / 38
Given N i.i.d samples {bξ k}Nk=1, discrete empirical probability distribution
PN =1
N
NXk=1
δbξ k
EPN [f (x , ξ)] =1
N
NXi=1
f (x , ξk)
Sample-average approximation has
almost sure convergence guarantee as N →∞poor out-of-sample performance for small N
´
Network Optimization under Uncertainty
Stochastic optimization: inf EP[f (x , ξ)]x∈X ⊂Rd
objective f : Rd × Rm → R encodes network goal —total travel time, total distance covered, -network outflow
decision variable x at central/aggregate/local level —max velocity, inflows at control nodes, routing at intersections
random variable ξ distributed w/ unknown probability P —inflows/outflows at non-control nodes, road densities, vehicle locations
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 27 / 38
EPN [f (x , ξ)] =1
N
NXi=1
f (x , ξk)
Sample-average approximation has
almost sure convergence guarantee as N →∞poor out-of-sample performance for small N
´
Network Optimization under Uncertainty
Stochastic optimization: inf EP[f (x , ξ)]x∈X ⊂Rd
objective f : Rd × Rm → R encodes network goal —total travel time, total distance covered, -network outflow
decision variable x at central/aggregate/local level —max velocity, inflows at control nodes, routing at intersections
random variable ξ distributed w/ unknown probability P —inflows/outflows at non-control nodes, road densities, vehicle locations
Given N i.i.d samples {b k=1, discrete empirical probability distribution ξ k }N
NX1PN = δξbkN
k=1
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 27 / 38
Sample-average approximation has
almost sure convergence guarantee as N →∞poor out-of-sample performance for small N
´
Network Optimization under Uncertainty
Stochastic optimization: inf EP[f (x , ξ)]x∈X ⊂Rd
objective f : Rd × Rm → R encodes network goal —total travel time, total distance covered, -network outflow
decision variable x at central/aggregate/local level —max velocity, inflows at control nodes, routing at intersections
random variable ξ distributed w/ unknown probability P —inflows/outflows at non-control nodes, road densities, vehicle locations
Given N i.i.d samples {b k=1, discrete empirical probability distribution ξ k }N
N NX X1 1 kPN = δξbk EPN [f (x , ξ)] = f (x , ξ )
N N k=1 i=1
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 27 / 38
´
Network Optimization under Uncertainty
Stochastic optimization: inf EP[f (x , ξ)]x∈X ⊂Rd
objective f : Rd × Rm → R encodes network goal —total travel time, total distance covered, -network outflow
decision variable x at central/aggregate/local level —max velocity, inflows at control nodes, routing at intersections
random variable ξ distributed w/ unknown probability P —inflows/outflows at non-control nodes, road densities, vehicle locations
Given N i.i.d samples {b k=1, discrete empirical probability distribution ξ k }N
N NX X1 1PN = δξbk EPN [f (x , ξ)] = f (x , ξk )
N N k=1 i=1
Sample-average approximation has
almost sure convergence guarantee as N →∞
poor out-of-sample performance for small N
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 27 / 38
Desirable properties in ambiguity set
rich enough to contain true data-generating distribution with high confidence
small enough to exclude pathological distributions (avoid conservativeness)
easy to parameterize from data
facilitate tractable solution of optimization problem
´
Distributionally Robust Optimization
Account for ignorance of true data-generating distribution P
Distributionally robust (DRO) formulation to hedge against uncertainty
inf sup EQ[f (x , ξ)] x∈X Q∈PbN
bPN is ambiguity set of probability distributions
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 28 / 38
´
Distributionally Robust Optimization
Account for ignorance of true data-generating distribution P
Distributionally robust (DRO) formulation to hedge against uncertainty
inf sup EQ[f (x , ξ)] x∈X Q∈PbN
bPN is ambiguity set of probability distributions
Desirable properties in ambiguity set
rich enough to contain true data-generating distribution with high confidence
small enough to exclude pathological distributions (avoid conservativeness)
easy to parameterize from data
facilitate tractable solution of optimization problem
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 28 / 38
´
Ambiguity Sets Via Wasserstein Metric
Wasserstein metric: cost of optimal transportation plan of probability mass
2 dW2 (Q1, Q2) =
�inf
nZ kξ1 − ξ2k2Π(dξ1, dξ2)
�� Π ∈ H(Q1, Q2)o� 1
Ξ2
—H(Q1, Q2) is set of distributions w/ marginals Q1 and Q2
Q1 Q2
Π is transportation plan for moving mass distribution
norm k · k encodes transportation cost
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 29 / 38
Out-of-sample guarantee & tractability [Esfahani & Kuhn ’16]
Let bJN , bxN be optimal value, optimizer of DRO. ThenPN�EP[f (bxN , ξ)] ≤ bJN� ≥ 1− β
Additionally, if x 7→ f (x , ξ) convex ∀ξ ∈ Ξ, then bJN equal to convex optimizationinf
λ≥0,x∈X
nλ�2N(β) +
1
N
NXk=1
maxξ∈Rm
�f (x , ξ)− λkξ − ξkk2
�o
´
Data-Driven DRO
Ambiguity set via Wasserstein metric
PbN = B�N (β)(PN )
Then, PN (P ∈ PbN ) ≥ 1 − β
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 30 / 38
Additionally, if x 7→ f (x , ξ) convex ∀ξ ∈ Ξ, then bJN equal to convex optimizationinf
λ≥0,x∈X
nλ�2N(β) +
1
N
NXk=1
maxξ∈Rm
�f (x , ξ)− λkξ − ξkk2
�o
´
Data-Driven DRO
Ambiguity set via Wasserstein metric
PbN = B�N (β)(PN )
Then, PN (P ∈ PbN ) ≥ 1 − β
Out-of-sample guarantee & tractability [Esfahani & Kuhn ’16]
Let bJN , bxN be optimal value, optimizer of DRO. Then
PN � EP[f (bxN , ξ)] ≤ bJN
� ≥ 1 − β
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 30 / 38
´
Data-Driven DRO
Ambiguity set via Wasserstein metric
PbN = B�N (β)(PN )
Then, PN (P ∈ PbN ) ≥ 1 − β
Out-of-sample guarantee & tractability [Esfahani & Kuhn ’16]
Let bJN , bxN be optimal value, optimizer of DRO. Then
PN � EP[f (bxN , ξ)] ≤ bJN
� ≥ 1 − β
Additionally, if x 7→ f (x , ξ) convex ∀ξ ∈ Ξ, then bJN equal to convex optimization
inf λ≥0,x∈X
n λ�2
N (β) + 1 N
NX
k=1
max ξ∈Rm
� f (x , ξ) − λkξ − ξk k2
�o
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 30 / 38
´
Network Optimization Problem
Cooperative network of n agents collecting data, communicating w/ neighbors, no central coordinator
communication modeled by graph (V, E) each agent knows objective function f , constraint set X
each agent gathers i.i.d samples Ξb i (Ξb i ∩ Ξb j = ∅)
data available across network ∪n Ξb i = {ξbk }N i=1 k=1
{ξ9, ξ10}
{ξ1, ξ2}
{ξ3, ξ4}
{ξ5, ξ6}{ξ7, ξ8}
f(x, ξ)
Leverage power-of-many to collectively improve out-of-sample guarantee
individual agents can solve DRO with own data, but
can benefit from others’ contributions to obtain higher-quality solution
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 31 / 38
| {z }agreement on λi ’s
| {z }agreement on xi ’s
Problems are equivalent (w/ f convex in x)
Given bxN , there exists λ∗ ≥ 0 s.t. (1n ⊗ bxN , λ∗1n) is optimizer of (?)
If (x∗v , λ∗v ) is optimizer of (?), then x∗v = 1n ⊗ bxN
Same optimal value bJNOptimization (?) has separable objective & locally computable constraints!
´
Distributed Reformulation of Data-Driven DRO
Each agent i with own estimate x i (and λi ) of optimal solution
N (β)1> N � ��2 X n λv 1 kmin + max f (x vk , ξ)−λvk kξ − ξb k2
xv ,λv ≥0n n N ξ∈Rm k=1
(?) subject to Lλv = 0n and (L ⊗ Id )xv = 0nd
1 n(Here xv = (x ; . . . ; x ), λv = (λ1; . . . ; λn ))
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 32 / 38
Problems are equivalent (w/ f convex in x)
Given bxN , there exists λ∗ ≥ 0 s.t. (1n ⊗ bxN , λ∗1n) is optimizer of (?)
If (x∗v , λ∗v ) is optimizer of (?), then x∗v = 1n ⊗ bxN
Same optimal value bJNOptimization (?) has separable objective & locally computable constraints!
´
Distributed Reformulation of Data-Driven DRO
Each agent i with own estimate x i (and λi ) of optimal solution
N (β)1> N � ��2 X n λv 1 kmin + max f (x vk , ξ)−λvk kξ − ξb k2
xv ,λv ≥0n n N ξ∈Rm k=1
(?) subject to Lλv = 0n and (L ⊗ Id )xv = 0nd| {z } | {z }
agreement on λi ’s agreement on xi ’s
1 n(Here xv = (x ; . . . ; x ), λv = (λ1; . . . ; λn ))
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 32 / 38
´
Distributed Reformulation of Data-Driven DRO
Each agent i with own estimate x i (and λi ) of optimal solution
N (β)1> N � ��2 X n λv 1 kmin + max f (x vk , ξ)−λvk kξ − ξb k2
xv ,λv ≥0n n N ξ∈Rm k=1
(?) subject to Lλv = 0n and (L ⊗ Id )xv = 0nd| {z } | {z }
agreement on λi ’s agreement on xi ’s
1 n(Here xv = (x ; . . . ; x ), λv = (λ1; . . . ; λn ))
Problems are equivalent (w/ f convex in x)
Given bxN , there exists λ ∗ ≥ 0 s.t. (1n ⊗ bxN , λ ∗ 1n) is optimizer of (?)
If (x ∗ v , λ ∗
v ) is optimizer of (?), then x ∗ v = 1n ⊗ bxN
Same optimal value bJN
Optimization (?) has separable objective & locally computable constraints!
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 32 / 38
Augmented Lagrangian: (for better convergence properties)
Laug(xv, λv, ν, η) := L(xv, λv, ν, η) +1
2x>v (L⊗ Id )xv +
1
2λ>v Lλv
= max{ξ(k)}
Laug(xv, λv, ν, η, {ξ(k)})
Laug(xv, λv, ν, η, {ξ(k)}) :=�2N(β)1
>n λv
n+
NXk=1
�f (xvk , ξ)−λvk kξ − bξkk2�
+ ν>Lλv + η>(L⊗ Id )xv +1
2x>v (L⊗ Id )xv +
1
2λ>v Lλv
´
Modified Lagrangian
Getting rid of inner maximization in Lagrangian
λv N
N n�2 (β)1> X � � Lagrangian: L(xv, λv, ν, η) := + max f (x vk , ξ)−λvk kξ − ξbk k2
n ξ∈Rm k=1
+ ν>Lλv + η>(L ⊗ Id )xv
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 33 / 38
= max{ξ(k)}
Laug(xv, λv, ν, η, {ξ(k)})
Laug(xv, λv, ν, η, {ξ(k)}) :=�2N(β)1
>n λv
n+
NXk=1
�f (xvk , ξ)−λvk kξ − bξkk2�
+ ν>Lλv + η>(L⊗ Id )xv +1
2x>v (L⊗ Id )xv +
1
2λ>v Lλv
´
Modified Lagrangian
Getting rid of inner maximization in Lagrangian
N � ��2 (β)1> X N n λv Lagrangian: L(xv, λv, ν, η) := + max f (x vk , ξ)−λvk kξ − ξbk k2
n ξ∈Rm k=1
+ ν>Lλv + η>(L ⊗ Id )xv
Augmented Lagrangian: (for better convergence properties)
>Laug(xv, λv, ν, η) := L(xv, λv, ν, η) + 1 x (L ⊗ Id )xv +
1 λ>Lλvv v2 2
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 33 / 38
´
Modified Lagrangian
Getting rid of inner maximization in Lagrangian
N � ��2 (β)1> X N n λv Lagrangian: L(xv, λv, ν, η) := + max f (x vk , ξ)−λvk kξ − ξbk k2
n ξ∈Rm k=1
+ ν>Lλv + η>(L ⊗ Id )xv
Augmented Lagrangian: (for better convergence properties)
1 > 1 Laug(xv, λv, ν, η) := L(xv, λv, ν, η) + x (L ⊗ Id )xv + λ>Lλvv v2 2
= max Laug(xv, λv, ν, η, {ξ(k)})˜{ξ(k)}
N � � Laug(xv, λv, ν, η, {ξ(k)}) := N n λv
+ f (x vk , ξ)−λvk kξ − ξbk k2�2 (β)1> X
n k=1
>+ ν>Lλv + η>(L ⊗ Id )xv + 1 x (L ⊗ Id )xv +
1 λ>Lλvv v2 2
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 33 / 38
Correspondence between optima and saddle points
1 If (x∗v , λ∗v , ν
∗, η∗) is saddle point of L, then ∃ {(ξ∗)(k)} such that((x∗v , λ
∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n
2 If ((x∗v , λ∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n, then(x∗v , λ
∗v , ν
∗, η∗) is saddle point of L
Laug is convex-concave in ((xv, λv), (ν, η, {ξ(k)})) over domain λv ≥ 0n
´
Modified Lagrangian
Saddle points of Laug exists implying
min max Laug(xv,λv, ν, η) = max min Laug(xv, λv, ν, η) xv ,λv ≥0n ν,η ν,η xv ,λv ≥0n
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 34 / 38
Correspondence between optima and saddle points
1 If (x∗v , λ∗v , ν
∗, η∗) is saddle point of L, then ∃ {(ξ∗)(k)} such that((x∗v , λ
∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n
2 If ((x∗v , λ∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n, then(x∗v , λ
∗v , ν
∗, η∗) is saddle point of L
Laug is convex-concave in ((xv, λv), (ν, η, {ξ(k)})) over domain λv ≥ 0n
´
Modified Lagrangian
Saddle points of Laug exists implying
min max Laug(xv,λv, ν, η) = max min Laug(xv, λv, ν, η) xv ,λv ≥0n ν,η ν,η xv ,λv ≥0n
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 34 / 38
Correspondence between optima and saddle points
1 If (x∗v , λ∗v , ν
∗, η∗) is saddle point of L, then ∃ {(ξ∗)(k)} such that((x∗v , λ
∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n
2 If ((x∗v , λ∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n, then(x∗v , λ
∗v , ν
∗, η∗) is saddle point of L
Laug is convex-concave in ((xv, λv), (ν, η, {ξ(k)})) over domain λv ≥ 0n
´
Modified Lagrangian
Saddle points of Laug exists implying
˜ ˜min max max Laug(·) = max min max Laug(·) xv,λv ≥0n ν,η {ξ(k)} ν,η xv ,λv ≥0n {ξ(k)}
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 34 / 38
assuming min-max operator on the right can be interchanged – requires formal proof
Correspondence between optima and saddle points
1 If (x∗v , λ∗v , ν
∗, η∗) is saddle point of L, then ∃ {(ξ∗)(k)} such that((x∗v , λ
∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n
2 If ((x∗v , λ∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n, then(x∗v , λ
∗v , ν
∗, η∗) is saddle point of L
Laug is convex-concave in ((xv, λv), (ν, η, {ξ(k)})) over domain λv ≥ 0n
´
Modified Lagrangian
Saddle points of Laug exists implying
˜ ˜min max Laug(·) = max min max Laug(·)ν,η xv ,λv ≥0n ν,η,{ξ(k)} xv ,λv ≥0n {ξ(k)}
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 34 / 38
Correspondence between optima and saddle points
1 If (x∗v , λ∗v , ν
∗, η∗) is saddle point of L, then ∃ {(ξ∗)(k)} such that((x∗v , λ
∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n
2 If ((x∗v , λ∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n, then(x∗v , λ
∗v , ν
∗, η∗) is saddle point of L
Laug is convex-concave in ((xv, λv), (ν, η, {ξ(k)})) over domain λv ≥ 0n
´
Modified Lagrangian
Saddle points of Laug exists implying
˜ ˜min max Laug(·) = max min max Laug(·) xv ,λv ≥0n ν,η,{ξ(k)} ν,η xv ,λv ≥0n {ξ(k)}
assuming min-max operator on the right can be interchanged – requires formal proof
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 34 / 38
Correspondence between optima and saddle points
1 If (x∗v , λ∗v , ν
∗, η∗) is saddle point of L, then ∃ {(ξ∗)(k)} such that((x∗v , λ
∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n
2 If ((x∗v , λ∗v , ν
∗, η∗, {(ξ∗)(k)}) is saddle point of Laug over λv ≥ 0n, then(x∗v , λ
∗v , ν
∗, η∗) is saddle point of L
Laug is convex-concave in ((xv, λv), (ν, η, {ξ(k)})) over domain λv ≥ 0n
´
Modified Lagrangian
Saddle points of Laug exists implying
˜ ˜min max Laug(·) = max min Laug(·) xv ,λv≥0n xv ,λv ≥0nν,η,{ξ(k)} ν,η,{ξ(k)}
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 34 / 38
´
Modified Lagrangian
Saddle points of Laug exists implying
˜ ˜min max Laug(·) = max min Laug(·) xv ,λv≥0n xv ,λv ≥0nν,η,{ξ(k)} ν,η,{ξ(k)}
Correspondence between optima and saddle points
1 If (x ∗ v , λ ∗
v , ν ∗ , η ∗ ) is saddle point of L, then ∃ {(ξ ∗ )(k)} such that ((x ∗
v , λ ∗ v , ν ∗ , η ∗ , {(ξ ∗ )(k)}) is saddle point of Laug over λv ≥ 0n
2 If ((x ∗ v , λ ∗
v , ν ∗ , η ∗ , {(ξ ∗ )(k)}) is saddle point of Laug over λv ≥ 0n, then (x ∗ v , λ ∗
v , ν ∗ , η ∗ ) is saddle point of L
Laug is convex-concave in ((xv, λv), (ν, η, {ξ(k)})) over domain λv ≥ 0n
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 34 / 38
´
When Can Max-Min Operator Be Interchanged?
Theorem Assuming f satisfies technical condition on directions of recession. Max-min operator can be interchanged under either
1 convex-concave objective function f
2 convex-convex objective function f and
quadratic in ξ,
f (x , ξ) = ξ>Qξ + x >Rξ + `(x)
least-squares problem (w/ d = m),
f (x , ξ) = a(ξm − (ξ1:m−1; 1)> x)2
In either case, Laug is convex-concave in variables ((xv, λv), {ξ(k)})
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 35 / 38
´
Distributed Algorithm for Network Optimization
Primal-dual dynamics for Laug is distributed
dxv (k)˜= − PrX (rxv Laug(xv, λv, ν, η, {ξ }))dt dλv ˜= [−rλv Laug(xv, λv, ν, η, {ξ
(k)})]+ λvdt
dν ˜= rν Laug(xv, λv, ν, η, {ξ(k)})dt dη ˜= rη Laug(xv, λv, ν, η, {ξ(k)})dt
dξ(k) ˜= rξ(k) Laug(xv, λv, ν, η, {ξ(k)}), ∀k ∈ {1, . . . , N}
dt
Laug not necessarily strictly convex in (xv, λv), not linear in {ξk }
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 36 / 38
´
Distributed Algorithm for Network Optimization
Primal-dual dynamics for Laug is distributed � �dx i 1 X i i k
X i j i j= rxi gk (x , λ , ξ ) + (η − η ) + (x − x )
dt N k∈Ki j∈Ni h X X� �i+dλi �2 1N (β) i i k i j i j= + rλi gk (x , λ , ξ ) + (ν − ν ) + (λ − λ )
dt n N λi k∈Ki j∈Ni
dν i X = aij (λ
i − λj )dt
j∈Ni
dηi X = aij (x i − x j )
dt j∈Ni
dξk 1 i i k = rξgk (x , λ , ξ ), ∀k ∈ Ki [gk (x , λ, ξ) := f (x , ξ) − λkξ − kk2]dt N
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 36 / 38
´
Illustration
data ξbk = ( wbk , ybk ) ∈ R4 × R: input-output pairs goal: find predictor x ∈ R5 such that x >(w ; 1) ∼ y
quadratic loss f (x , ξ) = (x >(w ; 1) − y)2
dataset: w ∼ N (0, I4), y = (1, 4, 3, 2) ∗ w + v , v uniformly distributed over [−1, 1] each agent 30 i.i.d samples (300 network samples)
9
10
1
2
3
4
5
6
7
8
0 40 80 120 160 200
-2
-1
0
1
2
3
4
5
0 2 4 6 8 10
-2
0
2
4
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
evolution of optimizer estimates relative benefit of cooperation J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 37 / 38
´
Summary
Conclusions
network optimization via primal-dual dynamics
Lyapunov function: distance to saddle-point set + magnitude of vector field
robustness against disturbances, real-time state-triggered implementation, time-varying, data-driven formulations
Current&Future work
distributed regularization for strongly convex-concave formulations and impact on saddle points
robust stability via ISS for general convex optimization
nonconvex scenarios via sequential convex approx.
dynamic ambiguity sets and online data-driven distributionally robust optimization
trigger design for accelerated convergence
J. Cortes (UC San Diego) Distributed Solvers for Network Optimization April 11, 2019 38 / 38