LetterSpace=2GAMES, DYNAMICS & OPTIMIZATION

1/51

Overview From flows to algorithms From algorithms to flows Flows in games Monotone games Spurious limits References

GAMES, DYNAMICS & OPTIMIZATION

Panayotis Mertikopoulos

French National Center for Scientific Research (CNRS)

Laboratoire d’Informatique de Grenoble (LIG)

Criteo AI Lab

CoreLab, NTUA — October 5, 2020

P. Mertikopoulos CNRS & Criteo AI Lab

2/51


Outline

Overview

From flows to algorithms

From algorithms to flows

Flows in games

Monotone games

Spurious limits


3/51


Overview

What is the long-run behavior of first-order methods in optimization / games?

In optimization:▸ Do first-order (= gradient-based) algorithms converge to critical points?▸ Are local minimizers selected?

In games:▸ Do gradient methods converge to Nash equilibrium?▸ Are all Nash equilibria created equal?

Dynamics: from discrete to continuous and back again


3/51


Overview






3/51


Overview






3/51


Overview






4/51


About

N. Hallak

A. Kavis Y. -P. Hsieh

V. Cevher G. Piliouras

C. Papadimitriou

Z. Zhou▸ M, Papadimitriou & Piliouras, Cycles in adversarial regularized learning, SODA 2018▸ M & Zhou, Learning in games with continuous action sets and unknown payoff functions,Mathematical Programming, vol. 173, pp. 465–507, Jan. 2019

▸ M, Hallak, Kavis & Cevher,On the almost sure convergence of stochastic gradient descent innon-convex problems, NeurIPS 2020

▸ Hsieh, M & Cevher, The limits of min-max optimization algorithms: convergence to spuriousnon-critical sets, https://arxiv.org/abs/2006.09065


https://arxiv.org/abs/2006.09065

5/51


Outline

Overview



Flows in games

Monotone games

Spurious limits


6/51


Basic problem

minimizex∈Rd f (x)

▸ f non-convex [technical assumptions later]

▸ f unknown/difficult to manipulate in closed form [low precision methods]

▸ Single-player game: calculate best responses [more in second part]


7/51


Gradient flows

Gradient flow of a function f ∶Rd → R

x(t) = −∇ f (x(t)) (GF)

Main property: f is a (strict) Lyapunov function for (GF)

d f /dt = −∥∇ f (x(t))∥ ≤ w/ equality iff∇ f (x) =


8/51


Convergence of gradient flows

Blanket assumptions▸ Lipschitz smoothness:

∥∇ f (x′) −∇ f (x)∥ ≤ L∥x′ − x∥ for all x , x′ ∈ Rd (LS)

▸ Bounded sublevels:

Lc ≡ {x ∈ Rd ∶ f (x) ≤ c} is bounded for all c < sup f (sub)

Theorem▸ Assume: (LS), (sub)▸ Then: x(t) converges to crit( f ) ≡ {x∗ ∈ Rd ∶ ∇ f (x∗) = }

[NB: setwise, not pointwise convergence, cf. Palis Jr. and de Melo, 1982]


9/51


From flows to algorithms: gradient descent

Forward Euler (explicit) Ô⇒ gradient descent (GD) [Cauchy, 1847]

Xn+ = Xn − γn∇ f (Xn) (GD)

x

x+

−γ∇ f (x)


10/51


From flows to algorithms: proximal gradient

Backward Euler (implicit) Ô⇒ proximal gradient (PG) [Martinet, 1970]

Xn+ = Xn − γn∇ f (Xn+) (PG)

x

x+−γ∇ f (x+)


11/51


From flows to algorithms: extra-gradient

Midpoint Runge-Kutta (explicit) Ô⇒ extra-gradient (EG) [Korpelevich, 1976]

Xn+/ = Xn − γn∇ f (Xn) Xn+ = Xn − γn∇ f (Xn+/) (EG)

x

xlead

−γ∇ f (x)

−γ∇ f (xlead)

x+−γ∇ f (xlead)


11/51





x

xlead

−γ∇ f (x) −γ∇ f (xlead)



11/51





x

xlead

−γ∇ f (x) −γ∇ f (xlead)



12/51


Stochastic gradient feedback

In many applications, perfect gradient information is unavailable / too costly:

▸ Machine learning:f (x) = ∑N

i= f i(x) and only a batch of∇ f i(x) is computable per iteration

▸ Control / Engineering:f (x) = E[F(x;ω)] and only∇F(x;ω) can be observed for a random ω

▸ Game Theory / Bandit Learning:Only f (x) is observable

Stochastic first-order oracle (SFO) feedback:

Xn ↦ Vn¯

feedback

= ∇ f (Xn)´¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¶gradient

+ Zn

noise

+ bn®bias

(SFO)

where Zn is “zero-mean” and bn is “small” (more later)


12/51


Stochastic gradient feedback

In many applications, perfect gradient information is unavailable / too costly:

▸ Machine learning:f (x) = ∑N

i= f i(x) and only a batch of∇ f i(x) is computable per iteration

▸ Control / Engineering:f (x) = E[F(x;ω)] and only∇F(x;ω) can be observed for a random ω

▸ Game Theory / Bandit Learning:Only f (x) is observable

Stochastic first-order oracle (SFO) feedback:

Xn ↦ Vn¯

feedback

= ∇ f (Xn)´¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¶gradient

+ Zn

noise

+ bn®bias

(SFO)

where Zn is “zero-mean” and bn is “small” (more later)


13/51


Stochastic gradient descent

Noisy Euler (explicit) Ô⇒ stochastic gradient descent (SGD)

Xn+ = Xn − γn[∇ f (Xn) + Wn°noise

] (SGD)

x

x+

−γ[∇ f (x) +w]


14/51


Example: zeroth-order feedback

Given f ∶R→ R, estimate f ′(x) at target point x ∈ R

f ′(x) ≈ f (x + δ) − f (x − δ)δ

Pick u = ± with probability /. Then:

E[ f (x + δu)u] = f (x + δ) −

f (x − δ)

Ô⇒ Estimate f ′(x) with a single query of f at x = x + δu

Algorithm Simultaneous perturbation stochastic approximation [Spall, 1992]

1: Draw u uniformly from Sd

2: Query x = x + δu3: Get f = f (x)4: Set V = (d/δ) f u


14/51


Example: zeroth-order feedback

Given f ∶R→ R, estimate f ′(x) at target point x ∈ R

f ′(x) ≈ f (x + δ) − f (x − δ)δ

Pick u = ± with probability /. Then:

E[ f (x + δu)u] = f (x + δ) −

f (x − δ)

Ô⇒ Estimate f ′(x) with a single query of f at x = x + δu

Algorithm Simultaneous perturbation stochastic approximation [Spall, 1992]

1: Draw u uniformly from Sd

2: Query x = x + δu3: Get f = f (x)4: Set V = (d/δ) f u


15/51


The Robbins-Monro template

Generalized Robbins-Monro:

Xn+ = Xn − γn[∇ f (Xn) + Zn + bn] (RM)

with∑n γn =∞, γn → , and E[Zn ∣ Xn , . . . , X] =

Examples▸ Gradient descent (det.): Zn = , bn = ▸ Proximal gradient (det.): Zn = , bn = ∇ f (Xn+) −∇ f (Xn)

▸ Extra-gradient (det.): Zn = , bn = ∇ f (Xn+/) −∇ f (Xn)

▸ Stochastic gradient descent (stoch.): Zn = zero-mean, bn =

▸ SPSA (stoch.): Zn = (d/δ) f (Xn)Un −∇ fδ(Xn), bn = ∇ fδ(Xn) −∇ f (Xn) where

fδ(x) =

vol(Bδ) ∫Bδf (x + δu) du

▸ ⋯


16/51


Outline

Overview



Flows in games

Monotone games

Spurious limits


17/51



Basic idea: if γn is “small”, the noise washes out and “ limt→∞ (RM) = limt→∞ (GF) ”

Ô⇒ ODE method of stochastic approximation[Ljung, 1977; Benveniste et al., 1990; Kushner and Yin, 1997; Benaïm, 1999]

▸ Virtual time: τn = ∑nk= γk

▸ Virtual trajectory: X(t) = Xn +t − τn

τn+ − τn(Xn+ − Xn)

▸ Asymptotic pseudotrajectory (APT):limt→∞

sup≤h≤T

∥X(t + h) −Φh(X(t))∥ =

where Φs(x) denotes the position at time s of an orbit of (GF) starting at x▸ Long run: X(t) tracks (GF) with arbitrary accuracy over windows of arbitrary length

[Benaïm and Hirsch, 1995, 1996; Benaïm, 1999; Benaïm et al., 2005, 2006]


17/51



Basic idea: if γn is “small”, the noise washes out and “ limt→∞ (RM) = limt→∞ (GF) ”

Ô⇒ ODE method of stochastic approximation[Ljung, 1977; Benveniste et al., 1990; Kushner and Yin, 1997; Benaïm, 1999]

▸ Virtual time: τn = ∑nk= γk

▸ Virtual trajectory: X(t) = Xn +t − τn

τn+ − τn(Xn+ − Xn)

▸ Asymptotic pseudotrajectory (APT):limt→∞

sup≤h≤T

∥X(t + h) −Φh(X(t))∥ =

where Φs(x) denotes the position at time s of an orbit of (GF) starting at x▸ Long run: X(t) tracks (GF) with arbitrary accuracy over windows of arbitrary length

[Benaïm and Hirsch, 1995, 1996; Benaïm, 1999; Benaïm et al., 2005, 2006]


18/51


Stochastic approximation criteria

When is a sequence generated by (RM) an APT?

(A) ▸ Xn is bounded▸ f is Lipschitz continuous and smooth:

∣ f (x′) − f (x)∣ ≤ G∥x′ − x∥ (LC)∥∇ f (x′) −∇ f (x)∥ ≤ L∥x′ − x∥ (LS)

(B) ▸ E[∑n γn∥Zn∥] <∞

▸ supn E[∥Zn∥q] <∞ and∑n γ+q/n <∞

▸ Zn sub-Gaussian and γn = o(/ log n)

(C) ▸ ∑n γnbn = with probability

Proposition (Benaïm, 1999; Hsieh,M & Cevher, 2020)▸ Assume: any of (A); any of (B); (C)▸ Then: Xn is an APT of (GF) with probability


19/51


Convergence of APTs

Theorem (Benaïm and Hirsch, 1995, 1996)▸ Assume: Xn is a bounded APT of (GF)▸ Then: Xn converges to crit( f ) with probability

Theorem (Ljung, 1977; Benaïm, 1999)▸ Assume: (LC), (LS), (sub); supn∥Xn∥ <∞▸ Then: Xn converges (a.s.) to a component of crit( f ) where f is constant

Boundedness: implicit, algorithm-dependent assumption; non-verifiable!


19/51


Convergence of APTs

Theorem (Benaïm and Hirsch, 1995, 1996)▸ Assume: Xn is a bounded APT of (GF)▸ Then: Xn converges to crit( f ) with probability

Theorem (Ljung, 1977; Benaïm, 1999)▸ Assume: (LC), (LS), (sub); supn∥Xn∥ <∞▸ Then: Xn converges (a.s.) to a component of crit( f ) where f is constant

Boundedness: implicit, algorithm-dependent assumption; non-verifiable!


20/51


Can boundedness be dropped?

Key obstacle: infinite plains of vanishing gradients [think f (x) = − exp(−x)]

Countered if gradient sublevel sets do not extend to infinity

Mε ≡ {x ∈ Rd ∶ ∥∇ f (x)∥ ≤ ε} is bounded for some ε > (Gsub)

[standard under regularization]

Proposition (M, Hallak, Kavis & Cevher, 2020)▸ Assume: (LC), (LS), (sub), (Gsub)▸ Then: for all ε > , there exists some τ = τ(ε) such that, for all t ≥ τ:

(a) f (x(t)) ≤ f (x()) − ε; or(b) x(t) is within ε-distance of crit( f )

In words: (GF) either descends f by a uniform amount, or it is already near-critical


20/51











20/51











21/51



Proposition▸ Assume: (LC), (LS), (sub), (Gsub); any of (B); (C)▸ Then: With probability , a subsequence of Xn converges to crit( f )

Theorem (M, Hallak, Kavis & Cevher, 2020)▸ Assume: (LC), (LS), (sub), (Gsub); any of (B); (C)▸ Then: With probability , Xn converges to a (possibly random) component

of crit( f ) over which f is constant


21/51



Proposition▸ Assume: (LC), (LS), (sub), (Gsub); any of (B); (C)▸ Then: With probability , a subsequence of Xn converges to crit( f )

Theorem (M, Hallak, Kavis & Cevher, 2020)▸ Assume: (LC), (LS), (sub), (Gsub); any of (B); (C)▸ Then: With probability , Xn converges to a (possibly random) component

of crit( f ) over which f is constant


22/51


Are all critical points desirable?

Figure: A hyperbolic ridge manifold, typical of ResNet loss landscapes [Li et al., 2018]


23/51


Are traps avoided?

Hyperbolic saddle (isolated non-minimizing critical point)

λmin(Hess( f (x∗))) < , det(Hess( f (x∗))) ≠

Ô⇒ (GF) is linearly unstable near x∗

Ô⇒ convergence to x∗ unlikely

Theorem (Pemantle, 1990)▸ Assume:

▸ x∗ is a hyperbolic saddle point▸ Zn is finite (a.s.) and uniformly exciting

E[⟨Z , u⟩+] ≥ c for all unit vectors u ∈ Sd− , x ∈ Rd

▸ γn ∝ /n

▸ Then: P(limn→∞ Xn = x∗) =


23/51


Are traps avoided?

Hyperbolic saddle (isolated non-minimizing critical point)

λmin(Hess( f (x∗))) < , det(Hess( f (x∗))) ≠

Ô⇒ (GF) is linearly unstable near x∗

Ô⇒ convergence to x∗ unlikely

Theorem (Pemantle, 1990)▸ Assume:

▸ x∗ is a hyperbolic saddle point▸ Zn is finite (a.s.) and uniformly exciting


▸ γn ∝ /n

▸ Then: P(limn→∞ Xn = x∗) =


24/51


Are non-hyperbolic traps avoided?

Strict saddleλmin(Hess( f (x∗))) <

Theorem (Ge et al., 2015)▸ Given: confidence level ζ > ▸ Assume:

▸ f is bounded and satisfies (LS)▸ Hess( f (x)) is Lipschitz continuous▸ for all x ∈ Rd : (a) ∥∇ f (x)∥ ≥ ε; or (b) λmin(Hess( f (x))) ≤ −β; or (c) x is δ-close

to a local minimum x∗ of f around which f is α-strongly convex▸ Zn is finite (a.s.) and contains a component uniformly sampled from the unit

sphere; also, bn = ▸ γn ≡ γ with γ = O(/ log(/ζ))

▸ Then: with probability at least − ζ, SGD produces afterO(γ− log(/(γζ)))iterations a point which isO(√γ log(/(γζ)))-close to x∗ (and hence awayfrom any strict saddle)


24/51


Are non-hyperbolic traps avoided?

Strict saddleλmin(Hess( f (x∗))) <

Theorem (Ge et al., 2015)▸ Given: confidence level ζ > ▸ Assume:

▸ f is bounded and satisfies (LS)▸ Hess( f (x)) is Lipschitz continuous▸ for all x ∈ Rd : (a) ∥∇ f (x)∥ ≥ ε; or (b) λmin(Hess( f (x))) ≤ −β; or (c) x is δ-close

to a local minimum x∗ of f around which f is α-strongly convex▸ Zn is finite (a.s.) and contains a component uniformly sampled from the unit

sphere; also, bn = ▸ γn ≡ γ with γ = O(/ log(/ζ))

▸ Then: with probability at least − ζ, SGD produces afterO(γ− log(/(γζ)))iterations a point which isO(√γ log(/(γζ)))-close to x∗ (and hence awayfrom any strict saddle)


25/51


Are non-hyperbolic traps avoided always?

Theorem (M, Hallak, Kavis & Cevher, 2020)▸ Assume:

▸ f satisfies (LC) and (LS)▸ Zn is finite (a.s.) and uniformly exciting


▸ γn ∝ /np for some p ∈ (, ]

▸ Then: P(Xn converges to a set of strict saddle points) =

Proof.Use Pemantle (1990) + differential geometric arguments of Benaïm and Hirsch (1995).


26/51


Outline

Overview



Flows in games

Monotone games

Spurious limits


27/51


Single- vs. multi-agent setting

In single-agent optimization, first-order iterative schemes▸ Converge to the problem’s set of critical points▸ Avoid spurious, non-minimizing critical manifolds

Does this intuition carry over to games?

Domulti-agent learning algorithms▸ Converge to unilaterally stable/stationary points?▸ Avoid spurious, non-equilibrium points?


27/51


Single- vs. multi-agent setting

In single-agent optimization, first-order iterative schemes▸ Converge to the problem’s set of critical points▸ Avoid spurious, non-minimizing critical manifolds

Does this intuition carry over to games?

Domulti-agent learning algorithms▸ Converge to unilaterally stable/stationary points?▸ Avoid spurious, non-equilibrium points?


28/51


Online decision processes

Agents called to take repeated decisions withminimal information:

for n ≥ doChoose action Xn [focal agent choice]

Incur loss ℓn(Xn) [depends on all agents]

end for

Driving question: How to choose “good” actions?▸ Unknown world: no beliefs, knowledge of the game, etc.▸ Minimal information: feedback often limited to incurred losses


29/51


N-player games

The game▸ Finite set of players i ∈ N = {, . . . ,N}▸ Each player selects an action from a closed convex set Xi ⊆ Rd i

▸ Loss of player i given by loss function ℓ i ∶X ≡∏i Xi → R

Examples▸ Finite games (mixed extensions)▸ Divisible good auctions (Kelly)▸ Traffic routing▸ Power control/allocation problems▸ Cournot oligopolies▸ ⋯


30/51


Nash equilibrium

Nash equilibriumAction profile x∗ = (x∗ , . . . , x∗n ) ∈ X that is unilaterally stable

ℓ i(x∗i ; x∗−i) ≤ ℓ i(x i ; x∗−i) for every player i ∈ N and every deviation x i ∈ Xi

▸ Local Nash equilibrium: local version [stable under local deviations]▸ Critical point: unilateral stationarity [x∗i is stationary for ℓ i(⋅, x∗−i)]

Individual loss gradientsVi(x) = ∇x i ℓ i(x i ; x−i)

Ô⇒ individually steepest variation

Variational characterizationIf x∗ is a (local) Nash equilibrium, then

⟨Vi(x∗), x i − x∗i ⟩ ≥ for all i ∈ N , x i ∈ Xi

Intuition: ℓ i(x i ; x∗−i) weakly increasing along all rays emanating from x∗i


30/51


Nash equilibrium










30/51


Nash equilibrium










31/51


Geometric interpretation

X

TC(x∗)

NC(x∗)

.x∗

−V(x∗)

At Nash equilibrium, individual descent directions are outward-pointing


32/51


First-order algorithms in games

Individual gradient field V(x) = (V(x), . . . ,VN(x)), x = (x , . . . , xN)

▸ Individual gradient descent:

Xn+ = Xn − γnV(Xn)

▸ Extra-gradient:

Xn+/ = Xn − γn∇ℓ(Xn) Xn+ = Xn − γn∇ℓ(Xn+/)

▸ ⋯

Mean dynamics:x(t) = −V(x(t)) (MD)

Ô⇒ no longer a gradient system


33/51


Outline

Overview



Flows in games

Monotone games

Spurious limits


34/51


The dynamics of min-max games

Bilinear min-max games (saddle-point problems)

minx∈X

maxx∈X

L(x , x) = (x − b)⊺A(x − b) (SP)

[no constraints: X = Rd ,X = Rd ]

Mean dynamics:

x = −A(x − b) x = A⊺(x − b)

Energy function:E(x) =

∥x − b∥ +

∥x − b∥

Lyapunov property:dEdt≤ w/ equality if A = A⊺

Ô⇒ distance to solutions (weakly) decreasing along (MD)


34/51


The dynamics of min-max games

Bilinear min-max games (saddle-point problems)

minx∈X

maxx∈X

L(x , x) = (x − b)⊺A(x − b) (SP)

[no constraints: X = Rd ,X = Rd ]

Mean dynamics:

x = −A(x − b) x = A⊺(x − b)

Energy function:E(x) =

∥x − b∥ +

∥x − b∥

Lyapunov property:dEdt≤ w/ equality if A = A⊺

Ô⇒ distance to solutions (weakly) decreasing along (MD)


35/51


Cycles

Roadblock: the energy might be a constant of motion [Hofbauer et al., 2009]

-�� -��

-��

-��

��

��

��

��

� �

Figure: Hamiltonian flow of L(x , x) = xx


36/51


Poincaré recurrence

Definition (Poincaré, 1890’s)A dynamical system is Poincaré recurrent if almost all solution trajectories returninfinitely close to their starting point infinitely many times

Theorem (M, Papadimitriou, Piliouras, 2018; unconstrained version)(MD) is Poincaré recurrent in all bilinear min-max games that admit an equilibrium


36/51


Poincaré recurrence

Definition (Poincaré, 1890’s)A dynamical system is Poincaré recurrent if almost all solution trajectories returninfinitely close to their starting point infinitely many times

Theorem (M, Papadimitriou, Piliouras, 2018; unconstrained version)(MD) is Poincaré recurrent in all bilinear min-max games that admit an equilibrium


37/51


Learning in min-max games: gradient descent

Individual gradient descent:



37/51





Energy no longer a constant:∥Xn+ − x∗∥ =

∥Xn − x∗∥ + γn

hhhhhhh⟨V(Xn), Xn − x∗⟩´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

from (MD)

+ γn∥V(Xn)∥´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶discretization error

…even worse


37/51





-�� -��

-��

-��

��

��

��

��

� �


38/51


Learning in min-max games: extra-gradient

Extra-gradient:

Xn+/ = Xn − γn∇ℓ(Xn) Xn+ = Xn − γn∇ℓ(Xn+/)

-�� -��

-��

-��

��

��

��

��

� �


39/51


Learning in min-max games

Long-run behavior of min-max learning algorithms:

▸ Mean dynamics: Poincaré recurrent (periodic orbits)7 Individual gradient descent: divergence (outward spirals)3 Extra-gradient: convergence (inward spirals)

Different outcomes despite same underlying dynamics!


39/51


Learning in min-max games

Long-run behavior of min-max learning algorithms:

▸ Mean dynamics: Poincaré recurrent (periodic orbits)7 Individual gradient descent: divergence (outward spirals)3 Extra-gradient: convergence (inward spirals)

Different outcomes despite same underlying dynamics!


40/51


Monotonicity and strict monotonicity

Bilinear games are special cases ofmonotone games:

⟨V(x′) − V(x), x′ − x⟩ ≥ for all x , x′ ∈ X (MC)

[Ô⇒ strictly monotone if (MC) is strict for x ≠ x′]

Equivalently: H(x) ≽ where H is the game’s Hessian matrix:

H i j(x) =∇x j∇x j ℓ i(x) +

(∇x i∇x j ℓ j(x))

⊺

Examples: bilinear games (not strict), Kelly auctions, Cournot markets, routing, …

Nomenclature:▸ Diagonal strict convexity [Rosen, 1965]▸ Stable games [Hofbauer and Sandholm, 2009]▸ Contractive games [Sandholm, 2015]▸ Dissipative games [Sorin and Wan, 2016]


40/51









⊺




40/51









⊺




41/51


Convergence to equilibrium

Different behavior under strictmonotonicity:∥Xn+ − x∗∥ =

∥Xn − x∗∥ − γn ⟨V(Xn), Xn − x∗⟩

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶> if Xn not Nash


Can the drift overcome the discretization error?

Theorem (M & Zhou, 2019)▸ Assume: strict monotonicity; any of (A); any of (B); (C)▸ Then: any generalized Robbins-Monro learning algorithm converges to the

game’s (unique) Nash equilibrium with probability

In strictly monotone games, gradient methods↝ Nash equilibrium


41/51


Convergence to equilibrium

Different behavior under strictmonotonicity:∥Xn+ − x∗∥ =

∥Xn − x∗∥ − γn ⟨V(Xn), Xn − x∗⟩

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶> if Xn not Nash


Can the drift overcome the discretization error?

Theorem (M & Zhou, 2019)▸ Assume: strict monotonicity; any of (A); any of (B); (C)▸ Then: any generalized Robbins-Monro learning algorithm converges to the

game’s (unique) Nash equilibrium with probability

In strictly monotone games, gradient methods↝ Nash equilibrium


42/51


Outline

Overview



Flows in games

Monotone games

Spurious limits


43/51


Almost bilinear games

Consider the “almost bilinear” game

minx∈X

maxx∈X

L(x , x) = xx + εϕ(x)

where ε > and ϕ(x) = (/)x − (/)x

Properties:▸ Unique critical point at the origin▸ Not Nash; unstable under (MD)▸ (MD) attracted to unique, stable limit cycle from almost all initial conditions

[Hsieh,M & Cevher, 2020]


44/51


Spurious limits in almost bilinear games

Trajectories of (RM) converge to a spurious cycle that contains no critical points

-�� -�� -��

-��

-��

-��

��

��

��

��

-�� -�� -��

-��

-��

-��

��

��

��

��

-�� -�� -��

-��

-��

-��

��

��

��

��

Figure: Left: (MD); center: SGD; right: stochastic extra-gradient (SEG)


45/51


Forsaken solutions

Another almost bilinear game

minx∈X

maxx∈X

L(x , x) = xx + ε[ϕ(x) − ϕ(x)]

where ε > and ϕ(x) = (/)x − (/)x + (/)x

Properties:▸ Unique critical point at the origin▸ Local Nash equilibrium; stable under (MD)▸ Two isolated periodic orbits:

▸ One unstable, shielding equilibrium, but small▸ One stable, attracts all trajectories of (MD) outside small basin

[Hsieh,M & Cevher, 2020]


46/51


Forsaken solutions in almost bilinear games

With high probability, (RM) forsakes the game’s unique (local) equilibrium

-�� -�� -��

-��

-��

-��

��

��

��

��

-�� -�� -��

-��

-��

-��

��

��

��

��

-�� -�� -��

-��

-��

-��

��

��

��

��

Figure: Left: (MD); center: SGD; right: SEG


47/51


The limits of gradient-based learning in games

Limit cycles Ô⇒ internally chain transitive (ICT) = invariant, no proper attractors

Examples of ICT sets▸ V = ∇ℓ Ô⇒ components of critical points▸ L(x , x) = xx Ô⇒ any annular region centered on (, )▸ Almost bilinear Ô⇒ isolated periodic orbits + unique stationary point

Theorem (Hsieh,M & Cevher, 2020)▸ Assume: any of (A); any of (B); (C)▸ Then:▸ Xn converges to an ICT of (MD) with probability ▸ (RM) converges to attractors of (MD) with arbitrarily high probability


47/51


The limits of gradient-based learning in games

Limit cycles Ô⇒ internally chain transitive (ICT) = invariant, no proper attractors

Examples of ICT sets▸ V = ∇ℓ Ô⇒ components of critical points▸ L(x , x) = xx Ô⇒ any annular region centered on (, )▸ Almost bilinear Ô⇒ isolated periodic orbits + unique stationary point

Theorem (Hsieh,M & Cevher, 2020)▸ Assume: any of (A); any of (B); (C)▸ Then:▸ Xn converges to an ICT of (MD) with probability ▸ (RM) converges to attractors of (MD) with arbitrarily high probability


48/51


Conclusions

In contrast to single-agent problems (optimization), game-theoretic learning▸ May have limit points that are neither stable nor stationary▸ Cannot avoid spurious, non-equilibrium points with positive probability▸ Different approach needed (mixed-strategy learning, multiple-timescales…)

What about finite games?▸ Limit cycles may still appear▸ Which Nash equilibria are stable under no-regret learning?

[stay tuned to CoreLab FM,]


48/51


Conclusions

In contrast to single-agent problems (optimization), game-theoretic learning▸ May have limit points that are neither stable nor stationary▸ Cannot avoid spurious, non-equilibrium points with positive probability▸ Different approach needed (mixed-strategy learning, multiple-timescales…)

What about finite games?▸ Limit cycles may still appear▸ Which Nash equilibria are stable under no-regret learning?

[stay tuned to CoreLab FM,]


49/51


References I

M. Benaïm. Dynamics of stochastic approximation algorithms. In J. Azéma, M. Émery, M. Ledoux, andM. Yor, editors, Séminaire de Probabilités XXXIII, volume 1709 of Lecture Notes in Mathematics, pages1–68. Springer Berlin Heidelberg, 1999.

M. Benaïm and M. W. Hirsch. Dynamics of Morse-Smale urn processes. Ergodic Theory and DynamicalSystems, 15(6):1005–1030, December 1995.

M. Benaïm and M. W. Hirsch. Asymptotic pseudotrajectories and chain recurrent flows, withapplications. Journal of Dynamics and Differential Equations, 8(1):141–176, 1996.

M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAMJournal on Control and Optimization, 44(1):328–348, 2005.

M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions, part II:Applications. Mathematics of Operations Research, 31(4):673–695, 2006.

A. Benveniste, M. Métivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximations.Springer, 1990.

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points — Online stochastic gradient for tensordecomposition. In COLT ’15: Proceedings of the 28th Annual Conference on Learning Theory, 2015.

J. Hofbauer and W. H. Sandholm. Stable games and their dynamics. Journal of Economic Theory, 144(4):1665–1693, July 2009.

J. Hofbauer, S. Sorin, and Y. Viossat. Time average replicator and best reply dynamics. Mathematics ofOperations Research, 34(2):263–269, May 2009.


50/51


References II

Y.-P. Hsieh, P. Mertikopoulos, and V. Cevher. The limits of min-max optimization algorithms:Convergence to spurious non-critical sets. https://arxiv.org/abs/2006.09065, 2020.

G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Èkonom. iMat. Metody, 12:747–756, 1976.

H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications. Springer-Verlag,New York, NY, 1997.

H. Li, Z. Xu, G. Taylor, C. Suder, and T. Goldstein. Visualizing the loss landscape of neural nets. InNeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information ProcessingSystems, 2018.

L. Ljung. Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control, 22(4):551–575, August1977.

B. Martinet. Régularisation d’inéquations variationnelles par approximations successives. ESAIM:Mathematical Modelling and Numerical Analysis, 4(R3):154–158, 1970.

P. Mertikopoulos and Z. Zhou. Learning in games with continuous action sets and unknown payofffunctions. Mathematical Programming, 173(1-2):465–507, January 2019.

P. Mertikopoulos, C. H. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. InSODA ’18: Proceedings of the 29th annual ACM-SIAM Symposium on Discrete Algorithms, 2018.


https://arxiv.org/abs/2006.09065

51/51


References III

P. Mertikopoulos, N. Hallak, A. Kavis, and V. Cevher. On the almost sure convergence of stochasticgradient descent in non-convex problems. In NeurIPS ’20: Proceedings of the 34th InternationalConference on Neural Information Processing Systems, 2020.

J. Palis Jr. and W. de Melo. Geometric Theory of Dynamical Systems. Springer-Verlag, 1982.

R. Pemantle. Nonconvergence to unstable points in urn models and stochastic aproximations. Annals ofProbability, 18(2):698–712, April 1990.

R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Optimization,14(5):877–898, 1976.

J. B. Rosen. Existence and uniqueness of equilibrium points for concave N-person games.Econometrica, 33(3):520–534, 1965.

W. H. Sandholm. Population games and deterministic evolutionary dynamics. In H. P. Young andS. Zamir, editors, Handbook of Game Theory IV, pages 703–778. Elsevier, 2015.

S. Sorin and C. Wan. Finite composite games: Equilibria and dynamics. Journal of Dynamics andGames, 3(1):101–120, January 2016.

J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradientapproximation. IEEE Trans. Autom. Control, 37(3):332–341, March 1992.


Documents

LetterSpace=2GAMES, DYNAMICS & OPTIMIZATION