68
Understanding the Role of Optimism in Minimax Optimization: A Proximal Point Approach Aryan Mokhtari (UT Austin) joint work with Sarath Pattathil (MIT) & Asu Ozdaglar (MIT) Workshop on Bridging Game Theory & Deep Learning NeurIPS 2019 Vancouver, Canada, December 14, 2019 1

Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Understanding the Role of Optimism in MinimaxOptimization: A Proximal Point Approach

Aryan Mokhtari (UT Austin)

joint work with

Sarath Pattathil (MIT) & Asu Ozdaglar (MIT)

Workshop on Bridging Game Theory & Deep LearningNeurIPS 2019

Vancouver, Canada, December 14, 2019

1

Page 2: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Introduction

We focus on minimax optimization (known as saddle-pointproblem)

minx∈Rm

maxy∈Rn

f (x, y)

Minimax optimization has been studied for a very long time:

Game Theory: [Basar & Oldser, ’99]

Control Theory: [Hast et al. ’13]

Robust Optimization: [Ben-Tal et al., ’09]

Recently popularized to train GANs - [Goodfellow et al., ’14].

2

Page 3: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Introduction

We focus on minimax optimization (known as saddle-pointproblem)

minx∈Rm

maxy∈Rn

f (x, y)

Minimax optimization has been studied for a very long time:

Game Theory: [Basar & Oldser, ’99]

Control Theory: [Hast et al. ’13]

Robust Optimization: [Ben-Tal et al., ’09]

Recently popularized to train GANs - [Goodfellow et al., ’14].

2

Page 4: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Training GANs

A game between the discriminator and generator

Can be formulated as a minimax optimization problemNonconvex-nonconcave since both discriminator and generator areneural networks.

3

Page 5: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Algorithms

A simple algorithm to use is the Gradient Descent Ascent (GDA)

xk+1 = xk − η∇xf (xk , yk), yk+1 = yk + η∇yf (xk , yk).

The Optimistic Gradient Descent Ascent (OGDA) method is

xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))

yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))

It has been observed that “negative momentum” or “optimism”helps in training GANs. ([Gidel et al., ’18] , [Daskalakis et al., ’18])

4

Page 6: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Algorithms

A simple algorithm to use is the Gradient Descent Ascent (GDA)

xk+1 = xk − η∇xf (xk , yk), yk+1 = yk + η∇yf (xk , yk).

The Optimistic Gradient Descent Ascent (OGDA) method is

xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))

yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))

It has been observed that “negative momentum” or “optimism”helps in training GANs. ([Gidel et al., ’18] , [Daskalakis et al., ’18])

4

Page 7: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Algorithms

A simple algorithm to use is the Gradient Descent Ascent (GDA)

xk+1 = xk − η∇xf (xk , yk), yk+1 = yk + η∇yf (xk , yk).

The Optimistic Gradient Descent Ascent (OGDA) method is

xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))

yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))

It has been observed that “negative momentum” or “optimism”helps in training GANs. ([Gidel et al., ’18] , [Daskalakis et al., ’18])

4

Page 8: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Impact of Optimism

Consider the following images generated by GANs (from [Daskalakis et al. ’18])

Figure: Adam Figure: Optimistic Adam

Adam - Similar and inferior quality images.

Optimistic Adam - Diverse and higher quality images

5

Page 9: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

History of Optimism

Clear that optimism is helping in faster convergence.

There are several results which study Optimistic Methods

Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method

In the context of Online Learning in [Rakhlin et al. ’13]

Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]

Several works analyzing OGDA in special settings [Daskalakis et al. ’18],

[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]

We show that optimism or negative momentum helps to approximateproximal point method more accurately!

OGDA can be considered as an approximation of Proximal Point

6

Page 10: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

History of Optimism

Clear that optimism is helping in faster convergence.

There are several results which study Optimistic Methods

Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method

In the context of Online Learning in [Rakhlin et al. ’13]

Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]

Several works analyzing OGDA in special settings [Daskalakis et al. ’18],

[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]

We show that optimism or negative momentum helps to approximateproximal point method more accurately!

OGDA can be considered as an approximation of Proximal Point

6

Page 11: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

History of Optimism

Clear that optimism is helping in faster convergence.

There are several results which study Optimistic Methods

Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method

In the context of Online Learning in [Rakhlin et al. ’13]

Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]

Several works analyzing OGDA in special settings [Daskalakis et al. ’18],

[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]

We show that optimism or negative momentum helps to approximateproximal point method more accurately!

OGDA can be considered as an approximation of Proximal Point

6

Page 12: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

History of Optimism

Clear that optimism is helping in faster convergence.

There are several results which study Optimistic Methods

Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method

In the context of Online Learning in [Rakhlin et al. ’13]

Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]

Several works analyzing OGDA in special settings [Daskalakis et al. ’18],

[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]

We show that optimism or negative momentum helps to approximateproximal point method more accurately!

OGDA can be considered as an approximation of Proximal Point

6

Page 13: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

History of Optimism

Clear that optimism is helping in faster convergence.

There are several results which study Optimistic Methods

Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method

In the context of Online Learning in [Rakhlin et al. ’13]

Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]

Several works analyzing OGDA in special settings [Daskalakis et al. ’18],

[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]

We show that optimism or negative momentum helps to approximateproximal point method more accurately!

OGDA can be considered as an approximation of Proximal Point

6

Page 14: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Simple Example

Consider the following bilinear problem:

minx∈Rd

maxy∈Rd

f (x, y) = x>y

The solution is (x∗, y∗) = (0, 0).

The Gradient Descent Ascent (GDA) updates for this problem:

xk+1 = xk − η yk︸︷︷︸∇xf (xk ,yk )

, yk+1 = yk + η xk︸︷︷︸∇yf (xk ,yk )

where η is the stepsize.

7

Page 15: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Simple Example

Consider the following bilinear problem:

minx∈Rd

maxy∈Rd

f (x, y) = x>y

The solution is (x∗, y∗) = (0, 0).

The Gradient Descent Ascent (GDA) updates for this problem:

xk+1 = xk − η yk︸︷︷︸∇xf (xk ,yk )

, yk+1 = yk + η xk︸︷︷︸∇yf (xk ,yk )

where η is the stepsize.

7

Page 16: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

GDA

On running GDA, after k iterations we have:

‖xk+1‖2 + ‖yk+1‖2 = (1 + η2)(‖xk‖2 + ‖yk‖2)

GDA diverges as (1 + η2) > 1

8

Page 17: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

OGDA

OGDA updates for the bilinear problem

xk+1 = xk − η( 2yk − yk−1︸ ︷︷ ︸2∇xf (xk ,yk )−∇xf (xk−1,yk−1)

)

yk+1 = yk + η( 2xk − xk−1︸ ︷︷ ︸2∇yf (xk ,yk )−∇yf (xk−1,yk−1)

)

9

Page 18: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point

The Proximal Point (PP) updates for the same problem:

xk+1 = xk − η yk+1︸︷︷︸∇xf (xk+1,yk+1)

yk+1 = yk + η xk+1︸︷︷︸∇yf (xk+1,yk+1)

where η is the stepsize.

The difference from GDA is that the gradient at the iterate (xk+1, yk+1)is used for the update instead of the gradient at (xk , yk).

Although for this problem it takes a simple form

xk+1 =1

1 + η2(xk − ηyk) , yk+1 =

1

1 + η2(yk + ηxk)

Proximal Point method in general involves operator inversion and is noteasy to implement.

10

Page 19: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point

The Proximal Point (PP) updates for the same problem:

xk+1 = xk − η yk+1︸︷︷︸∇xf (xk+1,yk+1)

yk+1 = yk + η xk+1︸︷︷︸∇yf (xk+1,yk+1)

where η is the stepsize.

The difference from GDA is that the gradient at the iterate (xk+1, yk+1)is used for the update instead of the gradient at (xk , yk).

Although for this problem it takes a simple form

xk+1 =1

1 + η2(xk − ηyk) , yk+1 =

1

1 + η2(yk + ηxk)

Proximal Point method in general involves operator inversion and is noteasy to implement.

10

Page 20: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point

The Proximal Point (PP) updates for the same problem:

xk+1 = xk − η yk+1︸︷︷︸∇xf (xk+1,yk+1)

yk+1 = yk + η xk+1︸︷︷︸∇yf (xk+1,yk+1)

where η is the stepsize.

The difference from GDA is that the gradient at the iterate (xk+1, yk+1)is used for the update instead of the gradient at (xk , yk).

Although for this problem it takes a simple form

xk+1 =1

1 + η2(xk − ηyk) , yk+1 =

1

1 + η2(yk + ηxk)

Proximal Point method in general involves operator inversion and is noteasy to implement.

10

Page 21: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point

On running PP, after k iterations we have:

‖xk+1‖2 + ‖yk+1‖2 =1

1 + η2(‖xk‖2 + ‖yk‖2)

PP converges as 1/(1 + η2) < 1

11

Page 22: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

OGDA vs Proximal Point

It seems like OGDA approximates Proximal Point method!

Their convergence paths are very similar

12

Page 23: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Contributions & Outline

View OGDA as approximations of PP for finding a saddle point

We show the iterates of OGDA are o(η2) approx. of PP iterates

We focus on the convex-concave setting in this talk

We use the PP approximation viewpoint to show that for OGDA

The iterates are boundedFunction value converges at a rate of O(1/k)

At the end, we also provide convergence rate results for

Bilinear problemsStrongly convex-strongly concave problems

We revisit the Extra-gradient (EG) method using the same approach

13

Page 24: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Contributions & Outline

View OGDA as approximations of PP for finding a saddle point

We show the iterates of OGDA are o(η2) approx. of PP iterates

We focus on the convex-concave setting in this talk

We use the PP approximation viewpoint to show that for OGDA

The iterates are boundedFunction value converges at a rate of O(1/k)

At the end, we also provide convergence rate results for

Bilinear problemsStrongly convex-strongly concave problems

We revisit the Extra-gradient (EG) method using the same approach

13

Page 25: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Contributions & Outline

View OGDA as approximations of PP for finding a saddle point

We show the iterates of OGDA are o(η2) approx. of PP iterates

We focus on the convex-concave setting in this talk

We use the PP approximation viewpoint to show that for OGDA

The iterates are boundedFunction value converges at a rate of O(1/k)

At the end, we also provide convergence rate results for

Bilinear problemsStrongly convex-strongly concave problems

We revisit the Extra-gradient (EG) method using the same approach

13

Page 26: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Contributions & Outline

View OGDA as approximations of PP for finding a saddle point

We show the iterates of OGDA are o(η2) approx. of PP iterates

We focus on the convex-concave setting in this talk

We use the PP approximation viewpoint to show that for OGDA

The iterates are boundedFunction value converges at a rate of O(1/k)

At the end, we also provide convergence rate results for

Bilinear problemsStrongly convex-strongly concave problems

We revisit the Extra-gradient (EG) method using the same approach

13

Page 27: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Contributions & Outline

View OGDA as approximations of PP for finding a saddle point

We show the iterates of OGDA are o(η2) approx. of PP iterates

We focus on the convex-concave setting in this talk

We use the PP approximation viewpoint to show that for OGDA

The iterates are boundedFunction value converges at a rate of O(1/k)

At the end, we also provide convergence rate results for

Bilinear problemsStrongly convex-strongly concave problems

We revisit the Extra-gradient (EG) method using the same approach

13

Page 28: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Problem

We consider finding the saddle point of the problem:

minx∈Rm

maxy∈Rn

f (x, y)

f is convex in x and concave in y.

f (x, y) is continuously differentiable in x and y.

∇xf and ∇yf are Lipschitz in x and y. L denotes the Lipschitz constant.

(x∗, y∗) ∈ Rm × Rn is a saddle point if it satisfies:

f (x∗, y) ≤ f (x∗, y∗) ≤ f (x, y∗),

for all x ∈ Rm, y ∈ Rn.

14

Page 29: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Problem

We consider finding the saddle point of the problem:

minx∈Rm

maxy∈Rn

f (x, y)

f is convex in x and concave in y.

f (x, y) is continuously differentiable in x and y.

∇xf and ∇yf are Lipschitz in x and y. L denotes the Lipschitz constant.

(x∗, y∗) ∈ Rm × Rn is a saddle point if it satisfies:

f (x∗, y) ≤ f (x∗, y∗) ≤ f (x, y∗),

for all x ∈ Rm, y ∈ Rn.

14

Page 30: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Problem

We consider finding the saddle point of the problem:

minx∈Rm

maxy∈Rn

f (x, y)

f is convex in x and concave in y.

f (x, y) is continuously differentiable in x and y.

∇xf and ∇yf are Lipschitz in x and y. L denotes the Lipschitz constant.

(x∗, y∗) ∈ Rm × Rn is a saddle point if it satisfies:

f (x∗, y) ≤ f (x∗, y∗) ≤ f (x, y∗),

for all x ∈ Rm, y ∈ Rn.

14

Page 31: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point

The PP method at each step solves the following:

(xk+1, yk+1) = arg minx∈Rm

maxy∈Rn

{f (x, y) +

1

2η‖x− xk‖2 −

1

2η‖y − yk‖2

}.

Using the first order optimality conditions leads to the following update:

xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).

Theorem (Convergence of Proximal Point )

The iterates generated by Proximal Point satisfy[maxy∈D

f (xk , y)− f ?]

+

[f ? −min

x∈Df (x, yk)

]≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2

ηk.

where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2}.

15

Page 32: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point

The PP method at each step solves the following:

(xk+1, yk+1) = arg minx∈Rm

maxy∈Rn

{f (x, y) +

1

2η‖x− xk‖2 −

1

2η‖y − yk‖2

}.

Using the first order optimality conditions leads to the following update:

xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).

Theorem (Convergence of Proximal Point )

The iterates generated by Proximal Point satisfy[maxy∈D

f (xk , y)− f ?]

+

[f ? −min

x∈Df (x, yk)

]≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2

ηk.

where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2}.

15

Page 33: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point

The PP method at each step solves the following:

(xk+1, yk+1) = arg minx∈Rm

maxy∈Rn

{f (x, y) +

1

2η‖x− xk‖2 −

1

2η‖y − yk‖2

}.

Using the first order optimality conditions leads to the following update:

xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).

Theorem (Convergence of Proximal Point )

The iterates generated by Proximal Point satisfy[maxy∈D

f (xk , y)− f ?]

+

[f ? −min

x∈Df (x, yk)

]≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2

ηk.

where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2}.

15

Page 34: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

OGDA updates - How prediction takes place

One way of approximating the Proximal Point update is as follows

∇xf (xk+1, yk+1)) ≈ ∇xf (xk , yk) + (∇xf (xk , yk)−∇xf (xk−1, yk−1))

∇yf (xk+1, yk+1)) ≈ ∇yf (xk , yk) + (∇yf (xk , yk)−∇yf (xk−1, yk−1))

This leads to the OGDA update

xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))

yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))

16

Page 35: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

OGDA updates - How prediction takes place

One way of approximating the Proximal Point update is as follows

∇xf (xk+1, yk+1)) ≈ ∇xf (xk , yk) + (∇xf (xk , yk)−∇xf (xk−1, yk−1))

∇yf (xk+1, yk+1)) ≈ ∇yf (xk , yk) + (∇yf (xk , yk)−∇yf (xk−1, yk−1))

This leads to the OGDA update

xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))

yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))

16

Page 36: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

OGDA vs PP

Given a point (xk , yk), let

(xk+1, yk+1) be the point obtained by performing PP on (xk , yk).(xk+1, yk+1) be the point obtained by performing OGDA on (xk , yk)

For a given stepsize η > 0 we have

‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).

Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1) at the point (xk , yk).

17

Page 37: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

OGDA vs PP

Given a point (xk , yk), let

(xk+1, yk+1) be the point obtained by performing PP on (xk , yk).(xk+1, yk+1) be the point obtained by performing OGDA on (xk , yk)

For a given stepsize η > 0 we have

‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).

Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1) at the point (xk , yk).

17

Page 38: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Convex-Concave case)

Let the stepsize η satisfies the condition 0 < η ≤ 1/2L, then the iteratesgenerated by OGDA satisfy

[maxy∈D

f (xk , y)− f ?] + [f ?−minx∈D

f (x, yk)] ≤(‖x0 − x∗‖2 + ‖y0 − y∗‖2)(8L + 1

2η )

k

where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ 2(‖x0 − x∗‖2 + ‖y0 − y∗‖2

)}.

OGDA has an iteration complexity of O(1/k) (same as PP)

First convergence guarantee for OGDA

18

Page 39: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point with error

We study ‘inexact’ versions of the Proximal Point method, given by

xk+1 = xk − η∇xf (xk+1, yk+1) + εxk

yk+1 = yk + η∇yf (xk+1, yk+1)− εyk

The ‘error’ εxk , εyk should be such that

Algorithm should be easy to implement.Retains convergence properties of PP.

Lemma (3 point equality for Proximal Point with error)

F (zk+1)>(zk+1 − z)

=1

2η‖zk − z‖2 − 1

2η‖zk+1 − z‖2 − 1

2η‖zk+1 − zk‖2 +

1

ηεk

T (zk+1 − z),

Here z = [x; y], F (z) = [∇xf (x, y);−∇yf (x, y)] and εk = [εxk ;−εyk ].

19

Page 40: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point with error

We study ‘inexact’ versions of the Proximal Point method, given by

xk+1 = xk − η∇xf (xk+1, yk+1) + εxk

yk+1 = yk + η∇yf (xk+1, yk+1)− εyk

The ‘error’ εxk , εyk should be such that

Algorithm should be easy to implement.Retains convergence properties of PP.

Lemma (3 point equality for Proximal Point with error)

F (zk+1)>(zk+1 − z)

=1

2η‖zk − z‖2 − 1

2η‖zk+1 − z‖2 − 1

2η‖zk+1 − zk‖2 +

1

ηεk

T (zk+1 − z),

Here z = [x; y], F (z) = [∇xf (x, y);−∇yf (x, y)] and εk = [εxk ;−εyk ].

19

Page 41: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Proximal Point with error

We study ‘inexact’ versions of the Proximal Point method, given by

xk+1 = xk − η∇xf (xk+1, yk+1) + εxk

yk+1 = yk + η∇yf (xk+1, yk+1)− εyk

The ‘error’ εxk , εyk should be such that

Algorithm should be easy to implement.Retains convergence properties of PP.

Lemma (3 point equality for Proximal Point with error)

F (zk+1)>(zk+1 − z)

=1

2η‖zk − z‖2 − 1

2η‖zk+1 − z‖2 − 1

2η‖zk+1 − zk‖2 +

1

ηεk

T (zk+1 − z),

Here z = [x; y], F (z) = [∇xf (x, y);−∇yf (x, y)] and εk = [εxk ;−εyk ].

19

Page 42: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Proof Sketch: (Here z = [x; y] and F (z) = [∇xf (x, y);−∇yf (x, y)].)

For OGDA, we have: εk = η[F (zk+1)− 2F (zk) + F (zk−1)].

Substitute this error in the 3 point lemma for PP with error to get:

N−1∑k=0

F (zk+1)>(zk+1 − z) ≤ 1

2η‖z0 − z‖2 − 1

2η‖zN − z‖2

− L

2‖zN − zN−1‖2 + (F (zN)− F (zN−1))>(zN − z).

20

Page 43: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Proof Sketch: (Here z = [x; y] and F (z) = [∇xf (x, y);−∇yf (x, y)].)

For OGDA, we have: εk = η[F (zk+1)− 2F (zk) + F (zk−1)].

Substitute this error in the 3 point lemma for PP with error to get:

N−1∑k=0

F (zk+1)>(zk+1 − z) ≤ 1

2η‖z0 − z‖2 −

(1

2η− L

2

)‖zN − z‖2

20

Page 44: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Proof Sketch: (Here z = [x; y] and F (z) = [∇xf (x, y);−∇yf (x, y)].)

For OGDA, we have: εk = η[F (zk+1)− 2F (zk) + F (zk−1)].

Substitute this error in the 3 point lemma for PP with error to get:

N−1∑k=0

F (zk+1)>(zk+1 − z) ≤ 1

2η‖z0 − z‖2 −

(1

2η− L

2

)‖zN − z‖2

F (z)>(z− z∗) ≥ 0 for all z ⇒ The iterates are bounded

‖zN − z∗‖2 ≤ c‖z0 − z∗‖2

Using convexity-concavity of f (x, y) we have:

[maxy∈D

f (xN , y)− f ?] + [f ? −minx∈D

f (x, yN)] ≤ 1

N

N−1∑k=0

F (zk)>(zk − z)

≤ C‖z0 − z∗‖2

N

20

Page 45: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Other Approximation of Proximal Point

The Proximal Point updates are:

xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).

This can also be written as:

xk+1 = xk − η∇xf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)),

yk+1 = yk + η∇yf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)).

Approximate the inner gradient at (xk+1, yk+1) by the gradient at(xk , yk):

xk+1 = xk − η∇xf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)),

yk+1 = yk + η∇yf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)).

which is nothing but the Extra-gradient Algorithm!

21

Page 46: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Other Approximation of Proximal Point

The Proximal Point updates are:

xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).

This can also be written as:

xk+1 = xk − η∇xf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)),

yk+1 = yk + η∇yf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)).

Approximate the inner gradient at (xk+1, yk+1) by the gradient at(xk , yk):

xk+1 = xk − η∇xf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)),

yk+1 = yk + η∇yf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)).

which is nothing but the Extra-gradient Algorithm!

21

Page 47: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Other Approximation of Proximal Point

The Proximal Point updates are:

xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).

This can also be written as:

xk+1 = xk − η∇xf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)),

yk+1 = yk + η∇yf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)).

Approximate the inner gradient at (xk+1, yk+1) by the gradient at(xk , yk):

xk+1 = xk − η∇xf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)),

yk+1 = yk + η∇yf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)).

which is nothing but the Extra-gradient Algorithm!

21

Page 48: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

EG updates - How prediction takes place

The updates of EG

xk+1/2 = xk − η∇xf (xk , yk), yk+1/2 = yk + η∇yf (xk , yk).

The gradients evaluated at the midpoints xk+1/2 and yk+1/2 are used tocompute the new iterates xk+1 and yk+1 by performing the updates

xk+1 = xk − η∇xf (xk+1/2, yk+1/2),

yk+1 = yk + η∇yf (xk+1/2, yk+1/2).

22

Page 49: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

EG vs PP

Given a point (xk , yk), let

(xk+1, yk+1) be the point we obtain by performing PP on (xk , yk).(xk+1, yk+1) be the point we obtain by performing EG on (xk , yk)

Then, for a given stepsize η > 0 we have

‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).

Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1), as well as ∇xf (xk+1/2, yk+1/2) and ∇yf (xk+1/2, yk+1/2)about the point (xk , yk).

23

Page 50: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

EG vs PP

Given a point (xk , yk), let

(xk+1, yk+1) be the point we obtain by performing PP on (xk , yk).(xk+1, yk+1) be the point we obtain by performing EG on (xk , yk)

Then, for a given stepsize η > 0 we have

‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).

Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1), as well as ∇xf (xk+1/2, yk+1/2) and ∇yf (xk+1/2, yk+1/2)about the point (xk , yk).

23

Page 51: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

EG vs PP

EG also does a good job in approximating Proximal Point method!

24

Page 52: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Convex-Concave case)

For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[

maxy∈D

f (xk , y)− f ?]

+

[f ? −min

x∈Df (x, yk)

]≤

DL(

16 + 332(1−σ2)

)k

where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2

1−η2L2 )D}.

EG has an iteration complexity of O(1/k) (same as Proximal Point)

O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]

when the feasible set is compact.

[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.

Our result shows

a convergence rate of O(1/k) in terms of function valuewithout assuming compactness

25

Page 53: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Convex-Concave case)

For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[

maxy∈D

f (xk , y)− f ?]

+

[f ? −min

x∈Df (x, yk)

]≤

DL(

16 + 332(1−σ2)

)k

where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2

1−η2L2 )D}.

EG has an iteration complexity of O(1/k) (same as Proximal Point)

O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]

when the feasible set is compact.

[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.

Our result shows

a convergence rate of O(1/k) in terms of function valuewithout assuming compactness

25

Page 54: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Convex-Concave case)

For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[

maxy∈D

f (xk , y)− f ?]

+

[f ? −min

x∈Df (x, yk)

]≤

DL(

16 + 332(1−σ2)

)k

where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2

1−η2L2 )D}.

EG has an iteration complexity of O(1/k) (same as Proximal Point)

O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]

when the feasible set is compact.

[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.

Our result shows

a convergence rate of O(1/k) in terms of function valuewithout assuming compactness

25

Page 55: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Convex-Concave case)

For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[

maxy∈D

f (xk , y)− f ?]

+

[f ? −min

x∈Df (x, yk)

]≤

DL(

16 + 332(1−σ2)

)k

where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2

1−η2L2 )D}.

EG has an iteration complexity of O(1/k) (same as Proximal Point)

O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]

when the feasible set is compact.

[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.

Our result shows

a convergence rate of O(1/k) in terms of function valuewithout assuming compactness

25

Page 56: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Extensions

Bilinear:

f (x, y) = x>ByB ∈ Rd×d is a square full-rank matrix.(x∗, y∗) = (0, 0) is the unique saddle point.

The condition number is defined as κ := λmax(B>B)

λmin(B>B)

Strongly Convex - Strongly Concave:

f (x, y) is continuously differentiable in x and y.f is µx -strongly convex in x and µy -strongly concave in y.We define µ := min{µx , µy}.The condition number is defined as κ := L

µ

26

Page 57: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Extensions

Bilinear:

f (x, y) = x>ByB ∈ Rd×d is a square full-rank matrix.(x∗, y∗) = (0, 0) is the unique saddle point.

The condition number is defined as κ := λmax(B>B)

λmin(B>B)

Strongly Convex - Strongly Concave:

f (x, y) is continuously differentiable in x and y.f is µx -strongly convex in x and µy -strongly concave in y.We define µ := min{µx , µy}.The condition number is defined as κ := L

µ

26

Page 58: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Bilinear case (OGDA))

Let the stepsize η = (1/40√λmax(B>B)), then the iterates generated by the

OGDA method satisfy

‖xk+1‖2 + ‖yk+1‖2 ≤(

1− 1

)k

(‖x0‖2 + ‖y0‖2),

where c is a positive constant independent of the problem parameters.

Theorem (Bilinear case (EG))

Let the stepsize η = 1/(2√

2λmax(B>B)), then the iterates generated by theEG method satisfy

‖xk+1‖2 + ‖yk+1‖2 ≤(

1− 1

)(‖xk‖2 + ‖yk‖2),

where c is a positive constant independent of the problem parameters.

27

Page 59: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Bilinear case (OGDA))

Let the stepsize η = (1/40√λmax(B>B)), then the iterates generated by the

OGDA method satisfy

‖xk+1‖2 + ‖yk+1‖2 ≤(

1− 1

)k

(‖x0‖2 + ‖y0‖2),

where c is a positive constant independent of the problem parameters.

Theorem (Bilinear case (EG))

Let the stepsize η = 1/(2√

2λmax(B>B)), then the iterates generated by theEG method satisfy

‖xk+1‖2 + ‖yk+1‖2 ≤(

1− 1

)(‖xk‖2 + ‖yk‖2),

where c is a positive constant independent of the problem parameters.

27

Page 60: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Strongly convex-Strongly concave case (OGDA))

Let the stepsize η = (1/(4L)), then the iterates generated by OGDA satisfy

‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(

1− 1

)k

(‖x0 − x∗‖2 + ‖y0 − y∗‖2),

where c is a positive constant independent of the problem parameters.

Theorem (Strongly convex-Strongly concave case (EG))

Let the stepsize η = 1/(4L), then the iterates generated by the EG methodsatisfy

‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(

1− 1

)(‖xk − x∗‖2 + ‖yk − x∗‖2),

where c is a positive constant independent of the problem parameters.

28

Page 61: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

Theorem (Strongly convex-Strongly concave case (OGDA))

Let the stepsize η = (1/(4L)), then the iterates generated by OGDA satisfy

‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(

1− 1

)k

(‖x0 − x∗‖2 + ‖y0 − y∗‖2),

where c is a positive constant independent of the problem parameters.

Theorem (Strongly convex-Strongly concave case (EG))

Let the stepsize η = 1/(4L), then the iterates generated by the EG methodsatisfy

‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(

1− 1

)(‖xk − x∗‖2 + ‖yk − x∗‖2),

where c is a positive constant independent of the problem parameters.

28

Page 62: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Convergence rates

OGDA and EG have an iteration complexity of O(κ log(1/ε))

This is the same iteration complexity as the PP method.Similar rates obtained in [Tseng 95], [Liang & Stokes 19], [Gidel et al. 19]

Performance of GDA in these settings

Bilinear: GDA may not converge.Strongly convex-Strongly concave: GDA has an iterationcomplexity of O(κ2 log(1/ε))

29

Page 63: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Generalized OGDA

Using the interpretation that OGDA is an approximation of PP, we canextend this algorithm to a more general set of parameters.

The Generalized OGDA updates are:

xk+1 = xk − α∇xf (xk , yk)− β(∇xf (xk , yk)−∇xf (xk−1, yk−1)

)yk+1 = yk + α∇yf (xk , yk) + β

(∇yf (xk , yk)−∇yf (xk−1, yk−1)

)

Note that

β = 0, this reduces to the GDA updates.β = α, this reduces to the standard OGDA updates.

30

Page 64: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Generalized OGDA

Using the interpretation that OGDA is an approximation of PP, we canextend this algorithm to a more general set of parameters.

The Generalized OGDA updates are:

xk+1 = xk − α∇xf (xk , yk)− β(∇xf (xk , yk)−∇xf (xk−1, yk−1)

)yk+1 = yk + α∇yf (xk , yk) + β

(∇yf (xk , yk)−∇yf (xk−1, yk−1)

)

Note that

β = 0, this reduces to the GDA updates.β = α, this reduces to the standard OGDA updates.

30

Page 65: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Generalized OGDA

Using the interpretation that OGDA is an approximation of PP, we canextend this algorithm to a more general set of parameters.

The Generalized OGDA updates are:

xk+1 = xk − α∇xf (xk , yk)− β(∇xf (xk , yk)−∇xf (xk−1, yk−1)

)yk+1 = yk + α∇yf (xk , yk) + β

(∇yf (xk , yk)−∇yf (xk−1, yk−1)

)

Note that

β = 0, this reduces to the GDA updates.β = α, this reduces to the standard OGDA updates.

30

Page 66: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Conclusion

Conclusions and Future Work

Studied Optimism through the lens of Proximal Point method

Analyzed OGDA as an inexact version of Proximal Point method

Convex-concave, bilinear, strongly convex-strongly concave

Revisited EG as an approximation of the Proximal Point method

This interpretation also aided in the design of Generalized OGDA

Extension to more general settings:

nonconvex-concaveweakly convex-weakly concave

31

Page 67: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

Conclusion

Conclusions and Future Work

Studied Optimism through the lens of Proximal Point method

Analyzed OGDA as an inexact version of Proximal Point method

Convex-concave, bilinear, strongly convex-strongly concave

Revisited EG as an approximation of the Proximal Point method

This interpretation also aided in the design of Generalized OGDA

Extension to more general settings:

nonconvex-concaveweakly convex-weakly concave

31

Page 68: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,

References

Thanks!

Convergence Rate of O(1/k) for Optimistic Gradient and Extra-gradientMethods in Smooth Convex-Concave Saddle Point Problemshttps://arxiv.org/pdf/1906.01115.pdf

A Unified Analysis of Extra-gradient and Optimistic Gradient Methods forSaddle Point Problems: Proximal Point Approachhttps://arxiv.org/pdf/1901.08511.pdf

32