Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf ·...

Stochastic Games (Part I): Policy Improvementin Discounted (Noncompetitive) Markov Decision

Processes

Paul Varkey

Multi Agent Systems Group, Department of Computer Science, UIC

4th Annual Graduate Student Probability ConferenceApr 30th – 2010

Duke University, Durham NC

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 1 / 18

Outline

1 The Model (definitions, notations and the problem statement)

2 Basic Theorems

3 The Algorithm

4 An Example

References

BLACKWELL, D. (1962): “Discrete Dynamic Programming”,The Annals of Mathematical Statistics, Vol. 33, No. 2. (Jun.1962), pp. 719-726.

FILAR, J.A. and VRIEZE, O.J. (1996): “Competitive MarkovDecision Processes Theory, Algorithms, and Applications”,SpringerVerlag, New York, 1996.

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

is given by

r(f )[s] = r(s, f (s))

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

βtPt(π)r(ft)

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

βtPt(π)r(ft)

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

Optimal Strategies

π0 is an optimal strategy if

π0 ≥ π for all π

or, in other words,

π0 “maximizes” φβ(π)

The Problem

Given an MDP, find the optimal strategy

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

The L(·) operator

φβ(π) = r(f0) + βP(f0)∞∑t=1

= r(f0) + βP(f0)φβ(π+)

L(g) : R|S | → R|S |

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 1

optimal.

Proof.

= φβ(f0, f1, ..., fT , π0)

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

∑∞t=0 β

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Theorem 2

Proof.

⇒ f > π

Policy Improvement Algorithm

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

{r(s, a) + β

N∑s′=1

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

{r(s, a) + β

N∑s′=1

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Improvement Lemma

fk+1 > fk

Proof.

i.e. that

(fk+1,fk) > fk

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Termination Lemma

Proof.

Termination Lemma

Proof.

Termination Lemma

Proof.

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Policy denoted as (action in state 1, action in state 2)Corresponding value vector denoted as value of current policy start-ing when starting in (state 1,... state 2)

Let β = 0.9

No improvement! ⇒ Optimal strategy: (a1,a1)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

No improvement! ⇒ Optimal strategy: (a1,a1)Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Thank You! Any Questions?

Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf ·...

Documents

STOCHASTIC MARKSMANSHIP CONTEST GAMES WITH … › ~archive › pdf › e_mag › Vol.58_03_223.pdf · STOCHASTIC MARKSMANSHIP CONTEST GAMES WITH RANDOM TERMINATION | SURVEY AND APPLICATIONS

Stochastic Target Games with Controlled Loss · Stochastic Target Games with Controlled Loss B.Bouchard Ceremade - Univ. Paris-Dauphine, and, Crest - Ensae-ParisTech USC,February2013

Linear-quadratic stochastic pursuit-evasion games · Linear-Quadratic Stochastic Pursuit-Evasion Games Arunabha Bagchi and Geert Jan Olsder Department of Applied Mathematics, Twente

Stochastic Modeling of Irrationality in Normal-Form Games

Stochastic Perron for Stochastic Target Games...Stochastic Perron for Stochastic Target Games Jiaqi Li (Joint work with Erhan Bayraktar) Department of Mathematics University of Michigan

Stochastic Coalitional Games for Cooperative Random Access ... · Stochastic Coalitional Games for Cooperative Random Access in M2M Communications Mehdi Naderi Soorki, Walid Saad,

Deterministic and Stochastic Prisoner's Dilemma Games

Mean Field Games and Stochastic Growth Modeling · 2014-09-18 · Brief overview Stochastic Growth Capital Accumulation Game Continuous Time Mean eld games and stochastic growth The

Evolution of cooperation in stochastic games · framework of repeated games is a special case of stochastic games with only one state. The effect of changing environments on evolutionary

Zero-Sum Stochastic Games An algorithmic review

De nable zero-sum stochastic games · 2017. 1. 29. · Keywords Zero-sum stochastic games, Shapley operator, o-minimal structures, deﬁnable games, uniform value, nonexpansive mappings,

Synthesis for Multi-Objective Stochastic Games: An Application ... - PRISM … · 2013. 6. 3. · methods in PRISM-games, a model checker for stochastic multi-player games, and present

Stochastic games and their complexities

Two-stage allocations for stochastic linear programming games

Introduction to Game Theory - University Of Stochastic games.pdf · Nau: Game Theory 2 Stochastic Games A stochastic game is a collection of normal-form games that the agents play

Algorithms for simple stochastic games

Discounted Stochastic Games with Voluntary Transfers · 2015-02-05 · Discounted Stochastic Games with Voluntary Transfers Susanne Goldluc¨ ke∗and Sebastian Kranz† This Version,

Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

Dynamic Oligopolistic Games Under Uncertainty: A Stochastic … › ~sreynold › GRSjedc31.pdf · Dynamic Oligopolistic Games Under Uncertainty: A Stochastic Programming Approach*

Forward–Backward Stochastic Differential Games and ... › mathfi › pdf-publications › article.pdf · 2 Maximum Principles for Stochastic Differential Games of Forward–Backward