64
Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision Processes Paul Varkey Multi Agent Systems Group, Department of Computer Science, UIC 4th Annual Graduate Student Probability Conference Apr 30th – 2010 Duke University, Durham NC Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 1 / 18

Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Stochastic Games (Part I): Policy Improvementin Discounted (Noncompetitive) Markov Decision

Processes

Paul Varkey

Multi Agent Systems Group, Department of Computer Science, UIC

4th Annual Graduate Student Probability ConferenceApr 30th – 2010

Duke University, Durham NC

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 1 / 18

Page 2: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Outline

1 The Model (definitions, notations and the problem statement)

2 Basic Theorems

3 The Algorithm

4 An Example

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 2 / 18

Page 3: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

References

BLACKWELL, D. (1962): “Discrete Dynamic Programming”,The Annals of Mathematical Statistics, Vol. 33, No. 2. (Jun.1962), pp. 719-726.

FILAR, J.A. and VRIEZE, O.J. (1996): “Competitive MarkovDecision Processes Theory, Algorithms, and Applications”,SpringerVerlag, New York, 1996.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 3 / 18

Page 4: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Page 5: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Page 6: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Page 7: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Page 8: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Page 9: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Page 10: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Page 11: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Page 12: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Page 13: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Page 14: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Page 15: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

where

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

t=0

βtPt(π)r(ft)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 6 / 18

Page 16: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Decision Rules, Strategies, Transitions & Rewards

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

where

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

t=0

βtPt(π)r(ft)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 6 / 18

Page 17: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 7 / 18

Page 18: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 7 / 18

Page 19: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Optimal Strategies

π0 is an optimal strategy if

π0 ≥ π for all π

or, in other words,

π0 “maximizes” φβ(π)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 8 / 18

Page 20: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

The Problem

Given an MDP, find the optimal strategy

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 9 / 18

Page 21: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

Page 22: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

Page 23: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

Page 24: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

Page 25: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

Page 26: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Page 27: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Page 28: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Page 29: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Page 30: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Page 31: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Page 32: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Page 33: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Page 34: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Page 35: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Page 36: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Page 37: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Page 38: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Policy Improvement Algorithm

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}(?)

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 13 / 18

Page 39: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Policy Improvement Algorithm

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}(?)

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 13 / 18

Page 40: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Policy Improvement Algorithm

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

a∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 14 / 18

Page 41: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Policy Improvement Algorithm

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

a∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 14 / 18

Page 42: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Page 43: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Page 44: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Page 45: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Page 46: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Page 47: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

Page 48: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

Page 49: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

Page 50: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

Page 51: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Policy denoted as (action in state 1, action in state 2)Corresponding value vector denoted as value of current policy start-ing when starting in (state 1,... state 2)

Let β = 0.9

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 52: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 53: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 54: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 55: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 56: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 57: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 58: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 59: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 60: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 61: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 62: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 63: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Page 64: Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf · Stochastic Games (Part I): Policy Improvement in Discounted (Noncompetitive) Markov Decision

Thank You! Any Questions?

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 18 / 18