Stochastic Games (Part I): Policy Improvement in ...pvarkey/docs/talks/paul_gspc_2010.pdf ·...

Preview:

Citation preview

Stochastic Games (Part I): Policy Improvementin Discounted (Noncompetitive) Markov Decision

Processes

Paul Varkey

Multi Agent Systems Group, Department of Computer Science, UIC

4th Annual Graduate Student Probability ConferenceApr 30th – 2010

Duke University, Durham NC

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 1 / 18

Outline

1 The Model (definitions, notations and the problem statement)

2 Basic Theorems

3 The Algorithm

4 An Example

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 2 / 18

References

BLACKWELL, D. (1962): “Discrete Dynamic Programming”,The Annals of Mathematical Statistics, Vol. 33, No. 2. (Jun.1962), pp. 719-726.

FILAR, J.A. and VRIEZE, O.J. (1996): “Competitive MarkovDecision Processes Theory, Algorithms, and Applications”,SpringerVerlag, New York, 1996.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 3 / 18

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Decision Processes, States and Actions

A decision process is a discrete stochastic process that is observed atdiscrete time points t = 0, 1, 2, 3, ... called stages where a decision-maker (or, controller) chooses an action at each stage.

S denotes the state space – the process may be in one of (finitely)many states

A denotes the action space – the actions may be chosen from oneof (finitely) many actions

Choosing action a in state s results in

(i) an immediate reward r(s, a)(ii) a probabilistic transition to a state s ′ given by p(s ′|s, a)

↑the (stationary) Markov assumption

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 4 / 18

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Decision Rules, Strategies, Transitions & Rewards

A decision rule f : S → A is a function that specifies the actionthat a controller chooses in a given state

Each decision rule f definesI a transition probability matrix P(f ) such that the (s, s ′)-th entry

is given by

P(f )[s, s ′] = p(s ′|s, f (s))I a reward vector r(f ) such that the (s)-th entry is given by

r(f )[s] = r(s, f (s))

A (Markov) strategy π is a sequence of decision rules {fn, n =0, 1, ...} such that fn is used by the controller at stage n

A stationary strategy is a stage-independent strategy; i.e. it usesthe same decision rule f at each stage, and will be denoted as f

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 5 / 18

Decision Rules, Strategies, Transitions & Rewards

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

where

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

t=0

βtPt(π)r(ft)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 6 / 18

Decision Rules, Strategies, Transitions & Rewards

Each strategy π defines a t-stage transition probability matrix as

Pt(π) := P(f0)P(f1)...P(ft−1) for t = 1, 2, ...

where

P0(π) := IN

Associate with each strategy π the β-discounted value vector

φβ(π) =∞∑

t=0

βtPt(π)r(ft)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 6 / 18

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 7 / 18

≥ Ordering for Value Vectors and Strategies

≥ Ordering Value Vectors

Denote u ≥ w whenever [u]s ≥ [w ]s for all s ∈ S

Denote u > w whenever u ≥ w and u 6= w

≥ Ordering Strategies

π1 ≥ π2 iff φβ(π1) ≥ φβ(π2)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 7 / 18

Optimal Strategies

π0 is an optimal strategy if

π0 ≥ π for all π

or, in other words,

π0 “maximizes” φβ(π)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 8 / 18

The Problem

Given an MDP, find the optimal strategy

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 9 / 18

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

The L(·) operator

Let π+ := {f1, f2, ...} (where π = {f0, f1, f2, ...})I Observe that

φβ(π) = r(f0) + βP(f0)∞∑t=1

βt−1Pt−1(π+)r(ft)

= r(f0) + βP(f0)φβ(π+)

For every decision rule g , we define the associated operator

L(g) : R|S | → R|S |

as

L(g)(u) := r(g) + βP(g)u ; u ∈ R|S|

I In this notation,

φβ(f , π) = L(f )(φβ(π))

and, more generally,

φβ(f0, f1, ..., ft−1, π) = L(f0)L(f1)...L(ft−1)(φβ(π))I L(g) is a monotone operator

u ≥ w ⇒ L(g)(u) ≥ L(g)(w)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 10 / 18

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Theorem 1

Let π0 = {f 0n , n = 0, 1, ...}. If π0 ≥ (f , π0) for all f , then π0 is

optimal.

Proof.

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy.

π0 ≥ (f , π0)⇒ φβ(π0) ≥ φβ(f , π0) = L(f )(φβ(π0))

We use the monotonicity of the L(f) operator to obtain for e.g.φβ(π0) ≥ L(fT−1)(φβ(π0)) ≥ L(fT−1)L(fT )(φβ(π0))

Iterating this argument using the first (T + 1) decision rules of π to obtainφβ(π0) ≥ L(f0)L(f1)...L(fT )(φβ(π0))

= φβ(f0, f1, ..., fT , π0)

=∑T

t=0 βtPt(π)r(ft) +

∑∞t=T+1 β

tPt(f0, ...fT , π0)r(f 0

t−T−1)for every nonnegative integer T

Allowing T →∞, the last term above vanishes, leading toφβ(π0) ≥

∑∞t=0 β

tPt(π)r(ft) = φβ(π)

Since π was arbitrary, the proof is complete.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 11 / 18

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Theorem 2

Let π = {fn, n = 0, 1, ...} be an arbitrary (Markov) strategy and f adecision rule such that (f , π) > π. Then f > π.

Proof.

(f , π) > π ⇒ φβ(f , π) = L(f )(φβ(π)) > φβ(π)

As before, we use the monotonicity of L(f ) and repeated ap-plication of the above inequality to obtain φβ(f , f , f , ..., f , π) =L(T )(f )(φβ(π)) ≥ L(f )(φβ(π)) = φβ(f , π) > φβ(π)

Allowing T →∞ yields φβ(f) = φβ(f , f , f , ...) ≥ φβ(f , π) > φβ(π)

⇒ f > π

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 12 / 18

Policy Improvement Algorithm

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}(?)

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 13 / 18

Policy Improvement Algorithm

Step 0. (Initialization) Set k := 0. Select any pure strategy f.Set f0 := f and φ0 := φβ(f0) = [I − βP(f0)]−1r(f0)

Step 1. (Check Optimality) Let aks be the action selected by fk in state s.

If the following “optimality equation”r(s, ak

s ) + β∑N

s′=1 p(s ′|s, aks )φk(s ′)

= maxa∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}(?)

holds for each s ∈ S , STOP.The strategy f k is optimal and φk is the discounted optimal valuevector.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 13 / 18

Policy Improvement Algorithm

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

a∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 14 / 18

Policy Improvement Algorithm

Step 2. (Policy improvement) Let S be the nonempty subset of states forwhich equality is violated in (?), i.e., the left side is strictly smallerthan the right side. Define

aks = arg max

a∈A

{r(s, a) + β

N∑s′=1

p(s ′|s, aks )φk(s ′)

}for each s ∈ S , and a new strategy g by

g(s, a) =

f (s, a) if s /∈ S ,

1 if s ∈ S and a = aks

0 otherwise

Set fk+1 := g and φk+1 := φβ(g)

Step 3. (Iteration) Set k := k + 1, fk := fk+1, φk := φk+1. Return toStep 1.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 14 / 18

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Correctness & Termination

Improvement Lemma

fk+1 > fk

Proof.

Let fk+1 be the decision rule corresponding to fk+1.By its definition, we have that

φβ(fk+1,fk) = r(fk+1) + βP(fk+1)φβ(fk) > φβ(fk)

i.e. that

(fk+1,fk) > fk

Then, from Theorem 2, fk+1 > fk .

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 15 / 18

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

Correctness & Termination

Termination Lemma

The Policy Improvement Algorithm terminates in finite steps.

Proof.

At stage k, if the algorithm has not already found the optimal strat-egy, it computes fk+1 such that fk+1 > fk (Improvement Lemma).Therefore, the algorithm does not cycle. There are only finitestrategies. Therefore, the algorithm terminates when it finds the(an) optimal strategy.

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 16 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Policy denoted as (action in state 1, action in state 2)Corresponding value vector denoted as value of current policy start-ing when starting in (state 1,... state 2)

Let β = 0.9

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f0: (a2,a2)

Evaluate value vector φ0β := (0, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 1consider a2: 0 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = 0

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 0 + 0.4 ∗ 0) = −1consider a2: 0 + 0.9(0 ∗ 0 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f1: (a1,a2)

Evaluate value vector φ1β := (2.1739, 0)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 2.1739consider a2: 0 + 0.9(1 ∗ 2.1739 + 0 ∗ 0) = 1.9565

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.1739 + 0.4 ∗ 0) = 0.1739consider a2: 0 + 0.9(0 ∗ 2.1739 + 1 ∗ 0) = 0

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

An ExampleState: s1

a r(s1,a) p([s1,s2]|s1,a)a1 1 [0.6,0.4]a2 0 [1,0]

State: s2

a r(s2,a) p([s1,s2]|s2,a)a1 -1 [0.6,0.4]a2 0 [0,1]

Let β = 0.9

Policy f2: (a1,a1)

Evaluate value vector φ2β := (2.302, 0.1637)

Policy improvement:I For state s1,

consider a1: 1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 2.302consider a2: 0 + 0.9(1 ∗ 2.302 + 0 ∗ 0.1637) = 2.0718

I For state s2,consider a1: −1 + 0.9(0.6 ∗ 2.302 + 0.4 ∗ 0.1637) = 0.302consider a2: 0 + 0.9(0 ∗ 2.302 + 1 ∗ 0.1637) = 0.14733

No improvement! ⇒ Optimal strategy: (a1,a1)Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 17 / 18

Thank You! Any Questions?

Paul Varkey (CS, UIC) Policy Improvement in Discounted MDPs 4th GSPC, Apr 30, 2010 18 / 18

Recommended