Between winning slow and losing fast

Preview:

DESCRIPTION

BMAC presentation. Department of Computer Science. Colorado State University. Feb. 08, 2010.

Citation preview

Winning slow, losing fast, and in between.

Reinaldo A Uribe Muriel

Colorado State University. Prof. C. AndersonOita University. Prof. K. Shibata

Universidad de Los Andes. Prof. F. Lozano

February 8, 2010

It’s all fun and games until someone proves a theorem.Outline

1 Fun and games

2 A theorem

3 An algorithm

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Boring!(No skill required, only luck.)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

Variation: Decision Snakes and Ladders

Sets of “win” and“loss” terminal states.

Actions: either“advance” or “retreat,”to be decided beforethrowing the die.

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

But...

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

But...

Claim:

It is not always desirable to find the optimal policyfor that problem.

Claim:

It is not always desirable to find the optimal policyfor that problem.

Hint: mean episode length of the optimal policy, d = 84.58333steps.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.

At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.

At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.

Origin: Maze rewards, −1 except on termination.Problem: rstep =?(i.e, cost of staying in the game usually incommensurable withterminal rewards)

Better than optimal?

Optimal policy forrstep = 0

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Better than optimal?

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Better than optimal?

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Better than optimal?

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

∗ in 108 ply.

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.

Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.

Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.

Discount factor γ, used to ensure values are finite, has effect inepisode length, but is unpredictable and suboptimal (for the pw

dproblem)

Main result.

For a general ±1-rewarded problem, there exists anr ∗step for which the value-optimal solution maximizespw

d and the value of the initial state is -1

∃r∗step|

π∗ = argmaxπ∈Π

v = argmaxπ∈Π

pw

d

v∗(s0) = v = −1

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

(Lemma: Extensible to vectors using indicator variables)

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

(Lemma: Extensible to vectors using indicator variables)

The proof rests on a solid foundation of duh!

Key substitution.The w − l space

w = pwd l = 1−pw

d

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

Key substitution.The w − l space

w = pwd l = 1−pw

d

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

Key substitution.The w − l space

w = pwd l = 1−pw

d

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

Execution and speed in the w − l space.

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8

0.8

1 1 1 1

l

w

0 0.5 10

0.2

0.4

0.6

0.8

1

Winning probability:

pw =w

w + l

2

2

4

816

w

l

0 0.5 10

0.2

0.4

0.6

0.8

1

Mean episode length:

d =1

w + l

Proof Outline - Value in the w − l space.

v =w − l − rstep

w + l

So...

All level sets intersect at the same point,(rstep,−rstep)

There is a one-to-one relationship betweenvalues and slopes.

Value (for all rstep), mean episode length andwinning probability level sets are lines

Optimal policies in the convex hull of the policycloud.

And done!

π∗ = maxπ

pw

d= max

πw

(Vertical level sets) When vt ≈ −1, we’re there.

Algorithm

Set εInitialize π0

rstep ← 0Repeat:

Find π+, vπ+(solve from π0 by any RL method)

rstep ← r ′stepπ0 ← π+

Until |vπ+(s0) + 1| < ε

On termination, π+ ≈ π∗.

rstep update using a learning rate µ > 0,

r ′step = rstep + µ[vπ+

(s0) + 1]

Algorithm

Set εInitialize π0

rstep ← 0Repeat:

Find π+, vπ+(solve from π0 by any RL method)

rstep ← r ′stepπ0 ← π+

Until |vπ+(s0) + 1| < ε

On termination, π+ ≈ π∗.

rstep update using a learning rate µ > 0,

r ′step = rstep + µ[vπ+

(s0) + 1]

Optimal rstep update.

Minimizing the interval of rstep uncertainty in the nextiteration.

Requires solving a minmax problem. Either root of an 8thdegree polynomial in r ′step or zero of the difference of tworational functions of order 4. (Easy using secant method).

O(log 1ε ) complexity.

Extensions.

Problems solvable through a similar methodConvex (linear) tradeoff.

π∗ = argmaxπ∈Π {αpw − (1− α)d}Greedy tradeoff.

π∗ = argmaxπ∈Π

{2pw−1

d

}Arbitrary tradeoffs.

π∗ = argmaxπ∈Π

{αpw−β

d

}Asymmetric rewards.

rwin = a, rloss = −b; a, b ≥ 0

Games with tie outcomes.

Games with multiple win / loss rewards.

Harder family of problems

Maximize the probability of having won before n steps / mepisodes.

Why? Non-linear level sets / non-convex functions in the w − lspace.

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.1 Continuous/discrete statewise action neighbourhoods.2 Discrete policy neighbourhoods for structured tasks.3 General policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness1 Value/Speed/Execution neighbourhoods in the w − l space.2 Robustness as a trading off of features

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Thank you.muriel@cs.colostate.edu - r-uribe@uniandes.edu.co

Untitled by Li Wei, School of Design, Oita University, 2009.

Recommended