Between winning slow and losing fast

Winning slow, losing fast, and in between.

Reinaldo A Uribe Muriel

Colorado State University. Prof. C. AndersonOita University. Prof. K. Shibata

Universidad de Los Andes. Prof. F. Lozano

February 8, 2010

It’s all fun and games until someone proves a theorem.Outline

1 Fun and games

2 A theorem

3 An algorithm

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Boring!(No skill required, only luck.)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

Variation: Decision Snakes and Ladders

Sets of “win” and“loss” terminal states.

Actions: either“advance” or “retreat,”to be decided beforethrowing the die.

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

But...

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

But...

Claim:

It is not always desirable to find the optimal policyfor that problem.

Claim:

It is not always desirable to find the optimal policyfor that problem.

Hint: mean episode length of the optimal policy, d = 84.58333steps.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Seek winning.

Avoid losing.

Stay safe.

Seek winning.

Avoid losing.

Stay safe.

Seek winning.

Avoid losing.

Stay safe.

Seek winning.

Avoid losing.

Stay safe.

Seek winning.

Avoid losing.

Stay safe.

Seek winning.

Avoid losing.

Stay safe.

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.

At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.

At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.

Origin: Maze rewards, −1 except on termination.Problem: rstep =?(i.e, cost of staying in the game usually incommensurable withterminal rewards)

Better than optimal?

Optimal policy forrstep = 0

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

∗ in 108 ply.

1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.

1Shannon, 1950.

∗ in 108 ply.

Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.

1Shannon, 1950.

∗ in 108 ply.

Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.

Discount factor γ, used to ensure values are finite, has effect inepisode length, but is unpredictable and suboptimal (for the pw

dproblem)

Main result.

For a general ±1-rewarded problem, there exists anr ∗step for which the value-optimal solution maximizespw

d and the value of the initial state is -1

∃r∗step|

π∗ = argmaxπ∈Π

v = argmaxπ∈Π

v∗(s0) = v = −1

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

(Lemma: Extensible to vectors using indicator variables)

The proof rests on a solid foundation of duh!

Key substitution.The w − l space

w = pwd l = 1−pw

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

w = pwd l = 1−pw

Execution and speed in the w − l space.

1 1 1 1

0 0.5 10

Winning probability:

0 0.5 10

Mean episode length:

Proof Outline - Value in the w − l space.

v =w − l − rstep

All level sets intersect at the same point,(rstep,−rstep)

There is a one-to-one relationship betweenvalues and slopes.

Value (for all rstep), mean episode length andwinning probability level sets are lines

Optimal policies in the convex hull of the policycloud.

And done!

π∗ = maxπ

d= max

(Vertical level sets) When vt ≈ −1, we’re there.

Algorithm

Set εInitialize π0

rstep ← 0Repeat:

Find π+, vπ+(solve from π0 by any RL method)

rstep ← r ′stepπ0 ← π+

Until |vπ+(s0) + 1| < ε

On termination, π+ ≈ π∗.

rstep update using a learning rate µ > 0,

r ′step = rstep + µ[vπ+

(s0) + 1]

Algorithm

Set εInitialize π0

rstep ← 0Repeat:

Find π+, vπ+(solve from π0 by any RL method)

rstep ← r ′stepπ0 ← π+

Until |vπ+(s0) + 1| < ε

On termination, π+ ≈ π∗.

rstep update using a learning rate µ > 0,

r ′step = rstep + µ[vπ+

(s0) + 1]

Optimal rstep update.

Minimizing the interval of rstep uncertainty in the nextiteration.

Requires solving a minmax problem. Either root of an 8thdegree polynomial in r ′step or zero of the difference of tworational functions of order 4. (Easy using secant method).

O(log 1ε ) complexity.

Extensions.

Problems solvable through a similar methodConvex (linear) tradeoff.

π∗ = argmaxπ∈Π {αpw − (1− α)d}Greedy tradeoff.

{2pw−1

}Arbitrary tradeoffs.

{αpw−β

}Asymmetric rewards.

rwin = a, rloss = −b; a, b ≥ 0

Games with tie outcomes.

Games with multiple win / loss rewards.

Harder family of problems

Maximize the probability of having won before n steps / mepisodes.

Why? Non-linear level sets / non-convex functions in the w − lspace.

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Feature-robustness

Defining policy neighbourhoods.1 Continuous/discrete statewise action neighbourhoods.2 Discrete policy neighbourhoods for structured tasks.3 General policy neighbourhoods.

Feature-robustness

Feature-robustness1 Value/Speed/Execution neighbourhoods in the w − l space.2 Robustness as a trading off of features

Feature-robustness

Thank you.muriel@cs.colostate.edu - r-uribe@uniandes.edu.co

Untitled by Li Wei, School of Design, Oita University, 2009.

Between winning slow and losing fast

Documents

When is Losing, Winning? · Volume 3, Issue 2 When is Losing, Winning? By Frank Finch, Jr., Executive Director of the LCSCA June 2005 Special points of in-terest: • Executive Director

Winning and Losing in Investor-State Dispute Settlement

Reconstruction Winning a war but losing the peace

270 WIN © 2017 Capital Game Mfg. FORM NO. 5515U … mountain man™ #5515u. insets & symbols. winning symbols losing symbols. instant winning ticket multiple winning ticket. dollars

Winning or losing a generation

Winning Business in a Losing Economy

Winning and Losing with Grace: Jitterbugs Culture Code

Winning while losing: Competition dynamics in the presence ...Winning while losing: Competition dynamics in the presence of indirect network effects☆ Sarit Markovicha,⁎, Johannes

Winning Small Battles Losing the War (Violencia Policial en a

Losing and Winning - SAGE Publications Inc...of people who have run winning and losing races for both parties and at all levels . of politics—from the courthouse to the White House

Essence of Victory Winning and Losing International Crises

LOSING is a WINNING GAME!!

Winning and Losing in Environmental Ethics

Art of Winning & Science of Losing - Next Jump...THE ART OF WINNING & SCIENCE OF LOSING Tarun Gidoomal Co-Managing Director tarun@nextjump.com Kevin McCoy Co-Managing Director / Co-Founder

Winning With Digital Velocity: “Slow” and “Digital” …i.dell.com/.../Documents/Winning-With-Digital-Velocity.pdfWinning With Digital Velocity: “Slow” and “Digital”

Children’s Nonverbal Displays of Winning and Losing: Effects of … · 2017-08-28 · ORIGINAL PAPER Children’s Nonverbal Displays of Winning and Losing: Effects of Social and

Losing a Little Speed Didn’t Slow Cameron Down...Vol. 6 Issue 2 Holiday 2013 Losing a Little Speed Didn’t Slow Cameron Down continued on page 4 In January 2010, Cameron Greenwood

New Winning the battle or losing the war: the impact of European … · 2020. 3. 26. · Winning the battle or losing the war: the impact of European integration on labour market

Winning by Losing: Evidence on Overbidding in Mergers

Winning Well: A Managerâ€™s Guide to Getting Results---Without Losing Your Soul