Life: play and win in 20 trillion moves Stuart Russell

1

Life: play and win in 20 trillion movesStuart Russell

Based on work with Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe

White to play and win in 2 moves

2

Life

100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions

(Not to mention choosing brain activities!) The world has a very large, partially observable

state space and an unknown model So, being intelligent is “provably” hard How on earth do we manage?

3

4

Hierarchical structure!! (among other things)

Deeply nested behaviors give modularity E.g., tongue controls for [t] independent of

everything else given decision to say “tongue” High-level decisions give scale

E.g., “go to IJCAI 13” 3,000,000,000 actions Look further ahead, reduce computation

exponentially[NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05,

ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]

Reminder: MDPs Markov decision process (MDP) defined by

Set of states S Set of actions A Transition model P(s’ | s,a)

Markovian, i.e., depends only on st and not on history

Reward function R(s,a,s’) Discount factor γ

Optimal policy π* maximizes the value function Vπ

= Σt γt R(st,π(st),st+1)

Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)] 5

Reminder: policy invariance

MDP M with reward function R(s,a,s’) Any potential function Φ(s)

then Any policy that is optimal for M is optimal for

any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s)

So we can provide hints to speed up RL by shaping the reward function

6

Reminder: Bellman equation

Bellman equation for a fixed policy π :

Vπ(s) = Σs’ P(s’ | s,π(s)) [ R(s,π(s),s’) + γ Vπ (s’) ] Bellman equation for optimal value function:

V (s) = maxa Σs’ P(s’ | s,a) [ R(s,a,s’) + γ V (s’) ]

7

Reminder: Reinforcement learning

Agent observes transitions and rewards: Doing a in s leads to s’ with reward r

Temporal-difference (TD) learning: V(s) <- (1-α)V(s) + α[r + γ V(s’) ]

Q-learning: Q(s,a) <- (1-α)Q(s,a) + α[r + γ maxa’ Q(s’,a’) ]

SARSA (after observing the next action a’ ): Q(s,a) <- (1-α)Q(s,a) + α[r + γ Q(s’,a’) ]

8

Reminder: RL issues

Credit assignment (problem: Generalization problem:

9

Abstract states

First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

10

Abstract states


11

Abstract states


12

Oops – lost theMarkov property!Action outcome depends on thelow-level state,which dependson history

Abstract actions Forestier & Varaiya, 1978: supervisory control:

Abstract action is a fixed policy for a region “Choice states” where policy exits region

Decision process on choice states is semi-Markov

Semi-Markov decision processes

Add random duration N for action executions:

P(s’,N | s,a)

Modified Bellman equation:

Vπ = Σs’,N P(s’,N | s,π(s)) [ R(s,π(s),s’,N) + γN Vπ (s’) ] Modified Q-learning with abstract actions:

For transitions within an abstract action πa:

rc = rc + γcR(s,πa(s),s’) ; γc = γcγ

For transitions that terminate action begun at s0

Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’ Q(s’,πa’) ]14

Hierarchical abstract machines

Hierarchy of NDFAs; states come in 4 types: Action states execute a physical action Call states invoke a lower-level NDFA Choice points nondeterministically select next state Stop states return control to invoking NDFA

NDFA internal transitions dictated by percept

15

16

Running example

Peasants can move, pickup and dropoff

Penalty for collision Cost-of-living each step Reward for dropping off

resources Goal : gather 10 gold +

10 wood (3L)n++ states s 7n primitive actions a

Part of 1-peasant HAM

17

Joint SMDP

Physical states S, machine states Θ, joint state space Ω = S x Θ

Joint MDP: Transition model for physical state (from physical

MDP) and machine state (as defined by HAM) A(ω) is fixed except when θ is at a choice point

Reduction to SMDP: States are just those where θ is at a choice point

18

Hierarchical optimality

Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM

19

Partial programs

HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state

A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do

HAMs, MAXQs, Options are trivially expressible as partial programs

Standard MDP: loop forever choose an action20

21

RL and partial programs

Learning algorithm

Completion

a

s,r

Partial program

22

RL and partial programs

Learning algorithm

Completion

a

s,r

Partial program

Hierarchically optimalfor all terminating programs

23

(defun top () (loop do (choose ‘top-choice

(gather-wood)(gather-gold))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff)))

(defun nav (dest) (until (= (pos (get-state))

dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

Single-threaded Alisp program

Program state θ includes Program counter Call stack Global variables

24

Q-functions

Represent completions using Q-function Joint state ω = [s,θ] env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice u in ω

and follow completion π thereafter” Modified Q-learning [AR 02] finds optimal completion

ω u Q(ω,u)... ... ...

At nav-choice, Pos=(2,3), Dest=(6,5), ... North 15

At resource-choice, Pos=(2,3), Gold=7, Wood=3, ...

Gather-wood -42

... ... ...

25

Internal state

Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies

E.g., while navigating to location (x,y), moving towards (x,y) is a good idea

Natural local shaping potential (distance from destination) impossible to express in external terms

26

Top TopTop

GatherWood GatherGold

Nav(Forest1) Nav(Mine2)

Qr Qc Qe

Q

• Temporal decomposition Q = Qr+Qc+Qe where

• Qr(ω,u) = reward while doing u (may be many steps)

• Qc(ω,u) = reward in current subroutine after doing u

• Qe(ω,u) = reward after current subroutine

Temporal Q-decomposition

27

State abstraction

• Temporal decomposition => state abstraction E.g., while navigating, Qc independent of

gold reserves In general, local Q-components can depend

on few variables => fast learning

28

Get-wood

Handling multiple effectors

Multithreaded agent programs Threads = tasks Each effector assigned to a thread Threads can be created/destroyed Effectors can be reassigned Effectors can be created/destroyed

Get-gold Get-wood

(Defend-Base)

29

(defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas

(first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas))))



(defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood))))



(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff)))

(defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

An example Concurrent ALisp programAn example single-threaded ALisp program

30

defun gather-wood ()

(setf dest (choose *forests*))

(nav dest)

(action ‘get-wood)

(nav *home-base-loc*)

(action ‘put-wood))

(defun gather-gold ()

(setf dest (choose *goldmines*))

(nav dest)

(action ‘get-gold)

(nav *home-base-loc*)

(action ‘put-gold))

(defun nav (Wood1)

(loop

until (at-pos Wood1)

(setf d (choose ‘(N S E W R)))

do (action d)

))

(defun nav (Gold2)

(loop

until (at-pos Gold2)

(setf d (choose ‘(N S E W R)))

do (action d)

))

Thread 2

Thread 1

Environment timestep 27Environment timestep 26

Peasant 1

Peasant 2

PausedMaking joint choice

PausedMaking joint choice

Running

Running

Waiting for joint action

Waiting for joint action

Concurrent Alisp semantics

31

Threads execute independently until they hit a choice or action

Wait until all threads are at a choice or action If all effectors have been assigned an action, do

that joint action in environment Otherwise, make joint choice

Concurrent Alisp semantics

32

Q-functions

To complete partial program, at each choice state ω, need to specify choices for all choosing threads

So Q(ω,u) as before, except u is a joint choice Suitable SMDP Q-learning gives optimal completion

Example Q-function

ω u Q(ω,u)... ... ...

Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14

(Peas1:East, Peas3:Forest2)

15.7

... ... ...

34

Problems with concurrent activities

Temporal decomposition of Q-function lost No credit assignment among threads

Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly

Peasant 2 thinks he’s done very well!! Significantly slows learning as number of peasants

increases

35

Threadwise decomposition

Idea : decompose reward among threads (Russell+Zimdars, 2003)

E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants

Qjπ(ω,u) = “Expected total reward received by

thread j if we make joint choice u and then do π” Threadwise Q-decomposition Q = Q1+…Qn

Recursively distributed SARSA => global optimality

36

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

a

r

r1

r2 r3

Action state

decomp

37

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

Choice state ω

Q1(ω,·) Q3(ω,·)

Q2(ω,·)

+

Q (ω,·)argmax = u

Q-update

Q-update

Q-update

38

Peasant 3 thread

Threadwise and temporal decomposition

Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j while

doing u Qj,c (ω,u) = Expected reward gained by thread j after u

but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j after

current subroutine

Peasant 2 thread

Peasant 1 thread

Top thread

39Num steps learning (x 1000)

Reward of learnt policy

Flat

Undecomposed

Threadwise

Threadwise + Temporal

Resource gathering with 15 peasants

40

41

Summary of Part I

Structure in behavior seems essential for scaling up Partial programs

Provide natural structural constraints on policies Decompose value functions into simple components Include internal state (e.g., “goals”) that further simplifies

value functions, shaping rewards

Concurrency Simplifies description of multieffector behavior Messes up temporal decomposition and credit assignment

(but threadwise reward decomposition restores it)

42

Outline

Hierarchical RL with abstract actions Hierarchical RL with partial programs Efficient hierarchical planning

43

Abstract Lookahead

k-step lookahead >> 1-step lookahead e.g., chess

k-step lookahead no use if steps are too small e.g., first k characters of a NIPS paper

Abstract plans (high-level executions of partial programs) are shorter Much shorter plans for given goals => exponential savings Can look ahead much further into future

Kc3 Rxa1

Qa4 c2 c2 Kxa1

‘a’ ‘b’

‘a’ ‘b’ ‘a’ ‘b’

a b

aa ab ba bb

44

High-level actions

Start with restricted classical HTN form of partial program A high-level action (HLA)

Has a set of possible immediate refinements into sequences of actions, primitive or high-level

Each refinement may have a precondition on its use Executable plan space = all primitive refinements of Act

[Act]

[GoSFO, Act] [GoOAK, Act]

[WalkToBART, BARTtoSFO, Act]

Lookahead with HLAs

To do offline or online planning with HLAs, need a model

See classical HTN planning literature

45

Lookahead with HLAs

To do offline or online planning with HLAs, need a model

See classical HTN planning literature – not! Drew McDermott (AI Magazine, 2000):

The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions

46

47

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

48

Downward refinement




Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

49

Downward refinement





MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

50

Downward refinement





MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

Problem: how to say true things about effects of HLAs?

51

s0

g

S

s0

g

S

s0

g

S

h2h1 h1h2

h2

h1h2 is a solution

Angelic semantics for HLAs Begin with atomic state-space view, start in s0, goal state g Central idea is the reachable set of an HLA from each state

When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal

May seem related to nondeterminism But the nondeterminism is angelic: the “uncertainty” will be resolved

by the agent, not an adversary or nature

a4 a1 a3 a2

State space

52

Technical development NCSTRIPS to describe reachable sets

STRIPS add/delete plus “possibly add”, “possibly delete”, etc E.g., GoSFO adds AtSFO, possibly adds CarAtSFO

Reachable sets may be too hard to describe exactly Upper and lower descriptions bound the exact

reachable set (and its cost) above and below Still support proofs of plan success/failure Possibly-successful plans must be refined

Sound, complete, optimal offline planning algorithms Complete, eventually-optimal online (real-time) search

53

Example – Warehouse World

Has similarities to blocks and taxi domains, but more choices and constraints Gripper must stay in bounds Can’t pass through blocks Can only turn around at top row

Goal: have C on T4 Can’t just move directly Final plan has 22 steps

Left, Down, Pickup,Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown

54

Navigate(xt,yt) (Pre: At(xs,ys))

Upper: -At(xs,ys), +At(xt,yt), FacingRight

Lower: IF (Free(xt,yt) x Free(x,ymax)):

-At(xs,ys), +At(xt,yt), FacingRight, ELSE: nil

Representing descriptions: NCSTRIPS Assume S = set of truth assignments to propositions Descriptions specify propositions (possibly) added/deleted by HLA An efficient algorithm exists to progress state sets (represented as DNF

formulae) through descriptions

~ st

~ st

st

x

55

Experiment Instance 3

5x8 world 90-step plan

Flat/hierarchical without descriptions did not terminate within 10,000 seconds

56

Online Search: Preliminary ResultsS

tep

s to

rea

ch g

oal

Documents

Life: play and win in 20 trillion moves Stuart Russell