Upload
marlee
View
37
Download
1
Embed Size (px)
DESCRIPTION
Life: play and win in 20 trillion moves Stuart Russell. White to play and win in 2 moves. Life. 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions (Not to mention choosing brain activities!) - PowerPoint PPT Presentation
Citation preview
1
Life: play and win in 20 trillion movesStuart Russell
Based on work with Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe
White to play and win in 2 moves
2
Life
100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions
(Not to mention choosing brain activities!) The world has a very large, partially observable
state space and an unknown model So, being intelligent is “provably” hard How on earth do we manage?
3
4
Hierarchical structure!! (among other things)
Deeply nested behaviors give modularity E.g., tongue controls for [t] independent of
everything else given decision to say “tongue” High-level decisions give scale
E.g., “go to IJCAI 13” 3,000,000,000 actions Look further ahead, reduce computation
exponentially[NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05,
ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]
Reminder: MDPs Markov decision process (MDP) defined by
Set of states S Set of actions A Transition model P(s’ | s,a)
Markovian, i.e., depends only on st and not on history
Reward function R(s,a,s’) Discount factor γ
Optimal policy π* maximizes the value function Vπ
= Σt γt R(st,π(st),st+1)
Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)] 5
Reminder: policy invariance
MDP M with reward function R(s,a,s’) Any potential function Φ(s)
then Any policy that is optimal for M is optimal for
any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s)
So we can provide hints to speed up RL by shaping the reward function
6
Reminder: Bellman equation
Bellman equation for a fixed policy π :
Vπ(s) = Σs’ P(s’ | s,π(s)) [ R(s,π(s),s’) + γ Vπ (s’) ] Bellman equation for optimal value function:
V (s) = maxa Σs’ P(s’ | s,a) [ R(s,a,s’) + γ V (s’) ]
7
Reminder: Reinforcement learning
Agent observes transitions and rewards: Doing a in s leads to s’ with reward r
Temporal-difference (TD) learning: V(s) <- (1-α)V(s) + α[r + γ V(s’) ]
Q-learning: Q(s,a) <- (1-α)Q(s,a) + α[r + γ maxa’ Q(s’,a’) ]
SARSA (after observing the next action a’ ): Q(s,a) <- (1-α)Q(s,a) + α[r + γ Q(s’,a’) ]
8
Reminder: RL issues
Credit assignment (problem: Generalization problem:
9
Abstract states
First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.
10
Abstract states
First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.
11
Abstract states
First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.
12
Oops – lost theMarkov property!Action outcome depends on thelow-level state,which dependson history
Abstract actions Forestier & Varaiya, 1978: supervisory control:
Abstract action is a fixed policy for a region “Choice states” where policy exits region
Decision process on choice states is semi-Markov
Semi-Markov decision processes
Add random duration N for action executions:
P(s’,N | s,a)
Modified Bellman equation:
Vπ = Σs’,N P(s’,N | s,π(s)) [ R(s,π(s),s’,N) + γN Vπ (s’) ] Modified Q-learning with abstract actions:
For transitions within an abstract action πa:
rc = rc + γcR(s,πa(s),s’) ; γc = γcγ
For transitions that terminate action begun at s0
Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’ Q(s’,πa’) ]14
Hierarchical abstract machines
Hierarchy of NDFAs; states come in 4 types: Action states execute a physical action Call states invoke a lower-level NDFA Choice points nondeterministically select next state Stop states return control to invoking NDFA
NDFA internal transitions dictated by percept
15
16
Running example
Peasants can move, pickup and dropoff
Penalty for collision Cost-of-living each step Reward for dropping off
resources Goal : gather 10 gold +
10 wood (3L)n++ states s 7n primitive actions a
Part of 1-peasant HAM
17
Joint SMDP
Physical states S, machine states Θ, joint state space Ω = S x Θ
Joint MDP: Transition model for physical state (from physical
MDP) and machine state (as defined by HAM) A(ω) is fixed except when θ is at a choice point
Reduction to SMDP: States are just those where θ is at a choice point
18
Hierarchical optimality
Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM
19
Partial programs
HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state
A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do
HAMs, MAXQs, Options are trivially expressible as partial programs
Standard MDP: loop forever choose an action20
21
RL and partial programs
Learning algorithm
Completion
a
s,r
Partial program
22
RL and partial programs
Learning algorithm
Completion
a
s,r
Partial program
Hierarchically optimalfor all terminating programs
23
(defun top () (loop do (choose ‘top-choice
(gather-wood)(gather-gold))))
(defun gather-wood () (with-choice ‘forest-choice
(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
(defun gather-gold () (with-choice ‘mine-choice
(dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff)))
(defun nav (dest) (until (= (pos (get-state))
dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))
Single-threaded Alisp program
Program state θ includes Program counter Call stack Global variables
24
Q-functions
Represent completions using Q-function Joint state ω = [s,θ] env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice u in ω
and follow completion π thereafter” Modified Q-learning [AR 02] finds optimal completion
ω u Q(ω,u)... ... ...
At nav-choice, Pos=(2,3), Dest=(6,5), ... North 15
At resource-choice, Pos=(2,3), Gold=7, Wood=3, ...
Gather-wood -42
... ... ...
25
Internal state
Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies
E.g., while navigating to location (x,y), moving towards (x,y) is a good idea
Natural local shaping potential (distance from destination) impossible to express in external terms
26
Top TopTop
GatherWood GatherGold
Nav(Forest1) Nav(Mine2)
Qr Qc Qe
Q
• Temporal decomposition Q = Qr+Qc+Qe where
• Qr(ω,u) = reward while doing u (may be many steps)
• Qc(ω,u) = reward in current subroutine after doing u
• Qe(ω,u) = reward after current subroutine
Temporal Q-decomposition
27
State abstraction
• Temporal decomposition => state abstraction E.g., while navigating, Qc independent of
gold reserves In general, local Q-components can depend
on few variables => fast learning
28
Get-wood
Handling multiple effectors
Multithreaded agent programs Threads = tasks Each effector assigned to a thread Threads can be created/destroyed Effectors can be reassigned Effectors can be created/destroyed
Get-gold Get-wood
(Defend-Base)
29
(defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas
(first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas))))
(defun gather-wood () (with-choice ‘forest-choice
(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
(defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood))))
(defun gather-wood () (with-choice ‘forest-choice
(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
(defun gather-gold () (with-choice ‘mine-choice
(dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff)))
(defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))
An example Concurrent ALisp programAn example single-threaded ALisp program
30
defun gather-wood ()
(setf dest (choose *forests*))
(nav dest)
(action ‘get-wood)
(nav *home-base-loc*)
(action ‘put-wood))
(defun gather-gold ()
(setf dest (choose *goldmines*))
(nav dest)
(action ‘get-gold)
(nav *home-base-loc*)
(action ‘put-gold))
(defun nav (Wood1)
(loop
until (at-pos Wood1)
(setf d (choose ‘(N S E W R)))
do (action d)
))
(defun nav (Gold2)
(loop
until (at-pos Gold2)
(setf d (choose ‘(N S E W R)))
do (action d)
))
Thread 2
Thread 1
Environment timestep 27Environment timestep 26
Peasant 1
Peasant 2
PausedMaking joint choice
PausedMaking joint choice
Running
Running
Waiting for joint action
Waiting for joint action
Concurrent Alisp semantics
31
Threads execute independently until they hit a choice or action
Wait until all threads are at a choice or action If all effectors have been assigned an action, do
that joint action in environment Otherwise, make joint choice
Concurrent Alisp semantics
32
Q-functions
To complete partial program, at each choice state ω, need to specify choices for all choosing threads
So Q(ω,u) as before, except u is a joint choice Suitable SMDP Q-learning gives optimal completion
Example Q-function
ω u Q(ω,u)... ... ...
Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14
(Peas1:East, Peas3:Forest2)
15.7
... ... ...
34
Problems with concurrent activities
Temporal decomposition of Q-function lost No credit assignment among threads
Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly
Peasant 2 thinks he’s done very well!! Significantly slows learning as number of peasants
increases
35
Threadwise decomposition
Idea : decompose reward among threads (Russell+Zimdars, 2003)
E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants
Qjπ(ω,u) = “Expected total reward received by
thread j if we make joint choice u and then do π” Threadwise Q-decomposition Q = Q1+…Qn
Recursively distributed SARSA => global optimality
36
Peasant 3 thread
Learning threadwise decomposition
Peasant 2 thread
Peasant 1 thread
Top thread
a
r
r1
r2 r3
Action state
decomp
37
Peasant 3 thread
Learning threadwise decomposition
Peasant 2 thread
Peasant 1 thread
Top thread
Choice state ω
Q1(ω,·) Q3(ω,·)
Q2(ω,·)
+
Q (ω,·)argmax = u
Q-update
Q-update
Q-update
38
Peasant 3 thread
Threadwise and temporal decomposition
Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j while
doing u Qj,c (ω,u) = Expected reward gained by thread j after u
but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j after
current subroutine
Peasant 2 thread
Peasant 1 thread
Top thread
39Num steps learning (x 1000)
Reward of learnt policy
Flat
Undecomposed
Threadwise
Threadwise + Temporal
Resource gathering with 15 peasants
40
41
Summary of Part I
Structure in behavior seems essential for scaling up Partial programs
Provide natural structural constraints on policies Decompose value functions into simple components Include internal state (e.g., “goals”) that further simplifies
value functions, shaping rewards
Concurrency Simplifies description of multieffector behavior Messes up temporal decomposition and credit assignment
(but threadwise reward decomposition restores it)
42
Outline
Hierarchical RL with abstract actions Hierarchical RL with partial programs Efficient hierarchical planning
43
Abstract Lookahead
k-step lookahead >> 1-step lookahead e.g., chess
k-step lookahead no use if steps are too small e.g., first k characters of a NIPS paper
Abstract plans (high-level executions of partial programs) are shorter Much shorter plans for given goals => exponential savings Can look ahead much further into future
Kc3 Rxa1
Qa4 c2 c2 Kxa1
‘a’ ‘b’
‘a’ ‘b’ ‘a’ ‘b’
a b
aa ab ba bb
44
High-level actions
Start with restricted classical HTN form of partial program A high-level action (HLA)
Has a set of possible immediate refinements into sequences of actions, primitive or high-level
Each refinement may have a precondition on its use Executable plan space = all primitive refinements of Act
[Act]
[GoSFO, Act] [GoOAK, Act]
[WalkToBART, BARTtoSFO, Act]
Lookahead with HLAs
To do offline or online planning with HLAs, need a model
See classical HTN planning literature
45
Lookahead with HLAs
To do offline or online planning with HLAs, need a model
See classical HTN planning literature – not! Drew McDermott (AI Magazine, 2000):
The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions
46
47
Downward refinement
Downward refinement property (DRP) Every apparently successful high-level plan has a successful
primitive refinement
“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
48
Downward refinement
Downward refinement property (DRP) Every apparently successful high-level plan has a successful
primitive refinement
“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general
49
Downward refinement
Downward refinement property (DRP) Every apparently successful high-level plan has a successful
primitive refinement
“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general
MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds
50
Downward refinement
Downward refinement property (DRP) Every apparently successful high-level plan has a successful
primitive refinement
“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general
MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds
Problem: how to say true things about effects of HLAs?
51
s0
g
S
s0
g
S
s0
g
S
h2h1 h1h2
h2
h1h2 is a solution
Angelic semantics for HLAs Begin with atomic state-space view, start in s0, goal state g Central idea is the reachable set of an HLA from each state
When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal
May seem related to nondeterminism But the nondeterminism is angelic: the “uncertainty” will be resolved
by the agent, not an adversary or nature
a4 a1 a3 a2
State space
52
Technical development NCSTRIPS to describe reachable sets
STRIPS add/delete plus “possibly add”, “possibly delete”, etc E.g., GoSFO adds AtSFO, possibly adds CarAtSFO
Reachable sets may be too hard to describe exactly Upper and lower descriptions bound the exact
reachable set (and its cost) above and below Still support proofs of plan success/failure Possibly-successful plans must be refined
Sound, complete, optimal offline planning algorithms Complete, eventually-optimal online (real-time) search
53
Example – Warehouse World
Has similarities to blocks and taxi domains, but more choices and constraints Gripper must stay in bounds Can’t pass through blocks Can only turn around at top row
Goal: have C on T4 Can’t just move directly Final plan has 22 steps
Left, Down, Pickup,Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown
54
Navigate(xt,yt) (Pre: At(xs,ys))
Upper: -At(xs,ys), +At(xt,yt), FacingRight
Lower: IF (Free(xt,yt) x Free(x,ymax)):
-At(xs,ys), +At(xt,yt), FacingRight, ELSE: nil
Representing descriptions: NCSTRIPS Assume S = set of truth assignments to propositions Descriptions specify propositions (possibly) added/deleted by HLA An efficient algorithm exists to progress state sets (represented as DNF
formulae) through descriptions
~ st
~ st
st
x
55
Experiment Instance 3
5x8 world 90-step plan
Flat/hierarchical without descriptions did not terminate within 10,000 seconds
56
Online Search: Preliminary ResultsS
tep
s to
rea
ch g
oal