Upload
manon
View
15
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Hierarchical Lookahead Agents: A Preliminary Report. Bhaskara Marthi Stuart Russell Jason Wolfe. High-level actions. For our purposes, a high-level action (HLA) Has a set of possible immediate refinements into sequences of actions, primitive or high-level - PowerPoint PPT Presentation
Citation preview
1
Hierarchical Lookahead Agents:A Preliminary Report
Bhaskara Marthi
Stuart Russell
Jason Wolfe
2
High-level actions
• For our purposes, a high-level action (HLA)• Has a set of possible immediate refinements into sequences of
actions, primitive or high-level• Each refinement may have a precondition on its use
• Many of the actions we do are high-level• Go to work• Write a NIPS paper
• This definition generalizes others we’ve seen• Options don’t provide a choice of refinements• MAXQ tasks can’t represent sequencing constraints
3
Abstract Lookahead
• k-step lookahead >> 1-step lookahead• e.g., chess
• k-step lookahead no use if steps are too small• e.g., first k characters of a NIPS paper • this is one small part of a human life, =
~20,000,000,000,000 primitive actions
• Abstract plans with HLAs are shorter• Much shorter plans for given goals => exponential savings• Can look ahead much further into future
• Requires a model (i.e., transition and reward fns) for HLAs • No suitable models to be found in HRL or HTN literature• We proposed angelic semantics [MRW 07], which we extend here
Kc3
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Rxa1
Qa4 c2 c2 Kxa1
‘a’ ‘b’
‘a’ ‘b’ ‘a’ ‘b’
a b
aa ab ba bb
4
s0
t
S
s0
t
S
s0
t
S
h2h1 h1h2
h2
h1h2 is a solution
Angelic semantics for HLAs • Models HLAs in deterministic domains without rewards• Central idea is reachable set of an HLA from some state
• When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal
• May seem related to nondeterminism• But the nondeterminism is angelic: the “uncertainty” will be resolved by
the agent, not an adversary or nature
a4 a1 a3 a2
Goal
State space
5
Contributions
• Extended angelic semantics to account for rewards • Developed novel algorithms that do lookahead
• Hierarchical Forward Search• An optimal, offline algorithm• Can prove optimality of plans entirely at high level
• Hierarchical Real-Time Search• An online algorithm• Requires only bounded computation per time step• Significantly outperforms previous ``flat’’ lookahead algorithms
• Both algorithms require three inputs:• a deterministic MDP model• an action hierarchy (HLAs)• models for HLAs (based on angelic semantics)
6
S
Deterministic MDPs• A deterministic MDP consists of:
• State space S
• Initial state s0 2 S, (w.l.o.g.) single terminal state t S
• Primitive action set A• Transition function f : S x A ! S• Reward function r : S x A ! R
• Undiscounted, assume no non-negative-reward cycles
Transition & reward functions for action a1
S
s0
t
-5
3
-8-1
-4
s0
t
7
Example – Warehouse World
• Has similarities to blocks and taxi domains, but more choices and constraints• Gripper must stay in bounds• Can’t pass through blocks• Can only turn around at top row
• All actions have reward -1• Goal: have C on T4
• Can’t just move directly• Final plan has 22 steps
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B C
T1 T2 T3 T4
A B C
Left, Down, Pickup,Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
T1 T2 T3 T4
A B
C
8
Some HLAs for warehouse world
[Move(B,C) Act]
• Each HLA has a set of immediate refinements into sequences of actions
• All plans of interest are refinements of the special HLA Act
• HLAs define a context free grammar over primitive plans (Act is the start symbol)
[Act]
…
[Get(B) PutOn(C)]
[GetL(B)] [GetR(B)]
…
…
… …
9
Abstract Lookahead Trees• ALTs are primary data structures
• Represent a set of potential plans• Edges are (possibly abstract) actions
• Basic operation: refine a plan (replace with all refs at some HLA)• Nodes represent potential situations
(pairs of upper & lower valuations)
ActMove(C,A) Act
Act
Move(A,C)
Move(B,C) Act
Act
Move(C,B)
Get(C
)
PutOn(A)Act
Move(C,B)
5
2
3
00
1
2
22
34
13
11
32
10
S
s0
t
37
---
--
-2
--
--9
- 4
37
---
--
-5
--
--9
- 2
Modeling HLAs
• An HLA is fully characterized by primitive MDP + hierarchy• But without abstraction, we lose benefits of hierarchy
• Extension of idea from “Angelic Semantics for HLAs” [MRW ‘07]:• Valuation of HLA h from state s:
• For each s’, best reward of any prim. ref. of h that takes s to s’
• Exact description of h = valuation of h from each s• But this description has no compact, efficient rep. in general
S
s0
t
h 37
---
--
--
- 2
--
- -
5
-1
1
4
27
3
11
S
s0
t
S
s0
t
Upper and lower valuations
• Instead, use approximate valuations • Upper valuations never underestimate best achievable reward• Lower valuations never overestimate best achievable reward• We choose a simple form: reachable set + reward bound on set
Exact
- - -
- - -- --3
7 5 -2 4 6
8Upper
1Lower
12
Representing descriptions: NCSTRIPS
• Assume S = set of truth assignments to propositions• Descriptions specify propositions (possibly) added/deleted by HLA
• Also include a bound on rewards
• An efficient algorithm exists to progress valuations (represented as DNF formulae + numeric bound) through descriptions
Navigate(xt,yt) (Pre: At(xs,ys))
Upper: -At(xs,ys), +At(xt,yt), -FacingRight,
+FacingRight, +R(-|xs-xt| - |ys-yt|)
Lower: IF (Free(xt,yt) x Free(x,ymax)):
-At(xs,ys), +At(xt,yt), -FacingRight,
+FacingRight, +R(-|xs-xt| -ys-yt+1) ELSE: nil
~~
st
~~
st
st
x
13
Back to ALT nodes
• Initially, we know we are at s0 with reward 0
• Each ALT node is created from parent & descriptions of action from parent• Valuations give provable bounds on reachable states & rewards
• Lower valuation after Act usually vacuous, upper valuation gives admissible heuristic to t
S
s0
t
S
s0
t
-5
-3Move(A,C) Act
S
s0
t
-9
0
0
14
S
s0
t
S
s0
t
S
s0
t
PruningPrune whenever a node’s lower valuationdominates another node’s upper valuation
-5
0-9-7
-5-7
15
Hierarchical Forward Search
• Construct the initial ALT• Loop
• Select the plan with highest upper reward• If this plan is primitive, return it• Otherwise, refine one of its HLAs
• Related to A*
16
Intuitive Picture
s0 tAct
highest-level
primitive
17
Intuitive Picture
s0 tAct
highest-level
primitive
18
Intuitive Picture
s0 tAct
highest-level
primitive
19
Intuitive Picture
s0 tAct
highest-level
primitive
20
Intuitive Picture
tAct
s0
highest-level
primitive
21
Intuitive Picture
tAct
s0
highest-level
primitive
22
Intuitive Picture
ts0
highest-level
primitive
23
Intuitive Picture
ts0
highest-level
primitive
24
Analysis
• Optimality follows from guarantees offered by upper descriptions
• Accurate upper descriptions --> more directed search • Accurate lower descriptions --> more pruning
25
An online version?
• Hierarchical Forward Search plans all the way to t before starting acting• Can’t act in real time• Not really useful for RL• In stochastic case, must plan for all contingencies
• How can we choose a good first action with bounded computation?
ts0
26
Hierarchical Real-Time Search
• Loop until at t• Run Hierarchical Forward Search for (at most) k steps• Choose a plan that
• begins with a primitive action • has total upper reward ≥ that of the best unrefined plan• has minimal upper reward up to but not including Act
• Do the first action of this plan in the world• Do a Bellman backup for the current state (prevents looping)
• Related to LRTA*
27
Preliminary Results
Instance 1 (small) Instance 2 (larger)
-4000
-3000
-2000
-1000
0
0 1000 2000 3000 4000 5000 6000 7000 8000
# ALT refinements per step
Total reward
FlatHierarchical
-40
-30
-20
-10
0
0 100 200 300 400 500 600 700 800
# ALT refinements per step
Total reward
FlatHierarchical
28
Discussion
• We see a definite benefit to high-level lookahead• Still some subtleties still to be worked out
• Each HLA description has different biases, making direct comparisons between plans difficult
• Current scheme is overly optimistic
• Other directions for future work:• Extensions to probabilitic/partially obs. environments• Better meta-level control• Inferring/learning HLA models• Integration into RL algorithms• Abstract-level DP• …
Questions?