Upload
irene-watson
View
214
Download
0
Embed Size (px)
Citation preview
Classical Situation
hellheaven
• World deterministic• State observable
MDP-Style Planning
hellheaven
• World stochastic• State observable
[Koditschek 87, Barto et al. 89]
• Policy• Universal Plan• Navigation function
Stochastic, Partially Observable
sign
hell?heaven?
[Sondik 72] [Littman/Cassandra/Kaelbling 97]
Stochastic, Partially Observable
sign
hellheaven
sign
heavenhell
Stochastic, Partially Observable
sign
heavenhell
sign
??
sign
hellheaven
start
50% 50%
Robot Planning FrameworksClassicalAI/robotplanning
State/actions discrete & continuous
State observable
Environment deterministic
Plans Sequences of actions
Completeness Yes
Optimality Rarely
State space size
Huge, often continuous, 6 dimensions
Computational Complexity
varies
MDP-Style Planning
hellheaven
• World stochastic• State observable
[Koditschek 87, Barto et al. 89]
• Policy• Universal Plan• Navigation function
Markov Decision Process (discrete)
s2
s3
s4s5
s1
0.7
0.3
0.90.1
0.3
0.3
0.4
0.99
0.1
0.2
0.8 r=10
r=0
r=0
r=1
r=0
[Bellman 57] [Howard 60] [Sutton/Barto 98]
Value Iteration• Value function of policy
• Bellman equation for optimal value function
• Value iteration: recursively estimating value function
• Greedy policy:
)(,|)()( iitt
sasssrEsV
')'(),|'(max)()( dssVasspsrsVa
')'(),|'(argmax)( dssVasspsa
')'(),|'(max)()( dssVasspsrsVa
[Bellman 57] [Howard 60] [Sutton/Barto 98]
Value Iteration for Motion Planning
(assumes knowledge of robot’s location)
Continuous Environments
From: A Moore & C.G. Atkeson “The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Continuous State spaces,” Machine Learning 1995
Approximate Cell Decomposition [Latombe 91]
From: A Moore & C.G. Atkeson “The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Continuous State spaces,” Machine Learning 1995
Parti-Game [Moore 96]
From: A Moore & C.G. Atkeson “The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Continuous State spaces,” Machine Learning 1995
Robot Planning FrameworksClassicalAI/robotplanning
Value Iteration in
MDPs
Parti-Game
State/actions discrete & continuous
discrete continuous
State observable observable observable
Environment deterministic stochastic stochastic
Plans Sequences of actions
policy policy
Completeness Yes Yes Yes
Optimality Rarely Yes No
State space size
Huge, often continuous, 6 dimensions
millions n/a
Computational Complexity
varies quadratic n/a
Stochastic, Partially Observable
sign
??
start
sign
heavenhell
sign
hellheaven
50% 50%
sign
??
start
A Quiz
-dim continuous*stochastic1-dimcontinuous
stochastic
actions# states size belief space?sensors
3: s1, s2, s3deterministic3 perfect
3: s1, s2, s3stochastic3 perfect
23-1: s1, s2, s3, s12, s13, s23, s123deterministic3 abstract states
deterministic3 stochastic
2-dim continuous*: p(S=s1), p(S=s2)stochastic3 none
2-dim continuous*: p(S=s1), p(S=s2)
*) countable, but for all practical purposes
-dim continuous*deterministic1-dimcontinuous
stochastic
aargh!stochastic-dimcontinuous
stochastic
Introduction to POMDPs (1 of 3)
80100
ba
0
ba
40
s2s1
action a
action b
p(s1)
[Sondik 72, Littman, Kaelbling, Cassandra ‘97]
s2s1
100
0
100
action aaction b
Introduction to POMDPs (2 of 3)
80100
ba
0
ba
40
s2s1
80%c
20%
p(s1) s2
s1’
s1
s2’
p(s1’)
p(s1) s2s1
100
0
100
[Sondik 72, Littman, Kaelbling, Cassandra ‘97]
Introduction to POMDPs (3 of 3)
80100
ba
0
ba
40
s2s1
80%c
20%
p(s1) s2s1
100
0
100
p(s1) s2
s1
s1
s2
p(s1’|A)
B
A50%
50%
30%
70%B
A
p(s1’|B))())|(())((
},{11 zpzspVspV
BAz
[Sondik 72, Littman, Kaelbling, Cassandra ‘97]
Value Iteration in POMDPs• Value function of policy
• Bellman equation for optimal value function
• Value iteration: recursively estimating value function
• Greedy policy:
)(,|)()( iitt
babbbrEbV
')'(),|'(max)()( dbbVabbpbrbVa
')'(),|'(argmax)( dbbVabbpba
')'(),|'(max)()( dbbVabbpbrbVa
Substitute b for s
Missing Terms: Belief Space
• Expected reward:
• Next state density:
dssbsrbr )()()(
')(),|'()'|'(),|'( dsdssbasspsopabop
'),|'(),,'|'(),|'( doabopabobpabbp
Bayes filters!(Dirac distribution)
Value Iteration in Belief Space
. ...
next belief state b’
observation o
. ...
belief state b
max Q(b’, a)
next state s’, reward r’state s
Q(b, a)value function
Why is This So Complex?
State Space Planning(no state uncertainty)
Belief Space Planning(full state uncertainties)
?
Augmented MDPs:
s
sHsbb ][);(argmax
[Roy et al, 98/99]
conventional state space
uncertainty (entropy)
Path Planning with Augmented MDPsinformation gainConventional planner Probabilistic Planner
[Roy et al, 98/99]
Robot Planning FrameworksClassicalAI/robotplanning
Value Iteration in
MDPs
Parti-Game POMDP Augmented MDP
State/actions discrete & continuous
discrete continuous discrete discrete
State observable observable observable partially observable
partially observable
Environment deterministic stochastic stochastic stochastic stochastic
Plans Sequences of actions
policy policy policy policy
Completeness Yes Yes Yes Yes No
Optimality Rarely Yes No Yes No
State space size
Huge, often continuous, 6 dimensions
millions n/a dozens thousands
Computational Complexity
varies quadratic n/a exponential O(N4)