Upload
jasmin-gordon
View
214
Download
0
Embed Size (px)
Citation preview
Fast approximate POMDP planning:
Overcoming the curse of history!
Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU
Point-based value iteration: an anytime algorithm for POMDPs
Workshop on Advances in Machine Learning - June, 2003
Workshop on Advances in Machine Learning Joelle Pineau
Why use a POMDP?
• POMDPs provide a rich framework for sequential decision-making, which can model:
– varying rewards across actions and goals
– actions with random effects
– uncertainty in the state of the world
Workshop on Advances in Machine Learning Joelle Pineau
Existing applications of POMDPs
– Maintenance scheduling
» Puterman, 1994
– Robot navigation
» Koenig & Simmons, 1995;
Roy & Thrun, 1999
– Helicopter control
» Bagnell & Schneider, 2001;
Ng et al., 2002
– Dialogue modeling
» Roy, Pineau & Thrun, 2000;
Peak&Horvitz, 2000
– Preference elicitation
» Boutilier, 2002
Workshop on Advances in Machine Learning Joelle Pineau
POMDP Model
POMDP is n-tuple { S, A, , T, O, R }:
What goes on: st-1 st
at-1 at
T(s,a,s’) = state-to-state transition probabilitiesO(s,a,o) = observation generation probabilitiesR(s,a) = Reward function
S = state setA = action set = observation set
What we see: ot-1 ot
What we infer: bt-1 bt
Workshop on Advances in Machine Learning Joelle Pineau
Understanding the belief state
• A belief is a probability distribution over states
Where Dim(B) = |S|-1
– E.g. Let S={s1, s2}
P(s1)
0
1
Workshop on Advances in Machine Learning Joelle Pineau
Understanding the belief state
• A belief is a probability distribution over states
Where Dim(B) = |S|-1
– E.g. Let S={s1, s2, s3}
P(s1)
P(s2)
0
1
1
Workshop on Advances in Machine Learning Joelle Pineau
Understanding the belief state
• A belief is a probability distribution over states
Where Dim(B) = |S|-1
– E.g. Let S={s1, s2, s3 , s4}
P(s1)
P(s2)
0
1
1
P(s3)
Workshop on Advances in Machine Learning Joelle Pineau
The first curse of POMDP planning
• The curse of dimensionality:
– dimension of planning problem = # of states
– related to the MDP curse of dimensionality
Workshop on Advances in Machine Learning Joelle Pineau
POMDP value functions
V(b) = expected total discounted future reward starting from b
• Represent V as the upper surface of a set of hyper-planes.
• V is piecewise-linear convex
• Backup operator T: V TV
Bb
AabVbabTabRbV
'
)'()',,(),(max)(
P(s1)
V(b)
b
Workshop on Advances in Machine Learning Joelle Pineau
Exact value iteration for POMDPs
• Simple problem: |S|=2, |A|=3, ||=2
Iteration # hyper-planes 0 1
P(s1)
V0(b)
b
Workshop on Advances in Machine Learning Joelle Pineau
Exact value iteration for POMDPs
• Simple problem: |S|=2, |A|=3, ||=2
Iteration # hyper-planes 0 1 1 3
P(s1)
V1(b)
b
Workshop on Advances in Machine Learning Joelle Pineau
Exact value iteration for POMDPs
• Simple problem: |S|=2, |A|=3, ||=2
Iteration # hyper-planes 0 1 1 3 2 27
P(s1)
V2(b)
b
Workshop on Advances in Machine Learning Joelle Pineau
Exact value iteration for POMDPs
• Simple problem: |S|=2, |A|=3, ||=2
Iteration # hyper-planes 0 1 1 3 2 27 3 2187
P(s1)
V2(b)
b
Workshop on Advances in Machine Learning Joelle Pineau
Exact value iteration for POMDPs
• Simple problem: |S|=2, |A|=3, ||=2
Iteration # hyper-planes 0 1 1 3 2 27 3 2187 4 14,348,907
P(s1)
V2(b)
b
Workshop on Advances in Machine Learning Joelle Pineau
Exact value iteration for POMDPs
• Simple problem: |S|=2, |A|=3, ||=2
Many hyper-planes can be pruned away
P(s1)
V2(b)
b
Iteration # hyper-planes 0 1 1 3 2 5 3 9 4 7 5 13 10 27 15 47 20 59
Workshop on Advances in Machine Learning Joelle Pineau
Is pruning sufficient?
|S|=20, |A|=6, ||=8
Iteration # hyper-planes0 11 5
2 213 3 ?????
…
Not for this problem!
Workshop on Advances in Machine Learning Joelle Pineau
Certainly not for this problem!
Physiotherapy
Patientroom
Robothome
|S|=576, |A|=19, |O|=17
State Features: {RobotLocation, ReminderGoal, UserLocation, UserMotionGoal,
UserStatus, UserSpeechGoal}
Workshop on Advances in Machine Learning Joelle Pineau
The second curse of POMDP planning
• The curse of dimensionality:
– the dimension of each hyper-plane = # of states
• The curse of history:
– the number of hyper-planes grows
exponentially with the planning horizon
Workshop on Advances in Machine Learning Joelle Pineau
The second curse of POMDP planning
• The curse of dimensionality:
– the dimension of each hyper-plane = # of states
• The curse of history:
– the number of hyper-planes grows
exponentially with the planning horizon
||1
2 |||||| nAS
|| n
Complexity of POMDP value iteration:
dimensionality history
Workshop on Advances in Machine Learning Joelle Pineau
Possible approximation approaches
• Ignore the belief:
• Discretize the belief:
• Compress the belief:
• Plan for trajectories:
s1
s0
s2
- overcomes both curses- very fast- performs poorly in high entropy beliefs[Littman et al., 1995]
- overcomes the curse of history (sort of) - scales exponentially with # states[Lovejoy, 1991; Brafman 1997;Hauskrecht, 1998; Zhou&Hansen, 2001]
- overcomes the curse of dimensionality[Poupart&Boutilier, 2002; Roy&Gordon, 2002]
- can diminish both curses- requires restricted policy class- local minimum, small gradients[Baxter&Bartlett, 2000; Ng&Jordan, 2002]
Workshop on Advances in Machine Learning Joelle Pineau
A new algorithm: Point-based value iteration
• Main idea:
– Select a small set of belief points
P(s1)
V(b)
b1 b0 b2
Workshop on Advances in Machine Learning Joelle Pineau
A new algorithm: Point-based value iteration
• Main idea:
– Select a small set of belief points
– Plan for those belief points only
P(s1)
V(b)
b1 b0 b2
Workshop on Advances in Machine Learning Joelle Pineau
A new algorithm: Point-based value iteration
• Main idea:
– Select a small set of belief points Focus on reachable beliefs
– Plan for those belief points only
P(s1)
V(b)
b1 b0 b2a,o a,o
Workshop on Advances in Machine Learning Joelle Pineau
A new algorithm: Point-based value iteration
• Main idea:
– Select a small set of belief points Focus on reachable beliefs
– Plan for those belief points only Learn value and its gradient
P(s1)
V(b)
b1 b0 b2a,o a,o
Workshop on Advances in Machine Learning Joelle Pineau
Point-based value update
P(s1)
V(b)
b1 b0 b2
Workshop on Advances in Machine Learning Joelle Pineau
Point-based value update
• Initialize the value function (…and skip ahead a few iterations)
P(s1)
Vn(b)
b1 b0 b2
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
Point-based value update
P(s1)
Vn(b)
b
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
– For each (a,o): Project forward bba,o and find best value:
Point-based value update
P(s1)
Vn(b)
b
)()( ,, oan
oab bVs
ba1,o2ba2,o2ba2,o1ba1,o1
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
– For each (a,o): Project forward bba,o and find best value:
Point-based value update
P(s1)
Vn(b)
b
)()( ,, oan
oab bVs
ba1,o2ba2,o2ba2,o1ba1,o1
ba1,o1, b
a2,o1
ba2,o2
ba1,o2
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
– For each (a,o): Project forward bba,o and find best value:
– Sum over observations:
Point-based value update
P(s1)
Vn(b)
b
',
, )'(),,()',,(),()(so
oab
ab soasOsasTasRs
)()( ,, oan
oab bVs
ba1,o2ba2,o2ba2,o1
ba1,o1, b
a2,o1
ba2,o2
ba1,o2
ba1,o1
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
– For each (a,o): Project forward bba,o and find best value:
– Sum over observations:
Point-based value update
P(s1)
Vn(b)
b
',
, )'(),,()',,(),()(so
oab
ab soasOsasTasRs
)()( ,, oan
oab bVs
ba1,o1, b
a2,o1
ba2,o2
ba1,o2
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
– For each (a,o): Project forward bba,o and find best value:
– Sum over observations:
Point-based value update
P(s1)
Vn+1(b)
b
',
, )'(),,()',,(),()(so
oab
ab soasOsasTasRs
)()( ,, oan
oab bVs
ba1
ba2
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
– For each (a,o): Project forward bba,o and find best value:
– Sum over observations:
– Max over actions:
Point-based value update
',
, )'(),,()',,(),()(so
oab
ab soasOsasTasRs
)()( ,, oan
oab bVs
abanV maxarg1
P(s1)
Vn+1(b)
b
ba1
ba2
Workshop on Advances in Machine Learning Joelle Pineau
• Initialize the value function (…and skip ahead a few iterations)
• For each bB:
– For each (a,o): Project forward bba,o and find best value:
– Sum over observations:
– Max over actions:
Point-based value update
',
, )'(),,()',,(),()(so
oab
ab soasOsasTasRs
)()( ,, oan
oab bVs
abanV maxarg1
P(s1)
Vn+1(b)
b1 b2b0
Workshop on Advances in Machine Learning Joelle Pineau
Complexity of value update
Exact Update Point-based Update
I - Projection S2An S2AB
II - Sum SAn SAB2
III - Max SAn SAB
where: S = # states n = # solution vectors at iteration n A = # actions B = # belief points
= # observations
n+1
Workshop on Advances in Machine Learning Joelle Pineau
A bound on the approximation error
• Bound error of the point-based backup operator.
• Bound depends on how densely we sample belief points.– Let be the set of reachable beliefs.
– Let B be the set of belief points.
Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n=||Vn
B-Vn*|| is bounded by:
1'
2minmax
||'||minmax
)1(
)(
bbwhere
RR
BbbB
Bn
Workshop on Advances in Machine Learning Joelle Pineau
Experimental results: Lasertag domain
State space = RobotPosition OpponentPositionObservable: RobotPosition - always
OpponentPosition - only if same as Robot
Action space = {North, South, East, West, Tag}
Opponent strategy: Move away from robot w/ Pr=0.8
|S|=870, |A|=5, ||=30
Workshop on Advances in Machine Learning Joelle Pineau
Performance of PBVI on Lasertag domain
Opponent tagged 70% of trials
Opponent tagged 17% of trials
Workshop on Advances in Machine Learning Joelle Pineau
Performance on well-known POMDPs
Maze33|S|=36, |A|=5, ||=17
Hallway|S|=60, |A|=5, ||=20
Hallway2|S|=92, |A|=5, ||=17
Reward0.1980.942.302.25
Reward0.261n.v.0.530.53
Reward0.109n.v.0.350.34
Time(s)0.19n.v.
121663448
Time(s)0.51n.v.450288
Time(s)1.44n.v.
27898360
B-
174660470
B-
n.v.30086
B-
3371840
95
%Goal2298
10098
%Goal47n.v10095
MethodQMDP
GridPBUAPBVI
Workshop on Advances in Machine Learning Joelle Pineau
Selecting good belief points
• What can we learn from policy search methods?– Focus on reachable beliefs.
P(s1)
b ba1,o2ba2,o2ba2,o1ba1,o1
a2,o2 a1,o2
a2,o1
a1,o1
Workshop on Advances in Machine Learning Joelle Pineau
Selecting good belief points
• What can we learn from policy search methods?– Focus on reachable beliefs.
• How can we avoid including all reachable beliefs?– Reachability analysis considers all actions, but stochastic observation
choice.
P(s1)
b ba1,o2ba2,o1
a1,o2
a2,o1
ba2,o2ba1,o1
Workshop on Advances in Machine Learning Joelle Pineau
Selecting good belief points
• What can we learn from policy search methods?– Focus on reachable beliefs.
• How can we avoid including all reachable beliefs?– Reachability analysis considers all actions, but stochastic observation
choice.
• What can we learn from our error bound?– Select widely-spaced beliefs, rather than near-by beliefs.
P(s1)
b ba1,o2ba2,o1
21'minmax
)1(
||'||minmax)(
bbRR Bbbn
a1,o2
a2,o1
Workshop on Advances in Machine Learning Joelle Pineau
Validation of the belief expansion heuristic
• Hallway domain: |S|=60, |A|=5, ||=20
Workshop on Advances in Machine Learning Joelle Pineau
Validation of the belief expansion heuristic
• Tag domain: |S|=870, |A|=5, ||=30
Workshop on Advances in Machine Learning Joelle Pineau
The anytime PBVI algorithm
• Alternate between:
– Growing the set of belief point (e.g. B doubles in size everytime)
– Planning for those belief points
• Terminate when you run out of time or have a good policy.
Workshop on Advances in Machine Learning Joelle Pineau
The anytime PBVI algorithm
• Alternate between:
– Growing the set of belief point (e.g. B doubles in size everytime)
– Planning for those belief points
• Terminate when you run out of time or have a good policy.
• Lasertag results:
– 13 phases: |B|=1334
– ran out of time!
Workshop on Advances in Machine Learning Joelle Pineau
The anytime PBVI algorithm
• Alternate between:
– Growing the set of belief point (e.g. B doubles in size everytime)
– Planning for those belief points
• Terminate when you run out of time or have a good policy.
• Lasertag results:
– 13 phases: |B|=1334
– ran out of time!
• Hallway2 results:
– 8 phases: |B|=95
– found good policy.
Workshop on Advances in Machine Learning Joelle Pineau
Summary
• POMDPs suffer from the curse of history» # of beliefs grows exponentially with the planning horizon
• PBVI addresses the curse of history by limiting planning to a small set of likely beliefs.
• Strengths of PBVI include:» anytime algorithm;
» polynomial-time value updates;
» bounded approximation error;
» empirical results showing we can solve problems up to 870 states.
Workshop on Advances in Machine Learning Joelle Pineau
Recent work
• Current hurdle to solving even larger POMDPs:
PBVI complexity is O(S2AB + SAB2)
– Addressing S2:
» Combine PBVI with belief compression techniques.
But sparse transition matrices mean: S2 S
– Addressing B2:
» Use ball-trees to structure belief points.
» Find better belief selection heuristics.