Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Planning and Acting in Partially Observable Stochastic Domains
Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra****Computer Science Department, Brown University, Providence, RI, USA**Department of Computer Science, Duke University, Durham, NC, USA
***Microelectronics and Computer Technology Corporation(MCC), Austin, TX, USAArtificial Intelligence 1998
MINSOO KANGFebruary 6th 2018
1
Partially Observable Markov Decision Process : Basics
2
Observable Partially Observable
No Actions Markov Process Hidden Markov Model
Actions MDP POMDP
Given:S : States / A : Finite set of Actions / R: Reward / P: transition Probability (as common MDP)
O(Ω): set of conditional observation Probabilities (Observation Function)o: set of observations Added for POMDP
POMDP : Belief State & Value Function
3
Belief State : Probability distributions over states of the underlying MDP(satisfies Markov Property) Equation for moving from b belief to b’ belief:
Value Function :
Ex) Possible State Probability:b(s1,s2,s3) = (0.3, 0.4, 0.3) : b(s1)=0.3, b(s2)=0.4, b(s3)=0.3b’(s1,s2,s3) = (0.1, 0.2, 0.7) : b’(s1)=0.1, b’(s2)=0.2, b’(s3)=0.7
Belief State
4
|S| = 2|B| = ∞
MDP Value Iteration is impossible, since there are infinite number of states (beliefs)Unlike MDP, Optimal Policy in each time period is Non-stationary.(Time-variant)
Continuous!!
Belief States : Example for Larger Dimensions
5
Value Function for Belief State
6
Value Function for Belief State
7
State Estimator : SE(a,b,o) Where P(b’|b, a, o) = 1 if SE(b, a, o) = b’
P(b’|b, a, o) = 0 otherwise;State Estimator is Binary
Sondik (1971) :
b’(s)=(p(s1|o,a,b),p(s2|o,a,b),…)V(b’)=(V(s1,a),V(s2,a)…)
8
POMDP: How to solve?(Sondik 1971,Littman 1998)
Generalized Form
Let,
(Letting P be finite set of t-step policy makes Vt(b))
Can be represented in Piecewise Linear & Convex Value Function Geometrically.
The upper layer part is the Vt(b) we are interested in, and each line represents each action to take when in each belief state.
POMDP: How to solve?(Sondik 1971,Littman 1998)
9
1. Conduct one-step Policy tree : (Just one action) a1, a2
The value function(not optimal) here is calculated as below:
Reward for taking action a1 in state 0 = 2, state 1= 0
Reward for taking action a2 in state0=0 state1=3
Probability that you are in state 0
POMDP: How to solve?(Sondik 1971,Littman 1998)
10
2. Extend this to 2 step time horizon tree, and evaluate every possible 2-step policy tree with the value function equation update.
3. Prune the value functions that are dominated by other value function
Given an action, Value Function is
Light blue colored lines are pruned.
Example Problem
11
Example from Prof. Wolfram Burgard’s Lecture Note(Department of Computer Science in University of Freiburg)
Given Action set, Observation set, State set, Reward(Cost),Transition Probability, Observation Function (No discount factor)
12
Example Problem
If p1 is the probability of being in x1
r(b,a1)=-100p1 +100(1-p1) since b=(p1, 1- p1)r(b,a2)=100p1 -50(1-p1)r(b,a3)= -1For 1-step horizon Value Function
13
Example Problem
Pruned
Optimal Policy for 1-step horizon,
a₁ if p₁ < 3/7a₂ if p₁ ≥ 3/7
Example Problem
14
We will extend the time horizon to t=2, we consider V1 first,(Backward Induction) v
If we do this similarly with o₂ as well,
=
Example Problem
15
=
Pruned
Game ends when a1 or a2 is chosen at this point, since action chosen ends the game.However, it is also possible that choosing a3 is optimal, so we have to confirm whether it give optimal value. So let the first action be a3, then the there is a shift in belief state.
Example Problem
16
It is given that a3 is chosen first,
V₂(b) =
Max of Value function in t=2, given the belief state
POMDP: Conclusion
• Pruning is crucial in lessening the combinatorial explosion.• In the example above, the unpruned algorithm needs 10 amount of
linear equations until t=20, whereas, only 12 equations are needed to represent the value function of pruned algorithm.
• Researches show that it functions better than MDP on many contexts.(with small states and small action, observation)
• However, the solving for finite horizon POMDP has complexity of PSPACE-complete, infinite horizon POMDP is undecidable. (which means that finding the polynomial time-complexity algorithm for POMDP is proving P = NP problem)
• Thus, there are many value function approximation methods, which may be helpful, but the model is limited to very confined research.
17
547864
18
Thank you