Planning and Acting in Partially Observable Stochastic Domainsnemo.yonsei.ac.kr/wp-content/uploads/2018/02/20180206... · 2018-02-06 · Planning and Acting in Partially Observable

Planning and Acting in Partially Observable Stochastic Domains

Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra****Computer Science Department, Brown University, Providence, RI, USA**Department of Computer Science, Duke University, Durham, NC, USA

***Microelectronics and Computer Technology Corporation(MCC), Austin, TX, USAArtificial Intelligence 1998

MINSOO KANGFebruary 6th 2018

1

발표자

프레젠테이션 노트

Thank you for atdending the Seminar in English. The title of this paper ~~~ This paper was published on expert system with application on 2012.

Partially Observable Markov Decision Process : Basics

2

Observable Partially Observable

No Actions Markov Process Hidden Markov Model

Actions MDP POMDP

Given:S : States / A : Finite set of Actions / R: Reward / P: transition Probability (as common MDP)

O(Ω): set of conditional observation Probabilities (Observation Function)o: set of observations Added for POMDP

POMDP : Belief State & Value Function

3

Belief State : Probability distributions over states of the underlying MDP(satisfies Markov Property) Equation for moving from b belief to b’ belief:

Value Function :

Ex) Possible State Probability:b(s1,s2,s3) = (0.3, 0.4, 0.3) : b(s1)=0.3, b(s2)=0.4, b(s3)=0.3b’(s1,s2,s3) = (0.1, 0.2, 0.7) : b’(s1)=0.1, b’(s2)=0.2, b’(s3)=0.7

발표자


Memoryless property and bellman equation

Belief State

4

|S| = 2|B| = ∞

MDP Value Iteration is impossible, since there are infinite number of states (beliefs)Unlike MDP, Optimal Policy in each time period is Non-stationary.(Time-variant)

Continuous!!

발표자


Memoryless property and bellman equation

Belief States : Example for Larger Dimensions

5

Value Function for Belief State

6

발표자


Explain v1= sum of R(s,a)pr(s)

Value Function for Belief State

7

State Estimator : SE(a,b,o) Where P(b’|b, a, o) = 1 if SE(b, a, o) = b’

P(b’|b, a, o) = 0 otherwise;State Estimator is Binary

Sondik (1971) :

b’(s)=(p(s1|o,a,b),p(s2|o,a,b),…)V(b’)=(V(s1,a),V(s2,a)…)

8

POMDP: How to solve?(Sondik 1971,Littman 1998)

Generalized Form

Let,

(Letting P be finite set of t-step policy makes Vt(b))

Can be represented in Piecewise Linear & Convex Value Function Geometrically.

The upper layer part is the Vt(b) we are interested in, and each line represents each action to take when in each belief state.


9

1. Conduct one-step Policy tree : (Just one action) a1, a2

The value function(not optimal) here is calculated as below:

Reward for taking action a1 in state 0 = 2, state 1= 0

Reward for taking action a2 in state0=0 state1=3

Probability that you are in state 0


10

2. Extend this to 2 step time horizon tree, and evaluate every possible 2-step policy tree with the value function equation update.

3. Prune the value functions that are dominated by other value function

Given an action, Value Function is

Light blue colored lines are pruned.

발표자


b’은 b’(a,b,z)의 함수이므로, a와 z가 주어질때 b가 어디로 transform되는가에 대한 함수.

Example Problem

11

Example from Prof. Wolfram Burgard’s Lecture Note(Department of Computer Science in University of Freiburg)

Given Action set, Observation set, State set, Reward(Cost),Transition Probability, Observation Function (No discount factor)

12

Example Problem

If p1 is the probability of being in x1

r(b,a1)=-100p1 +100(1-p1) since b=(p1, 1- p1)r(b,a2)=100p1 -50(1-p1)r(b,a3)= -1For 1-step horizon Value Function

13

Example Problem

Pruned

Optimal Policy for 1-step horizon,

a₁ if p₁ < 3/7a₂ if p₁ ≥ 3/7

Example Problem

14

We will extend the time horizon to t=2, we consider V1 first,(Backward Induction) v

If we do this similarly with o₂ as well,

=

발표자


P(o|b)는 이미 taken into account되어있는 방정식.(위에보면 p(o|b)=0.7이리 되어잇자나) Backward induction이다. A1,a2는 이미 p(o)에 taken into account 되어있다. Max 안에서 위식은 a1, 아래식은 a2를 선택한다고 가정할 때이다. 일단, 이전 state의 b는 존재하지않는다. 또한, a,b를 선택 할 때, observation에는 영향을 미치지않는다. P(o)는 저기서 정확히말하자면, p(o|a,b)이다. 왜냐면, 쉬프트가 발생할때 observation이 발생하는데, 그 이전 내용이 있는 상태에서 고려하기 때문이다.

Example Problem

15

=

Pruned

Game ends when a1 or a2 is chosen at this point, since action chosen ends the game.However, it is also possible that choosing a3 is optimal, so we have to confirm whether it give optimal value. So let the first action be a3, then the there is a shift in belief state.

Example Problem

16

It is given that a3 is chosen first,

V₂(b) =

Max of Value function in t=2, given the belief state

발표자


윗식은 maximizing over everything이므로, a1,a2,a3중 어느게 가장 최댓값을 주느냐를 물어보는 것이고, 아래식은 이미 a3를 넣었다고 가정하고 있기때문.. A1, a2를 선택한경우엔 V2(b)= max 0+ V1(b) As a result, V2에서 첫번째 두번째엔 마이너스안하는이유는 -1이 안되기 때문이다.

POMDP: Conclusion

• Pruning is crucial in lessening the combinatorial explosion.• In the example above, the unpruned algorithm needs 10 amount of

linear equations until t=20, whereas, only 12 equations are needed to represent the value function of pruned algorithm.

• Researches show that it functions better than MDP on many contexts.(with small states and small action, observation)

• However, the solving for finite horizon POMDP has complexity of PSPACE-complete, infinite horizon POMDP is undecidable. (which means that finding the polynomial time-complexity algorithm for POMDP is proving P = NP problem)

• Thus, there are many value function approximation methods, which may be helpful, but the model is limited to very confined research.

17

547864

18

Thank you

Documents

Planning and Acting in Partially Observable Stochastic Domainsnemo.yonsei.ac.kr/wp-content/uploads/2018/02/20180206... · 2018-02-06 · Planning and Acting in Partially Observable