Partially Observable Markov Decision Process By Nezih Ergin Özkucur

Partially Observable Markov Decision Process

By

Nezih Ergin Özkucur

2

Contents

Markov Decision Process (MDP) Value Iteration Algorithm Reinforcement Learning

Partially Observable Markov Decision Process (POMDP) POMDP vs MDP Value Function Representation Exact algorithms

Kalman Filtering ART2A Network ARKAQ Learning Algorithm

3

Markov Decision Process

Consider an agent which must act rationally in an environment.

At each discrete time step, agent must choose one of the given decisions.

In the long term, agent tries to get good results. MDP is a way to model this kind of problems. By modeling the problem, we can run some

automated algorithms to solve it.

4

MDP Components

MDP can be defined by (S,A,T,R) where S is a finite set of states which describes the situation of

the environment. A is a finite set of actions, which agent must chose from

in each time step. T (State transition function) is a mapping from SxA to

probability distrubutions over S. T(s,a,s`) is the probability of being state s` when agent was in state s and have chosen action a.

R (Reward Function) is a mapping from SxA to real numbers.

5

Value Iteration Algorithm

A policy ( ) is a mapping from S to A which gives action to select in each state.

Value of a state is expected long term return starting from that state.

The Algorithm’s update rule:

6

Q-Learning Algorithm

Action Values:

Update rule:

7

Partially Observable Markov Decision Process

Consider a MDP in which agent cannot observe a state completely.

We can model this problem with POMDPPOMPD has 2 more components.O is the finite observation set.O(s,a,o) is the probability of making

observation o from state s after having taken action a.

8

Agent’s Internal State

Agent can represent the situation of environment with belief states.

A belief state (b) is a probability distrubition over S. b(s) is probability of being state s when belief state is b.

Next b can be calculated from previous b.

9

MDP vs POMDP

10

Belief State Example

Observations: [ goal non-goal ]Step 1 b = [ 0.33 0.33 0 0.33 ]Step 2 b = [ 0 0.5 0 0.5 ]Step 3 b = [ 0 0 0 1 ]

11

Value Iteration Algorithm

We can rewrite transition probabilities and reward functions. And try to apply value iteration algorithm

The problem here is how can we represent value function, and how can we iterate over infinite belief space.

12

Value Function Representation

Value function can be approximated with vectors which has the Piecewise Linear and Convex (PWLC) property.

13

Witness Algorithm

Start with a set of b, which are at the corners of the belief space. At each iteration find a witness point which satisfies where:

Calculate new vector and add to vector set. Stop when there is no witness

14

Incremental Pruning Algorithm

15

Heuristic Search Value Iteration Algorithm (HSVI)

16

ARKAQ Learning

17

Result of ARKAQ Learning Algorithm

4x4 Grid Problem

18

References

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, Volume 101, pp. 99-134, 1998

Anthony R. Cassandra, Michael L. Littman and Nevin L. Zhang. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. Uncertainty in Artificial Intelligence (UAI), 1997

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press

Heuristic Search Value Iteration for POMDPs. T. Smith and R. Simmons. In Proc. of UAI, 2004

Alp SARDAG Autonomous Strategy Planning Under. Uncertainty PhD Thesis Boğaziçi University 2006

Documents

Partially Observable Markov Decision Process By Nezih Ergin Özkucur