Upload
quentin-brown
View
219
Download
5
Embed Size (px)
Citation preview
Partially Observable Markov Decision Process
By
Nezih Ergin Özkucur
2
Contents
Markov Decision Process (MDP) Value Iteration Algorithm Reinforcement Learning
Partially Observable Markov Decision Process (POMDP) POMDP vs MDP Value Function Representation Exact algorithms
Kalman Filtering ART2A Network ARKAQ Learning Algorithm
3
Markov Decision Process
Consider an agent which must act rationally in an environment.
At each discrete time step, agent must choose one of the given decisions.
In the long term, agent tries to get good results. MDP is a way to model this kind of problems. By modeling the problem, we can run some
automated algorithms to solve it.
4
MDP Components
MDP can be defined by (S,A,T,R) where S is a finite set of states which describes the situation of
the environment. A is a finite set of actions, which agent must chose from
in each time step. T (State transition function) is a mapping from SxA to
probability distrubutions over S. T(s,a,s`) is the probability of being state s` when agent was in state s and have chosen action a.
R (Reward Function) is a mapping from SxA to real numbers.
5
Value Iteration Algorithm
A policy ( ) is a mapping from S to A which gives action to select in each state.
Value of a state is expected long term return starting from that state.
The Algorithm’s update rule:
6
Q-Learning Algorithm
Action Values:
Update rule:
7
Partially Observable Markov Decision Process
Consider a MDP in which agent cannot observe a state completely.
We can model this problem with POMDPPOMPD has 2 more components.O is the finite observation set.O(s,a,o) is the probability of making
observation o from state s after having taken action a.
8
Agent’s Internal State
Agent can represent the situation of environment with belief states.
A belief state (b) is a probability distrubition over S. b(s) is probability of being state s when belief state is b.
Next b can be calculated from previous b.
9
MDP vs POMDP
10
Belief State Example
Observations: [ goal non-goal ]Step 1 b = [ 0.33 0.33 0 0.33 ]Step 2 b = [ 0 0.5 0 0.5 ]Step 3 b = [ 0 0 0 1 ]
11
Value Iteration Algorithm
We can rewrite transition probabilities and reward functions. And try to apply value iteration algorithm
The problem here is how can we represent value function, and how can we iterate over infinite belief space.
12
Value Function Representation
Value function can be approximated with vectors which has the Piecewise Linear and Convex (PWLC) property.
13
Witness Algorithm
Start with a set of b, which are at the corners of the belief space. At each iteration find a witness point which satisfies where:
Calculate new vector and add to vector set. Stop when there is no witness
14
Incremental Pruning Algorithm
15
Heuristic Search Value Iteration Algorithm (HSVI)
16
ARKAQ Learning
17
Result of ARKAQ Learning Algorithm
4x4 Grid Problem
18
References
Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, Volume 101, pp. 99-134, 1998
Anthony R. Cassandra, Michael L. Littman and Nevin L. Zhang. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. Uncertainty in Artificial Intelligence (UAI), 1997
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press
Heuristic Search Value Iteration for POMDPs. T. Smith and R. Simmons. In Proc. of UAI, 2004
Alp SARDAG Autonomous Strategy Planning Under. Uncertainty PhD Thesis Boğaziçi University 2006