18
Partially Observable Markov Decision Process By Nezih Ergin Özkucur

Partially Observable Markov Decision Process By Nezih Ergin Özkucur

Embed Size (px)

Citation preview

Page 1: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

Partially Observable Markov Decision Process

By

Nezih Ergin Özkucur

Page 2: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

2

Contents

Markov Decision Process (MDP) Value Iteration Algorithm Reinforcement Learning

Partially Observable Markov Decision Process (POMDP) POMDP vs MDP Value Function Representation Exact algorithms

Kalman Filtering ART2A Network ARKAQ Learning Algorithm

Page 3: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

3

Markov Decision Process

Consider an agent which must act rationally in an environment.

At each discrete time step, agent must choose one of the given decisions.

In the long term, agent tries to get good results. MDP is a way to model this kind of problems. By modeling the problem, we can run some

automated algorithms to solve it.

Page 4: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

4

MDP Components

MDP can be defined by (S,A,T,R) where S is a finite set of states which describes the situation of

the environment. A is a finite set of actions, which agent must chose from

in each time step. T (State transition function) is a mapping from SxA to

probability distrubutions over S. T(s,a,s`) is the probability of being state s` when agent was in state s and have chosen action a.

R (Reward Function) is a mapping from SxA to real numbers.

Page 5: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

5

Value Iteration Algorithm

A policy ( ) is a mapping from S to A which gives action to select in each state.

Value of a state is expected long term return starting from that state.

The Algorithm’s update rule:

Page 6: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

6

Q-Learning Algorithm

Action Values:

Update rule:

Page 7: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

7

Partially Observable Markov Decision Process

Consider a MDP in which agent cannot observe a state completely.

We can model this problem with POMDPPOMPD has 2 more components.O is the finite observation set.O(s,a,o) is the probability of making

observation o from state s after having taken action a.

Page 8: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

8

Agent’s Internal State

Agent can represent the situation of environment with belief states.

A belief state (b) is a probability distrubition over S. b(s) is probability of being state s when belief state is b.

Next b can be calculated from previous b.

Page 9: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

9

MDP vs POMDP

Page 10: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

10

Belief State Example

Observations: [ goal non-goal ]Step 1 b = [ 0.33 0.33 0 0.33 ]Step 2 b = [ 0 0.5 0 0.5 ]Step 3 b = [ 0 0 0 1 ]

Page 11: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

11

Value Iteration Algorithm

We can rewrite transition probabilities and reward functions. And try to apply value iteration algorithm

The problem here is how can we represent value function, and how can we iterate over infinite belief space.

Page 12: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

12

Value Function Representation

Value function can be approximated with vectors which has the Piecewise Linear and Convex (PWLC) property.

Page 13: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

13

Witness Algorithm

Start with a set of b, which are at the corners of the belief space. At each iteration find a witness point which satisfies where:

Calculate new vector and add to vector set. Stop when there is no witness

Page 14: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

14

Incremental Pruning Algorithm

Page 15: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

15

Heuristic Search Value Iteration Algorithm (HSVI)

Page 16: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

16

ARKAQ Learning

Page 17: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

17

Result of ARKAQ Learning Algorithm

4x4 Grid Problem

Page 18: Partially Observable Markov Decision Process By Nezih Ergin Özkucur

18

References

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, Volume 101, pp. 99-134, 1998

Anthony R. Cassandra, Michael L. Littman and Nevin L. Zhang. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. Uncertainty in Artificial Intelligence (UAI), 1997

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press

Heuristic Search Value Iteration for POMDPs. T. Smith and R. Simmons. In Proc. of UAI, 2004

Alp SARDAG Autonomous Strategy Planning Under. Uncertainty PhD Thesis Boğaziçi University 2006