POMDP Seminar Backup3

Planning and Acting In Partially Observable Stochastic Demains

By Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra

Artificial Intelligence 101 (Jan 1998) 99-134

Iss seminar 11/16/06

Presented by Darin Hitchings

• Essentially a planning problem: given a model of the world dynamics and a reward structure, find an optimal way to behave.

• Decisions and rewards in stages• Limitation: discrete state representation

(MDP case)• Stochastic: actions have uncertain outcomes

(and state uncertainty)• change the state of the world <=> gain info

Introduction

MDP Model• Tradeoff: value of information vs immediate

reward vs long term cost• Uncertainty in actions’ effects but perfect

perception• MDP described by tuple (S, A, T, R):

Markov Property:

Optimality Concepts• Finite-horizon optimality:

max

• oo-horizon optimality: max

geometric discounting• Two Kinds of Policies: stationary, non-stationary

(t th-to-last step)

• Finite horizon model: typically not stationary

Optimality Cont• Finite horizon: last move usually very

different than first move• oo horizon: also optimal if the probability of

stopping early is • => Finite horizon problem when • Define value function of being in a state at

time t subject to policy :• Or for stationary case:

Recursion Relations

reward at last step

in general

current “local”

cost

discount (amortization

for time delay)

Expectation (over states) of future reward

(non-stationary)

oo-horizon

same function!

Recursion Relations Cont• oo-horizon: solution is unique, simultaneous set of

linear equations, one equation for each state s• Relies on having discrete statespace• Have shown: policy => Value Function, what

about Value Function => policy?• Can compute a Greedy (myopic) policy given the

value function:

Expected immediate reward Expected reward of next state

Bellman EquationOptimal policy for a single move?

Can figure out an expression for the general case by working backwards from the end of the horizon:

is derived from and

Stationary Policy Exists• For oo-horizon problems, Howard showed stationary

policy always exists and is optimal:

• An optimal policy, , is just a greedy policy with respect to

Value Iteration Algorithm• Algorithm 1. The value-iteration algorithm for

finite state space MDPs.

Value Iteration Cont• is the t-step value of starting in state s,

taking action a, then continuing with the optimal (t-1)-step nonstationary policy

• is the Bellman Error Magnitude• If for all s, then it can be

shown:

• Tighter bounds may be obtained from using span semi-norm on the value function [49].

POMDP Model

observation functionis a finite set of observations of world

(a probability distribution over possible observations for each state)

• POMDP described by tuple (S, A, T, R, , O): describe a Markov decision process

• New state => continuous distribution over discrete states in original MDP: “Belief State”

• Belief State is a sufficient statistic for the past history and initial belief state of the agent

POMDP Example 1

POMDP Example:S = 4, A = {East, West}, O = {not at goal, at goal}P(move W|a=W) = 0.90, P(move E|a=W) = 0.10P(move E|a=E) = 0.90, P(move W|a=E) = 0.10No movement if at boundry (but no other information)Initial b = { 0.333, 0.333, 0.000, 0.333 }

belief state given a=East,o=not at goal => { 0.100 0.450 0.000 0.450 }2nd belief state (East twice, no goal) => { 0.100 0.164 0.000 0.736 }

Computing New Belief States• Know for all • Must have to be a valid distribution

Bayes Rule

Markov Property

Plugging in Definitions

• So we now have a new state transition equation:

Belief Space State Transitions• B, the set of belief states are the statespace• A, set of actions remains the same• New state transition function:

where

• Reward Function

Expected reward over belief states

Policy Trees• Simplest case: 1 step policy tree:• General case: t-step nonstationary policy

=> tree of depth t:

Actions generate observations which control belief (state) evolution

Policy Trees Cont• If p is a t-step policy tree and a(p) is the

action specified at the top node:(t-1)-step policy subtree associated with observation oi

Expectation over possible future states

Expected value of a state

Value on Policy TreesValue of executing a policy tree p from some belief state b

(Just an expectation over world states of executing p in each state)More compactly, if

then

• Let P be the finite set of all t-step policy trees

then Important geometric consequences!

Geometric Import• Each policy tree p induces a value function Vp that

is linear in b• Vt is the upper surface of these value functions• Vt is piecewise-linear and convex

Two states so value function is 1D

Each policy component is a (linear) hyperplane

Max of convex set is convex

Simplex Constraints• Vertices (0,0), (0,1), (1,0)• Makes Intuitive Sense:

High value on edges, Low value in center (Higher Uncertainty)

S=3

• Just one policy is optimal within any colored region

• Can find optimal policy by projecting back down into belief space wheremaximal over entire region

More on Projection• Optimal t-step situation-action mapping found by

projecting value function onto belief space• Especially interested in a(p) action in the root node

of policy tree p

1D belief space, S=2

Remember multiple possible observations per action, also dim(a(p)) = S

• Define to be region where dominates:

Parsimonious Representation• pd never useful• pc never useful

given pa and pb

• Can find a point in by linear programming

• There are elements to the set

• Could construct a superset of all possible policies for Vt based on a useful (minimal) set of policies at time Vt-1

Pruning and Vt

• Pruning Method? Simple technique is to remove nowhere dominant policies, more complex / better techniques exist

• How to find Vt from Vt-1?

• So exhaustive enumeration wastes time

Redefined to be set of policy trees that forms

Directly Generating Vt

• Equivalent to long-standing question “Does NP=RP?”• Instead we compute a set of t-step policy trees ,

that have action ‘a’ at the root, one for each ‘a’

• Algorithm for generating Vt which is polynomial in:?

• We know ie

Witness Algorithm• Algorithm 2. Outer loop of the witness algorithm

Basic structure is still Value IterationFirst construct all possible policy subtrees usingThen prune using linear programs

• Witness algorithm usually polynomial time!

Witness Inner Loop• Initialize the set with one policy tree

which is best for an arbitrary b• Ask

From one-step lookahead with Vt-1

Using our model for Vt: Ua

• Such a point is called a Witness Point• When we can prove that no more witness

points exist, our model for Vt is exact!

Note:

• Let pnew = p with just 1 subtree replaced

Witness Theorem

• Then true Q-function differs from approx one and some biff

if no tree can be improved by replacing a single subtree, there are no witness points

replaced

where

Abc a• Finite-horizon optimality:

max


max


max

Documents

POMDP Seminar Backup3