Partially Observable Markov Decision Process in

Partially Observable Markov Decision Process inReinforcement Learning

Shvechikov Pavel

National Research University Higher School of Economics,Yandex School of Data Analysis

April 6, 2018

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Overview

1 What is wrong with MDP?MDP reminder

2 POMDP detailsDefinitionsAdapted policiesSufficient information process

3 Approximate Learning in POMDPsDeep Recurrent Q-LearningMERLIN

Shvechikov Pavel POMDP in RL


What is MDP?

Definition of Markov Decision ProcessMDP is a tuple 〈S,A,P,R〉, where

1 S – set of states of the world2 A – set of actions3 P : S ×A 7→ 4(S) – state-transition

function, giving us p(st+1 | st , at)4 R : S ×A 7→ R – reward function,

giving us ER [R(st , at) | st , at ].

Markov property

p(rt , st+1 | s0, a0, r0, ..., st , at) = p(rt , st+1 | st , at)

(next state, expected reward) depend on (previous state, action)



MDP problems are closer than they seem

Pong Space invaders

What is a state here?



MDP problems are closer than they seem

Pong Space invaders

What is a state here?

128 bytes of unobserved Atari simulator RAM



Sources of uncertainty

Typically autonomous agent’s state is composed ofmeasurement of environmentmeasurment of agent itself

In real system there is even more uncertainty:1 imperfect self-sensing (position, torque, velocity, etc.)2 imperfect environment perception3 incomplete observation of environment

How to incorporate uncertainty into decision making?



POMDP is a powerful mathematical abstraction

Industrial applicationsMachine maintenance (Shani et al., 2009)Wireless networking (Pajarinen et al., 2013)Wind farms managing (Memarzadeh et al., 2014)Aircraft collision avoidance (Bai et al., 2012)Choosing sellers in E-marketplaces (Irissappane et al., 2016)

Assistive careAssistant for patients with dementia (Hoey et al., 2010)Home assistants (Pineau et al., 2003)

RoboticsGrasping with a robotic arm (Hsiao et al., 2007)Navigating an office (Spaan et al., 2005)

Spoken dialog systemsUncertainty in voice recognition (Young et al., 2013)



Outline






POMDP’s place in a model world

POMDP’s siblings



POMDP model

DefinitionPartially Observed Markov Decision Processis a tuple 〈S,A,P,R,Ω,O〉

1 S,A,P,R are the same as in MDP2 Ω – finite set of observations3 O : S ×A 7→ 4(Ω) – observation

function, which gives ∀(s, a) ∈ S,A, aprobability distribution over Ω, i.e.p(o | st+1, at) ∀o ∈ Ω



POMDP model

DefinitionPartially Observed Markov Decision Processis a tuple 〈S,A,P,R,Ω,O〉

1 S,A,P,R are the same as in MDP2 Ω – finite set of observations3 O : S ×A 7→ 4(Ω) – observation

function, which gives ∀(s, a) ∈ S,A, aprobability distribution over Ω, i.e.p(o | st+1, at) ∀o ∈ Ω

What if we ignore the partial observability?



Reactive (adapted) policies

St St+1

At

Rt

At+1

Rt+1

· · · · · ·

MDP

St

Ot

At

Rt

St+1

Ot+1

At+1

Rt+1

· · · · · ·

POMDP



Adapted policies (Singh et al., 1994)

Adapted policy

is a mapping π : Ω→4(A)

Stationary adapted π’s in POMDP:1 deterministic π can be arbitrarily

bad compared to the beststochastic π

2 stochastic π can be arbitrarily badcompared to the optimal π inunderlying MDP




Adapted policy








Adapted policy





What maximum return is achievable for a nonstationary policy?



Sufficient information process (Striebel, 1965)

Complete information state ICt at time t

ICt = 〈ρ(s0), o0, a0, ..., at−1, ot〉

where ρ(s0) is a distribution over initial states

A sequence It defines a sufficient information process when1 It = τ(It−1, ot , at−1)

I can be updated incrementally2 P(st |It) = P(st |ICt )

It does not lose information about states3 P(ot |It−1, at−1) = P(ot |ICt−1, at−1)

It does not lose information about next observation



Information tracking















Belief states and their updates (Bayes filter)

Belief state – distribution over state space

bt(s)∆= P(st = s | ICt )

POMDP is MDP over properly updated beliefs (Astrom, 1965):

b′(s ′) = p(s ′ | o ′, a, b) =p(s ′, o ′ | a, b)

p(o ′ | a, b)

=p(o ′ | s ′, a) · p(s ′ | a, b)∑s′′ p(o ′ | s ′′, a) · p(s ′′ | a, b)

∝ p(o ′ | s ′, a)∑s

p(s ′ | a, s) · b(s)



From POMDP to MDP over beliefs

Bellman optimality equation for V ∗(st)

V ∗(s) = maxa

[R(s, a) + γ

∑s′

p(s ′ | s, a)V ∗(s ′)

]

Bellman optimality equation for V ∗(bt)

V ∗(b) = maxa

[R(b, a) + γ

∑s′

p(b′ | b, a)V ∗(b′)

]p(b′ | a, b) =

∑o′,s′,s

p(b′ | a, b, o ′)p(o ′ | s ′, a)p(s ′ | s, a)b(s)

p(b′ | a, b, o ′) = I(b′ = BayesFilter(o ′, a, b))

R(b, a) =∑s

b(s)R(s, a)



Reasoning about state uncertainty

Bad news: belief updating can be computed exactly only for1 discrete low-demensional state-spaces2 linear-Gaussian dynamics (leading to Kalman filter), i.e.

s ′ ∼ N (s ′ |Tss + Taa, Σs)o′ ∼ N (o′ |Oss

′ + Oaa,Σo)R(s, a) = s>Rss + a>Raa

What if1 states are of a complex nature? (i.e. images)2 state transition function is non-linear and unknown?



Reasoning about state uncertainty

Bad news: belief updating can be computed exactly only for1 discrete low-demensional state-spaces2 linear-Gaussian dynamics (leading to Kalman filter), i.e.

s ′ ∼ N (s ′ |Tss + Taa, Σs)o′ ∼ N (o′ |Oss

′ + Oaa,Σo)R(s, a) = s>Rss + a>Raa

What if1 states are of a complex nature? (i.e. images)2 state transition function is non-linear and unknown?



Possible options

1 Use advanced tracking techniquesDeep Variational Bayes Filter (Karl et al., 2016)

2 Just forget all the math and use LSTM / GRUDRQN (Hausknecht et al., 2015), DARQN (Zhu et al., 2017),RDPG (Heess et al., 2015)

3 Preserve information with predictive state representationsRecurrent Predictive State Policy (Hefny et al., 2018)

4 Use human-like differentiable memoryNeural Map (Parisotto et al., 2017), MERLIN (Wayne et al.,2018)



Outline






Deep Recurrent Q-Learning (DRQN)

Q-learning: Q(st , at) = Er ,s′|st ,at [ r + γmaxa′ Q(s ′, a′) ]

Problem: we don’t know st

DRQN solution: (Hausknecht et al., 2015)1 equip agent with memory ht2 approximate Q(st , at) with Q(ot , ht−1, at)

3 eliminate dependence on ot by modelling ht = LSTM(ot , ht−1)

Benefits:1 simple approximate POMDP solver with one frame input2 need only to model Q(ht , at)

3 minor changes to vanilla DQN architecture



DRQN: architecture



RNN-like memory

1 It – input image2 vt – agent’s velocity3 rt−1 – previous reward4 Tt – text instructions

Drawbacks:1 truncated BPTT2 sparse reward signal



DNC-like memory

Sensory data can instead be encoded and stored without trial anderror in a temporally local manner.



MERLIN (Wayne et al., 2018): design principles

Neuroscience motivation1 predictive sensory coding

brain continually generates models of the worldbased on context and information from memoryto predict future sensory input

2 hippocampal representation theory (Gluck et al., 1993)representations pass through a compressive bottleneckthen reconstruct input stimuli together with task reward

3 temporal context model

Under the hood:1 Variational Autoencoders2 Differentiable Neural Computer3 Recurrent Asynchronous Advantage Actor Critic (A3C)4 57 pages long paper of 24 authors from DeepMind



MERLIN (Wayne et al., 2018): architecture



Variational Autoencoder

Each component of

ot = (It , vt , at−1, rt−1,Tt) ∈ R10 000

is independently encoded into et ∈ R100 by

(6 ResNet blocks, MLP, None, None, 1-layer LSTM)

Decoding process uses same architectures.

Return predictor decoderMLP: V π(zt , log πt(a|Mt−1, z≤t)) is regressed on Gt

MLP: A(zt , at) is regressed on Gt − V π(·, ·)



Memory – simplified DNC model

Memory is a tensor Mt with dimensions (Nmem, 2|z |)Each step we write vector [zt , (1− γ)

∑t′>t γ

t′−tzt′ ] to Mt−1

denote mt = [m1t , ...,m

Kt ] for readout of K read heads

Reading from memoryMBP LSTM: [zt , at ,mt−1] → h1

t

Policy LSTM: [zt ] → h2t

Linear([h1t , h

2t ])→ it = [k1

t , ..., kKt , β

1t , .., β

Kt ]

c ijt = cosine(k it , Mt−1[j , ·])w it = Softmax(β1

t ci1t , ..., β

Kt c iNmem

t )

readout memory mit = M>t−1w

it



Memory – simplified DNC model

Memory is a tensor Mt with dimensions (Nmem, 2|z |)Each step we write vector [zt , (1− γ)

∑t′>t γ

t′−tzt′ ] to Mt−1

denote mt = [m1t , ...,m

Kt ] for readout of K read heads

Writing to memory is performed after reading:vwrt [i ] = δit

v rett = γv rett−1 + (1− γ)vwrt−1

Mt = Mt−1 + vwrt [zt , 0]> + v rett [0, zt ]>

When t > Nmem, select the cell with lowest utility

ut+k [k] = ut [k] +∑i

w it+1[k]



Latent space

Prior, MLP:

[ht−1,mt−1] → µpriort , log σpriort

Concatenate all information from this timestep

nt = [et , ht−1,mt−1, µpriort , log σpriort ]

Posterior

[µpostt , log σpostt ] = MLP(nt) + [µpriort , log σpriort ]

zt is a sample from posterior factorized Gaussian



Loss function

Variational Lower Bound:

log p(o≤t , r≤t) ≥t∑

τ=0

Eq(z<τ | o<τ ) [ DataTerm−KL ]

DataTerm = Eq(zτ | z<τ ,o≤τ ) [ log p(oτ , rτ | zt) ]

KL = DKL (q(zτ | z<τ , o≤τ || p(zτ | z<τ , a≤τ ))

Where p(oτ , rτ | zt) is a linear combination of 6 decoding losses



Experiments

The most interesting experiments1 Goal oriented navigation in a maze (3-7 rooms) (video)2 Arbitrary Visuomotor Mapping (video)3 T-maze (video)


https://www.youtube.com/watch?v=YFx-D4eEs5A&feature=youtu.be

https://www.youtube.com/watch?v=IiR_NOomcpk&feature=youtu.be

https://www.youtube.com/watch?v=3iA19h0Vvq0&feature=youtu.be

Thank you!


References I

Astrom, Karl J (1965). “Optimal control of Markov processes withincomplete state information”. In: Journal of mathematicalanalysis and applications 10.1, pp. 174–205.

Bai, Haoyu et al. (2012). “Unmanned aircraft collision avoidanceusing continuous-state POMDPs”. In: Robotics: Science andSystems VII 1, pp. 1–8.

Gluck, Mark A et al. (1993). “Hippocampal mediation of stimulusrepresentation: A computational theory”. In: Hippocampus 3.4,pp. 491–516.

Hausknecht, Matthew et al. (2015). “Deep recurrent q-learning forpartially observable mdps”. In: arXiv preprint arXiv:1507.06527.

Heess, Nicolas et al. (2015). “Memory-based control with recurrentneural networks”. In: arXiv preprint arXiv:1512.04455.



References II

Hefny, Ahmed et al. (2018). “Recurrent Predictive State PolicyNetworks”. In: arXiv preprint arXiv:1803.01489.

Hoey, Jesse et al. (2010). “Automated handwashing assistance forpersons with dementia using video and a partially observableMarkov decision process”. In: Computer Vision and ImageUnderstanding 114.5, pp. 503–519.

Hsiao, Kaijen et al. (2007). “Grasping pomdps”. In: Robotics andAutomation, 2007 IEEE International Conference on. IEEE,pp. 4685–4692.

Irissappane, Athirai Aravazhi et al. (2016). “A Scalable Frameworkto Choose Sellers in E-Marketplaces Using POMDPs.”. In: AAAI,pp. 158–164.

Karl, Maximilian et al. (2016). “Deep variational bayes filters:Unsupervised learning of state space models from raw data”. In:arXiv preprint arXiv:1605.06432.



References III

Memarzadeh, Milad et al. (2014). “Optimal planning and learningin uncertain environments for the management of wind farms”.In: Journal of Computing in Civil Engineering 29.5, p. 04014076.

Pajarinen, Joni et al. (2013). “Planning under uncertainty forlarge-scale problems with applications to wireless networking”. In:

Parisotto, Emilio et al. (2017). “Neural Map: Structured Memoryfor Deep Reinforcement Learning”. In: arXiv preprintarXiv:1702.08360.

Pineau, Joelle et al. (2003). “Towards robotic assistants in nursinghomes: Challenges and results”. In: Robotics and autonomoussystems 42.3, pp. 271–281.

Shani, Guy et al. (2009). “Improving existing fault recoverypolicies”. In: Advances in Neural Information ProcessingSystems, pp. 1642–1650.



References IV

Singh, Satinder P et al. (1994). “Learning without state-estimationin partially observable Markovian decision processes”. In:Machine Learning Proceedings 1994. Elsevier, pp. 284–292.

Spaan, Matthijs TJ et al. (2005). “Perseus: Randomizedpoint-based value iteration for POMDPs”. In: Journal of artificialintelligence research 24, pp. 195–220.

Striebel, Charlotte (1965). “Sufficient statistics in the optimumcontrol of stochastic systems”. In: Journal of MathematicalAnalysis and Applications 12.3, pp. 576–592.

Wayne, Greg et al. (2018). “Unsupervised Predictive Memory in aGoal-Directed Agent”. In: arXiv preprint arXiv:1803.10760.

Young, Steve et al. (2013). “Pomdp-based statistical spoken dialogsystems: A review”. In: Proceedings of the IEEE 101.5,pp. 1160–1179.



References V

Zhu, Pengfei et al. (2017). “On Improving Deep ReinforcementLearning for POMDPs”. In: arXiv preprint arXiv:1704.07978.


Documents

Partially Observable Markov Decision Process in