45
Partially Observable Markov Decision Process in Reinforcement Learning Shvechikov Pavel National Research University Higher School of Economics, Yandex School of Data Analysis April 6, 2018

Partially Observable Markov Decision Process in

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Partially Observable Markov Decision Process in

Partially Observable Markov Decision Process inReinforcement Learning

Shvechikov Pavel

National Research University Higher School of Economics,Yandex School of Data Analysis

April 6, 2018

Page 2: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Overview

1 What is wrong with MDP?MDP reminder

2 POMDP detailsDefinitionsAdapted policiesSufficient information process

3 Approximate Learning in POMDPsDeep Recurrent Q-LearningMERLIN

Shvechikov Pavel POMDP in RL

Page 3: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

What is MDP?

Definition of Markov Decision ProcessMDP is a tuple 〈S,A,P,R〉, where

1 S – set of states of the world2 A – set of actions3 P : S ×A 7→ 4(S) – state-transition

function, giving us p(st+1 | st , at)4 R : S ×A 7→ R – reward function,

giving us ER [R(st , at) | st , at ].

Markov property

p(rt , st+1 | s0, a0, r0, ..., st , at) = p(rt , st+1 | st , at)

(next state, expected reward) depend on (previous state, action)

Shvechikov Pavel POMDP in RL

Page 4: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

MDP problems are closer than they seem

Pong Space invaders

What is a state here?

Shvechikov Pavel POMDP in RL

Page 5: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

MDP problems are closer than they seem

Pong Space invaders

What is a state here?

128 bytes of unobserved Atari simulator RAM

Shvechikov Pavel POMDP in RL

Page 6: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Sources of uncertainty

Typically autonomous agent’s state is composed ofmeasurement of environmentmeasurment of agent itself

In real system there is even more uncertainty:1 imperfect self-sensing (position, torque, velocity, etc.)2 imperfect environment perception3 incomplete observation of environment

How to incorporate uncertainty into decision making?

Shvechikov Pavel POMDP in RL

Page 7: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

POMDP is a powerful mathematical abstraction

Industrial applicationsMachine maintenance (Shani et al., 2009)Wireless networking (Pajarinen et al., 2013)Wind farms managing (Memarzadeh et al., 2014)Aircraft collision avoidance (Bai et al., 2012)Choosing sellers in E-marketplaces (Irissappane et al., 2016)

Assistive careAssistant for patients with dementia (Hoey et al., 2010)Home assistants (Pineau et al., 2003)

RoboticsGrasping with a robotic arm (Hsiao et al., 2007)Navigating an office (Spaan et al., 2005)

Spoken dialog systemsUncertainty in voice recognition (Young et al., 2013)

Shvechikov Pavel POMDP in RL

Page 8: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Outline

1 What is wrong with MDP?MDP reminder

2 POMDP detailsDefinitionsAdapted policiesSufficient information process

3 Approximate Learning in POMDPsDeep Recurrent Q-LearningMERLIN

Shvechikov Pavel POMDP in RL

Page 9: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

POMDP’s place in a model world

POMDP’s siblings

Shvechikov Pavel POMDP in RL

Page 10: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

POMDP model

DefinitionPartially Observed Markov Decision Processis a tuple 〈S,A,P,R,Ω,O〉

1 S,A,P,R are the same as in MDP2 Ω – finite set of observations3 O : S ×A 7→ 4(Ω) – observation

function, which gives ∀(s, a) ∈ S,A, aprobability distribution over Ω, i.e.p(o | st+1, at) ∀o ∈ Ω

Shvechikov Pavel POMDP in RL

Page 11: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

POMDP model

DefinitionPartially Observed Markov Decision Processis a tuple 〈S,A,P,R,Ω,O〉

1 S,A,P,R are the same as in MDP2 Ω – finite set of observations3 O : S ×A 7→ 4(Ω) – observation

function, which gives ∀(s, a) ∈ S,A, aprobability distribution over Ω, i.e.p(o | st+1, at) ∀o ∈ Ω

What if we ignore the partial observability?

Shvechikov Pavel POMDP in RL

Page 12: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Reactive (adapted) policies

St St+1

At

Rt

At+1

Rt+1

· · · · · ·

MDP

St

Ot

At

Rt

St+1

Ot+1

At+1

Rt+1

· · · · · ·

POMDP

Shvechikov Pavel POMDP in RL

Page 13: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Adapted policies (Singh et al., 1994)

Adapted policy

is a mapping π : Ω→4(A)

Stationary adapted π’s in POMDP:1 deterministic π can be arbitrarily

bad compared to the beststochastic π

2 stochastic π can be arbitrarily badcompared to the optimal π inunderlying MDP

Shvechikov Pavel POMDP in RL

Page 14: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Adapted policies (Singh et al., 1994)

Adapted policy

is a mapping π : Ω→4(A)

Stationary adapted π’s in POMDP:1 deterministic π can be arbitrarily

bad compared to the beststochastic π

2 stochastic π can be arbitrarily badcompared to the optimal π inunderlying MDP

Shvechikov Pavel POMDP in RL

Page 15: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Adapted policies (Singh et al., 1994)

Adapted policy

is a mapping π : Ω→4(A)

Stationary adapted π’s in POMDP:1 deterministic π can be arbitrarily

bad compared to the beststochastic π

2 stochastic π can be arbitrarily badcompared to the optimal π inunderlying MDP

What maximum return is achievable for a nonstationary policy?

Shvechikov Pavel POMDP in RL

Page 16: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Sufficient information process (Striebel, 1965)

Complete information state ICt at time t

ICt = 〈ρ(s0), o0, a0, ..., at−1, ot〉

where ρ(s0) is a distribution over initial states

A sequence It defines a sufficient information process when1 It = τ(It−1, ot , at−1)

I can be updated incrementally2 P(st |It) = P(st |ICt )

It does not lose information about states3 P(ot |It−1, at−1) = P(ot |ICt−1, at−1)

It does not lose information about next observation

Shvechikov Pavel POMDP in RL

Page 17: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Information tracking

Shvechikov Pavel POMDP in RL

Page 18: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Information tracking

Shvechikov Pavel POMDP in RL

Page 19: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Information tracking

Shvechikov Pavel POMDP in RL

Page 20: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Information tracking

Shvechikov Pavel POMDP in RL

Page 21: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Information tracking

Shvechikov Pavel POMDP in RL

Page 22: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Belief states and their updates (Bayes filter)

Belief state – distribution over state space

bt(s)∆= P(st = s | ICt )

POMDP is MDP over properly updated beliefs (Astrom, 1965):

b′(s ′) = p(s ′ | o ′, a, b) =p(s ′, o ′ | a, b)

p(o ′ | a, b)

=p(o ′ | s ′, a) · p(s ′ | a, b)∑s′′ p(o ′ | s ′′, a) · p(s ′′ | a, b)

∝ p(o ′ | s ′, a)∑s

p(s ′ | a, s) · b(s)

Shvechikov Pavel POMDP in RL

Page 23: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

From POMDP to MDP over beliefs

Bellman optimality equation for V ∗(st)

V ∗(s) = maxa

[R(s, a) + γ

∑s′

p(s ′ | s, a)V ∗(s ′)

]

Bellman optimality equation for V ∗(bt)

V ∗(b) = maxa

[R(b, a) + γ

∑s′

p(b′ | b, a)V ∗(b′)

]p(b′ | a, b) =

∑o′,s′,s

p(b′ | a, b, o ′)p(o ′ | s ′, a)p(s ′ | s, a)b(s)

p(b′ | a, b, o ′) = I(b′ = BayesFilter(o ′, a, b))

R(b, a) =∑s

b(s)R(s, a)

Shvechikov Pavel POMDP in RL

Page 24: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Reasoning about state uncertainty

Bad news: belief updating can be computed exactly only for1 discrete low-demensional state-spaces2 linear-Gaussian dynamics (leading to Kalman filter), i.e.

s ′ ∼ N (s ′ |Tss + Taa, Σs)o′ ∼ N (o′ |Oss

′ + Oaa,Σo)R(s, a) = s>Rss + a>Raa

What if1 states are of a complex nature? (i.e. images)2 state transition function is non-linear and unknown?

Shvechikov Pavel POMDP in RL

Page 25: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Reasoning about state uncertainty

Bad news: belief updating can be computed exactly only for1 discrete low-demensional state-spaces2 linear-Gaussian dynamics (leading to Kalman filter), i.e.

s ′ ∼ N (s ′ |Tss + Taa, Σs)o′ ∼ N (o′ |Oss

′ + Oaa,Σo)R(s, a) = s>Rss + a>Raa

What if1 states are of a complex nature? (i.e. images)2 state transition function is non-linear and unknown?

Shvechikov Pavel POMDP in RL

Page 26: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Possible options

1 Use advanced tracking techniquesDeep Variational Bayes Filter (Karl et al., 2016)

2 Just forget all the math and use LSTM / GRUDRQN (Hausknecht et al., 2015), DARQN (Zhu et al., 2017),RDPG (Heess et al., 2015)

3 Preserve information with predictive state representationsRecurrent Predictive State Policy (Hefny et al., 2018)

4 Use human-like differentiable memoryNeural Map (Parisotto et al., 2017), MERLIN (Wayne et al.,2018)

Shvechikov Pavel POMDP in RL

Page 27: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Outline

1 What is wrong with MDP?MDP reminder

2 POMDP detailsDefinitionsAdapted policiesSufficient information process

3 Approximate Learning in POMDPsDeep Recurrent Q-LearningMERLIN

Shvechikov Pavel POMDP in RL

Page 28: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Deep Recurrent Q-Learning (DRQN)

Q-learning: Q(st , at) = Er ,s′|st ,at [ r + γmaxa′ Q(s ′, a′) ]

Problem: we don’t know st

DRQN solution: (Hausknecht et al., 2015)1 equip agent with memory ht2 approximate Q(st , at) with Q(ot , ht−1, at)

3 eliminate dependence on ot by modelling ht = LSTM(ot , ht−1)

Benefits:1 simple approximate POMDP solver with one frame input2 need only to model Q(ht , at)

3 minor changes to vanilla DQN architecture

Shvechikov Pavel POMDP in RL

Page 29: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

DRQN: architecture

Shvechikov Pavel POMDP in RL

Page 30: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

RNN-like memory

1 It – input image2 vt – agent’s velocity3 rt−1 – previous reward4 Tt – text instructions

Drawbacks:1 truncated BPTT2 sparse reward signal

Shvechikov Pavel POMDP in RL

Page 31: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

DNC-like memory

Sensory data can instead be encoded and stored without trial anderror in a temporally local manner.

Shvechikov Pavel POMDP in RL

Page 32: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

MERLIN (Wayne et al., 2018): design principles

Neuroscience motivation1 predictive sensory coding

brain continually generates models of the worldbased on context and information from memoryto predict future sensory input

2 hippocampal representation theory (Gluck et al., 1993)representations pass through a compressive bottleneckthen reconstruct input stimuli together with task reward

3 temporal context model

Under the hood:1 Variational Autoencoders2 Differentiable Neural Computer3 Recurrent Asynchronous Advantage Actor Critic (A3C)4 57 pages long paper of 24 authors from DeepMind

Shvechikov Pavel POMDP in RL

Page 33: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

MERLIN (Wayne et al., 2018): architecture

Shvechikov Pavel POMDP in RL

Page 34: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Variational Autoencoder

Each component of

ot = (It , vt , at−1, rt−1,Tt) ∈ R10 000

is independently encoded into et ∈ R100 by

(6 ResNet blocks, MLP, None, None, 1-layer LSTM)

Decoding process uses same architectures.

Return predictor decoderMLP: V π(zt , log πt(a|Mt−1, z≤t)) is regressed on Gt

MLP: A(zt , at) is regressed on Gt − V π(·, ·)

Shvechikov Pavel POMDP in RL

Page 35: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Memory – simplified DNC model

Memory is a tensor Mt with dimensions (Nmem, 2|z |)Each step we write vector [zt , (1− γ)

∑t′>t γ

t′−tzt′ ] to Mt−1

denote mt = [m1t , ...,m

Kt ] for readout of K read heads

Reading from memoryMBP LSTM: [zt , at ,mt−1] → h1

t

Policy LSTM: [zt ] → h2t

Linear([h1t , h

2t ])→ it = [k1

t , ..., kKt , β

1t , .., β

Kt ]

c ijt = cosine(k it , Mt−1[j , ·])w it = Softmax(β1

t ci1t , ..., β

Kt c iNmem

t )

readout memory mit = M>t−1w

it

Shvechikov Pavel POMDP in RL

Page 36: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Memory – simplified DNC model

Memory is a tensor Mt with dimensions (Nmem, 2|z |)Each step we write vector [zt , (1− γ)

∑t′>t γ

t′−tzt′ ] to Mt−1

denote mt = [m1t , ...,m

Kt ] for readout of K read heads

Writing to memory is performed after reading:vwrt [i ] = δit

v rett = γv rett−1 + (1− γ)vwrt−1

Mt = Mt−1 + vwrt [zt , 0]> + v rett [0, zt ]>

When t > Nmem, select the cell with lowest utility

ut+k [k] = ut [k] +∑i

w it+1[k]

Shvechikov Pavel POMDP in RL

Page 37: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Latent space

Prior, MLP:

[ht−1,mt−1] → µpriort , log σpriort

Concatenate all information from this timestep

nt = [et , ht−1,mt−1, µpriort , log σpriort ]

Posterior

[µpostt , log σpostt ] = MLP(nt) + [µpriort , log σpriort ]

zt is a sample from posterior factorized Gaussian

Shvechikov Pavel POMDP in RL

Page 38: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Loss function

Variational Lower Bound:

log p(o≤t , r≤t) ≥t∑

τ=0

Eq(z<τ | o<τ ) [ DataTerm−KL ]

DataTerm = Eq(zτ | z<τ ,o≤τ ) [ log p(oτ , rτ | zt) ]

KL = DKL (q(zτ | z<τ , o≤τ || p(zτ | z<τ , a≤τ ))

Where p(oτ , rτ | zt) is a linear combination of 6 decoding losses

Shvechikov Pavel POMDP in RL

Page 39: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

Experiments

The most interesting experiments1 Goal oriented navigation in a maze (3-7 rooms) (video)2 Arbitrary Visuomotor Mapping (video)3 T-maze (video)

Shvechikov Pavel POMDP in RL

Page 40: Partially Observable Markov Decision Process in

Thank you!

Page 41: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

References I

Astrom, Karl J (1965). “Optimal control of Markov processes withincomplete state information”. In: Journal of mathematicalanalysis and applications 10.1, pp. 174–205.

Bai, Haoyu et al. (2012). “Unmanned aircraft collision avoidanceusing continuous-state POMDPs”. In: Robotics: Science andSystems VII 1, pp. 1–8.

Gluck, Mark A et al. (1993). “Hippocampal mediation of stimulusrepresentation: A computational theory”. In: Hippocampus 3.4,pp. 491–516.

Hausknecht, Matthew et al. (2015). “Deep recurrent q-learning forpartially observable mdps”. In: arXiv preprint arXiv:1507.06527.

Heess, Nicolas et al. (2015). “Memory-based control with recurrentneural networks”. In: arXiv preprint arXiv:1512.04455.

Shvechikov Pavel POMDP in RL

Page 42: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

References II

Hefny, Ahmed et al. (2018). “Recurrent Predictive State PolicyNetworks”. In: arXiv preprint arXiv:1803.01489.

Hoey, Jesse et al. (2010). “Automated handwashing assistance forpersons with dementia using video and a partially observableMarkov decision process”. In: Computer Vision and ImageUnderstanding 114.5, pp. 503–519.

Hsiao, Kaijen et al. (2007). “Grasping pomdps”. In: Robotics andAutomation, 2007 IEEE International Conference on. IEEE,pp. 4685–4692.

Irissappane, Athirai Aravazhi et al. (2016). “A Scalable Frameworkto Choose Sellers in E-Marketplaces Using POMDPs.”. In: AAAI,pp. 158–164.

Karl, Maximilian et al. (2016). “Deep variational bayes filters:Unsupervised learning of state space models from raw data”. In:arXiv preprint arXiv:1605.06432.

Shvechikov Pavel POMDP in RL

Page 43: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

References III

Memarzadeh, Milad et al. (2014). “Optimal planning and learningin uncertain environments for the management of wind farms”.In: Journal of Computing in Civil Engineering 29.5, p. 04014076.

Pajarinen, Joni et al. (2013). “Planning under uncertainty forlarge-scale problems with applications to wireless networking”. In:

Parisotto, Emilio et al. (2017). “Neural Map: Structured Memoryfor Deep Reinforcement Learning”. In: arXiv preprintarXiv:1702.08360.

Pineau, Joelle et al. (2003). “Towards robotic assistants in nursinghomes: Challenges and results”. In: Robotics and autonomoussystems 42.3, pp. 271–281.

Shani, Guy et al. (2009). “Improving existing fault recoverypolicies”. In: Advances in Neural Information ProcessingSystems, pp. 1642–1650.

Shvechikov Pavel POMDP in RL

Page 44: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

References IV

Singh, Satinder P et al. (1994). “Learning without state-estimationin partially observable Markovian decision processes”. In:Machine Learning Proceedings 1994. Elsevier, pp. 284–292.

Spaan, Matthijs TJ et al. (2005). “Perseus: Randomizedpoint-based value iteration for POMDPs”. In: Journal of artificialintelligence research 24, pp. 195–220.

Striebel, Charlotte (1965). “Sufficient statistics in the optimumcontrol of stochastic systems”. In: Journal of MathematicalAnalysis and Applications 12.3, pp. 576–592.

Wayne, Greg et al. (2018). “Unsupervised Predictive Memory in aGoal-Directed Agent”. In: arXiv preprint arXiv:1803.10760.

Young, Steve et al. (2013). “Pomdp-based statistical spoken dialogsystems: A review”. In: Proceedings of the IEEE 101.5,pp. 1160–1179.

Shvechikov Pavel POMDP in RL

Page 45: Partially Observable Markov Decision Process in

What is wrong with MDP? POMDP details Approximate Learning in POMDPs

References V

Zhu, Pengfei et al. (2017). “On Improving Deep ReinforcementLearning for POMDPs”. In: arXiv preprint arXiv:1704.07978.

Shvechikov Pavel POMDP in RL