Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Partially Observable Markov Decision Process inReinforcement Learning
Shvechikov Pavel
National Research University Higher School of Economics,Yandex School of Data Analysis
April 6, 2018
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Overview
1 What is wrong with MDP?MDP reminder
2 POMDP detailsDefinitionsAdapted policiesSufficient information process
3 Approximate Learning in POMDPsDeep Recurrent Q-LearningMERLIN
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
What is MDP?
Definition of Markov Decision ProcessMDP is a tuple 〈S,A,P,R〉, where
1 S – set of states of the world2 A – set of actions3 P : S ×A 7→ 4(S) – state-transition
function, giving us p(st+1 | st , at)4 R : S ×A 7→ R – reward function,
giving us ER [R(st , at) | st , at ].
Markov property
p(rt , st+1 | s0, a0, r0, ..., st , at) = p(rt , st+1 | st , at)
(next state, expected reward) depend on (previous state, action)
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
MDP problems are closer than they seem
Pong Space invaders
What is a state here?
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
MDP problems are closer than they seem
Pong Space invaders
What is a state here?
128 bytes of unobserved Atari simulator RAM
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Sources of uncertainty
Typically autonomous agent’s state is composed ofmeasurement of environmentmeasurment of agent itself
In real system there is even more uncertainty:1 imperfect self-sensing (position, torque, velocity, etc.)2 imperfect environment perception3 incomplete observation of environment
How to incorporate uncertainty into decision making?
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
POMDP is a powerful mathematical abstraction
Industrial applicationsMachine maintenance (Shani et al., 2009)Wireless networking (Pajarinen et al., 2013)Wind farms managing (Memarzadeh et al., 2014)Aircraft collision avoidance (Bai et al., 2012)Choosing sellers in E-marketplaces (Irissappane et al., 2016)
Assistive careAssistant for patients with dementia (Hoey et al., 2010)Home assistants (Pineau et al., 2003)
RoboticsGrasping with a robotic arm (Hsiao et al., 2007)Navigating an office (Spaan et al., 2005)
Spoken dialog systemsUncertainty in voice recognition (Young et al., 2013)
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Outline
1 What is wrong with MDP?MDP reminder
2 POMDP detailsDefinitionsAdapted policiesSufficient information process
3 Approximate Learning in POMDPsDeep Recurrent Q-LearningMERLIN
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
POMDP’s place in a model world
POMDP’s siblings
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
POMDP model
DefinitionPartially Observed Markov Decision Processis a tuple 〈S,A,P,R,Ω,O〉
1 S,A,P,R are the same as in MDP2 Ω – finite set of observations3 O : S ×A 7→ 4(Ω) – observation
function, which gives ∀(s, a) ∈ S,A, aprobability distribution over Ω, i.e.p(o | st+1, at) ∀o ∈ Ω
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
POMDP model
DefinitionPartially Observed Markov Decision Processis a tuple 〈S,A,P,R,Ω,O〉
1 S,A,P,R are the same as in MDP2 Ω – finite set of observations3 O : S ×A 7→ 4(Ω) – observation
function, which gives ∀(s, a) ∈ S,A, aprobability distribution over Ω, i.e.p(o | st+1, at) ∀o ∈ Ω
What if we ignore the partial observability?
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Reactive (adapted) policies
St St+1
At
Rt
At+1
Rt+1
· · · · · ·
MDP
St
Ot
At
Rt
St+1
Ot+1
At+1
Rt+1
· · · · · ·
POMDP
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Adapted policies (Singh et al., 1994)
Adapted policy
is a mapping π : Ω→4(A)
Stationary adapted π’s in POMDP:1 deterministic π can be arbitrarily
bad compared to the beststochastic π
2 stochastic π can be arbitrarily badcompared to the optimal π inunderlying MDP
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Adapted policies (Singh et al., 1994)
Adapted policy
is a mapping π : Ω→4(A)
Stationary adapted π’s in POMDP:1 deterministic π can be arbitrarily
bad compared to the beststochastic π
2 stochastic π can be arbitrarily badcompared to the optimal π inunderlying MDP
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Adapted policies (Singh et al., 1994)
Adapted policy
is a mapping π : Ω→4(A)
Stationary adapted π’s in POMDP:1 deterministic π can be arbitrarily
bad compared to the beststochastic π
2 stochastic π can be arbitrarily badcompared to the optimal π inunderlying MDP
What maximum return is achievable for a nonstationary policy?
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Sufficient information process (Striebel, 1965)
Complete information state ICt at time t
ICt = 〈ρ(s0), o0, a0, ..., at−1, ot〉
where ρ(s0) is a distribution over initial states
A sequence It defines a sufficient information process when1 It = τ(It−1, ot , at−1)
I can be updated incrementally2 P(st |It) = P(st |ICt )
It does not lose information about states3 P(ot |It−1, at−1) = P(ot |ICt−1, at−1)
It does not lose information about next observation
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Information tracking
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Information tracking
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Information tracking
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Information tracking
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Information tracking
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Belief states and their updates (Bayes filter)
Belief state – distribution over state space
bt(s)∆= P(st = s | ICt )
POMDP is MDP over properly updated beliefs (Astrom, 1965):
b′(s ′) = p(s ′ | o ′, a, b) =p(s ′, o ′ | a, b)
p(o ′ | a, b)
=p(o ′ | s ′, a) · p(s ′ | a, b)∑s′′ p(o ′ | s ′′, a) · p(s ′′ | a, b)
∝ p(o ′ | s ′, a)∑s
p(s ′ | a, s) · b(s)
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
From POMDP to MDP over beliefs
Bellman optimality equation for V ∗(st)
V ∗(s) = maxa
[R(s, a) + γ
∑s′
p(s ′ | s, a)V ∗(s ′)
]
Bellman optimality equation for V ∗(bt)
V ∗(b) = maxa
[R(b, a) + γ
∑s′
p(b′ | b, a)V ∗(b′)
]p(b′ | a, b) =
∑o′,s′,s
p(b′ | a, b, o ′)p(o ′ | s ′, a)p(s ′ | s, a)b(s)
p(b′ | a, b, o ′) = I(b′ = BayesFilter(o ′, a, b))
R(b, a) =∑s
b(s)R(s, a)
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Reasoning about state uncertainty
Bad news: belief updating can be computed exactly only for1 discrete low-demensional state-spaces2 linear-Gaussian dynamics (leading to Kalman filter), i.e.
s ′ ∼ N (s ′ |Tss + Taa, Σs)o′ ∼ N (o′ |Oss
′ + Oaa,Σo)R(s, a) = s>Rss + a>Raa
What if1 states are of a complex nature? (i.e. images)2 state transition function is non-linear and unknown?
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Reasoning about state uncertainty
Bad news: belief updating can be computed exactly only for1 discrete low-demensional state-spaces2 linear-Gaussian dynamics (leading to Kalman filter), i.e.
s ′ ∼ N (s ′ |Tss + Taa, Σs)o′ ∼ N (o′ |Oss
′ + Oaa,Σo)R(s, a) = s>Rss + a>Raa
What if1 states are of a complex nature? (i.e. images)2 state transition function is non-linear and unknown?
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Possible options
1 Use advanced tracking techniquesDeep Variational Bayes Filter (Karl et al., 2016)
2 Just forget all the math and use LSTM / GRUDRQN (Hausknecht et al., 2015), DARQN (Zhu et al., 2017),RDPG (Heess et al., 2015)
3 Preserve information with predictive state representationsRecurrent Predictive State Policy (Hefny et al., 2018)
4 Use human-like differentiable memoryNeural Map (Parisotto et al., 2017), MERLIN (Wayne et al.,2018)
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Outline
1 What is wrong with MDP?MDP reminder
2 POMDP detailsDefinitionsAdapted policiesSufficient information process
3 Approximate Learning in POMDPsDeep Recurrent Q-LearningMERLIN
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Deep Recurrent Q-Learning (DRQN)
Q-learning: Q(st , at) = Er ,s′|st ,at [ r + γmaxa′ Q(s ′, a′) ]
Problem: we don’t know st
DRQN solution: (Hausknecht et al., 2015)1 equip agent with memory ht2 approximate Q(st , at) with Q(ot , ht−1, at)
3 eliminate dependence on ot by modelling ht = LSTM(ot , ht−1)
Benefits:1 simple approximate POMDP solver with one frame input2 need only to model Q(ht , at)
3 minor changes to vanilla DQN architecture
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
DRQN: architecture
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
RNN-like memory
1 It – input image2 vt – agent’s velocity3 rt−1 – previous reward4 Tt – text instructions
Drawbacks:1 truncated BPTT2 sparse reward signal
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
DNC-like memory
Sensory data can instead be encoded and stored without trial anderror in a temporally local manner.
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
MERLIN (Wayne et al., 2018): design principles
Neuroscience motivation1 predictive sensory coding
brain continually generates models of the worldbased on context and information from memoryto predict future sensory input
2 hippocampal representation theory (Gluck et al., 1993)representations pass through a compressive bottleneckthen reconstruct input stimuli together with task reward
3 temporal context model
Under the hood:1 Variational Autoencoders2 Differentiable Neural Computer3 Recurrent Asynchronous Advantage Actor Critic (A3C)4 57 pages long paper of 24 authors from DeepMind
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
MERLIN (Wayne et al., 2018): architecture
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Variational Autoencoder
Each component of
ot = (It , vt , at−1, rt−1,Tt) ∈ R10 000
is independently encoded into et ∈ R100 by
(6 ResNet blocks, MLP, None, None, 1-layer LSTM)
Decoding process uses same architectures.
Return predictor decoderMLP: V π(zt , log πt(a|Mt−1, z≤t)) is regressed on Gt
MLP: A(zt , at) is regressed on Gt − V π(·, ·)
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Memory – simplified DNC model
Memory is a tensor Mt with dimensions (Nmem, 2|z |)Each step we write vector [zt , (1− γ)
∑t′>t γ
t′−tzt′ ] to Mt−1
denote mt = [m1t , ...,m
Kt ] for readout of K read heads
Reading from memoryMBP LSTM: [zt , at ,mt−1] → h1
t
Policy LSTM: [zt ] → h2t
Linear([h1t , h
2t ])→ it = [k1
t , ..., kKt , β
1t , .., β
Kt ]
c ijt = cosine(k it , Mt−1[j , ·])w it = Softmax(β1
t ci1t , ..., β
Kt c iNmem
t )
readout memory mit = M>t−1w
it
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Memory – simplified DNC model
Memory is a tensor Mt with dimensions (Nmem, 2|z |)Each step we write vector [zt , (1− γ)
∑t′>t γ
t′−tzt′ ] to Mt−1
denote mt = [m1t , ...,m
Kt ] for readout of K read heads
Writing to memory is performed after reading:vwrt [i ] = δit
v rett = γv rett−1 + (1− γ)vwrt−1
Mt = Mt−1 + vwrt [zt , 0]> + v rett [0, zt ]>
When t > Nmem, select the cell with lowest utility
ut+k [k] = ut [k] +∑i
w it+1[k]
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Latent space
Prior, MLP:
[ht−1,mt−1] → µpriort , log σpriort
Concatenate all information from this timestep
nt = [et , ht−1,mt−1, µpriort , log σpriort ]
Posterior
[µpostt , log σpostt ] = MLP(nt) + [µpriort , log σpriort ]
zt is a sample from posterior factorized Gaussian
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Loss function
Variational Lower Bound:
log p(o≤t , r≤t) ≥t∑
τ=0
Eq(z<τ | o<τ ) [ DataTerm−KL ]
DataTerm = Eq(zτ | z<τ ,o≤τ ) [ log p(oτ , rτ | zt) ]
KL = DKL (q(zτ | z<τ , o≤τ || p(zτ | z<τ , a≤τ ))
Where p(oτ , rτ | zt) is a linear combination of 6 decoding losses
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
Experiments
The most interesting experiments1 Goal oriented navigation in a maze (3-7 rooms) (video)2 Arbitrary Visuomotor Mapping (video)3 T-maze (video)
Shvechikov Pavel POMDP in RL
Thank you!
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
References I
Astrom, Karl J (1965). “Optimal control of Markov processes withincomplete state information”. In: Journal of mathematicalanalysis and applications 10.1, pp. 174–205.
Bai, Haoyu et al. (2012). “Unmanned aircraft collision avoidanceusing continuous-state POMDPs”. In: Robotics: Science andSystems VII 1, pp. 1–8.
Gluck, Mark A et al. (1993). “Hippocampal mediation of stimulusrepresentation: A computational theory”. In: Hippocampus 3.4,pp. 491–516.
Hausknecht, Matthew et al. (2015). “Deep recurrent q-learning forpartially observable mdps”. In: arXiv preprint arXiv:1507.06527.
Heess, Nicolas et al. (2015). “Memory-based control with recurrentneural networks”. In: arXiv preprint arXiv:1512.04455.
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
References II
Hefny, Ahmed et al. (2018). “Recurrent Predictive State PolicyNetworks”. In: arXiv preprint arXiv:1803.01489.
Hoey, Jesse et al. (2010). “Automated handwashing assistance forpersons with dementia using video and a partially observableMarkov decision process”. In: Computer Vision and ImageUnderstanding 114.5, pp. 503–519.
Hsiao, Kaijen et al. (2007). “Grasping pomdps”. In: Robotics andAutomation, 2007 IEEE International Conference on. IEEE,pp. 4685–4692.
Irissappane, Athirai Aravazhi et al. (2016). “A Scalable Frameworkto Choose Sellers in E-Marketplaces Using POMDPs.”. In: AAAI,pp. 158–164.
Karl, Maximilian et al. (2016). “Deep variational bayes filters:Unsupervised learning of state space models from raw data”. In:arXiv preprint arXiv:1605.06432.
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
References III
Memarzadeh, Milad et al. (2014). “Optimal planning and learningin uncertain environments for the management of wind farms”.In: Journal of Computing in Civil Engineering 29.5, p. 04014076.
Pajarinen, Joni et al. (2013). “Planning under uncertainty forlarge-scale problems with applications to wireless networking”. In:
Parisotto, Emilio et al. (2017). “Neural Map: Structured Memoryfor Deep Reinforcement Learning”. In: arXiv preprintarXiv:1702.08360.
Pineau, Joelle et al. (2003). “Towards robotic assistants in nursinghomes: Challenges and results”. In: Robotics and autonomoussystems 42.3, pp. 271–281.
Shani, Guy et al. (2009). “Improving existing fault recoverypolicies”. In: Advances in Neural Information ProcessingSystems, pp. 1642–1650.
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
References IV
Singh, Satinder P et al. (1994). “Learning without state-estimationin partially observable Markovian decision processes”. In:Machine Learning Proceedings 1994. Elsevier, pp. 284–292.
Spaan, Matthijs TJ et al. (2005). “Perseus: Randomizedpoint-based value iteration for POMDPs”. In: Journal of artificialintelligence research 24, pp. 195–220.
Striebel, Charlotte (1965). “Sufficient statistics in the optimumcontrol of stochastic systems”. In: Journal of MathematicalAnalysis and Applications 12.3, pp. 576–592.
Wayne, Greg et al. (2018). “Unsupervised Predictive Memory in aGoal-Directed Agent”. In: arXiv preprint arXiv:1803.10760.
Young, Steve et al. (2013). “Pomdp-based statistical spoken dialogsystems: A review”. In: Proceedings of the IEEE 101.5,pp. 1160–1179.
Shvechikov Pavel POMDP in RL
What is wrong with MDP? POMDP details Approximate Learning in POMDPs
References V
Zhu, Pengfei et al. (2017). “On Improving Deep ReinforcementLearning for POMDPs”. In: arXiv preprint arXiv:1704.07978.
Shvechikov Pavel POMDP in RL