Moscow 2016
Reinforcement Learning:
Beyond Markov Decision Processes
Alexey O. Seleznev
PhD in Computational Chemistry
5vision team
Deep Learning Moscow
Seminar # 9
OUTLINE
Introduction
Markov Decision Processes and their Limitations
Main Point of the Presentation
Partially Observable Markov Decision Processes
Bayesian Reinforcement Learning
Multi-agent Systems
References
2
Introduction
Reinforcement Learning (RL):
• Agent interacts with a dynamic, stochastic, and incompletely known
environment with the goal of finding a strategy (policy) that optimizes some
long-term performance measure
• Unlike supervised machine learning (ML), RL focuses on strategies, not on
forecasts
• Examples of tasks:
3
Markov Decision Processes and their Limitations
To solve RL tasks, we have to formalize the approach
It turns out that the most convenient way to do it is to utilize a Markov Decision
Process (MDP), which is comprised of:
• A set of available states 𝑆 = 𝑠1, 𝑠2,…, 𝑠|𝑆|
• A set of available actions A= 𝑎1, 𝑎2,…, 𝑎|𝐴|
• A reward function, 𝑅: 𝑆 × 𝐴 → ℝ
• A transition function: 𝑇𝑖𝑗𝑎 = 𝑃 𝑆𝑡+1 = |𝑗 𝑆𝑡 = ⅈ, 𝑎𝑡 = 𝑎
• A discount factor, 𝛾 ∈ 0,1
4
5
Markov Decision Processes and their Limitations
Grid world is a good example of a task that can be formulated within the MDP
framework:
Agent’s goal is to find a policy 𝜋 𝑠 that would result in the highest cumulativereward within a fixed number of steps
How to find such a policy?
cell number is a state
transition matrix: 0 if dest. is wall, 1 if not
reward: on the scheme
actions: up, down, left, right
6
Markov Decision Processes and their Limitations
Classical MDP
Methods
Model-based
Conventional model-based
(Dynamic Programming)
Bayesian RL
PAC-MDP
(E3, Rmax)
Model-free
Actor-critic
Policy-based
(pure actor)
REINFORCE, finite-difference
methods
Value-based
(pure critic)
Monte-CarloTemporal-Difference
(SARSA, Q-Learning)
7
Markov Decision Processes and their Limitations
Limitations of the classical MDP framework:
The concept of state is the most unrealistic and stylized aspect of MDP
• What is a state? All information relevant to predicting subsequent
dynamics and rewards. But we do not talk about one-to-one mapping.
• Markov property requires that states be organized in such a way that
history (previous states and actions) is not relevant for predicting
subsequent dynamics and rewards
• So, one limitation of MDP is when the Markov property is violated.
Possible solution: augment states to “full states” by including (i)
relevant information from other states and/or (ii) previous action
record. Example: 4 game screens for DQN
• Another limitation is that in some cases, even full history is not enough to
determine underlying states. Example: frog in mist, financial market
8
Markov Decision Processes and their Limitations
The majority of MDP methods face exploration vs exploitation dilemma
• Data used for learning in RL depend on the agent
• Two goals: (i) exploration: to learn as much as possible,
(ii) exploitation: to obtain as much reward as possible
• What combination of two objectives will result in greatest long-term
reward?
• Existing methods use a variety of techniques to mitigate the dilemma,
among them:
Epsilon-greedy strategy
Boltzmann sampling
Optimism in the face of uncertainty
Intrinsic motivation
9
Markov Decision Processes and their Limitations
How to operate in multi-agent environments?
10
Main Point of the Presentation
Solutions to the difficulties mentioned can be
formulated within the MDP framework, but with
specific choices of its components.
11
Partially Observable Markov Decision Processes
To get an idea of what it is, let us consider a tiger example:
N. Daw (2013)
12
Partially Observable Markov Decision Processes
Partially Observable Markov Decision Process (POMDP) is comprised of:
• A set of available states 𝑆 = 𝑠1, 𝑠2,…, 𝑠|𝑆|
• A set of available actions A= 𝑎1, 𝑎2,…, 𝑎|𝐴|
• A reward function, 𝑅: 𝑆 × 𝐴 → ℝ
• A set of observations: Ω = 𝑜1, 𝑜2, … , 𝑜|O|
• A transition function: 𝑇𝑖𝑗𝑎 = 𝑃 𝑆𝑡+1 = |𝑗 𝑆𝑡 = ⅈ, 𝑎𝑡 = 𝑎
• Conditional observation probabilities : 𝑍𝑖𝑗𝑎 = 𝑃 𝑂𝑡+1 = |𝑗 𝑆𝑡+1 = ⅈ, 𝑎𝑡 = 𝑎
• A discount factor, 𝛾 ∈ 0,1
13
Partially Observable Markov Decision Processes
How to solve POMDP?
Model-based approach
1. Reformulate it as MDP. For this purpose, use, e.g.:
• Belief state MDPs
• Cross-product MDPs
2. Solve the MDP obtained by means of e.g.:
• Policy Iteration
• Value Iteration
• Gradient methods
Model-free approach
• Incorporating memory (HMM, RNN, Finite State Controllers)
• Policy-gradient methods D. Braziunas (2003)
14
Partially Observable Markov Decision Processes
There exists a direct connection between POMDP and MDP characterized by a
quadruple :
D. Braziunas (2003)
b is a belief state: distribution over states.
15
Evolution of belief state (example):
Partially Observable Markov Decision Processes
16
Partially Observable Markov Decision Processes
D. Braziunas (2003)
How to find optimal policy for POMDP?
policy trees (for finite horizon):
The optimal t-step value function can be found simply by enumerating all the
possible policy trees in the set 𝛤𝑡
17
Partially Observable Markov Decision Processes
Optimal t-step POMDP value function is piecewise linear and convex in b
The problem is almost intractable due to its computational complexity. A set of
simplifications has been suggested.
18
Bayesian Reinforcement Learning
In Bayesian RL, we encode unknown 𝑇(𝑠t+1|𝑠t, 𝑎t) with random variables 𝜃𝑖𝑗𝑎
distributed in accordance with multinomial distribution.
The agent maintains a posterior belief b over all possible transition models {T} given
its previous experience and a prior (Dirichlet distribution).
The task can be reformulated as either POMDP or MDP by redefining the state as
consisting of observable part S and unobservable parameters of Dirichlet
distribution. The construction is called superstate: ሚ𝑆 = 𝑆 × 𝜃
0
Due to the complexity of the belief state, Bayesian RL is typically intractable in
terms of both planning and updating the belief after an action. A recent approximate
solution to Bayesian RL is the Bayesian exploration bonus. Lopes et al. (2012)
Ross et al. (2011)
19
Multi-agent Systems
Stochastic games extend MDPs to multiple agents. The main difference between the
standard MDP and MDP for multiple players is that each agent is independently
choosing actions and receiving rewards while the state transitions matrix is defined
for the full joint-action.
Dermed at al. (2011)
20
Multi-agent Systems
How to solve stochastic games?
Replace V(s) in Bellman’s equation with an achievable set function
As a group of n agents follow a joint-policy, each player receives rewards. The
discounted sum of these rewards is that player’s utility. Joint-utility is a vector
of players’ utilities.
An achievable set contains all possible joint-utilities that players can receive
using policies in equilibrium.
Dermed at al. (2011)
21
References
Nathaniel Daw. in Neuroeconomics (Chapter 16): Advanced Reinforcement
Learning. Elsevier Inc. (2013)
Darius Braziunas. POMDP solution methods. Tutorial. University of Toronto (2003)
Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-Yves Oudeyer. Exploration in
Model-based Reinforcement Learning by Empirically Estimating Learning Progress.
in NIPS Proceedings (2012)
Stephane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann. A Bayesian
Approach for Learning and Planning in Partially Observable Markov Decision
Processes. Journal of Machine Learning Research 12 (2011) 1729-1770
Liam Mac Dermed, Charles L. Isbell, Lora Weiss. Markov Games of Incomplete
Information for Multi-Agent Reinforcement Learning. Interactive Decision Theory
and Game Theory: Papers from 2011 AAAI Workshop
THANK YOU FOR YOUR ATTENTION !
22