Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Kaur KarusNovin Shahroudi
Counterfactual Multi-Agent Policy
Gradients
https://arxiv.org/pdf/1705.08926.pdf
Article by Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S.
• Reinforcement learning in a nutshell
• Deep RL for Multi-agent Systems
• Q-learning in a flashback
• COMA in a simple idea
• Results in an RTS
Introduction
Reinforcement learning in a nutshell
(Deep) Reinforcement Learning
Adopted from Wikipedia
(Deep) Reinforcement Learning
Markov state and Property:
Adopted from David Silver's UCL Course on RL
(Deep) Reinforcement Learning
Markov Chain:
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Markov Decision Process (MDP):
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
MDP
Adopted from Berkeley CS294 DRL
(Deep) Reinforcement Learning
Partially Observable - MDP
Adopted from Berkeley CS294 DRL
(Deep) Reinforcement Learning
Agent:
• Policy (agent behavior is defined by the policy)
• Value function (how good state/action is)
• Model (agent’s representation of the environment)
(Deep) Reinforcement Learning
Policy (π)
• Maps states to actions
• Either stochastic π(a | s) or deterministic a = π(s)
Adopted from David Silver's UCL Course on RL
(Deep) Reinforcement Learning
Goal
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Deep PO-MDP
Adopted from Berkeley CS294 DRL
(Deep) Reinforcement Learning
Value & Q-value function
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Bellman equation
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Example (bellman expectation equation):
(Deep) Reinforcement Learning
Q-learning (value iteration algorithm)
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Experience replay
• Actions-States feedback loop
• Bias to consecutive samples
• Random minibatches from a transition table:
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Policy Gradient
• Complex Q-function
• Simpler with policy
Goal:
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Reinforce Algorithm
Trajectories:
Adopted from Stanford CS234 Lecture 14, Berkeley CS294 DRL
(Deep) Reinforcement Learning
Addressing high variance problem
1. Push up probabilities from t' > t
2. Add a discount factor for t' > t
3. Use a baseline:
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Better baseline – a foundation for Actor-Critic
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning
Actor-critic Algorithm
Adopted from Stanford CS234 Lecture 14
(Deep) Reinforcement Learning > recap
Policy Value function
Adopted from David Silver's UCL Course on RL
(Deep) Reinforcement Learning
Agent Types
Adopted from David Silver's UCL Course on RL
Value Based
• No policy (implicit)
• Value function
Policy Based
• Policy
• No value function
Actor-Critic
• Policy
• Value function
(Deep) Reinforcement Learning > Recap
Anatomy of RL algorithm
Adopted from Berkeley CS294 DRL
(Deep) Reinforcement Learning > Recap
Different RL methods
Adopted from Berkeley CS294 DRL
(Deep) Reinforcement Learning > Recap
Reviewed concepts:
• MDP
• PO-MDP
• State, Reward,
Action
• Discounted reward
• Policy
• Value function
• Optimal Policy
• Q-value function
• Q-Learning
• Experience Replay
• High variance issue
• Credit assignment
• Baselines
• Bellman equation
• Policy gradient
• Trajectory
• Actor-critic
• Advantage function
Deep RL for Multi-agent Systems
Deep RL for Multi-agent Systems
Variety of Behaviors
• Cooperative
• Competitive
• Mixed
Deep RL for Multi-agent Systems
Multi-agent MDP
• Global state:
• Individual action:
• State transition:
• Shared team reward:
Adapted from Shimon Whiteson's slides
Deep RL for Multi-agent Systems
Decentralized PO-MDP
• Observation function:
• Action-observation history:
• Decentralized policies:
• Decentralization types: natural, artificial
• From planning (centralized) to execution (decentralized)
Adapted from Shimon Whiteson's slides
Deep RL for Multi-agent Systems
MAS Challenges:
• Curse of dimensionality (state and action space)
• Multi-agent credit assignment problem
• Modeling other agents' state information
Adapted from Shimon Whiteson's slides
Deep RL for Multi-agent Systems
Independent Actor-Critic (IAC)
• First step towards RL for MAS
• Learn independently
• Each agent: its own actor and critic
• IAC-V & IAC-Q as baselines
• Treating other agents as part of environment
Adapted from Shimon Whiteson's slides
• Monte Carlo policy
• Temporal Difference
• Q-function
• Exploration?
Q-learning
Q-learning
COMA
• All-knowing single critic
• Counterfactual baseline
Counterfactual Multi-Agent Policy Gradients
All-knowing single critic
Counterfactual baseline
Gradient
Critic construction
Experimental Results
Experimental results
Testbed: StarCraft
StarCraft Environment
Experimental results
Experimental results
• 1000 evaluation episodes• Highest mean in bold• In parenthesis: 95% confidence interval
For a short video of the work done in the article (for whomever checks the
slides afterwards: https://www.youtube.com/watch?v=3OVvjE5B9LU
Other References:
• https://youtu.be/lvoHnicueoE
• http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
• http://rll.berkeley.edu/deeprlcourse/
• https://github.com/deepmind/pysc2
• https://www.youtube.com/watch?v=URWXG5jRB-A
Extra stuff for cool people