Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Kaur KarusNovin Shahroudi

Counterfactual Multi-Agent Policy

Gradients

https://arxiv.org/pdf/1705.08926.pdf

Article by Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S.

• Reinforcement learning in a nutshell

• Deep RL for Multi-agent Systems

• Q-learning in a flashback

• COMA in a simple idea

• Results in an RTS

Introduction

Reinforcement learning in a nutshell

(Deep) Reinforcement Learning

Adopted from Wikipedia


Markov state and Property:

Adopted from David Silver's UCL Course on RL


Markov Chain:

Adopted from Stanford CS234 Lecture 14


Markov Decision Process (MDP):



MDP

Adopted from Berkeley CS294 DRL


Partially Observable - MDP



Agent:

• Policy (agent behavior is defined by the policy)

• Value function (how good state/action is)

• Model (agent’s representation of the environment)


Policy (π)

• Maps states to actions

• Either stochastic π(a | s) or deterministic a = π(s)



Goal



Deep PO-MDP



Value & Q-value function



Bellman equation



Example (bellman expectation equation):


Q-learning (value iteration algorithm)



Experience replay

• Actions-States feedback loop

• Bias to consecutive samples

• Random minibatches from a transition table:



Policy Gradient

• Complex Q-function

• Simpler with policy

Goal:



Reinforce Algorithm

Trajectories:

Adopted from Stanford CS234 Lecture 14, Berkeley CS294 DRL


Addressing high variance problem

1. Push up probabilities from t' > t

2. Add a discount factor for t' > t

3. Use a baseline:



Better baseline – a foundation for Actor-Critic



Actor-critic Algorithm


(Deep) Reinforcement Learning > recap

Policy Value function



Agent Types


Value Based

• No policy (implicit)

• Value function

Policy Based

• Policy

• No value function

Actor-Critic

• Policy

• Value function

(Deep) Reinforcement Learning > Recap

Anatomy of RL algorithm



Different RL methods



Reviewed concepts:

• MDP

• PO-MDP

• State, Reward,

Action

• Discounted reward

• Policy

• Value function

• Optimal Policy

• Q-value function

• Q-Learning

• Experience Replay

• High variance issue

• Credit assignment

• Baselines

• Bellman equation

• Policy gradient

• Trajectory

• Actor-critic

• Advantage function

Deep RL for Multi-agent Systems


Variety of Behaviors

• Cooperative

• Competitive

• Mixed


Multi-agent MDP

• Global state:

• Individual action:

• State transition:

• Shared team reward:

Adapted from Shimon Whiteson's slides


Decentralized PO-MDP

• Observation function:

• Action-observation history:

• Decentralized policies:

• Decentralization types: natural, artificial

• From planning (centralized) to execution (decentralized)



MAS Challenges:

• Curse of dimensionality (state and action space)

• Multi-agent credit assignment problem

• Modeling other agents' state information



Independent Actor-Critic (IAC)

• First step towards RL for MAS

• Learn independently

• Each agent: its own actor and critic

• IAC-V & IAC-Q as baselines

• Treating other agents as part of environment


• Monte Carlo policy

• Temporal Difference

• Q-function

• Exploration?

Q-learning

Q-learning

COMA

• All-knowing single critic

• Counterfactual baseline

Counterfactual Multi-Agent Policy Gradients

All-knowing single critic

Counterfactual baseline

Gradient

Critic construction

Experimental Results

Experimental results

Testbed: StarCraft

StarCraft Environment



• 1000 evaluation episodes• Highest mean in bold• In parenthesis: 95% confidence interval

For a short video of the work done in the article (for whomever checks the

slides afterwards: https://www.youtube.com/watch?v=3OVvjE5B9LU

Other References:

• https://youtu.be/lvoHnicueoE

• http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

• http://rll.berkeley.edu/deeprlcourse/

• https://github.com/deepmind/pysc2

• https://www.youtube.com/watch?v=URWXG5jRB-A

Extra stuff for cool people

Documents

Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •