49
Kaur Karus Novin Shahroudi Counterfactual Multi-Agent Policy Gradients https://arxiv.org/pdf/1705.08926.pdf Article by Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S.

Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Kaur KarusNovin Shahroudi

Counterfactual Multi-Agent Policy

Gradients

https://arxiv.org/pdf/1705.08926.pdf

Article by Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S.

Page 2: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

• Reinforcement learning in a nutshell

• Deep RL for Multi-agent Systems

• Q-learning in a flashback

• COMA in a simple idea

• Results in an RTS

Introduction

Page 3: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Reinforcement learning in a nutshell

Page 4: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Adopted from Wikipedia

Page 5: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Markov state and Property:

Adopted from David Silver's UCL Course on RL

Page 6: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Markov Chain:

Adopted from Stanford CS234 Lecture 14

Page 7: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Markov Decision Process (MDP):

Adopted from Stanford CS234 Lecture 14

Page 8: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

MDP

Adopted from Berkeley CS294 DRL

Page 9: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Partially Observable - MDP

Adopted from Berkeley CS294 DRL

Page 10: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Agent:

• Policy (agent behavior is defined by the policy)

• Value function (how good state/action is)

• Model (agent’s representation of the environment)

Page 11: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Policy (π)

• Maps states to actions

• Either stochastic π(a | s) or deterministic a = π(s)

Adopted from David Silver's UCL Course on RL

Page 12: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Goal

Adopted from Stanford CS234 Lecture 14

Page 13: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Deep PO-MDP

Adopted from Berkeley CS294 DRL

Page 14: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Value & Q-value function

Adopted from Stanford CS234 Lecture 14

Page 15: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Bellman equation

Adopted from Stanford CS234 Lecture 14

Page 16: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Example (bellman expectation equation):

Page 17: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Q-learning (value iteration algorithm)

Adopted from Stanford CS234 Lecture 14

Page 18: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Experience replay

• Actions-States feedback loop

• Bias to consecutive samples

• Random minibatches from a transition table:

Adopted from Stanford CS234 Lecture 14

Page 19: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Policy Gradient

• Complex Q-function

• Simpler with policy

Goal:

Adopted from Stanford CS234 Lecture 14

Page 20: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Reinforce Algorithm

Trajectories:

Adopted from Stanford CS234 Lecture 14, Berkeley CS294 DRL

Page 21: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Addressing high variance problem

1. Push up probabilities from t' > t

2. Add a discount factor for t' > t

3. Use a baseline:

Adopted from Stanford CS234 Lecture 14

Page 22: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Better baseline – a foundation for Actor-Critic

Adopted from Stanford CS234 Lecture 14

Page 23: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Actor-critic Algorithm

Adopted from Stanford CS234 Lecture 14

Page 24: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning > recap

Policy Value function

Adopted from David Silver's UCL Course on RL

Page 25: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning

Agent Types

Adopted from David Silver's UCL Course on RL

Value Based

• No policy (implicit)

• Value function

Policy Based

• Policy

• No value function

Actor-Critic

• Policy

• Value function

Page 26: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning > Recap

Anatomy of RL algorithm

Adopted from Berkeley CS294 DRL

Page 27: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning > Recap

Different RL methods

Adopted from Berkeley CS294 DRL

Page 28: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

(Deep) Reinforcement Learning > Recap

Reviewed concepts:

• MDP

• PO-MDP

• State, Reward,

Action

• Discounted reward

• Policy

• Value function

• Optimal Policy

• Q-value function

• Q-Learning

• Experience Replay

• High variance issue

• Credit assignment

• Baselines

• Bellman equation

• Policy gradient

• Trajectory

• Actor-critic

• Advantage function

Page 29: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Deep RL for Multi-agent Systems

Page 30: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Deep RL for Multi-agent Systems

Variety of Behaviors

• Cooperative

• Competitive

• Mixed

Page 31: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Deep RL for Multi-agent Systems

Multi-agent MDP

• Global state:

• Individual action:

• State transition:

• Shared team reward:

Adapted from Shimon Whiteson's slides

Page 32: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Deep RL for Multi-agent Systems

Decentralized PO-MDP

• Observation function:

• Action-observation history:

• Decentralized policies:

• Decentralization types: natural, artificial

• From planning (centralized) to execution (decentralized)

Adapted from Shimon Whiteson's slides

Page 33: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Deep RL for Multi-agent Systems

MAS Challenges:

• Curse of dimensionality (state and action space)

• Multi-agent credit assignment problem

• Modeling other agents' state information

Adapted from Shimon Whiteson's slides

Page 34: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Deep RL for Multi-agent Systems

Independent Actor-Critic (IAC)

• First step towards RL for MAS

• Learn independently

• Each agent: its own actor and critic

• IAC-V & IAC-Q as baselines

• Treating other agents as part of environment

Adapted from Shimon Whiteson's slides

Page 35: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

• Monte Carlo policy

• Temporal Difference

• Q-function

• Exploration?

Q-learning

Page 36: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Q-learning

Page 37: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

COMA

Page 38: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

• All-knowing single critic

• Counterfactual baseline

Counterfactual Multi-Agent Policy Gradients

Page 39: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

All-knowing single critic

Page 40: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Counterfactual baseline

Page 41: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Gradient

Page 42: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •
Page 43: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Critic construction

Page 44: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Experimental Results

Page 45: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Experimental results

Testbed: StarCraft

Page 46: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

StarCraft Environment

Page 47: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Experimental results

Page 48: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

Experimental results

• 1000 evaluation episodes• Highest mean in bold• In parenthesis: 95% confidence interval

Page 49: Counterfactual Multi-Agent Policy Gradients · • First step towards RL for MAS • Learn independently • Each agent: its own actor and critic • IAC-V &IAC-Q as baselines •

For a short video of the work done in the article (for whomever checks the

slides afterwards: https://www.youtube.com/watch?v=3OVvjE5B9LU

Other References:

• https://youtu.be/lvoHnicueoE

• http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

• http://rll.berkeley.edu/deeprlcourse/

• https://github.com/deepmind/pysc2

• https://www.youtube.com/watch?v=URWXG5jRB-A

Extra stuff for cool people