Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Multi-Agent Reinforcement LearningAn Overview
Marcello Restelli
November 12, 2014
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Game Theory in Computer Science
Game Theory ComputerScience
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Game Theory in Computer Science
Game Theory ComputerScience
Computing Solution Concepts
Compact Game Representations
Mechanism Design
Multi-agent Learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Some Naming Conventions
Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Some Naming Conventions
Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Some Naming Conventions
Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Some Naming Conventions
Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Some Naming Conventions
Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Some Naming Conventions
Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Some Naming Conventions
Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
If multi-agent is theanswer, what is thequestion?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
What is Multi-Agent Learning?
Difficult question...... we will try to answer in theseslidesIt involves
Multiple agentsSelf-interestConcurrently learning
It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems
Shoham et al.,2002-2007
Stone, 2007
If multi-agent is theanswer, what is thequestion?
Multi-agent learningis not the answer, itis the question!
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Which Applications?
Distributed vehicle regulationAir traffic controlNetwork management and routingElectricity distribution managementSupply chainsJob schedulingComputer games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent Learning and RL
We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned
Fictitious playNo-regret learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent Learning and RL
We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned
Fictitious playNo-regret learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent Learning and RL
We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned
Fictitious playNo-regret learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent Learning and RL
We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned
Fictitious playNo-regret learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent Learning and RL
We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned
Fictitious playNo-regret learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
History of RL
Psychology, Trial and error
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
History of RL
Psychology, Trial and error
Pavlov (1903)Classical conditioning
Thorndike (1905)Law of effect
Minsky (1961)Credit–assignment
problem
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
History of RL
Psychology, Trial and error
Pavlov (1903)Classical conditioning
Thorndike (1905)Law of effect
Minsky (1961)Credit–assignment
problem
Optimal Control
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
History of RL
Psychology, Trial and error
Pavlov (1903)Classical conditioning
Thorndike (1905)Law of effect
Minsky (1961)Credit–assignment
problem
Optimal Control
Bellman (1957)Dynamic Programming
Howard (1960)Policy Iteration
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
History of RL
Psychology, Trial and error
Pavlov (1903)Classical conditioning
Thorndike (1905)Law of effect
Minsky (1961)Credit–assignment
problem
Reinforcement Learning
Optimal Control
Bellman (1957)Dynamic Programming
Howard (1960)Policy Iteration
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
History of RL
Psychology, Trial and error
Pavlov (1903)Classical conditioning
Thorndike (1905)Law of effect
Minsky (1961)Credit–assignment
problem
Reinforcement Learning
Optimal Control
Bellman (1957)Dynamic Programming
Howard (1960)Policy Iteration
Samuel (1956)Checkers
Sutton & Barto (1984)Temporal Difference
Watkins (1989)Q–learning
Littman (1994)minimax–Q
Tesauro (1992)TD–Gammon
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Agent-Environment Interface
Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Agent-Environment Interface
Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Agent-Environment Interface
Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Decision Processes
An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states
What the agent knows (complete observability)A: set of actions
What the agent can do (it may depend on state)P: state transition model
S × A× S → [0,1]
R: reward functionS × A× S → R
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Assumption
Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)
Markov is a special kind of conditional independenceFuture is independent of past given current state
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Assumption
Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)
Markov is a special kind of conditional independenceFuture is independent of past given current state
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Assumption
Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)
Markov is a special kind of conditional independenceFuture is independent of past given current state
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Markov Assumption
Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)
Markov is a special kind of conditional independenceFuture is independent of past given current state
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
The Goal: a Policy
Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?
a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary
Cost criteriafinite horizoninfinite horizon
average
discounted Rt =∞∑
k=0
γk rt+k
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Value Functions
MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)
Vπ(s) =∑a∈A
π(s|a)∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)
Qπ(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Value Functions
MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)
Vπ(s) =∑a∈A
π(s|a)∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)
Qπ(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Value Functions
MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)
Vπ(s) =∑a∈A
π(s|a)∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)
Qπ(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Value Functions
MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)
Vπ(s) =∑a∈A
π(s|a)∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)
Qπ(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Value Functions
MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)
Vπ(s) =∑a∈A
π(s|a)∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)
Qπ(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γVπ(s′))
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Optimal Value Functions
Optimal Bellamn equation (Bellman, 1957)
V ∗(s) = maxa
(∑s′∈S
P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))
Q∗(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γmaxa′
Q∗(s′,a′))
For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Optimal Value Functions
Optimal Bellamn equation (Bellman, 1957)
V ∗(s) = maxa
(∑s′∈S
P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))
Q∗(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γmaxa′
Q∗(s′,a′))
For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Optimal Value Functions
Optimal Bellamn equation (Bellman, 1957)
V ∗(s) = maxa
(∑s′∈S
P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))
Q∗(s,a) =∑s′∈S
P(s′|s,a)(R(s,a, s′) + γmaxa′
Q∗(s′,a′))
For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Solving an MDP
Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches
Dynamic Programming (DP)Value IterationPolicy Iteration
Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Dynamic Programming
Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems
Complete knowledgeComputational expensive
From DP algorithms have been derived RL algorithms
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Dynamic Programming
Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems
Complete knowledgeComputational expensive
From DP algorithms have been derived RL algorithms
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Dynamic Programming
Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems
Complete knowledgeComputational expensive
From DP algorithms have been derived RL algorithms
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Dynamic Programming
Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems
Complete knowledgeComputational expensive
From DP algorithms have been derived RL algorithms
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Dynamic Programming
Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems
Complete knowledgeComputational expensive
From DP algorithms have been derived RL algorithms
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
RL vs DP
RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches
Model-basedModel-free
Q-learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
RL vs DP
RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches
Model-basedModel-free
Q-learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
RL vs DP
RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches
Model-basedModel-free
Q-learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
RL vs DP
RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches
Model-basedModel-free
Q-learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
RL vs DP
RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches
Model-basedModel-free
Q-learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
RL vs DP
RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches
Model-basedModel-free
Q-learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Q-learning (Watkins,’89)
Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))
Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))
Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Q-learning (Watkins,’89)
Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))
Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))
Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Q-learning (Watkins,’89)
Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))
Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))
Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Q-learning (Watkins,’89)
Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))
Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))
Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Q-learning (Watkins,’89)
Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))
Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))
Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Q-learning (Watkins,’89)
Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))
Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))
Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Advanced Topics in RL
High-dimensional problemsContinuous MDPsPartially observable MDPsMulti-Objective MDPsInverse RLTransfer of KnowledgeExploration vs ExploitationMulti-agent learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Exploration vs Exploitation
To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied
ε-greedy
Boltzmann→ π(s,a) =e
Q(s,a)T∑
a′∈A eQ(s,a′)
T
More efficient techniques (Multi–Armed Bandits)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Exploration vs Exploitation
To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied
ε-greedy
Boltzmann→ π(s,a) =e
Q(s,a)T∑
a′∈A eQ(s,a′)
T
More efficient techniques (Multi–Armed Bandits)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Exploration vs Exploitation
To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied
ε-greedy
Boltzmann→ π(s,a) =e
Q(s,a)T∑
a′∈A eQ(s,a′)
T
More efficient techniques (Multi–Armed Bandits)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Exploration vs Exploitation
To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied
ε-greedy
Boltzmann→ π(s,a) =e
Q(s,a)T∑
a′∈A eQ(s,a′)
T
More efficient techniques (Multi–Armed Bandits)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Exploration vs Exploitation
To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied
ε-greedy
Boltzmann→ π(s,a) =e
Q(s,a)T∑
a′∈A eQ(s,a′)
T
More efficient techniques (Multi–Armed Bandits)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Exploration vs Exploitation
To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied
ε-greedy
Boltzmann→ π(s,a) =e
Q(s,a)T∑
a′∈A eQ(s,a′)
T
More efficient techniques (Multi–Armed Bandits)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Exploration vs Exploitation
To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied
ε-greedy
Boltzmann→ π(s,a) =e
Q(s,a)T∑
a′∈A eQ(s,a′)
T
More efficient techniques (Multi–Armed Bandits)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
How RL can be extended to MAS?
RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
How RL can be extended to MAS?
RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
How RL can be extended to MAS?
RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions
MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Multi-agent vs Single-agent
Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents
Best-responseEquilibrium
No learning procedure is optimal against all possibleopponent beahviors
Self-playTargeted optimality
Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Matrix Games
A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i
Matrix games are one-shot gamesLearning requires repeated interactions
Repeated gamesStochastic games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
A Special Case: Repeated Games
In repeated games, the same one-shot game (calledstage game) is repeatedly played
E.g., iterated prisoner’s dilemma
Infinitely vs finitely repeated gamesReally, an extensive form game
Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game
Stationary strategyAre other equilibria possible?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Folk Theorem
No strategy profiles, but obtained payoffsInformally:
“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under
mixed strategies in the single stage game,with the constraint on the mixed strategies
that each player’s payoff is at least theamount he would receive if the other players
adopted minimax strategies against him”
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Folk Theorem
No strategy profiles, but obtained payoffsInformally:
“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under
mixed strategies in the single stage game,with the constraint on the mixed strategies
that each player’s payoff is at least theamount he would receive if the other players
adopted minimax strategies against him”
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Folk Theorem
No strategy profiles, but obtained payoffsInformally:
“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under
mixed strategies in the single stage game,with the constraint on the mixed strategies
that each player’s payoff is at least theamount he would receive if the other players
adopted minimax strategies against him”
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Folk Theorem
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
MDP + Matrix = Stochastic Games (SGs)
A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent
SG extends MDP to multiple agentsSGs with one state are called repeated games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Strategies in SG
Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions
Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Strategies in SG
Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions
Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Strategies in SG
Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions
Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Strategies in SG
Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions
Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Strategies in SG
Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions
Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibria in SG
Markov-perfect equilibrium: is a profile of Markovstrategies that yields a Nash equilibrium in everyproper subgameEvery n-player, general-sum, discounted-rewardstochastic game has a Markov perfect equilibrium
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibria in SG
Markov-perfect equilibrium: is a profile of Markovstrategies that yields a Nash equilibrium in everyproper subgameEvery n-player, general-sum, discounted-rewardstochastic game has a Markov perfect equilibrium
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Stochastic Games: Example
RewardGoalreached:+100Collision: -1Otherwise: 0
Some solutions:
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Stochastic Games: Example
RewardGoalreached:+100Collision: -1Otherwise: 0
Some solutions:
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Stochastic Games: Example
RewardGoalreached:+100Collision: -1Otherwise: 0
Some solutions:
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Stochastic Games: Example
RewardGoalreached:+100Collision: -1Otherwise: 0
Some solutions:
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Stochastic Games: Example
RewardGoalreached:+100Collision: -1Otherwise: 0
Some solutions:
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryProblems
States
Agents
Single
Multiple
Single Multiple
Optimization
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryProblems
States
Agents
Single
Multiple
Single Multiple
Optimization
MDP
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryProblems
States
Agents
Single
Multiple
Single Multiple
Optimization
MDP
Matrix Game
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryProblems
States
Agents
Single
Multiple
Single Multiple
Optimization
MDP
Matrix Game
Stochastic Game
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryLearning
States
Agents
Single
Multiple
Single Multiple
Multi-ArmedBandit
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryLearning
States
Agents
Single
Multiple
Single Multiple
Multi-ArmedBandit
ReinforcementLearning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryLearning
States
Agents
Single
Multiple
Single Multiple
Multi-ArmedBandit
ReinforcementLearning
Learning inRepeated Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
SummaryLearning
States
Agents
Single
Multiple
Single Multiple
Multi-ArmedBandit
ReinforcementLearning
Learning inRepeated Games
Multi-AgentLearning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Learning vs Game Theory
Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally
The problem is unknownReal-time constraintsHumans
In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Learning vs Game Theory
Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally
The problem is unknownReal-time constraintsHumans
In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Learning vs Game Theory
Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally
The problem is unknownReal-time constraintsHumans
In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Learning vs Game Theory
Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally
The problem is unknownReal-time constraintsHumans
In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Learning vs Game Theory
Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally
The problem is unknownReal-time constraintsHumans
In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Learning vs Game Theory
Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally
The problem is unknownReal-time constraintsHumans
In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Why Learning?
If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game
Unknown payoffsUnknown transition probabilities
Observability: Do the agents see each others’actions, and/or each others’ payoffs?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Why Learning?
If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game
Unknown payoffsUnknown transition probabilities
Observability: Do the agents see each others’actions, and/or each others’ payoffs?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Why Learning?
If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game
Unknown payoffsUnknown transition probabilities
Observability: Do the agents see each others’actions, and/or each others’ payoffs?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Why Learning?
If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game
Unknown payoffsUnknown transition probabilities
Observability: Do the agents see each others’actions, and/or each others’ payoffs?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Why Learning?
If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game
Unknown payoffsUnknown transition probabilities
Observability: Do the agents see each others’actions, and/or each others’ payoffs?
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Desired Properties in MAL
Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Desired Properties in MAL
Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Desired Properties in MAL
Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Desired Properties in MAL
Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Desired Properties in MAL
Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Desired Properties in MAL
Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsTask type
Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL
Fully competitiveMinimax-Q
MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Taxonomy of MARL AlgorithmsField of Origin
TemporalDifference RL
single-agent RLJAL
Distributed-QEXORLHyper-Q
FMQ
CE-QNash-QTeam-Q
minimax-QNSCP
Asymmetric-Q
OAL
Fictitious Play
AWESOME
MetaStrategy
WoLF-PHCPD-WoLF
IGAWoLF-IGA
GIGA-WoLFGIGA
Game Theory
Direct Policy Search
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Equilibrium or not?
Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation
Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Indipendent Learners
Typical conditions for IndependentLearning (IL):
An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.
Independent learners try to learn bestresponses
AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents
DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination
Traffic
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent Reinforcement Learners
Q-learning [Watkins’92]Learning Automata [Narendra’74,Wheeler’86]WoLF-PHC [Bowling’01]FAQ-learning [Kaisers’10]RESQ-learning [Hennes’10]
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Joint Action Learners
A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)
fi(a−i) = Πj 6=iφj(a−i)
The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:
EV (ai) =∑
a−i∈A−i
Q(ai ∪ a−i)fi(a−i)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Joint Action Learners
A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)
fi(a−i) = Πj 6=iφj(a−i)
The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:
EV (ai) =∑
a−i∈A−i
Q(ai ∪ a−i)fi(a−i)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Joint Action Learners
A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)
fi(a−i) = Πj 6=iφj(a−i)
The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:
EV (ai) =∑
a−i∈A−i
Q(ai ∪ a−i)fi(a−i)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Outline
1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory
2 MARL algorithmsBest-Response LearningEquilibrium Learners
Team GamesZero-sum GamesGeneral-sum Games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Games
Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Games
Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Games
Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Coordination Equilibria
In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an
π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an
Qi(s, a1, . . . , an)
for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Coordination Equilibria
In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an
π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an
Qi(s, a1, . . . , an)
for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Coordination Equilibria
In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an
π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an
Qi(s, a1, . . . , an)
for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Coordination game
a0 a1b0 10 0b1 0 10
The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Coordination game
a0 a1b0 10 0b1 0 10
The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Coordination game
a0 a1b0 10 0b1 0 10
The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Coordination game
a0 a1b0 10 0b1 0 10
The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Penalty game
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Considering k < 0, the game has 3 pure equilibria
Suppose penalty K = −100
Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Penalty game
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Considering k < 0, the game has 3 pure equilibria
Suppose penalty K = −100
Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Penalty game
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Considering k < 0, the game has 3 pure equilibria
Suppose penalty K = −100
Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Penalty game
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Considering k < 0, the game has 3 pure equilibria
Suppose penalty K = −100
Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Climbing game
a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5
Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Climbing game
a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5
Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersExample: Climbing game
a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5
Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersSufficient conditions
The learning rate α decreases over time such that∑
t α is divergentand
∑t α
2 is convergent
Each agent samples each of its actions infinitely often
The probability P it (a) of agent i choosing action a is nonzero
Agents become full exploiters with probability one eventually:
limt→∞
P it (Xt) = 0
where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action
Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that
Pr(|Et − 1| < ε) > 1− δ
for all t > T (δ, ε).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersSufficient conditions
The learning rate α decreases over time such that∑
t α is divergentand
∑t α
2 is convergent
Each agent samples each of its actions infinitely often
The probability P it (a) of agent i choosing action a is nonzero
Agents become full exploiters with probability one eventually:
limt→∞
P it (Xt) = 0
where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action
Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that
Pr(|Et − 1| < ε) > 1− δ
for all t > T (δ, ε).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersSufficient conditions
The learning rate α decreases over time such that∑
t α is divergentand
∑t α
2 is convergent
Each agent samples each of its actions infinitely often
The probability P it (a) of agent i choosing action a is nonzero
Agents become full exploiters with probability one eventually:
limt→∞
P it (Xt) = 0
where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action
Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that
Pr(|Et − 1| < ε) > 1− δ
for all t > T (δ, ε).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersSufficient conditions
The learning rate α decreases over time such that∑
t α is divergentand
∑t α
2 is convergent
Each agent samples each of its actions infinitely often
The probability P it (a) of agent i choosing action a is nonzero
Agents become full exploiters with probability one eventually:
limt→∞
P it (Xt) = 0
where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action
Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that
Pr(|Et − 1| < ε) > 1− δ
for all t > T (δ, ε).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersSufficient conditions
The learning rate α decreases over time such that∑
t α is divergentand
∑t α
2 is convergent
Each agent samples each of its actions infinitely often
The probability P it (a) of agent i choosing action a is nonzero
Agents become full exploiters with probability one eventually:
limt→∞
P it (Xt) = 0
where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action
Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that
Pr(|Et − 1| < ε) > 1− δ
for all t > T (δ, ε).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersMyopic Heuristics
Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics
Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersMyopic Heuristics
Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics
Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersMyopic Heuristics
Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics
Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersMyopic Heuristics
Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics
Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersMyopic Heuristics
Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics
Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersMyopic Heuristics
Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics
Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Independent vs Joint Action LearnersMyopic heuristics: Penalty game
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learning [Lauer and Riedmiller,’01]
Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:
Q0(s,a) = 0
Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A
Q(s′,a′))
This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learning [Lauer and Riedmiller,’01]
Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:
Q0(s,a) = 0
Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A
Q(s′,a′))
This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learning [Lauer and Riedmiller,’01]
Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:
Q0(s,a) = 0
Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A
Q(s′,a′))
This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learning [Lauer and Riedmiller,’01]
Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:
Q0(s,a) = 0
Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A
Q(s′,a′))
This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningExample: Climbing Games
a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5
Distributed Q-tables
a0 a1 a2Q1(s0,a) 11 7 6Q2(s0,a) 11 7 5
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningExample: Climbing Games
a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5
Distributed Q-tables
a0 a1 a2Q1(s0,a) 11 7 6Q2(s0,a) 11 7 5
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningExample: Penalty Games
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Distributed Q-tables
a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10
It requires an additional mechanism of coordinationbetween agents
Update the current policy only if an improvement to theQ-value happens
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningExample: Penalty Games
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Distributed Q-tables
a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10
It requires an additional mechanism of coordinationbetween agents
Update the current policy only if an improvement to theQ-value happens
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningExample: Penalty Games
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Distributed Q-tables
a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10
It requires an additional mechanism of coordinationbetween agents
Update the current policy only if an improvement to theQ-value happens
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningExample: Penalty Games
a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10
Distributed Q-tables
a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10
It requires an additional mechanism of coordinationbetween agents
Update the current policy only if an improvement to theQ-value happens
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningStochastic environments
Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty
behavior of other agentsinfluence of stochastic environments
Distinguish these two uncertainties is a key point inmultiagent learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningStochastic environments
Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty
behavior of other agentsinfluence of stochastic environments
Distinguish these two uncertainties is a key point inmultiagent learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningStochastic environments
Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty
behavior of other agentsinfluence of stochastic environments
Distinguish these two uncertainties is a key point inmultiagent learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningStochastic environments
Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty
behavior of other agentsinfluence of stochastic environments
Distinguish these two uncertainties is a key point inmultiagent learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningStochastic environments
Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty
behavior of other agentsinfluence of stochastic environments
Distinguish these two uncertainties is a key point inmultiagent learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Distributed Q-learningStochastic environments
Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty
behavior of other agentsinfluence of stochastic environments
Distinguish these two uncertainties is a key point inmultiagent learning
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Q-learning [Littman, ’01]
Requires to observe actions from other agentsUpdate rule
Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α
(r1 + γmaxa′1,...,a
′n
Q1(s′, a′1, . . . , a′n)
)
It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Q-learning [Littman, ’01]
Requires to observe actions from other agentsUpdate rule
Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α
(r1 + γmaxa′1,...,a
′n
Q1(s′, a′1, . . . , a′n)
)
It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Q-learning [Littman, ’01]
Requires to observe actions from other agentsUpdate rule
Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α
(r1 + γmaxa′1,...,a
′n
Q1(s′, a′1, . . . , a′n)
)
It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Q-learning [Littman, ’01]
Requires to observe actions from other agentsUpdate rule
Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α
(r1 + γmaxa′1,...,a
′n
Q1(s′, a′1, . . . , a′n)
)
It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Q-learning [Littman, ’01]
Requires to observe actions from other agentsUpdate rule
Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α
(r1 + γmaxa′1,...,a
′n
Q1(s′, a′1, . . . , a′n)
)
It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Team Q-learning [Littman, ’01]
Requires to observe actions from other agentsUpdate rule
Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α
(r1 + γmaxa′1,...,a
′n
Q1(s′, a′1, . . . , a′n)
)
It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Zero-sum Games
Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Zero-sum Games
Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Zero-sum Games
Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Zero-sum Games
Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Zero-sum Games
Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Zero-sum Games
Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-Q [Littman, ’94]
In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory
Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-Q [Littman, ’94]
In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory
Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-Q [Littman, ’94]
In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory
Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-Q [Littman, ’94]
In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory
Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-Q [Littman, ’94]
In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory
Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-QLearning Optimal Policy
Q-learning
Q(s,a)← (1− α)Q(s,a) + α(r + γV (s′)
)V (s) = max
aQ(s,a)
minimax-Q learning
Q(s,a,o) := (1− α)Q(s,a,o) + α(rs,a,o + γV (s′)
)π(s, . . . ) := arg max
π′(s,... )min
o′
∑a′
(π(s,a′) ·Q(s,a′,o′)
)V (s) := min
o′
∑a′π(s,a′) ·Q(s,a′,o′)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-QLearning Optimal Policy
Q-learning
Q(s,a)← (1− α)Q(s,a) + α(r + γV (s′)
)V (s) = max
aQ(s,a)
minimax-Q learning
Q(s,a,o) := (1− α)Q(s,a,o) + α(rs,a,o + γV (s′)
)π(s, . . . ) := arg max
π′(s,... )min
o′
∑a′
(π(s,a′) ·Q(s,a′,o′)
)V (s) := min
o′
∑a′π(s,a′) ·Q(s,a′,o′)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-QConsiderations
In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-QConsiderations
In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-QConsiderations
In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-QConsiderations
In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Minimax-QConsiderations
In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Can we extend this approach to general-sumSGs?
Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Can we extend this approach to general-sumSGs?
Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Can we extend this approach to general-sumSGs?
Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-Q [Hu & Wellman, ’98-’03]
NashQit (s′) = π1(s′) · · · · · πn(s′) ·Qi
t (s′)
Each agent needs to maintain the Q-functions of all theother agents
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-Q [Hu & Wellman, ’98-’03]
NashQit (s′) = π1(s′) · · · · · πn(s′) ·Qi
t (s′)
Each agent needs to maintain the Q-functions of all theother agents
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QComplexity
Space requirements: n · |S| · |A|n
The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QComplexity
Space requirements: n · |S| · |A|n
The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QComplexity
Space requirements: n · |S| · |A|n
The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence
Assumptions
Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a global optimal point,
and agent’s payoff in this equilibrium are used to update their Q-functions
Every stage game (Q1t (s), . . . ,Qn
t (s)), for all t and s, has a saddle point, and
agent’s payoff in this equilibrium are used to update their Q-functions
TheoremUnder these assumptions, the sequence Qt = (Q1
t , . . . ,Qnt ), updated
by
Qkt+1(s, a1
, . . . , an) = (1−αt )Qkt (s, a1
, . . . , an)+αt
(rkt + γπ
1(s′) · · · · · πn(s′)Qkt (s′)
)for k = 1, . . . , n
where(π1(s′), . . . , πn(s′)
)is the appropriate type of Nash
equilibrium solution for the stage game(Q1
t (s′), . . . ,Qnt (s′)
),
converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn
∗).
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence Result Analysis
The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence Result Analysis
The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence Result Analysis
The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence Result Analysis
The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence Result Analysis
The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Nash-QConvergence Result Analysis
The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe [Littman, ’01]
Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect
friend: coordination equilibrium
Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2
Q1(s,a1,a2)
foe: adversarial equilibrium
Nash1(s,Q1,Q2) = maxπ∈Π(A1)
mina2∈A2
∑a1∈A1
π(a1)Q1(s,a1,a2)
In FFQ the learner maintains only a Q-function foritself
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe [Littman, ’01]
Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect
friend: coordination equilibrium
Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2
Q1(s,a1,a2)
foe: adversarial equilibrium
Nash1(s,Q1,Q2) = maxπ∈Π(A1)
mina2∈A2
∑a1∈A1
π(a1)Q1(s,a1,a2)
In FFQ the learner maintains only a Q-function foritself
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe [Littman, ’01]
Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect
friend: coordination equilibrium
Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2
Q1(s,a1,a2)
foe: adversarial equilibrium
Nash1(s,Q1,Q2) = maxπ∈Π(A1)
mina2∈A2
∑a1∈A1
π(a1)Q1(s,a1,a2)
In FFQ the learner maintains only a Q-function foritself
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe [Littman, ’01]
Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect
friend: coordination equilibrium
Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2
Q1(s,a1,a2)
foe: adversarial equilibrium
Nash1(s,Q1,Q2) = maxπ∈Π(A1)
mina2∈A2
∑a1∈A1
π(a1)Q1(s,a1,a2)
In FFQ the learner maintains only a Q-function foritself
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe [Littman, ’01]
Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect
friend: coordination equilibrium
Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2
Q1(s,a1,a2)
foe: adversarial equilibrium
Nash1(s,Q1,Q2) = maxπ∈Π(A1)
mina2∈A2
∑a1∈A1
π(a1)Q1(s,a1,a2)
In FFQ the learner maintains only a Q-function foritself
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe Q-learningConvergence results
Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)
Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium
Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe Q-learningConvergence results
Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)
Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium
Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe Q-learningConvergence results
Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)
Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium
Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe Q-learningConvergence results
Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)
Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium
Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe Q-learningConvergence results
Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)
Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium
Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy
MARL
MarcelloRestelli
Introduction toMulti-AgentReinforcementLearningReinforcementLearning
MARL vs RL
MARL vs GameTheory
MARLalgorithmsBest-ResponseLearning
Equilibrium Learners
Team Games
Zero-sum Games
General-sumGames
Friend-or-Foe Q-learningConvergence results
Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)
Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium
Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy