38
Reinforcement Learning Mul2-agent Reinforcement Learning Subramanian Ramamoorthy School of Informa2cs 28 March, 2017

Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

ReinforcementLearning

Mul2-agentReinforcementLearning

SubramanianRamamoorthySchoolofInforma2cs

28March,2017

Page 2: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Agentso)enfaceStrategicAdversaries

28/03/2017 2

Key issue we seek to model: Misaligned/conflicting interest

Page 3: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

OnSelf-Interest

Whatdoesitmeantosaythatagentsareself-interested?

•  Itdoesnotnecessarilymeanthattheywanttocauseharmtoeachother,oreventhattheycareonlyaboutthemselves.

•  Instead,itmeansthateachagenthashisowndescripHonofwhichstatesoftheworldhelikes—whichcanincludegoodthingshappeningtootheragents—andthatheactsinana.empttobringaboutthesestatesoftheworld(beLerterm:inter-dependentdecisionmaking)

28/03/2017 3

Page 4: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

ASimpleModelofaGame

•  Twodecisionmakers–  Robot(hasanacHonspace:a)–  Adversary(hasanacHonspace:θ)

•  Costorpayoff(tousethetermcommoningametheory)dependsonacHonsofbothdecisionmakers:R(a,θ)–denoteasamatrixcorrespondingtoproductspace

28/03/2017 4

This is the normal form – simultaneous choice over moves

A

Page 5: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

RepresenHngPayoffs

Inageneral,bi-matrix,normalformgame:

ThecombinedacHons(a1, a2, …, an)forman ac#onprofileaЄA

28/03/2017 5

Action sets of players Payoff function:

a.k.a. utility u2(a)

Page 6: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Example:Rock-Paper-Scissors

•  Famouschildren’sgame•  Twoplayers;EachplayersimultaneouslypicksanacHonwhich

isevaluatedasfollows,–  RockbeatsScissors–  ScissorsbeatsPaper–  PaperbeatsRock

28/03/2017 6

Page 7: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

TCPGame

•  Imaginethereareonlytwointernetusers:youandme•  InternettrafficisgovernedbyTCPprotocol,onefeatureof

whichisthebackoffmechanism:whennetworkiscongestedthenbackoffandreducetransmissionratesforawhile

•  ImaginethattherearetwoimplementaHons:C(correct,doeswhatisintended)andD(defecHve)

•  IfyoubothadoptC,packetdelayis1ms;ifyoubothadoptD,packetdelayis3ms

•  IfoneadoptsCbutotheradoptsDthenDusergetsnodelayandCusersuffers4msdelay

28/03/2017 7

Page 8: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

TCPGameinNormalForm

28/03/2017 8

Note that this is another way of writing a bi-matrix game: First number represents payoff of row player and second number is payoff for column player

Page 9: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

SomeFamousMatrixExamples-WhataretheyCapturing?

•  Prisoner’sDilemma:CooperateorDefect(sameasTCPgame)

•  BachorStravinsky(vonNeumanncalleditBaLleoftheSexes)

•  MatchingPennies:Trytogetthesameoutcome,Heads/Tails

28/03/2017 9

Page 10: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

DifferentCategorizaHon:CommonPayoff

Acommon-payoffgameisagameinwhichforallac?onprofilesa∈A1×···×Anandanypairofagentsi,j,itisthecasethatui(a)=uj(a)

28/03/2017 10

Pure coordination: e.g., driving on a side of the road

Page 11: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

DifferentCategorizaHon:ConstantSum

Atwo-playernormal-formgameisconstant-sumifthereexistsaconstantcsuchthatforeachstrategyprofilea∈A1×A2itisthecasethatu1(a)+u2(a)=c

28/03/2017 11

Pure competition: One player wants to coordinate Other player does not!

Page 12: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Definingthe“acHonspace”

28/03/2017 12

Page 13: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Strategies

28/03/2017 13

Expected utility

Page 14: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

SoluHonConcepts

Manywaysofdescribingwhatoneoughttodo:–  Dominance– Minimax–  ParetoEfficiency–  NashEquilibria–  CorrelatedEquilibria

RememberthatintheendgametheoryaspirestopredictbehaviourgivenspecificaHonofthegame.

Norma?vely,asoluHonconceptisara?onaleforbehaviour

28/03/2017 14

Page 15: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Concept:Dominance

28/03/2017 15

Page 16: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Concept:IteratedDominance

28/03/2017 16

Page 17: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Concept:Minimax

28/03/2017 17

Page 18: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Minimax

28/03/2017 18

Page 19: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

CompuHngMinimax:LinearProgramming

28/03/2017 19

Page 20: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Pick-a-Hand•  Therearetwoplayers:chooser(playerI)&hider(playerII)

•  Thehiderhastwogoldcoinsinhisbackpocket.Atthebeginningofaturn,heputshishandsbehindhisbackandeithertakesoutonecoinandholdsitinhisle)hand,ortakesoutbothandholdstheminhisrighthand.

•  Thechooserpicksahandandwinsanycoinsthehiderhashiddenthere.

•  Shemaygetnothing(ifthehandisempty),orshemightwinonecoin,ortwo.

28/03/2017 20

Page 21: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Pick-a-Hand,NormalForm:

•  Hidercouldminimizelossesbyplacing1coininle)hand,mosthecanloseis1

•  Ifchoosercanfigureouthider’splan,hewillsurelylosethat1

•  Ifhiderthinkschoosermightstrategise,hehasincenHvetoplayR2,…

•  Allhidercanguaranteeismaxlossof1coin

28/03/2017 21

•  Similarly,choosermighttrytomaximisegain,pickingR

•  However,ifhiderstrategizes,chooserendsupwithzero

•  So,choosercan’tactuallyguaranteewinninganything

Page 22: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Pick-a-Hand,withMixedStrategies

•  SupposethatchooserdecidestochooseRwithprobabilitypandLwithprobability1−p

•  IfhiderweretoplaypurestrategyR2hisexpectedlosswouldbe2p

•  IfheweretoplayL1,expectedlossis1−p

•  Choosermaximizeshergainsbychoosingpsoastomaximizemin{2p,1−p}

•  Thus,bychoosingRwithprobability1/3andLwithprobability2/3,chooserassuresexpectedpayoffof2/3,regardlessofwhetherhiderknowsherstrategy

28/03/2017 22

p

Chooser Payoff

Page 23: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

MixedStrategyfortheHider

•  HiderwillplayR2withsomeprobabilityqandL1withprobability1−q

•  Thepayoffforchooseris2qifshepicksR,and1−qifshepicksL

•  Ifsheknowsq,shewillchoosethestrategycorrespondingtothemaximumofthetwovalues.

•  Ifhiderknowschooser’splan,hewillchooseq=1/3tominimizethismaximum,guaranteeingthathisexpectedpayoutis2/3(because2/3=2q=1−q)

•  Choosercanassureexpectedgainof2/3,hidercanassureanexpectedlossofnomorethan2/3,regardlessofwhateitherknowsoftheother’sstrategy.

28/03/2017 23

Page 24: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

SafetyValueasIncenHve

•  Clearly,withoutsomeextraincenHve,itisnotinhider’sinteresttoplayPick-a-handbecausehecanonlylosebyplaying.

•  Thus,wecanimaginethatchooserpayshidertoenHcehimintojoiningthegame.

•  2/3isthemaximumamountthatchoosershouldpayhiminordertogainhisparHcipaHon.

28/03/2017 24

Page 25: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

EquilibriumasaSaddlePoint

28/03/2017 25

Page 26: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Concept:NashEquilibrium

28/03/2017 26

Page 27: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

NashEquilibrium

28/03/2017 27

Page 28: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

NashEquilibrium-Example

28/03/2017 28

Page 29: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

NashEquilibrium-Example

28/03/2017 29

Page 30: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

28/03/2017 30

Many well known techniques from reinforcement learning, e.g., value/policy iteration can still be applied to solving these games

Page 31: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

StochasHcGames(SG)

Definedbythetuple

28/03/2017 31

(n,S,A1,...,n, T, R1,...,n)

No. agents

Set of states

Set of actions available to each agent

A = A1 ⇥A2 ⇥ ...⇥An

S ⇥A⇥ S ! [0, 1]Transition dynamics

Reward function of ith agent

S ⇥A ! R

R = R1 ⇥R2 ⇥ ...⇥Rn

We wish to learn a stationary, possibly stochastic, policy: Objective continues to be maximization of expected future reward

⇢ : S ! Pr(Ai)

Page 32: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

AFirstAlgorithmforSGSoluHon[Shapley]

28/03/2017 32

This classic algorithm (from 1953) is akin to Value Iteration for MDPs. -  Max operator has been replaced by “Value”, which refers to equilibrium. -  i.e., the matrix game is being solved at each state (step 2b)

Page 33: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

ThePolicyIteraHonAlgorithmforSGs

28/03/2017 33

•  This algorithm is akin to Policy Iteration for MDPs. •  Each player selects equilibrium policy according to current value

function (using the same G matrix as in Shapley’s algorithm) •  Value function is then updated based on rewards as per equil. policy

Page 34: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Q-LearningforSGs

28/03/2017 34

•  Q-learning version of Shapley’s algorithm (maintaining value over joint actions)

•  Algorithm converges to stochastic game’s equilibrium, even if other player doesn’t, provided everyone executes all actions infinitely often.

Page 35: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

WhatdowedoifwehavenoModel?FicHHousPlay[Robinson‘51]

28/03/2017 35

•  Assumes opponents play stationary strategies •  Maintains information about average value of each action •  Finds equilibria in zero-sum and some general sum games

Page 36: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Summary:GeneralTacHcforSGs

28/03/2017 36

MatrixGameSolver

TemporalDifferencing

StochasHcGameSolver

Page 37: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

Summary:ManyApproaches

28/03/2017 37

Page 38: Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc games are from the paper: M. Bowling, M. Veloso, An analysis of stochasHc game theory

OpHonalReference/Acknowledgements

LearningalgorithmsforstochasHcgamesarefromthepaper:M.Bowling,M.Veloso,AnanalysisofstochasHcgametheoryfor

mulHagentreinforcementlearning,CMU-CS-00-165,2000.Severalslidesareadaptedfromthefollowingsources:•  TutorialatIJCAI2003byProfPeterStone,UniversityofTexas•  Y.Peres,GameTheory,Alive(LectureNotes)

28/03/2017 38