Subramanian Ramamoorthy School of Informacs 28 March, 2017 · Learning algorithms for stochasHc...

Preview:

Citation preview

ReinforcementLearning

Mul2-agentReinforcementLearning

SubramanianRamamoorthySchoolofInforma2cs

28March,2017

Agentso)enfaceStrategicAdversaries

28/03/2017 2

Key issue we seek to model: Misaligned/conflicting interest

OnSelf-Interest

Whatdoesitmeantosaythatagentsareself-interested?

•  Itdoesnotnecessarilymeanthattheywanttocauseharmtoeachother,oreventhattheycareonlyaboutthemselves.

•  Instead,itmeansthateachagenthashisowndescripHonofwhichstatesoftheworldhelikes—whichcanincludegoodthingshappeningtootheragents—andthatheactsinana.empttobringaboutthesestatesoftheworld(beLerterm:inter-dependentdecisionmaking)

28/03/2017 3

ASimpleModelofaGame

•  Twodecisionmakers–  Robot(hasanacHonspace:a)–  Adversary(hasanacHonspace:θ)

•  Costorpayoff(tousethetermcommoningametheory)dependsonacHonsofbothdecisionmakers:R(a,θ)–denoteasamatrixcorrespondingtoproductspace

28/03/2017 4

This is the normal form – simultaneous choice over moves

A

RepresenHngPayoffs

Inageneral,bi-matrix,normalformgame:

ThecombinedacHons(a1, a2, …, an)forman ac#onprofileaЄA

28/03/2017 5

Action sets of players Payoff function:

a.k.a. utility u2(a)

Example:Rock-Paper-Scissors

•  Famouschildren’sgame•  Twoplayers;EachplayersimultaneouslypicksanacHonwhich

isevaluatedasfollows,–  RockbeatsScissors–  ScissorsbeatsPaper–  PaperbeatsRock

28/03/2017 6

TCPGame

•  Imaginethereareonlytwointernetusers:youandme•  InternettrafficisgovernedbyTCPprotocol,onefeatureof

whichisthebackoffmechanism:whennetworkiscongestedthenbackoffandreducetransmissionratesforawhile

•  ImaginethattherearetwoimplementaHons:C(correct,doeswhatisintended)andD(defecHve)

•  IfyoubothadoptC,packetdelayis1ms;ifyoubothadoptD,packetdelayis3ms

•  IfoneadoptsCbutotheradoptsDthenDusergetsnodelayandCusersuffers4msdelay

28/03/2017 7

TCPGameinNormalForm

28/03/2017 8

Note that this is another way of writing a bi-matrix game: First number represents payoff of row player and second number is payoff for column player

SomeFamousMatrixExamples-WhataretheyCapturing?

•  Prisoner’sDilemma:CooperateorDefect(sameasTCPgame)

•  BachorStravinsky(vonNeumanncalleditBaLleoftheSexes)

•  MatchingPennies:Trytogetthesameoutcome,Heads/Tails

28/03/2017 9

DifferentCategorizaHon:CommonPayoff

Acommon-payoffgameisagameinwhichforallac?onprofilesa∈A1×···×Anandanypairofagentsi,j,itisthecasethatui(a)=uj(a)

28/03/2017 10

Pure coordination: e.g., driving on a side of the road

DifferentCategorizaHon:ConstantSum

Atwo-playernormal-formgameisconstant-sumifthereexistsaconstantcsuchthatforeachstrategyprofilea∈A1×A2itisthecasethatu1(a)+u2(a)=c

28/03/2017 11

Pure competition: One player wants to coordinate Other player does not!

Definingthe“acHonspace”

28/03/2017 12

Strategies

28/03/2017 13

Expected utility

SoluHonConcepts

Manywaysofdescribingwhatoneoughttodo:–  Dominance– Minimax–  ParetoEfficiency–  NashEquilibria–  CorrelatedEquilibria

RememberthatintheendgametheoryaspirestopredictbehaviourgivenspecificaHonofthegame.

Norma?vely,asoluHonconceptisara?onaleforbehaviour

28/03/2017 14

Concept:Dominance

28/03/2017 15

Concept:IteratedDominance

28/03/2017 16

Concept:Minimax

28/03/2017 17

Minimax

28/03/2017 18

CompuHngMinimax:LinearProgramming

28/03/2017 19

Pick-a-Hand•  Therearetwoplayers:chooser(playerI)&hider(playerII)

•  Thehiderhastwogoldcoinsinhisbackpocket.Atthebeginningofaturn,heputshishandsbehindhisbackandeithertakesoutonecoinandholdsitinhisle)hand,ortakesoutbothandholdstheminhisrighthand.

•  Thechooserpicksahandandwinsanycoinsthehiderhashiddenthere.

•  Shemaygetnothing(ifthehandisempty),orshemightwinonecoin,ortwo.

28/03/2017 20

Pick-a-Hand,NormalForm:

•  Hidercouldminimizelossesbyplacing1coininle)hand,mosthecanloseis1

•  Ifchoosercanfigureouthider’splan,hewillsurelylosethat1

•  Ifhiderthinkschoosermightstrategise,hehasincenHvetoplayR2,…

•  Allhidercanguaranteeismaxlossof1coin

28/03/2017 21

•  Similarly,choosermighttrytomaximisegain,pickingR

•  However,ifhiderstrategizes,chooserendsupwithzero

•  So,choosercan’tactuallyguaranteewinninganything

Pick-a-Hand,withMixedStrategies

•  SupposethatchooserdecidestochooseRwithprobabilitypandLwithprobability1−p

•  IfhiderweretoplaypurestrategyR2hisexpectedlosswouldbe2p

•  IfheweretoplayL1,expectedlossis1−p

•  Choosermaximizeshergainsbychoosingpsoastomaximizemin{2p,1−p}

•  Thus,bychoosingRwithprobability1/3andLwithprobability2/3,chooserassuresexpectedpayoffof2/3,regardlessofwhetherhiderknowsherstrategy

28/03/2017 22

p

Chooser Payoff

MixedStrategyfortheHider

•  HiderwillplayR2withsomeprobabilityqandL1withprobability1−q

•  Thepayoffforchooseris2qifshepicksR,and1−qifshepicksL

•  Ifsheknowsq,shewillchoosethestrategycorrespondingtothemaximumofthetwovalues.

•  Ifhiderknowschooser’splan,hewillchooseq=1/3tominimizethismaximum,guaranteeingthathisexpectedpayoutis2/3(because2/3=2q=1−q)

•  Choosercanassureexpectedgainof2/3,hidercanassureanexpectedlossofnomorethan2/3,regardlessofwhateitherknowsoftheother’sstrategy.

28/03/2017 23

SafetyValueasIncenHve

•  Clearly,withoutsomeextraincenHve,itisnotinhider’sinteresttoplayPick-a-handbecausehecanonlylosebyplaying.

•  Thus,wecanimaginethatchooserpayshidertoenHcehimintojoiningthegame.

•  2/3isthemaximumamountthatchoosershouldpayhiminordertogainhisparHcipaHon.

28/03/2017 24

EquilibriumasaSaddlePoint

28/03/2017 25

Concept:NashEquilibrium

28/03/2017 26

NashEquilibrium

28/03/2017 27

NashEquilibrium-Example

28/03/2017 28

NashEquilibrium-Example

28/03/2017 29

28/03/2017 30

Many well known techniques from reinforcement learning, e.g., value/policy iteration can still be applied to solving these games

StochasHcGames(SG)

Definedbythetuple

28/03/2017 31

(n,S,A1,...,n, T, R1,...,n)

No. agents

Set of states

Set of actions available to each agent

A = A1 ⇥A2 ⇥ ...⇥An

S ⇥A⇥ S ! [0, 1]Transition dynamics

Reward function of ith agent

S ⇥A ! R

R = R1 ⇥R2 ⇥ ...⇥Rn

We wish to learn a stationary, possibly stochastic, policy: Objective continues to be maximization of expected future reward

⇢ : S ! Pr(Ai)

AFirstAlgorithmforSGSoluHon[Shapley]

28/03/2017 32

This classic algorithm (from 1953) is akin to Value Iteration for MDPs. -  Max operator has been replaced by “Value”, which refers to equilibrium. -  i.e., the matrix game is being solved at each state (step 2b)

ThePolicyIteraHonAlgorithmforSGs

28/03/2017 33

•  This algorithm is akin to Policy Iteration for MDPs. •  Each player selects equilibrium policy according to current value

function (using the same G matrix as in Shapley’s algorithm) •  Value function is then updated based on rewards as per equil. policy

Q-LearningforSGs

28/03/2017 34

•  Q-learning version of Shapley’s algorithm (maintaining value over joint actions)

•  Algorithm converges to stochastic game’s equilibrium, even if other player doesn’t, provided everyone executes all actions infinitely often.

WhatdowedoifwehavenoModel?FicHHousPlay[Robinson‘51]

28/03/2017 35

•  Assumes opponents play stationary strategies •  Maintains information about average value of each action •  Finds equilibria in zero-sum and some general sum games

Summary:GeneralTacHcforSGs

28/03/2017 36

MatrixGameSolver

TemporalDifferencing

StochasHcGameSolver

Summary:ManyApproaches

28/03/2017 37

OpHonalReference/Acknowledgements

LearningalgorithmsforstochasHcgamesarefromthepaper:M.Bowling,M.Veloso,AnanalysisofstochasHcgametheoryfor

mulHagentreinforcementlearning,CMU-CS-00-165,2000.Severalslidesareadaptedfromthefollowingsources:•  TutorialatIJCAI2003byProfPeterStone,UniversityofTexas•  Y.Peres,GameTheory,Alive(LectureNotes)

28/03/2017 38

Recommended