Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
ReinforcementLearning
Mul2-agentReinforcementLearning
SubramanianRamamoorthySchoolofInforma2cs
28March,2017
Agentso)enfaceStrategicAdversaries
28/03/2017 2
Key issue we seek to model: Misaligned/conflicting interest
OnSelf-Interest
Whatdoesitmeantosaythatagentsareself-interested?
• Itdoesnotnecessarilymeanthattheywanttocauseharmtoeachother,oreventhattheycareonlyaboutthemselves.
• Instead,itmeansthateachagenthashisowndescripHonofwhichstatesoftheworldhelikes—whichcanincludegoodthingshappeningtootheragents—andthatheactsinana.empttobringaboutthesestatesoftheworld(beLerterm:inter-dependentdecisionmaking)
28/03/2017 3
ASimpleModelofaGame
• Twodecisionmakers– Robot(hasanacHonspace:a)– Adversary(hasanacHonspace:θ)
• Costorpayoff(tousethetermcommoningametheory)dependsonacHonsofbothdecisionmakers:R(a,θ)–denoteasamatrixcorrespondingtoproductspace
28/03/2017 4
This is the normal form – simultaneous choice over moves
A
RepresenHngPayoffs
Inageneral,bi-matrix,normalformgame:
ThecombinedacHons(a1, a2, …, an)forman ac#onprofileaЄA
28/03/2017 5
Action sets of players Payoff function:
a.k.a. utility u2(a)
Example:Rock-Paper-Scissors
• Famouschildren’sgame• Twoplayers;EachplayersimultaneouslypicksanacHonwhich
isevaluatedasfollows,– RockbeatsScissors– ScissorsbeatsPaper– PaperbeatsRock
28/03/2017 6
TCPGame
• Imaginethereareonlytwointernetusers:youandme• InternettrafficisgovernedbyTCPprotocol,onefeatureof
whichisthebackoffmechanism:whennetworkiscongestedthenbackoffandreducetransmissionratesforawhile
• ImaginethattherearetwoimplementaHons:C(correct,doeswhatisintended)andD(defecHve)
• IfyoubothadoptC,packetdelayis1ms;ifyoubothadoptD,packetdelayis3ms
• IfoneadoptsCbutotheradoptsDthenDusergetsnodelayandCusersuffers4msdelay
28/03/2017 7
TCPGameinNormalForm
28/03/2017 8
Note that this is another way of writing a bi-matrix game: First number represents payoff of row player and second number is payoff for column player
SomeFamousMatrixExamples-WhataretheyCapturing?
• Prisoner’sDilemma:CooperateorDefect(sameasTCPgame)
• BachorStravinsky(vonNeumanncalleditBaLleoftheSexes)
• MatchingPennies:Trytogetthesameoutcome,Heads/Tails
28/03/2017 9
DifferentCategorizaHon:CommonPayoff
Acommon-payoffgameisagameinwhichforallac?onprofilesa∈A1×···×Anandanypairofagentsi,j,itisthecasethatui(a)=uj(a)
28/03/2017 10
Pure coordination: e.g., driving on a side of the road
DifferentCategorizaHon:ConstantSum
Atwo-playernormal-formgameisconstant-sumifthereexistsaconstantcsuchthatforeachstrategyprofilea∈A1×A2itisthecasethatu1(a)+u2(a)=c
28/03/2017 11
Pure competition: One player wants to coordinate Other player does not!
Definingthe“acHonspace”
28/03/2017 12
Strategies
28/03/2017 13
Expected utility
SoluHonConcepts
Manywaysofdescribingwhatoneoughttodo:– Dominance– Minimax– ParetoEfficiency– NashEquilibria– CorrelatedEquilibria
RememberthatintheendgametheoryaspirestopredictbehaviourgivenspecificaHonofthegame.
Norma?vely,asoluHonconceptisara?onaleforbehaviour
28/03/2017 14
Concept:Dominance
28/03/2017 15
Concept:IteratedDominance
28/03/2017 16
Concept:Minimax
28/03/2017 17
Minimax
28/03/2017 18
CompuHngMinimax:LinearProgramming
28/03/2017 19
Pick-a-Hand• Therearetwoplayers:chooser(playerI)&hider(playerII)
• Thehiderhastwogoldcoinsinhisbackpocket.Atthebeginningofaturn,heputshishandsbehindhisbackandeithertakesoutonecoinandholdsitinhisle)hand,ortakesoutbothandholdstheminhisrighthand.
• Thechooserpicksahandandwinsanycoinsthehiderhashiddenthere.
• Shemaygetnothing(ifthehandisempty),orshemightwinonecoin,ortwo.
28/03/2017 20
Pick-a-Hand,NormalForm:
• Hidercouldminimizelossesbyplacing1coininle)hand,mosthecanloseis1
• Ifchoosercanfigureouthider’splan,hewillsurelylosethat1
• Ifhiderthinkschoosermightstrategise,hehasincenHvetoplayR2,…
• Allhidercanguaranteeismaxlossof1coin
28/03/2017 21
• Similarly,choosermighttrytomaximisegain,pickingR
• However,ifhiderstrategizes,chooserendsupwithzero
• So,choosercan’tactuallyguaranteewinninganything
Pick-a-Hand,withMixedStrategies
• SupposethatchooserdecidestochooseRwithprobabilitypandLwithprobability1−p
• IfhiderweretoplaypurestrategyR2hisexpectedlosswouldbe2p
• IfheweretoplayL1,expectedlossis1−p
• Choosermaximizeshergainsbychoosingpsoastomaximizemin{2p,1−p}
• Thus,bychoosingRwithprobability1/3andLwithprobability2/3,chooserassuresexpectedpayoffof2/3,regardlessofwhetherhiderknowsherstrategy
28/03/2017 22
p
Chooser Payoff
MixedStrategyfortheHider
• HiderwillplayR2withsomeprobabilityqandL1withprobability1−q
• Thepayoffforchooseris2qifshepicksR,and1−qifshepicksL
• Ifsheknowsq,shewillchoosethestrategycorrespondingtothemaximumofthetwovalues.
• Ifhiderknowschooser’splan,hewillchooseq=1/3tominimizethismaximum,guaranteeingthathisexpectedpayoutis2/3(because2/3=2q=1−q)
• Choosercanassureexpectedgainof2/3,hidercanassureanexpectedlossofnomorethan2/3,regardlessofwhateitherknowsoftheother’sstrategy.
28/03/2017 23
SafetyValueasIncenHve
• Clearly,withoutsomeextraincenHve,itisnotinhider’sinteresttoplayPick-a-handbecausehecanonlylosebyplaying.
• Thus,wecanimaginethatchooserpayshidertoenHcehimintojoiningthegame.
• 2/3isthemaximumamountthatchoosershouldpayhiminordertogainhisparHcipaHon.
28/03/2017 24
EquilibriumasaSaddlePoint
28/03/2017 25
Concept:NashEquilibrium
28/03/2017 26
NashEquilibrium
28/03/2017 27
NashEquilibrium-Example
28/03/2017 28
NashEquilibrium-Example
28/03/2017 29
28/03/2017 30
Many well known techniques from reinforcement learning, e.g., value/policy iteration can still be applied to solving these games
StochasHcGames(SG)
Definedbythetuple
28/03/2017 31
(n,S,A1,...,n, T, R1,...,n)
No. agents
Set of states
Set of actions available to each agent
A = A1 ⇥A2 ⇥ ...⇥An
S ⇥A⇥ S ! [0, 1]Transition dynamics
Reward function of ith agent
S ⇥A ! R
R = R1 ⇥R2 ⇥ ...⇥Rn
We wish to learn a stationary, possibly stochastic, policy: Objective continues to be maximization of expected future reward
⇢ : S ! Pr(Ai)
AFirstAlgorithmforSGSoluHon[Shapley]
28/03/2017 32
This classic algorithm (from 1953) is akin to Value Iteration for MDPs. - Max operator has been replaced by “Value”, which refers to equilibrium. - i.e., the matrix game is being solved at each state (step 2b)
ThePolicyIteraHonAlgorithmforSGs
28/03/2017 33
• This algorithm is akin to Policy Iteration for MDPs. • Each player selects equilibrium policy according to current value
function (using the same G matrix as in Shapley’s algorithm) • Value function is then updated based on rewards as per equil. policy
Q-LearningforSGs
28/03/2017 34
• Q-learning version of Shapley’s algorithm (maintaining value over joint actions)
• Algorithm converges to stochastic game’s equilibrium, even if other player doesn’t, provided everyone executes all actions infinitely often.
WhatdowedoifwehavenoModel?FicHHousPlay[Robinson‘51]
28/03/2017 35
• Assumes opponents play stationary strategies • Maintains information about average value of each action • Finds equilibria in zero-sum and some general sum games
Summary:GeneralTacHcforSGs
28/03/2017 36
MatrixGameSolver
TemporalDifferencing
StochasHcGameSolver
Summary:ManyApproaches
28/03/2017 37
OpHonalReference/Acknowledgements
LearningalgorithmsforstochasHcgamesarefromthepaper:M.Bowling,M.Veloso,AnanalysisofstochasHcgametheoryfor
mulHagentreinforcementlearning,CMU-CS-00-165,2000.Severalslidesareadaptedfromthefollowingsources:• TutorialatIJCAI2003byProfPeterStone,UniversityofTexas• Y.Peres,GameTheory,Alive(LectureNotes)
28/03/2017 38