State-Coupled Replicator Dynamics€¦ · the evolution of this population as described by the repli-cator dynamics (Section 2.1). We leverage the theoretical framework of stochastic

789

State-Coupled Replicator Dynamics

Daniel HennesEindhoven University

of TechnologyP.O. Box 513, 5600 MB,

Eindhoven, The [email protected]

Karl TuylsEindhoven University



Matthias RauterbergEindhoven University



ABSTRACT

This paper introduces a new model, i.e. state-coupled repli-cator dynamics, expanding the link between evolutionarygame theory and multiagent reinforcement learning to multi-state games. More precisely, it extends and improves pre-vious work on piecewise replicator dynamics, a combinationof replicators and piecewise models. The contributions ofthe paper are twofold. One, we identify and explain the ma-jor shortcomings of piecewise replicators, i.e. discontinuitiesand occurrences of qualitative anomalies. Two, this analysisleads to the proposal of the new model for learning dynamicsin stochastic games, named state-coupled replicator dynam-ics. The preceding formalization of piecewise replicators -general in the number of agents and states - is factoredinto the new approach. Finally, we deliver a comparativestudy of finite action-set learning automata to piecewise andstate-coupled replicator dynamics. Results show that state-coupled replicators model learning dynamics in stochasticgames more accurately than their predecessor, the piecewiseapproach.

Categories and Subject Descriptors

I.2.6 [Computing Methodologies]: Artificial Intelligence—Learning

General Terms

Algorithms, Theory

Keywords

Multi-agent learning, Evolutionary game theory, Replicatordynamics, Stochastic games

1. INTRODUCTIONThe learning performance of contemporary reinforcement

learning techniques has been studied in great depth exper-imentally as well as formally for a diversity of single agentcontrol tasks [5]. Markov decision processes provide a math-ematical framework to study single agent learning. However,in general they are not applicable to multi-agent learning.Once multiple adaptive agents simultaneously interact witheach other and the environment, the process becomes highly

Cite as: State-Coupled Replicator Dynamics, D. Hennes, K. Tuylsand M. Rauterberg, Proc. of 8th Int. Conf. on AutonomousAgents and Multiagent Systems (AAMAS 2009), Decker, Sich-man, Sierra and Castelfranchi (eds.), May, 10–15, 2009, Budapest, Hun-gary, pp. XXX-XXX.Copyright c© 2009, International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

dynamic and non-deterministic, thus violating the Markovproperty. Evidently, there is a strong need for an adequatetheoretical framework modeling multi-agent learning. Re-cently, an evolutionary game theoretic approach has beenemployed to fill this gap [6]. In particular, in [1] the au-thors have derived a formal relation between multi-agentreinforcement learning and the replicator dynamics. Therelation between replicators and reinforcement learning hasbeen extended to different algorithms such as learning au-tomata and Q-learning in [7].

Exploiting the link between reinforcement learning andevolutionary game theory is beneficial for a number of rea-sons. The majority of state of the art reinforcement learningalgorithms are blackbox models. This makes it difficult togain detailed insight into the learning process and parametertuning becomes a cumbersome task. Analyzing the learningdynamics helps to determine parameter configurations priorto actual employment in the task domain. Furthermore, thepossibility to formally analyze multi-agent learning helps toderive and compare new algorithms, which has been success-fully demonstrated for lenient Q-learning in [4].

The main limitation of this game theoretic approach tomulti-agent systems is its restriction to stateless repeatedgames. Even though real-life tasks might be modeled state-lessly, the majority of such problems naturally relates tomulti-state situations. Vrancx et al. [9] have made the firstattempt to extend replicator dynamics to multi-state games.More precisely, the authors have combined replicator dy-namics and piecewise dynamics, called piecewise replicatordynamics, to model the learning behavior of agents in stochas-tic games. Recently, we have formalized this promising proofof concept in [2].

Piecewise models are a methodology in the area of dy-namical system theory. The core concept is to partition thestate space of a dynamical system into cells. The behaviorof a dynamical system can then be described as the statevector movement through this collection of cells. Dynam-ics within each cell are determined by the presence of anattractor or repeller. Piecewise linear systems make the as-sumption that each cell is reigned by a specific attractor andthat the induced dynamics can be approximated linearly.

In this work, we demonstrate the major shortcomingsof piecewise modeling in the domain of replicator dynam-ics and subsequently propose a new method, named state-coupled replicator dynamics. Grounded on the formalizationof piecewise replicators, the new model describes the directcoupling between states and thus overcomes the problem ofanomalies induced by approximation.

Cite as: State-Coupled Replicator Dynamics, Daniel Hennes, Karl Tuyls, Matthias Rauterberg, Proc. of 8th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2009), Decker, Sichman, Sierra and Castel-franchi (eds.), May, 10–15, 2009, Budapest, Hungary, pp. 789–796Copyright © 2009, International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org), All rights reserved.

AAMAS 2009 • 8th International Conference on Autonomous Agents and Multiagent Systems • 10–15 May, 2009 • Budapest, Hungary

790

The rest of this article is organized as follows. Section 2provides background information about the game theoreti-cal framework and the theory of learning automata. In Sec-tion 3 we formally introduce piecewise replicator dynamicsand hereafter discuss their shortcomings in Section 4. Sec-tion 5 presents the new state-coupled model and delivers acomparative study of piecewise and state-coupled replicatordynamics to learning automata. Section 6 concludes thisarticle.

2. BACKGROUNDIn this section, we summarize required background knowl-

edge from the fields of multi-agent learning and evolutionarygame theory. In particular, we consider an individual levelof analogy between the related concepts of learning and evo-lution. Each agent has a set of possible strategies at hand.Which strategies are favored over others depends on the ex-perience the agent has previously gathered by interactingwith the environment and other agents. The pool of possi-ble strategies can be interpreted as a population in an evo-lutionary game theory perspective. The dynamical changeof preferences within the set of strategies can be seen asthe evolution of this population as described by the repli-cator dynamics (Section 2.1). We leverage the theoreticalframework of stochastic games (Section 2.2) to model thislearning process and use learning automata as an examplefor reinforcement learning (Section 2.3).

2.1 Replicator dynamicsThe continuous time two-population replicator dynamics

are defined by the following system of ordinary differentialequations:

dπi

dt=ˆ(Aσ)i − π′Aσ

˜πi

dσi

dt=ˆ(Bπ)i − σ′Bπ

˜σi ,

(1)

where A and B are the payoff matrices for player 1 and2 respectively. The probability vector π describes the fre-quency of all pure strategies (replicators) for player 1. Suc-cess of a replicator i is measured by the difference betweenits current payoff (Aσ)i and the average payoff of the entirepopulation π against the strategy of player 2: π′Aσ.

2.2 Stochastic gamesStochastic games allow to model multi-state problems in

an abstract manner. The concept of repeated games is gen-eralized by introducing probabilistic switching between mul-tiple states. In each stage, the game is in a specific state fea-turing a particular payoff function and an admissible actionset for each player. Players take actions simultaneously andhereafter receive an immediate payoff depending on theirjoint action. A transition function maps the joint actionspace to a probability distribution over all states which inturn determines the probabilistic state change. Thus, simi-lar to a Markov decision process, actions influence the statetransitions. A formal definition of stochastic games (alsocalled Markov games) is given below.

Definition 1. The game G =˙n, S, A, q, τ, π1 . . . πn

¸is

a stochastic game with n players and k states. In eachstate s ∈ S =

`s1,. . .,sk

éach player i chooses an action ai

from its admissible action set Ai (s) according to its strategyπi (s).

The payoff function τ (s, a) :Qn

i=1 Ai (s) �→ �n maps thejoint action a =

à1,. . .,an

´to an immediate payoff value for

each player.The transition function q(s, a) :

Qni=1 Ai (s) �→ Δk−1 de-

termines the probabilistic state change, where Δk−1 is the(k − 1)-simplex and qs′ (s, a) is the transition probability fromstate s to s′ under joint action a.

In this work we restrict our consideration to the set ofgames where all states s ∈ S are in the same ergodic set.The motivation for this restriction is two-folded. In the pres-ence of more than one ergodic set one could analyze the cor-responding sub-games separately. Furthermore, the restric-tion ensures that the game has no absorbing states. Gameswith absorbing states are of no particular interest in respectto evolution or learning since any type of exploration willeventually lead to absorption. The formal definition of anergodic set in stochastic games is given below.

Definition 2. In the context of a stochastic game G,E ⊆ S is an ergodic set if and only if the following con-ditions hold:(a) For all s ∈ E, if G is in state s at stage t, then at t + 1:

Pr (G in some state s′ ∈ E) = 1, and(b) for all proper subsets E′ ⊂ E, (a) does not hold.

Note that in repeated games player i either tries to max-imize the limit of the average of stage rewards

maxπi

lim infT→∞

1

T

TXt=1

τ i (t) (2)

or the discounted sum of stage rewardsPT

t=1 τ i (t) δt−1 with

0 < δ < 1, where τ i (t) is the immediate stage reward forplayer i at time step t. While the latter is commonly usedin Q-learning, this work regards the former to derive a tem-poral difference reward update for learning automata in Sec-tion 2.3.1.

2.2.1 2-State Prisoners’ DilemmaThe 2-State Prisoners’ Dilemma is a stochastic game for

two players. The payoff matrices are given by

À1, B1´=„ 3, 3 0, 10

10, 0 2, 2

«,À2, B2´=„ 4, 4 0, 10

10, 0 1, 1

«.

Where As determines the payoff for player 1 and Bs forplayer 2 in state s. The first action of each player is cooperateand the second is defect. Player 1 receives τ1 (s, a) = As

a1,a2

while player 2 gets τ2 (s, a) = Bsa1,a2 for a given joint ac-

tion a = (a1, a2). Similarly, the transition probabilities are

given by the matrices Qs→s′ where qs′ (s, a) = Qs→s′a1,a2 is the

probability for a transition from state s to state s′.

Qs1→s2=

„0.1 0.90.9 0.1

«, Qs2→s1

=

„0.1 0.90.9 0.1

«

The probabilities to continue in the same state after the

transition are qs1`s1, a

´= Qs1→s1

a1,a2 = 1 − Qs1→s2

a1,a2 and

qs2`s2, a

´= Qs2→s2

a1,a2 = 1 − Qs2→s1

a1,a2 .Essentially a Prisoners’ Dilemma is played in both states,

and if regarded separately defect is still a dominating strat-egy. One might assume that the Nash equilibrium strat-egy in this game is to defect at every stage. However, theonly pure stationary equilibria in this game reflect strategies

Daniel Hennes, Karl Tuyls, Matthias Rauterberg • State-Coupled Replicator Dynamics

791

where one of the players defects in one state while cooper-ating in the other and the second player does exactly theopposite. Hence, a player betrays his opponent in one statewhile being exploited himself in the other state.

2.2.2 Common Interest GameAnother 2-player, 2-actions and 2-state game is the Com-

mon Interest Game. Payoff and transition matrices are givenbelow. Note that both players receive identical immediaterewards.

A1 = B1 =

„5 66 7

«, A2 = B2 =

„0 105 0

«

Qs1→s2=

„0.9 0.90.9 0.1

«, Qs2→s1

=

„0.1 0.10.1 0.9

«

In this game the highest payoff is gained in state 2 underjoint action (1, 2) which is associate with a low state tran-

sition probability Qs2→s11,2 = 0.1. If however the state is

switched, players are encouraged to play either joint ac-tion (1, 2) or (2, 1) in order to transition back to state 2with a high probability. While joint action (2, 2) maximizesimmediate payoff in state 1, the low transition probability

Qs1→s2

2,2 = 0.1 hinders the return to state 2.

2.3 Learning automataA learning automaton (LA) uses the basic policy iteration

reinforcement learning scheme. An initial random policy isused to explore the environment; by monitoring the rein-forcement signal the policy is updated in order to learn theoptimal policy and maximize the expected reward.

In this paper we focus on finite action-set learning au-tomata (FALA). FALA are model free, stateless and inde-pendent learners. This means interacting agents do notmodel each other; they only act upon the experience col-lected by experimenting with the environment. Further-more, no environmental state is considered which meansthat the perception of the environment is limited to the rein-forcement signal. While these restrictions are not negligiblethey allow for simple algorithms that can be treated ana-lytically. Convergence for learning automata in single andspecific multi-agent cases has been proven in [3].

The class of finite action-set learning automata consid-ers only automata that optimize their policies over a finiteaction-set A = {1, . . . , k} with k some finite integer. Oneoptimization step, called epoch, is divided into two parts:action selection and policy update. At the beginning of anepoch t, the automaton draws a random action a(t) accord-ing to the probability distribution π(t), called policy. Basedon the action a(t), the environment responds with a rein-forcement signal τ(t), called reward. Hereafter, the automa-ton uses the reward τ(t) to update π(t) to the new policyπ(t+1). The update rule for FALA using the linear reward-inaction (LR−I) scheme is given below.

πi(t + 1) = πi(t) +

(ατ (t) (1 − πi(t)) if a (t) = i−ατ (t) πi(t) otherwise

where τ ∈ [0, 1]. The reward parameter α ∈ [0, 1] determinesthe learning rate.

Situating automata in stateless games is straightforwardand only a matter of unifying the different taxonomies ofgame theory and the theory of learning automata (e.g. ”player”

and ”agent” are interchangeable, as are ”payoff” and ”re-ward”). However, multi-state games require an extensionof the stateless FALA model.

2.3.1 Networks of learning automataFor each agent, we use a network of automata in which

control is passed on from one automaton to another [8]. Anagent associates a dedicated learning automata to each stateof the game. This LA tries to optimize the policy in thatstate using the standard update rule given in (2.3). Only asingle LA is active and selects an action at each stage of thegame. However, the immediate reward from the environ-ment is not directly fed back to this LA. Instead, when theLA becomes active again, i.e. next time the same state isplayed, it is informed about the cumulative reward gatheredsince the last activation and the time that has passed by.

The reward feedback τ i for agent i’s automaton LAi(s)associated with state s is defined as

τ i (t) =Δri

Δt=

Pt−1l=t0(s) ri (l)

t − t0(s), (3)

where ri (t) is the immediate reward for agent i in epocht and t0(s) is the last occurrence function and determineswhen states s was visited last. The reward feedback inepoch t equals the cumulative reward Δri divided by time-frame Δt. The cumulative reward Δri is the sum over all im-mediate rewards gathered in all states beginning with epocht0(s) and including the last epoch t− 1. The time-frame Δtmeasures the number of epochs that have passed since au-tomaton LAi(s) has been active last. This means the statepolicy is updated using the average stage reward over theinterim immediate rewards.

3. PIECEWISE REPLICATOR DYNAMICSAs outlined in the previous section, agents maintain an in-

dependent policy for each state and this consequently leadsto a very high dimensional problem. Piecewise replicatordynamics analyze the dynamics per state in order to copewith this problem. For each state of a stochastic game aso-called average reward game is derived. An average re-ward game determines the expected reward for each jointaction in a given state, assuming fixed strategies in all otherstates. This method projects the limit average reward of astochastic game onto a stateless normal-form game whichcan be analyzed using the multi-population replicator dy-namics given in (1).

In general we can not assume that strategies are fixed inall but one state. Agents adopt their policies in all states inparallel and therefore the average reward game along withthe linked replicator dynamics are changing as well. Thecore idea of piecewise replicator dynamics is to partition thestrategy space into cells, where each cell corresponds to a setof attractors in the average reward game. This approach isbased on the methodology of piecewise dynamical systems.

In dynamic system theory, the state vector of a systemeventually enters an area of attraction and becomes subjectto the influence of this attractor. In case of piecewise repli-cator dynamics the state vector is an element of the strategyspace and attractors resemble equilibrium points in the av-erage reward game. It is assumed that the dynamics in eachcell are reigned by a set of equilibria and therefore we canqualitatively describe the dynamics of each cell by a set ofreplicator equations.


792

We use this approach to model learning dynamics in stochas-tic games as follows. For each state of a stochastic gamewe derive the average reward game (Section 3.1) and con-sider the strategy space over all joint actions for all otherstates. This strategy space is then partitioned into cells (Sec-tion 3.2), where each cell corresponds to a set of equilibriumpoints in the average reward game. We sample the strategyspace of each cell (Section 3.3) and compute the correspond-ing limit average reward for each joint action, eventuallyleading to a set of replicator equations for each cell (Sec-tion 3.4).

More precisely, each state features a number of cells, eachrelated to a set of replicator dynamics. For each state, asingle cell is active and the associated replicator equationsdetermine the dynamics in that state, while the active cell ofa particular state is exclusively determined by the strategiesin all other states. Strategy changes occur in all states inparallel and hence mutually induce cell switching.

3.1 Average reward gameFor a repeated automata game, the objective of player i

at stage t0 is to maximize the limit average reward τ i =lim infT→∞ 1

T

PTt=t0

τ i (t) as defined in (2). The scope ofthis paper is restricted to stochastic games where the se-quence of game states X (t) is ergodic (see Section 2.2).Hence, there exists a stationary distribution x over all states,where fraction xs determines the frequency of state s in X.Therefore, we can rewrite τ i as τ i =

Ps∈S xsP

i (s), where

P i (s) is the expected payoff of player i in state s.In piecewise replicator dynamics, states are analyzed sep-

arately to cope with the high dimensionality. Thus, let usassume the game is in state s at stage t0 and players playa given joint action a in s and fixed strategies π (s′) in allstates but s. Then the limit average payoff becomes

τ (s, a) = xsτ (s, a) +X

s′∈S−{s}xs′P

i `s′´ , (4)

where

P i `s′´ =X

a′∈Qni=1 Ai(s′)

τ`s′, a′´ nY

i=1

πia′

i

`s′´!

.

An intuitive explanation of (4) goes as follows. At each stageplayers consider the infinite horizon of payoffs under currentstrategies. We untangle the current state s from all otherstates s′ = s and the limit average payoff τ becomes the sumof the immediate payoff for joint action a in state s and theexpected payoffs in all other states. Payoffs are weighted bythe frequency of corresponding state occurrences. Thus, ifplayers invariably play joint action a every time the game isin state s and their fixed strategies π (s′) for all other states,the limit average reward for T → ∞ is expressed by (4).

Since a specific joint action a is played in state s, thestationary distribution x depends on s and a as well. Aformal definition is given below.

Definition 3. For G =˙n, S, A, q, τ, π1 . . . πn

¸where S

itself is the only ergodic set in S =`s1 . . . sk

´, we say x (s, a)

is a stationary distribution of the stochastic game G if andonly if

Pz∈S xz (s, a) = 1 and

xz (s, a) = xs (s, a) qz (s, a) +X

s′∈S−{s}xs′ (s, a) Qi `s′´ ,

where

Qi `s′´ =X

a′∈Qni=1 Ai(s′)

qz

`s′, a′´ nY

i=1

πia′

i

`s′´!

.

Based on this notion of stationary distribution and (4) wecan define the average reward game as follows.

Definition 4. For a stochastic game G where S itself isthe only ergodic set in S =

`s1 . . . sk

´, we define the average

reward game for some state s ∈ S as the normal-form game

G`s, π1 . . . πn´=˙n, A1 (s) . . . An (s) , τ , π1 (s) . . . πn (s)

¸,

where each player i plays a fixed strategy πi (s′) in all statess′ = s. The payoff function τ is given by

τ (s, a) = xs (s, a) τ (s, a) +X

s′∈S−{s}xs′ (s, a) P i `s′´ .

This formalization of average reward games has laid the ba-sis for the definition and analysis of pure equilibrium cells.

3.2 Equilibrium cellsThe average reward game projects the limit average re-

ward for a given state onto a stateless normal-form game.This projection depends on the fixed strategies in all otherstates. In this section we explain how this strategy spacecan be partitioned into discrete cells, each corresponding toa set of equilibrium points of the average reward game.

First, we introduce the concept of a pure equilibrium cell.Such a cell is a subset of all strategy profiles under which agiven joint action specifies a pure equilibrium in the averagereward game. In a Nash equilibrium situation, no playercan improve its payoff by unilateral deviation from its ownstrategy πi. In the context of an average reward game, allstrategies including πi (s′) are fixed for all states s′ = s.Therefore, the payoff τ i (s, a) (see (4)) depends only on thejoint action a in state s. Hence, the equilibrium constrainttranslates to:

∀i∀ai′∈Ai(s) : τ i (s, a) ≥ τ i“s, ai. . . ai−1,ai′,ai+1. . . an

”Consequently, this leads to the following definition of pureequilibrium cells.

Definition 5. We call C (s, a) a pure equilibrium cell ofa stochastic game G if and only if C (s, a) is the subset ofall strategy profiles π =

`π1 . . . πn

únder which the following

condition holds

∀i∀a′ : τ i (s, a) ≥ τ i `s, a′´ ,

where τ is the payoff function of the average reward gameG`s, π1 . . . πn

´; a and a′ are joint actions where ∀j �=i : a′j = aj.

Thus, a is a pure equilibrium in state s for all strategy pro-files π ∈ C (s, a).

Note that τ is independent of the players’ strategies in s.Hence, we can express the cell boundaries in state s = si as afunction of the profiles π

`s1´. . . π

`si−1

´, π`si+1

´. . . π

`sk´,

i.e. players’ strategies in all but state s. However, pure equi-librium cells might very well overlap for certain areas of thisstrategy space [9]. Therefore, we consider all possible combi-nations of equilibrium points within one state and partitionthe strategy space of all other states into corresponding dis-crete cells.

3.3 Strategy space samplingThe partitioned strategy space is sampled in order to com-

pute particular average reward game payoffs that in turn areused to obtain the set of replicator equations. For each stateand each discrete cell, the corresponding strategy space is


793

scanned using an equally spaced grid. Each grid point de-fines a specific joint strategy of all states but the one underconsideration. Average reward game payoffs are averagedover all grid points and the resulting payoffs are embeddedin the set of replicator equations RDcs for the specified cellc and state s.

3.4 DefinitionThis section links average reward game, pure equilibrium

cells and strategy space sampling in order to obtain a coher-ent definition of piecewise replicator dynamics.

For each state s = si the strategy space in all remain-ing states is partitioned into discrete cells. Each cell cs ⊂A`s1´×. . .×A

`si−1

´×A`si+1

´×. . .×A`sk´

refers to somecombination of pure equilibria. This combination might aswell resemble only a single equilibrium point or the emptyset, i.e. no pure equilibrium in the average reward game. Asexplained in the previous section, the strategy subspace ofeach cell is sampled. As a result, we obtain payoff matriceswhich in turn lead to a set of replicator equations RDcs foreach cell. However, the limiting distribution over states un-der the strategy π has to be factored into the system. Thismeans that different strategies result in situations where cer-tain states are played more frequently than others. Since wemodel each cell in each state with a separate set of repli-cator dynamics, we need to scale the change of π (s) withfrequency xs. The frequency xs determines the expectedfraction of upcoming stages played in state s.

Definition 6. The piecewise replicator dynamics are de-fined by the following system of differential equations:

dπ (s)

dt= RDcs (π (s)) xs,

where cs is the active cell in state s and RDcs is the set ofreplicator equations that reign in cell cs. Furthermore, x isthe stationary distribution over all states S under π, withP

s∈S xs (π) = 1 and

xs (π) =Xz∈S

24xz (π)

Xa∈Qn

i=1 Ai(s)

qs (z, a)

nYi=1

πiai

(s)

!35Note that xs is defined similarly to Definition 3. However,here xs is independent of joint action a in s but rather as-sumes strategy π (s) to be played instead.

4. ANOMALIES OF PIECEWISE

REPLICATOR DYNAMICSThis section shows the shortcomings of piecewise repli-

cators by comparing the dynamics of learning automata topredictions from the piecewise model. The layered sequenceplot in Figure 1 is used to observe and describe the differ-ent dynamics. Each still image consists of three layers, thelearning trace of automata (LR−I with a = 0.001), cell par-titioning and a vector field. The learning trace in state 1are plotted together with the cell boundaries for state 2and vice versa. Depending on the current end-point loca-tion of the trajectory in state 1 we illustrate the dynamicsin state 2 by plotting the vector field of the correspond-ing set of replicator equations. For the particular examplein Figure 1, the trajectory in state 2 does not cross anycell boundaries and therefore the vector field in state 1 re-mains unchanged during the sequence. The learning trace in

state 1 state 2

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

π11(s2)

π12(s2)

Figure 1: Trajectory plot for learning automata inthe 2-State Prisoners’ Dilemma. Each boundary in-tersection in state 1 (left column) causes a qualita-tive change of dynamics in state 2 (right column).


794

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Cell boundaries in state 2

11(s1)

12(s1)

(C,D)

(C,C)(D,C)

(C,D) & (D,C)

(C,D)

(D,C)

0 0.05 0.1 0.15 0.20

0.05

0.1

0.15

0.2Cell boundaries in state 2

π11(s1)

π12(s1)

Figure 2: Cell partitioning and location of specificstrategy profiles, i.e. (0.015, 0.015), (0.06, 0.01) and(0.01, 0.06), for vector field sensitivity analysis in the2-State Prisoners’ Dilemma.

state 1 follows this vector field and eventually converges tojoint action cooperate-defect (C, D) resembling point (1, 0)in the strategy space. During this process, the trajectoryintersects multiple cell boundaries and the vector field instate 2 changes accordingly. We consult Figure 2 to identifycell labels for the partitioning plotted in the strategy spaceof state 1.

The first row of Figure 1 shows that the current end-pointof the learning trajectory in state 1 is within the boundariesof cell (C, D). This means, cell (C, D) is active in state 2and the dynamics are reigned by this attractor. In fact, wesee that the policies of learning automata are attracted bythis equilibrium point and approximately follow the vectorfield toward (1, 0).

Let us now consider the third and fifth row. Here, the cur-rent end-points of the trajectories in state 1 correspond tothe cells (D, C) and (C, D) respectively. In the former case,the vector field plot shows arrows near the end of the tracethat point toward (0, 1). However, the automata policiescontinue on an elliptic curve and therefore approximatelyprogress into the direction of (1, 1) rather than (0, 1). Thelatter case shows even greater discrepancies between vectorfield and policy trajectory. The vector field predicts move-ment toward (0, 0), while the trajectory trace continues con-vergence to (0, 1). We now attempt to given an explanationfor these artifacts by performing a sensitivity analysis ofvector field plots.

Figure 2 displays the strategy space partitioning for state 2depending on strategies in state 1 (left) as well as a magni-fication of the subspace near the origin (right). We specif-ically focus on this subspace and compute the average re-ward games for three strategy profiles, i.e. (0.015, 0.015),(0.06, 0.01) and (0.01, 0.06). Each strategy profile corre-sponds to one of three cells, (C, C), (C, D) and (D, C) re-spectively, as indicated in the clipped section to the right inFigure 2. The computed average reward game payoff matri-ces are used to derive the replicator dynamics and visualizethe vector field plots. Figure 3 compares the field plots ofthe three specific average reward games with the dynam-ics for the corresponding cells obtained by sampling. Onthe highest level, i.e. presence and convergence to strongattractors, all pairs match. This is clear, since average re-ward games for specific points within a cell sustain the sameequilibrium combination. However, in order to examine theanomaly in Figure 1 (fifth still image), we consider especiallythe direction of field plots in the area around the trajec-tory end-point, which in this case is circa (0.15, 0.4). In thisarea dynamics for cells (C, D) and (D, C) show clear qualita-

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Dynamics for π

11(s1)=0.015 and π

12(s1)=0.015

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Dynamics in (C,C) cell

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Dynamics for π

11(s1)=0.06 and π

12(s1)=0.01

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Dynamics in (C,D) cell

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Dynamics for π

11(s1)=0.01 and π

12(s1)=0.06

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Dynamics in (D,C) cell

π11(s2)

π12(s2)

Figure 3: Vector field sensitivity analysis for piece-wise replicator dynamics. Cell replicator equationsare compared to the dynamics of specific averagereward games for the following strategy profiles:(0.015, 0.015), (0.06, 0.01) and (0.01, 0.06).

tive differences from the field plots for corresponding specificstrategy profiles (0.06, 0.01) and (0.01, 0.06). Furthermore,the field plots to the left in Figure 3 are consistent with thelearning trajectory sequence displayed in Figure 1.

The vector field plots of piecewise replicator dynamics pre-dict the learning dynamics only on the highest level (see Fig-ure 3). However, due to the coupling of states, qualitativeanomalies occur. Furthermore, piecewise replicators showdiscontinuities due to discrete switching of dynamics eachtime a cell boundary is crossed. These discontinuities areneither intuitive nor reflected in real learning behavior. Ad-ditionally, these artifacts potentially yield profound impactif the number of states is increased.

The next section aims to alleviate the shortcomings ofpiecewise replicators by proposing a new approach to modelthe learning dynamics in stochastic games more accurately.

5. STATE-COUPLED REPLICATORSPiecewise replicators are an implementation of the paradigm

of piecewise dynamical systems and therefore inherently lim-ited to a qualitative approximation of the underlying, intrin-sic behavior. Anomalies occur if this approximation is defi-cient for some area in the strategy space. This observationdirectly suggest either a) to refine the cell partitioning or b)to strive for direct state coupling, discarding the piecewisemodel.

Refining the cell partitioning is not straightforward. One


795

might consider to separate the disjointed parts of the strat-egy subspace, previously covered by a single cell. More pre-cisely, this would lead to two separate cells each for pureequilibria (C, D) and (D, C) in state 2 of the 2-State Pris-oners’ Dilemma. However, this approach is not in line withthe argumentation that a cell should reflect the subspace inwhich a certain attractor reigns.

The second option is most promising since it eliminatesundesired discontinuities induced by discrete cell switchingand furthermore avoids approximation anomalies. Accord-ingly, the next section derives a new model for state-coupledlearning dynamics in stochastic games. We call this ap-proach state-coupled replicator dynamics.

5.1 DefinitionWe reconsider the replicator equations for population π

as given in (1):

dπi

dt=ˆ(Aσ)i − π′Aσ

˜πi (5)

Essentially, the payoff of an individual in population π, play-ing pure strategy i against population σ, is compared to theaverage payoff of population π. In the context of an averagereward game G with payoff function τ the expected payofffor player i and pure action j is given by

P ij (s) =

Xa′∈Q

l�=i Al(s)

0@τ i (a)

Yl�=i

πlal

(s)

1A ,

where a = (a1 . . . ai−1, j, ai . . . an). This means that we enu-merate all possible joint actions a with fixed action j foragent i. In general, for some mixed strategy ω, agent i re-ceives an expected payoff of

P i (s, ω) =X

j∈Ai(s)

24ωj

Xa′∈Q

l�=i Al(s)

0@τ i (s, a)

Yl�=i

πlal

(s)

1A35 .

If each player i is represented by a population πi, we canset up a system of differential equations, each similar to(5), where the payoff matrix A is substituted by the averagereward game payoff τ . Furthermore, σ now represents allremaining populations πl where l = i.

Definition 7. The multi-population state-coupled repli-cator dynamics are defined by the following system of differ-ential equations:

dπij (s)

dt=hP i (s, ej) − P i

“s, πi (s)

”iπi

j xs (π) ,

where ej is the jth-unit vector. P i (s, ω) is the expected pay-off for an individual of population i playing some strategy ωin state s. P i is defined as

P i (s, ω) =X

j∈Ai(s)

24ωj

Xa′∈Q

l�=i Al(s)

0@τ i (s, a)

Yl�=i

πlal

(s)

1A35 ,

where τ is the payoff function of G`s, π1 . . . πn

ánd

a = (a1 . . . ai−1, j, ai . . . an) .

Furthermore, x is the stationary distribution over all statesS under π, with X

s∈S

xs (π) = 1 and

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1FALA, state 1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1FALA, state 2

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Piecewise RD, state 1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1State−coupled RD, state 1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s1)

π12(s1)

Figure 4: Comparison between a single trajectorytrace of learning automata, piecewise and state-coupled replicator dynamics in the 2-State Prison-ers’ Dilemma. The piecewise replicator dynamicsclearly show discontinuities, while the state-coupledreplicators model the learning dynamics more accu-rately.

xs (π) =Xz∈S

24xz (π)

Xa∈Qn

i=1 Ai(s)

qs (z, a)

nYi=1

πiai

(s)

!35 .

In total this system has N =P

s∈S

Pni=1 |Ai (s) | replicator

equations.

Piecewise replicator dynamics rely on a cell partitioning,where the dynamics in each cell are approximated by a staticset of replicator equations. In contrast, the state-coupledreplicator dynamics use direct state-coupling by incorporat-ing the expected payoff in all states under current strategies,weighted by the frequency of state occurrences.

5.2 Results and discussionThis section sets the newly proposed state-coupled repli-

cator dynamics in perspective by comparing their dynamicswith learning automata and piecewise replicators.

Figure 4 plots a single trace each for learning automata aswell as piecewise and state-coupled replicator dynamics inthe 2-State Prisoners’ Dilemma. All three trajectories con-verge to the same equilibrium point. The piecewise replica-tor dynamics clearly show discontinuities due to switchingdynamics, triggered at each cell boundary intersection. Fur-thermore, the trace in state 2 enters the subspace for cell(D, D) in state 1, while both trajectories for learning au-tomata and state-coupled replicators remain in cell (C, D).


796

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1FALA, state 1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1FALA, state 2

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s2)

π12(s2)

Figure 5: Comparison between trajectory tracesof learning automata, piecewise and state-coupledreplicator dynamics in the 2-State Prisoners’Dilemma. Initial action probabilities are fixed instate 1 while a set of 8 random strategy profiles isused in state 2. Wrong convergence and discontinu-ities are observed for piecewise replicators.

This comparison clearly shows that state-coupled replicatorsmodel the learning dynamics more precisely.

In Figure 5 we compare multiple trajectory traces origi-nating from one fixed strategy profile in state 1 and a setof randomly chosen strategies in state 2. This allows tojudge the predictive quality of piecewise- and state-coupledreplicator dynamics with respect to the learning curves ofautomata games.

Finally, Figure 6 presents trajectory plots for the Com-mon Interest Game. Automata games and learning dynam-ics modeled by state-coupled replicator dynamics are com-pared. The strong alignment between model and real learn-ing traces are evident in this game just as for the 2-StatePrisoners’ Dilemma.

The new proposed state-coupled replicator dynamics di-rectly describe the coupling between states and hence nolonger rely on an additional layer of abstraction like piece-wise cell dynamics. We observe in a variety of results thatstate-coupled replicator dynamics model multi-agent rein-forcement learning in stochastic games by far better thanpiecewise replicators.

6. CONCLUSIONSWe identified shortcomings of piecewise replicator dynam-

ics, i.e. discontinuities and occurrences of qualitative anoma-lies, and ascertained cause and effect. State-coupled replica-

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1FALA, state 1

π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1FALA, state 2

π11(s2)

π12(s2)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s1)

π12(s1)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8


π11(s2)

π12(s2)

Figure 6: Comparison between trajectory traces oflearning automata and state-coupled replicator dy-namics in the Common Interest Game. Initial actionprobabilities are fixed in state 2 while a set of 8 ran-dom strategy profiles is used in state 1.

tor dynamics were proposed to alleviate these disadvantages.The preceding formalization of piecewise replicators was de-liberately factored into the new approach. Finally, this workdelivered a comparative study of finite action-set learningautomata as well as piecewise and state-coupled replicatordynamics. State-coupled replicators have been shown to pre-dict learning dynamics in stochastic games more accuratelythan their predecessor, the piecewise model.

This research was partially funded by the Netherlands Or-ganisation for Scientific Research (NWO).

7. REFERENCES[1] T. Borgers and R. Sarin. Learning through

reinforcement and replicator dynamics. Journal ofEcon. Theory, 77(1), 1997.

[2] D. Hennes, K. Tuyls, and M. Rauterberg. Formalizingmulti-state learning dynamics. In IAT, 2008.

[3] K. Narendra and M. Thathachar. Learning AutomataAn Introduction. Prentice-Hall, Inc., New Jersey, 1989.

[4] L. Panait, K. Tuyls, and S. Luke. Theoreticaladvantages of lenient learners: An evolutionary gametheoretic perspective. Journal of Machine LearningResearch, 9:423–457, 2008.

[5] R. Sutton and A. Barto. Reinforcement Learning: AnIntroduction. MIT Press, Cambridge, MA, 1998.

[6] K. Tuyls and S. Parsons. What evolutionary gametheory tells us about multiagent learning. ArtificialIntelligence, 171(7):115–153, 2007.

[7] K. Tuyls, K. Verbeeck, and T. Lenaerts. Aselection-mutation model for Q-learning in multi-agentsystems. In AAMAS, 2003.

[8] K. Verbeeck, P. Vrancx, and A. Nowe. Networks oflearning automata and limiting games. In ALAMAS,2006.

[9] P. Vrancx, K. Tuyls, R. Westra, and A. Nowe.Switching dynamics of multi-agent learning. InAAMAS, 2008.

Documents

State-Coupled Replicator Dynamics€¦ · the evolution of this population as described by the repli-cator dynamics (Section 2.1). We leverage the theoretical framework of stochastic