Stochastic Direct Reinforcement: Application to Simple Games … · 2006-01-11 · Stochastic Direct Reinforcement: Application to Simple Games with Recurrence John Moody*, Yufeng

Stochastic Direct Reinforcement:Application to Simple Games with Recurrence

John Moody*, Yufeng Liu*, Matthew Saffell* and Kyoungju Youn†{moody, yufeng, saffell}@icsi.berkeley.edu, [email protected]

*International Computer Science Institute, Berkeley†OGI School of Science and Engineering, Portland

Abstract

We investigate repeated matrix games with stochastic playersas a microcosm for studying dynamic, multi-agent interac-tions using the Stochastic Direct Reinforcement (SDR) pol-icy gradient algorithm. SDR is a generalization of Recur-rent Reinforcement Learning (RRL) that supports stochas-tic policies. Unlike other RL algorithms, SDR and RRLuse recurrent policy gradients to properly address tempo-ral credit assignment resulting from recurrent structure. Ourmain goals in this paper are to (1) distinguish recurrent mem-ory from standard, non-recurrent memory for policy gradi-ent RL, (2) compare SDR with Q-type learning methods forsimple games, (3) distinguish reactive from endogenous dy-namical agent behavior and (4) explore the use of recurrentlearning for interacting, dynamic agents.We find that SDR players learn much faster and hence outper-form recently-proposed Q-type learners for the simple gameRock, Paper, Scissors (RPS). With more complex, dynamicSDR players and opponents, we demonstrate that recurrentrepresentations and SDR’s recurrent policy gradients yieldbetter performance than non-recurrent players. For the It-erated Prisoners Dilemma, we show that non-recurrent SDRagents learn only to defect (Nash equilibrium), while SDRagents with recurrent gradients can learn a variety of interest-ing behaviors, including cooperation.

1 IntroductionWe are interested in the behavior of reinforcement learningagents when they are allowed to learn dynamic policies andoperate in complex, dynamic environments. A general chal-lenge for research in multi-agent systems is for agents tolearn to leverage or exploit the dynamic properties of theother agents and the overall environment. Even seeminglysimple repeated matrix games can display complex behav-iors when the policies are reactive, predictive or endoge-nously dynamic in nature.

In this paper we use repeated matrix games with stochas-tic players as a microcosm for studying dynamic, multi-agent interactions using the recently-proposed StochasticDirect Reinforcement (SDR) policy gradient algorithm.Within this context, we distinguish recurrent memory fromstandard, non-recurrent memory for policy gradient RL(Section 2), compare policy gradient with Q-type learningmethods for simple games (Section 3), contrast reactive with

non-reactive, endogenous dynamical agent behavior and ex-plore the use of recurrent learning for addressing temporalcredit assignment with interacting, dynamic agents (Section4).

Most studies of RL agents in matrix games use Q-Learning, (Littman 1994; Sandholm & Crites 1995; Claus& Boutilier 1998; Hu & Wellman 1998; Tesauro & Kephart2002; Bowling & Veloso 2002; Tesauro 2004). These meth-ods based on Markov Decision Processes (MDPs) learn a Q-value for each state-action pair, and represent policies onlyimplicitly. Though theoretically appealing, Q-Learning cannot easily be scaled up to the large state or action spaceswhich often occur in practice.

Direct reinforcement (DR) methods (policy gradient andpolicy search) (Williams 1992)(Moody & Wu 1997)(Moodyet al. 1998)(Baxter & Bartlett 2001) (Ng & Jordan 2000)represent policies explicitly and do not require that a valuefunction be learned. Policy gradient methods seek to im-prove the policy by using the gradient of the expected av-erage or discounted reward with respect to the policy func-tion parameters. More generally, they seek to maximize anagent’s utility, which is a function of its rewards. We findthat direct representation of policies allows us handle mixedstrategies for matrix games naturally and efficiently.

Most previous work with RL agents in repeated matrixgames has used static, memory-less opponents, meaning thatpolicy inputs or state-spaces for Q-tables do not encode pre-vious actions. (One exception is the work of (Sandholm &Crites 1995).) In order to exhibit interesting temporal be-haviors, a player with a fixed policy must have memory. Werefer to such players as dynamic contestants. We note thefollowing properties of the repeated matrix games and play-ers we consider:

1. Known, stationary reward structure.

2. Simultaneous play with immediate rewards.

3. Mixed policies: stochastic actions.

4. Partial-observability: players do not know each others’strategies or internal states.

5. Standard Memory: dynamic players may have policiesthat use memory of their opponents’ actions, which is rep-resented by non-recurrent inputs or state variables.

6. Recurrent Memory: dynamic players may have policiesthat depend upon inputs or state variables representingtheir own previous actions.

While the use of standard memory in Q-type RL agents hasbeen well studied (Lin & Mitchell 1992; McCallum 1995)and applied in other areas of AI, the importance of distin-guishing between standard memory and recurrent memorywithin a policy gradient context has not previously been rec-ognized.

With dynamic contestants, the game dynamics are recur-rent. Dynamic contestants can be reactive or predictive ofopponents’ actions and can also generate endogenous dy-namic behavior by using their own past actions as inputs (re-currence). When faced with a predictive opponent, a playermust have recurrent inputs in order to predict the predic-tive responses of its opponent. The player must know whatits opponent knows in order to anticipate the response andthereby achieve best play.

The feedback due to recurrent inputs gives rise to learningchallenges that are naturally solved within a non-Markovianor recurrent learning framework. The SDR algorithm de-scribed in Section 2 is well-matched for use in games withdynamic, stochastic opponents of the kinds we investigate.In particular, SDR uses recurrent policy gradients to prop-erly address temporal credit assignment that results from re-currence in games.

Section 3 provides a baseline comparison to traditionalvalue function RL methods by pitting SDR players againsttwo recently-proposed Q-type learners for the simple gameRock, Paper, Scissors (RPS). In these competitions, SDRplayers learn much faster and thus out-perform their Q-typeopponents. The results and analysis of this section raisequestions as to whether use of a Q-table is desirable for sim-ple games.

Section 4 considers games with more complex SDR play-ers and opponents. We demonstrate that recurrence givesrise to predictive abilities and dynamical behaviors. Recur-rent representations and SDR’s recurrent policy gradientsare shown to be yield better performance than non-recurrentSDR players when competing against dynamic opponents inthe zero sum games. Case studies include Matching Penniesand RPS.

For the Iterated Prisoners Dilemma (a general sum game),we consider SDR self-play and play against fixed oppo-nents. We find that non-recurrent SDR agents learn onlythe Nash equilibrium strategy “always defect”, while SDRagents trained with recurrent gradients can learn a varietyof interesting behaviors. These include the Pareto-optimalstrategy “always cooperate”, a time-varying strategy that ex-ploits a generous opponent and the evolutionary stable strat-egy Tit-for-Tat.

2 Stochastic Direct ReinforcementWe summarize the Stochastic Direct Reinforcement (SDR)algorithm, which generalizes the recurrent reinforcementlearning (RRL) algorithm of (Moody & Wu 1997; Moodyet al. 1998; Moody & Saffell 2001) and the policy gradientwork of (Baxter & Bartlett 2001). This stochastic policy

gradient algorithm is formulated for probabilistic actions,partially-observed states and non-Markovian policies. WithSDR, an agent represents actions explicitly and learns themdirectly without needing to learn a value function. SinceSDR agents represent policies directly, they can naturallyincorporate recurrent structure that is intrinsic to many po-tential applications. Via use of recurrence, the short termeffects of an agent’s actions on its environment can be cap-tured, leading to the possibility of discovering more effectivepolicies.

Here, we describe a simplified formulation of SDR appro-priate for the repeated matrix games with stochastic playersthat we study. This formulation is sufficient to (1) compareSDR to recent work on Q-Learning in games and (2) inves-tigate recurrent representations in dynamic settings. A moregeneral and complete presentation of SDR will be presentedelsewhere.

In the following subsections, we provide an overview ofSDR for repeated, stochastic matrix games, derive the batchversion of SDR with recurrent gradients, present the stochas-tic gradient approximation to SDR for more efficient and on-line learning, describe the model structures used for SDRagents in this paper and relate recurrent representations tomore conventional “memory-based” approaches.

2.1 SDR OverviewConsider an agent that takes action at with probability p(at)given by its stochastic policy function P

p(at) = P (at; θ, . . .) , (1)

The goal of the SDR learning algorithm is to find parametersθ for the best policy according to an agent’s expected utilityUT accumulated over T time steps. For the time being, letus assume that the agent’s policy (e.g. game strategy) andthe environment (e.g. game rules, game reward structure,opponent’s strategy) are stationary. The expected total utilityof a sequence of T actions can then be written in terms ofmarginal utilities ut(at) gained at each time:

UT =

T∑

t=1

∑

at

ut(at)p(at). (2)

This simple additive utility is a special case of more generalpath-dependent utilities as described in (Moody et al. 1998;Moody & Saffell 2001), but is adequate for studying learn-ing in repeated matrix games.

More specifically, a recurrent SDR agent has a stochasticpolicy function P

p(at) = P (at; θ, I(n)t−1, A

(m)t−1). (3)

Here A(m)t−1 = {at−1, at−2..., at−m} is a partial history of m

recent actions (recurrent inputs), and the external informa-tion set I

(n)t−1 = {it−1, it−2..., it−n} represents n past obser-

vation vectors (non-recurrent, memory inputs) available tothe agent at time t. Such an agent has recurrent memory oforder m and standard memory of length n As described inSection 4, such an agent is said to have dynamics of order(m, n).

We refer to the combined action and information sets asthe observed history. Note that the observed history is notassumed to provide full knowledge of the state St of theworld. While probabilities for previous actions A

(m)t−1 depend

explicitly on θ, we assume here that components of I(n)t−1

have no known direct θ dependence, although there may beunknown dependencies of I

(n)t−1 upon recent actions A

(m)t−1

1. For lagged inputs, it is the explicit dependence on pol-icy parameters θ that distinguishes recurrence from standardmemory. Within a policy gradient framework, this distinc-tion is important for properly addressing the temporal creditassignment problem.

The matrix games we consider have unknown opponents,and are hence partially-observable. The SDR agents in thispaper have knowledge of their own policy parameters θ,the history of play, or the pay-off structure of the gameu. However, they have no knowledge of the opponent’sstrategy or internal state. Denoting the opponent’s actionat time t as ot, a partial history of n opponent plays isI(n)t−1 = {ot−1, ot−2..., ot−n}, which are treated as non-

recurrent, memory inputs. The marginal utility or pay-offat time t depends only upon the most recent actions of theplayer and its opponent: ut(at) = u(at, ot). For the gamesimulations in this paper, we consider SDR players and op-ponents with (m, n) ∈ {0, 1, 2}.

For many applications, the marginal utilities of all possi-ble actions ut(at) are known after an action is taken. Whentrading financial markets for example, the profits of all pos-sible one month trades are known at the end of the month.2For repeated matrix games, the complete reward structure isknown to the players, and they act simultaneously at eachgame iteration. Once the opponent’s action is observed,the SDR agent knows the marginal utility for each possiblechoice of its prior action. Hence, the SDR player can com-pute the second sum in Eq.(2). For other situations whereonly the utility of the action taken is known (e.g. in gameswhere players alternate turns), then the expected total rewardmay be expressed as UT =

∑T

t=1 ut(at)p(at).

2.2 SDR: Recurrent Policy GradientsIn this section, we present the batch version of SDR withrecurrent policy gradients. The next section describes thestochastic gradient approximation for on-line SDR learning

UT can be maximized by performing gradient ascent onthe utility gradient

∆θ ∝dUT

dθ=

∑

t

∑

at

ut(at)d

dθp(at). (4)

1In situations where aspects of the environment can be mod-eled, then It−1 may depend to some extent in a known way onAt−1. This implicit dependence on policy parameters θ introducesadditional recurrence which can be captured in the learning algo-rithm. In the context of games, this requires explicitly modelingthe opponent’s responses. Environment and opponent models arebeyond the scope of this short paper.

2For small trades in liquid securities, this situation holds well.For very large trades, the non-trivial effects of market impact mustbe considered.

In this section, we derive expressions for the batch policygradient dp(at)

dθwith a stationary policy. When the policy

has m-th order recurrence, the full policy gradients are ex-pressed as m-th order recursions. These recurrent gradientsare required to properly account for temporal credit assign-ment when an agent’s prior actions directly influence its cur-rent action. The m-th order recursions are causal (propagateinformation forward in time), but non-Markovian.

For the non-recurrent m = 0 case, the policy gradient issimply:

dp(at)

dθ=

dP (at; θ, I(n)t−1)

dθ. (5)

For first order recurrence m = 1, the probabilities ofcurrent actions depend upon the probabilities of prior ac-tions. Note that A

(1)t−1 = at−1 and denote p(at|at−1) =

P (at; θ, I(n)t−1, at−1). Then we have

p(at) =∑

at−1

p(at|at−1)p(at−1) (6)

The recursive update to the total policy gradient is

dp(at)

dθ=

∑

at−1

{

∂p(at|at−1)

∂θp(at−1)

+p(at|at−1)dp(at−1)

dθ

}

(7)

where ∂∂θ

p(at|at−1) is calculated as ∂∂θ

P (at; θ, I(n)t−1, at−1)

with at−1 fixed.For the case m = 2, A

(2)t−1 = {at−1, at−2}. Denoting

p(at|at−1, at−2) = P (at; θ, I(n)t−1, at−1, at−2), we have the

second-order recursion

p(at) =∑

at−1

p(at|at−1)p(at−1)

p(at|at−1) =∑

at−2

p(at|at−1, at−2)p(at−2) (8)

The second-order recursive updates for the total policy gra-dient can be expressed with two equations. The first is

d

dθp(at) =

∑

at−1

{

∂

∂θp(at|at−1)p(at−1)

+p(at|at−1)d

dθp(at−1)

}

. (9)

In contrast to Eq.(7) for m = 1, however, the first gradienton the RHS of Eq.(9) must be evaluated using

d

dθp(at|at−1) =

∑

at−2

{

∂

∂θp(at|at−1, at−2)p(at−2)

+p(at|at−1, at−2)d

dθp(at−2)

}

.(10)

In the above expression, both at−1 and at−2 aretreated as fixed and ∂

∂θp(at|at−1, at−2) is calculated as

∂∂θ

P (at; θ, I(n)t−1, at−1, at−2). These recursions can be ex-

tended in the obvious way for m ≥ 3.

2.3 SDR with Stochastic GradientsThe formulation of SDR of Section 2.2 provides the exactrecurrent policy gradient. It can be used for learning off-linein batch mode or for on-line applications by retraining asnew experience is acquired. For the latter, either a completehistory or a moving window of experience may be used forbatch training, and policy parameters can be estimated fromscratch or simply updated from previous values.

We find in practice, though, that a stochastic gradient for-mulation of SDR yields much more efficient learning thanthe exact version of the previous section. This frequentlyholds whether an application requires a pre-learned solutionor adaptation in real-time.

To obtain SDR with stochastic gradients, an estimateg(at) of the recurrent policy gradient for time-varying pa-rameters θt is computed at each time t. The on-line SDRparameter update is then the stochastic approximation

∆θt ∝∑

at

ut(at)gt(at). (11)

The on-line gradients gt(at) are estimated via approximaterecursions. For m = 1, we have

gt(at) ≈∑

at−1

{

∂pt(at|at−1)

∂θpt−1(at−1)

+pt(at|at−1)gt−1(at−1)} . (12)

Here, the subscripts indicate the times at which each quan-tity is evaluated. For m = 2, the stochastic gradient is ap-proximated as

gt(at) ≈∑

at−1

{

∂pt(at|at−1)

∂θpt−1(at−1)

+pt(at|at−1)gt−1(at−1)} , with (13)

dpt(at|at−1)

dθ≈

∑

at−2

{

∂

∂θpt(at|at−1, at−2)pt−2(at−2)

+pt(at|at−1, at−2)gt−2(at−2)} . (14)

Stochastic gradient implementations of learning algo-rithms are often referred to as “on-line”. Regardless ofthe terminology used, one should distinguish the number oflearning iterations (parameter updates) from the number ofevents or time steps of an agent’s experience. The compu-tational issues of how parameters are optimized should notbe limited by how much information or experience an agenthas accumulated.

In our simulation work with RRL and SDR, we have usedvarious types of learning protocols based on stochastic gra-dient parameter updates: (1) A single sequential learningpass (or epoch) through a history window of length τ . Thisresults in one parameter update per observation, action orevent and a total of τ parameter updates. (2) n multiplelearning epochs over a fixed window. This yields a total ofnτ stochastic gradient updates. (3) Hybrid on-line / batchtraining using history subsets (blocks) of duration b. Withthis method, a single pass through the history has τ/b pa-rameter updates. This can reduce variance in the gradient

estimates. (4) Training on N randomly resampled blocksfrom the a history window. (5) Variations and combinationsof the above. Note that (4) can be used to implement bag-ging or to reduce the likelihood of being caught in history-dependent local minima.

Following work by other authors in multi-agent RL, allsimulation results presented in this paper use protocol (1).Hence, the number of parameter updates equals the lengthof agent’s history. In real-world use, however, the otherlearning protocols are usually preferred. RL agents withlimited experience typically learn better with stochastic gra-dient methods when using multiple training epochs n or alarge number parameter updates N >> τ .

2.4 Model StructuresIn this section we describe the two model structures we usein our simulations. The first model can be used for gameswith binary actions and is based on the tanh function. Thesecond model is appropriate for multiple-action games anduses the softmax representation. Recall that the SDR agentwith memory length n and recurrence of order m takes anaction with probability, p(at) = P (at; θ, I

(n)t−1, A

(m)t−1). Let

θ = {θ0, θI , θA} denote the bias, non-recurrent, and recur-rent weight vectors respectively. The binary-action recurrentpolicy function has the form P (at = +1; θ, I

(n)t−1, A

(m)t−1) =

(1+ft)/2, where ft = tanh(θ0 +θI ·I(n)t−1 +θA ·A

(m)t−1) and

p(−1) = 1 − p(+1). The multiple-action recurrent policyfunction has the form

p(at = i) =exp(θi0 + θiI · I

(n)t−1 + θiA · A

(m)t−1)

∑

j exp(θj0 + θjI · I(n)t−1 + θjA · A

(m)t−1)

.

(15)

2.5 Relation to “Memory-Based” MethodsFor the repeated matrix games with dynamic players that weconsider in this paper, the SDR agents have both memory ofopponents’ actions I

(n)t−1 and recurrent dependencies on their

own previous actions A(m)t−1. These correspond to standard,

non-recurrent memory and recurrent memory, respectively.To our knowledge, we are the first in the RL community tomake this distinction, and to propose recurrent policy gradi-ent algorithms such as RRL and SDR to properly handle thetemporal dependencies of an agent’s actions.

A number of “memory-based” approaches to value func-tion RL algorithms such as Q-Learning have been dis-cussed (Lin & Mitchell 1992; McCallum 1995). With thesemethods, prior actions are often included in the observationIt−1 or state vector St. In a policy gradient framework, how-ever, simply including prior actions in the observation vectorIt−1 ignores their dependence upon the policy parameters θ.As mentioned previously, the probabilities of prior actionsA

(m)t−1 depend explicitly on θ, but components of I

(n)t−1 are

assumed to have no known or direct dependence on θ.3

3Of course, agents do influence their environments, and gameplayers do influence their opponents. Such interaction with the en-

A standard memory-based approach of treating A(m)t−1 as

part of It−1 by ignoring the dependence of A(m)t−1 on θ

would in effect truncate the gradient recursions above. Thiswould replace probabilities of prior actions p(at−j) withKronecker δ’s (1 for the realized action and 0 for all otherpossible actions) and setting d

dθp(at−1) and d

dθp(at−2) to

zero. When this is done, the m = 1 and m = 2 recurrentgradients of equations (7) to (10) collapse to:

d

dθp(at) ≈

∂

∂θp(at|at−1), m = 1

d

dθp(at) ≈

∂

∂θp(at|at−1, at−2), m = 2.

These equations are similar to the non-recurrent m = 0 casein that they have no dependence upon the policy gradientscomputed at times t − 1 or t − 2. When total derivativeare approximated by partial derivatives, sequential depen-dencies that are important for solving the temporal creditassignment problem when estimating θ are lost.

In our empirical work with RRL and SDR, we have foundthat removal of the recurrent gradient terms can lead to fail-ure of the algorithms to learn effective policies. It is be-yond the scope of this paper to discuss recurrent gradientsand temporal credit assignment further or to present unre-lated empirical results. In the next two sections, we userepeated matrix games with stochastic players as a micro-cosm to (1) compare the policy gradient approach to Q-typelearning algorithms and (2) explore the use of recurrent rep-resentations, and interacting, dynamic agents.

3 SDR vs. Value Function Methods for aSimple Game

The purpose of this section is to compare simple SDR agentswith Q-type agents via competition in a simple game.

Nearly all investigations of reinforcement learning inmulti-agent systems have utilized Q-type learning meth-ods. Examples include bidding agents for auctions (Kephart,Hansen, & Greenwald 2000; Tesauro & Bredin 2002), au-tonomic computing (Boutilier et al. 2003), and games(Littman 1994; Hu & Wellman 1998), to name a few.

However, various authors have noted that Q-functions canbe cumbersome for representing and learning good policies.Examples have been discussed in robotics (Anderson 2000),telecommunications (Brown 2000), and finance (Moody &Saffell 1999; 2001). The latter papers provide comparisonsof trading agents trained with the RRL policy gradient algo-rithm and Q-Learning for the real-world problem of allocat-ing assets between the S&P 500 stock index and T-Bills. Asimple toy example that elucidates the representational com-plexities of Q-functions versus direct policy representationsis the Oracle Problem described in (Moody & Saffell 2001).

To investigate the application of policy gradient methodsto multi-agent systems, we consider competition betweenSDR and Q-type agents playing Rock-Paper-Scissors (RPS),

vironment results in implicit dependencies of I(n)t−1 on θ. We de-

scribe a more general formulation of SDR with environment mod-els that capture such additional recurrence in another paper.

a simple two-player repeated matrix game. We chose ouropponents to be two recently proposed Q-type learners withmixed strategies, WoLF-PHC (Bowling & Veloso 2002) andHyper-Q (Tesauro 2004). We selected RPS for the non-recurrent case studies in this section in order to build uponthe empirical comparisons provided by Bowling & Velosoand Tesauro. Hence, we use memory-less Q-type playersand follow key elements of these authors’ experimental pro-tocols. To enable fair competitions, the SDR players haveaccess to only as much information as their PHC and Hyper-Q opponents. We also use the on-line learning paradigms ofthese authors, whereby a single parameter update (learningstep) is performed after each game iteration (or “throw”).

Rather than consider very long matches of millions ofthrows as previous authors have,4 we have chosen to limitall simulations in this paper to games of 20,000 plays orless. We believe that O(104) plays is more than adequateto learn a simple matrix game.The Nash equilibrium strat-egy for RPS is just random play, which results in an averagedraw for both contestants. This strategy is uninteresting forcompetition, and the discovery of Nash equilibrium strate-gies in repeated matrix games has been well-studied by oth-ers. Hence, the studies in this section emphasize the non-equilibrium, short-run behaviors that determine a player’sviability as an effective competitor in plausible matches.Since real tournaments keep score from the beginning of agame, it is cumulative wins that count, not potential asymp-totic performance. Transient performance matters and sodoes speed of learning.

3.1 Rock-Paper-Scissors & Memory-Less PlayersRPS is a two-player repeated matrix game where one ofthree actions {(R)ock, (P)aper, (S)cissors} is chosen by eachplayer at each iteration of the game. Rock beats Scissorswhich beats Paper which in turn beats Rock. Competitionsare typically sequences of throws. A +1 is scored when yourthrow beats your opponent’s throw, a 0 when there is a draw,and a −1 when your throw is beaten. The Nash equilibriumfor non-repeated play of this game is to choose the throwrandomly.

In this section we will compare SDR players to two Q-type RPS opponents: the Policy Hill Climbing (PHC) vari-ant of Q-Learning (Bowling & Veloso 2002) and the Hyper-Q algorithm (Tesauro 2004). The PHC and Hyper-Q play-ers are memory-less, by which we mean that they have noencoding of previous actions in their Q-table state spaces.Thus, there is no sequential structure in their RPS strategies.These opponents’ mixed policies are simple unconditionalprobabilities for throwing rock, paper or scissors, and areindependent of their own or their opponents’ recent past ac-tions. These Q-type agents are in essence three-armed ban-dit players. When learning rates are zero, their memory-lessstrategies are static. When learning rates are positive, com-peting against these opponents is similar to playing a “rest-less” three-armed bandit.

4See the discussion of RPS tournaments in Section 4.2. Com-puter RoShamBo and human world championship tournamentshave much shorter games.

For the RPS competitions in this section, we use non-recurrent SDR agents. Since the Q-type players arememory-less and non-dynamic, the SDR agents do not re-quire inputs corresponding to lagged actions of their own ortheir opponents to learn superior policies. Thus, the policygradient players use the simple m = 0 case of the SDR al-gorithm.

The memory-less SDR and Q-type players have time-varying behavior due only to their learning. When learningrates are zero, the players policies are static unconditionalprobabilities of throwing {R,P,S}, and the games are station-ary with no sequential patterns in play. When learning ratesare non-zero, the competitions between the adaptive RLagents in this section are analogous to having two restlessbandits play each other. This is also the situation for all em-pirical results for RL in repeated matrix games presented by(Singh, Kearns, & Mansour 2000; Bowling & Veloso 2002;Tesauro 2004).

SDR vs. PHC PHC is a simple extension of Q-learning.The algorithm, in essence, performs hill-climbing in thespace of mixed policies toward the greedy policy of its cur-rent Q-function. The policy function is a table of three un-conditional probabilities with no direct dependence on re-cent play. We use the WoLF-PHC (win or learn fast) variantof the algorithm.

For RPS competition against WoLF-PHC, an extremelysimple SDR agent with no inputs and only bias weights issufficient. The SDR agent represents mixed strategies withsoftmax: p(i) = ewi

∑

je

wj, for i, j = R, P, S. SDR with-

out inputs resembles Infinitesimal Gradient Ascent (Singh,Kearns, & Mansour 2000), but assumes no knowledge ofthe opponent’s mixed strategy.

For the experiments in this section, the complexities ofboth players are quite simple. SDR maintains three weights.WoLF-PHC updates a Q-table with three states and has threepolicy parameters.

At the outset, we strove to design a fair competition,whereby neither player would have an obvious advantage.Both players have access to the same information, specifi-cally just the current plays and reward. The learning rate forSDR is η = 0.5, and the weight decay rate is λ = 0.001.The policy update rates of WoLF-PHC for the empirical re-sults we present are δl = 0.5 when losing and δw = 0.125when winning. The ratio δl/δw = 4 is chosen to correspondto experiments done by (Bowling & Veloso 2002), who usedratios of 2 and 4. We select δl for PHC to equal SDR’s learn-ing rate η to achieve a balanced baseline comparison. How-ever, since δw = η/4, PHC will not deviate from winningstrategies as quickly or as much as will SDR. (While learn-ing, some policy fluctuation occurs at all times due to therandomness of plays in the game.)

Figure 1 shows a simulation result with PHC’s Q-learningrate set to 0.1, a typical value used by many in the RL com-munity. Some might find the results to be surprising. Evenwith well-matched policy learning parameters for SDR andPHC, and a standard Q learning rate, the Q function of PHCdoes not converge. Both strategies evolve continuously, and

0 0.5 1 1.5 2x 10

4

0

0.5

1simple SDR (no input) vs PHC

RPS

0 0.5 1 1.5 2x 10

4

0

0.5

1Wins:0.486Draws:0.349Losses:0.165

Figure 1: The dynamic behavior of the SDR player whenplaying Rock-Paper-Scissors against the WoLF-PHC playerwith Q learning rate 0.1. The top panel shows the chang-ing probabilities of the SDR players strategy over time. Thebottom panel shows the moving average of the percentageof wins, losses, and draws against WoLF-PHC. Althoughchaotic limit cycles in play develop, SDR achieves a consis-tent advantage.

SDR dominates WoLF-PHC.Upon investigation, we find that the behavior apparent in

Figure 1 is due to PHC’s relative sluggishness in updatingits Q-table. The role of time delays (or mis-matched time-scales) in triggering such emergent chaotic behavior is well-known in nonlinear dynamical systems. The simple versionof SDR used in this experiment has learning delays, too; itessentially performs a moving average estimation of its op-timal mixed strategy based upon PHC’s actions. Even so,SDR does not have the encumbrance of a Q-function, so italways learns faster than PHC. This enables SDR to exploitPHC, while PHC is left to perpetually play “catch-up” toSDR. As SDR’s strategy continuously evolves, PHC playscatch-up, and neither player reaches the Nash equilibriumstrategy of random play with parity in performance.5

To confirm this insight, we ran simulated competitionsfor various Q learning rates while keeping the policy updaterates for PHC and SDR fixed. We also wished to explorehow quickly WoLF-PHC would have to learn to be compet-itive with SDR. Figure 2 shows performance results for arange of Q learning rates from 1/32 to 1. WoLF-PHC isable to match SDR’s performance by achieving net wins ofzero (an average draw) only when its Q learning rate is setequal to 1.6

5In further experiments, we find that the slowness of Q-functionupdates prevents PHC from responding directly to SDR’s changeof strategy even with hill-climbing rates δl approaching 1.0.

6In this limit, the Q learning equations simplify, and Q-table el-ements for stochastic matrix games equal recently received rewardsplus a stochastic constant value. Since the greedy policy dependsonly upon relative Q values for the possible actions, the Q-functioncan be replaced by the immediate rewards for each action. Hence,PHC becomes similar to a pure policy gradient algorithm, and both

1/32 1/16 1/8 1/4 1/2 1

−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0A

vera

ge W

ins

− Lo

sses

for P

HC

PHC Q−learning rate

Figure 2: Average wins minus losses for PHC versus its Qlearning rate. The Tukey box plots summarize the distribu-tions of outcomes of 10 matches against SDR opponents foreach case. Each match has 2.5×104 throws, and wins minuslosses are computed for the last 2 × 104 throws. The resultsshow that PHC loses to SDR on average for all Q learningrates less than 1, and the dominance of SDR is statisticallysignificant. With a Q learning rate of 1, PHC considers onlyimmediate rewards and behaves like a pure policy gradientalgorithm. Only in this case is PHC able to match SDR’sperformance, with both players reaching Nash equilibriumand achieving an average draw.

SDR vs. Hyper-Q The standard Q-Learning process ofusing discrete-valued actions is not suitable for stochasticmatrix games as argued in (Tesauro 2004). Hyper-Q ex-tends the Q-function framework to joint mixed strategies.The hyper-Q-table is augmented by a Bayesian inferencemethod for estimating other agent’s strategies that appliesa recency-weighted version of Bayes’ rule to the observedaction sequence. In Tesauro’s simulation, Hyper-Q took 106

iterations to converge when playing against PHC.As for PHC, we strove to design a fair competition be-

tween SDR and Hyper-Q, whereby neither player wouldhave an obvious information advantage. Both players haveaccess to the same observations, specifically the Bayesianestimate of its opponent’s mixed strategy and the most re-cent plays and reward. Following Tesauro’s RPS experi-ments, our simulation of the Hyper-Q player has a discountfactor γ = 0.9 and a constant learning rate α = 0.01 anduses a Bayes estimate of the opponent’s strategies. Simi-larly, the input of SDR is a vector consisting of three proba-bilities estimating Hyper-Q’s mixed strategy. The estimationis performed by applying the same recency-weighted Bayes’rule as used in Hyper-Q on the observed actions of the op-ponent. The probability p(y|H) that the opponent is usingmixed strategy y given the history H of observed oppo-nent’s actions can be estimated using Bayes’ rule p(y|H) ∝p(H |y)p(y). A recency-weighted version of Bayes’ ruleis obtained using p(H |y) =

∏t

k=0 p(ok|y(t))wk , where

SDR and PHC players reach Nash equilibrium.

0 0.5 1 1.5 2x 10

4

0

100

200

300

400SDR vs HyperQ

cum

ulat

ive

net w

ins

Figure 3: SDR vs Hyper-Q in a Rock-Paper-Scissors match.Following (Tesauro 2004), we plot the cumulative net wins(the sum of wins minus losses) of SDR over Hyper-Q ver-sus number of throws. This graphical representation depictsthe outcomes of matches of length up to 20,000 throws. Thecurve is on average increasing; SDR beats Hyper-Q moreoften than not as play continues. The failure of the curveto flatten out (which would corresponding to eventual netdraws on average) during the game length demonstrates therelatively faster learning behavior of the SDR player. Basedupon Tesauro’s empirical work, we anticipate that conver-gence of both players to the Nash equilibrium of randomplay would require on the order of 106 throws. However,we limited all simulations in this paper to games of 20,000throws or less.

wk = 1 − µ(t − k). In this simulation, SDR maintainsnine weights, whereas Hyper-Q stores a Q-table of 105625elements.

Figure 3 shows the cumulative net wins for the SDRplayer over the Hyper-Q player during learning. The SDRplayer exploits the slow learning of Hyper-Q before the play-ers reach the Nash equilibrium strategy of random play. Dur-ing our entire simulation of 2×104 plays, the net wins curveis increasing and SDR dominates Hyper-Q.

3.2 DiscussionOur experimental results show that simple SDR playerscan adapt faster than more complex Q-type agents in RPS,thereby achieving winning performance. The results sug-gest that policy gradient methods like SDR may be sufficientfor attaining superior performance in repeated matrix games,and also raise questions as to when use of a Q-function canprovide benefit.

Regarding this latter point, we make two observations.First, repeated matrix games provide players with frequent,immediate rewards. Secondly, the implementations of PHCused by (Bowling & Veloso 2002) and of Hyper-Q usedby (Tesauro 2004) are non-dynamic and memory-less7 in thesense that their policies are just the marginal probabilities ofthrowing rock, paper or scissors, and current actions are notconditioned on previous actions. There are no sequentialpatterns in their play, so adaptive opponents do not have to

7One can distinguish between “long-term memory” as encodedby learned parameters and “short-term memory” as encoded byinput variables. Tesauro’s moving average Bayes estimate of theopponent’s mixed strategy is used as an input variable, but reallycaptures long-term behavior rather than the short term, dynamicalmemory associated with recent play.

solve a temporal credit assignment problem. We leave it toproponents of Q-type algorithms like WoLF-PHC or Hyper-Q to investigate more challenging multi-agent systems ap-plications where Q-type approaches may provide significantbenefits and to explore dynamical extensions to their algo-rithms.

In the next section, we present studies of SDR in repeatedmatrix games with stochastic opponents that exhibit morecomplex behaviors. For these cases, SDR agents must solvea temporal credit assignment problem to achieve superiorplay.

4 Games with Dynamic ContestantsPlayers using memory-less, non-dynamic mixed strategiesas in (Bowling & Veloso 2002; Tesauro 2004) show noneof the short-term patterns of play which naturally occur inhuman games. Dynamics exist in strategic games due tothe fact that the contestants are trying to discover patternsin each other’s play, and are modifying their own strategiesto attempt to take advantage of predictability to defeat theiropponents. To generate dynamic behavior, game playingagents must have memory of recent actions. We refer toplayers with such memory as dynamic contestants.

We distinguish between two types of dynamical behav-iors. Purely reactive players observe and respond to only thepast actions of their opponents, while purely non-reactiveplayers remember and respond to only their own past ac-tions. Reactive players generate dynamical behavior onlythrough interaction with an opponent, while non-reactiveplayers ignore their opponents entirely and generate sequen-tial patterns in play purely autonomously. We call such au-tonomous pattern generation endogenous dynamics. Reac-tive players have only non-recurrent memory, while non-reactive players have only recurrent memory. A generaldynamic player has both reactive and non-reactive capabili-ties and exhibits both reactive and endogenous dynamics. Ageneral dynamic player with non-reactive (recurrent) mem-ory of length m and reactive (non-recurrent) memory oflength n is said to have dynamics of order (m, n). A playerthat learns to model its opponent’s dynamics (whether reac-tive or endogenous) and to dominate that opponent becomespredictive.

In this section, we illustrate the use of recurrent SDRlearning in competitions between dynamic, stochastic con-testants. As case studies, we consider sequential structure inthe repeated matrix games, Matching Pennies (MP), Rock-Paper-Scissors (RPS), and the Iterated Prisoners’ Dilemma(IPD). To effectively study the learning behavior of SDRagents and the effects of recurrence, we limit the complexityof the experiments by considering only non-adaptive oppo-nents with well-defined policies or self-play.

4.1 Matching PenniesMatching pennies is a simple zero sum game that involvestwo players, A and B. Each player conceals a penny in hispalm with either head or tail up. Then they reveal the penniessimultaneously. By an agreement made before the play, ifboth pennies match, player A wins, otherwise B wins. The

Non−Recurrent Only Recurrent Both Types

0

0.1

0.2

0.3

0.4

0.5Reward Distributions for Three SDR learners

Ave

rage

rew

ard

Figure 4: Matching Pennies with a general dynamic oppo-nent. Boxplots summarize the average reward (wins minuslosses) of an SDR learner over 30 simulated games. Eachgame consists of 2000 rounds. Since the opponent’s playdepends on both its own and its opponents play, both re-current and non-recurrent inputs are necessary to maximizerewards.

player who wins receives a reward of +1 while the loserreceives −1.

The matching pennies SDR player uses the softmax func-tion as described in Section 2.4. and m = 1 recurrence. Wetested versions of the SDR player using three input sets: thefirst, “Non-Recurrent” with (m, n) = (0, 1), uses only itsopponent’s previous move ot−1, the second, “Only Recur-rent” with (m, n) = (1, 0), uses only its own previous moveat−1, and the third, “Both Types” with (m, n) = (1, 1), usesboth its own and its opponent’s previous move. Here, at−1

and ot−1 are represented as vectors (1, 0)T for heads and(0, 1)T for tails.

To test the significance of recurrent inputs, we design twokinds of reactive opponents that will be described in the fol-lowing two sections.

General Dynamic Opponent The general dynamic oppo-nent makes its action based on both its own previous action(internal dynamics) and its opponent’s previous action (reac-tive dynamics). In our test, we construct an opponent usingthe following strategy:

V =

(

0.6 0.4 0.2 0.80.4 0.6 0.8 0.2

)

p = V(at−1 ⊗ ot−1)

where at−1 ⊗ ot−1 is the outer product of at−1 and ot−1.The matrix V is constant. p = (pH , pT )T gives the prob-ability of presenting a head or tail at time t. Softmax isnot needed. SDR players use the softmax representation il-lustrated in Eq.(15) with I

(1)t−1 = ot−1 and A

(1)t−1 = at−1.

The reactive and non-reactive weight matrices for the SDRplayer are θI and θA, respectively.

The results in Figure 4 show the importance of includingboth actions in the input set to maximize performance.

Purely Endogenous Opponent The purely endogenousopponent chooses its action based only on its own previous


−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Reward Distributions for Three SDR learners

Ave

rage

rew

ard

Figure 5: Matching Pennies with a purely endogenous oppo-nent for 30 simulations. Such a player does not react to itsopponent’s actions and cannot learn to directly predict its op-ponents responses. These features make such opponents notvery realistic of challenging. Boxplots of the average reward(wins minus losses) of an SDR learner. The results show thatnon-recurrent, memory inputs are sufficient to provide SDRplayers with the predictive abilities needed to beat a purelyendogenous opponent.

actions. Such a player does not react to its opponent’s ac-tions, but generates generates patterns of play autonomously.Since they cannot learn strategies to predict their opponentsresponses, endogenous players are neither very realistic norvery challenging to defeat.

The following is an example of the endogenous dynamicopponent’s strategy:

V =

(

0.7 0.30.3 0.7

)

p = Vot−1 (16)

Figure 5 shows results for 30 games of 2000 rounds ofthe three SDR players against the purely endogenous oppo-nent. The results show that including only the non-recurrentinputs provides enough information for competing success-fully against it.

Purely Reactive Opponent The purely reactive opponentchooses its action based only on its opponents’ previousactions. The opponent has no endogenous dynamics ofits own, but generates game dynamics through interaction.Such a player is more capable of learning to predict its op-ponents’ behaviors and is thus a more realistic choice forcompetition than a purely endogenous player. The follow-ing is an example of the reactive opponent’s strategy:

V =

(

0.7 0.30.3 0.7

)

p = Vat−1 (17)

Figure 6 shows results for 30 games of 2000 rounds ofthe three SDR players against the purely reactive oppo-nent. The results show that including only the recurrentinputs provides enough information for competing success-fully against a purely reactive opponent.

Non−Recurrent Only Recurrent Both Types−0.1

0

0.1

0.2

0.3

0.4

Reward Distributions for Three SDR learners

Ave

rage

rew

ard

Figure 6: Matching Pennies with a purely reactive oppo-nent for 30 simulations. Boxplots of the average reward(wins minus losses) of an SDR learner. The results clearlyshow that non-recurrent, memory inputs provide no advan-tage, and that recurrent inputs are necessary and sufficient toprovide SDR players with the predictive abilities needed tobeat a purely reactive opponent.

4.2 Rock Paper ScissorsClassical game theory provides a static perspective on mixedstrategies in games like RPS. For example, the Nash equilib-rium strategy for RPS is to play randomly with equal proba-bilities of {R, P, S}, which is static and quite dull. Studyingthe dynamic properties of human games in the real world andof repeated matrix games with dynamic contestants is muchmore interesting. Dynamic games are particularly challeng-ing when matches are short, with too few rounds for thelaw of large numbers to take effect. Human RPS compe-titions usually consist of at most a few dozens of throws.The RoShamBo (another name for RPS) programming com-petition extends game length to just one thousand throws.Competitors cannot win games or tournaments on averagewith the Nash equilibrium strategy of random play, so theymust attempt to model, predict and dominate their oppo-nents. Quickly discovering deviations from randomness andpatterns in their opponents’ play is crucial for competitors towin in these situations.

In this section, we demonstrate that SDR players canlearn winning strategies against dynamic and predictive op-ponents, and that recurrent SDR players learn better strate-gies than non-recurrent players. For the experiments shownin this section, we are not concerned with the speed of learn-ing in game time. Hence, we continue to use the “naive”learning protocol (1) described in Section 2.3 which takesone stochastic gradient step per RPS throw. This protocol issufficient for our purpose of studying recurrence and recur-rent gradients. In real-time tournament competition, how-ever, we would use a learning protocol with many stochasticgradient steps per throw that would make maximally effi-cient use of acquired game experience and converge to bestpossible play much more rapidly.

Human Champions Dataset & Dynamic Opponent Tosimulate dynamic RPS competitions, we need a general dy-namic opponent for SDR to compete against. We create

this dynamic opponent using SDR by training on a datasetinspired by human play and then fixing its weights afterconvergence. The “Human Champions” dataset consists oftriplets of plays called gambits often used in human worldchampionship play (http://www.worldrps.com). (One exam-ple is the “The Crescendo” gambit P-S-R). The so-called“Great Eight” gambits were sampled at random using priorprobabilities such that the resulting unconditional probabil-ities of each action {R,P,S} are equal. Hence there wereno trivial vulnerabilities in its play. The general dynamicplayer has a softmax representation and uses its opponent’stwo previous moves and its own two previous moves to de-cide its current throw. It is identical to an SDR player, butthe parameters are fixed once training is complete. Thus thedynamic player always assumes its opponents are “humanchampions” and attempts to predict their throws and defeatthem accordingly. The dynamic player is then used as apredictive opponent for testing different versions of SDR-Learners.

Importance of recurrence in RPS Figures 7 and 8 showperformance for three different types of SDR-Learnersfor a single match of RPS. The “Non-Recurrent” play-ers, (m, n) = (0, 2), only have knowledge of their oppo-nents previous two moves. The “Only Recurrent” players,(m, n) = (2, 0), know only about their own previous twomoves. The “Both Types” players have dynamics of or-der (m, n) = (2, 2), with both recurrent and non-recurrentmemory. Each SDR-Learner plays against the same dy-namic “Human Champions” opponent described previously.The SDR players adapt as the matches progress, and areable to learn to exploit weaknesses in their opponents’ play.While each player is able to learn a winning strategy, theplayers with recurrent knowledge are able to learn to predicttheir opponents’ reactions to previous actions. Hence, therecurrent SDR players gain a clear and decisive advantage.

4.3 Iterated Prisoners’ Dilemma (IPD)The Iterated Prisoners Dilemma IPD is a general-sum gamein which groups of agents can learn either to defect or tocooperate. For IPD, we consider SDR self-play and playagainst fixed opponents. We find that non-recurrent SDRagents learn only the Nash equilibrium strategy “always de-fect”, while SDR agents trained with recurrent gradients canlearn a variety of interesting behaviors. These include thePareto-optimal strategy “always cooperate”, a time-varyingstrategy that exploits a generous opponent, and the evolu-tionary stable strategy Tit-for-Tat (TFT).

Learning to cooperate with TFT is a benchmark problemfor game theoretic algorithms. Figure 9 shows IPD playwith a dynamic TFT opponent. We find that non-recurrentSDR agents prefer the static, Nash equilibrium strategy ofdefection, while recurrent m = 1 players are able to learnto cooperate, and find the globally-optimal (Pareto) equi-librium. Without recurrent policy gradients, the recurrentSDR player also fails to cooperate with TFT, showing thatrecurrent learning and not just a recurrent representation isrequired. We find similar results for SDR self-play, whereSDR agents learn to always defect, while recurrent gradient

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.35

0.5The fraction of Wins,Draws,and Losses with SDR.

Non

−Rec

urre

nt

Wins: 0.357Draws: 0.383Losses: 0.260

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.35

0.5

Onl

y R

ecur

rent

Wins: 0.422Draws: 0.305Losses: 0.273

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.35

0.5

Bot

h Ty

pes Wins: 0.463

Draws: 0.298Losses: 0.239

Figure 7: Learning curves showing fraction of wins, lossesand draws during an RPS match between SDR players anda dynamic, predictive opponent. The top player has non-recurrent inputs, the middle player has only recurrent in-puts, and the bottom player has both types of inputs. Re-current SDR players achieve higher fractions of wins thando the non-recurrent players. The length of each match is20000 throws with one stochastic gradient learning step perthrow. Fractions of wins, draws and losses are calculated us-ing the last 4000 throws. Note that the learning traces shownare only an illustration; much faster convergence can be ob-tained using multiple stochastic gradient steps per game it-eration.


0.1

0.15

0.2

0.25

Performance distributions for three model players

Win

s −

Loss

es

LossesDraws Wins0.2

0.35

0.5

Non−RecurrentLossesDraws Wins

0.2

0.35

0.5The fraction of Wins,Draws,and Losses with SDR.

Only RecurrentLossesDraws Wins

0.2

0.35

0.5

Both Types

Figure 8: Performance distributions for SDR learners com-peting against predictive “human champions” RPS oppo-nents. Tukey boxplots summarize the results of 30 matchessuch as those shown in Figure 7. The SDR players withrecurrent inputs have a clear and decisive advantage.

Figure 9: IPD: SDR agents versus TFT opponent. Plots ofthe frequencies of outcomes, for example C/D indicates TFT“cooperates” and SDR “defects”. The top panel shows thata simple m = 1 SDR player with recurrent inputs but no re-current gradients learns to always defect (Nash equilibrium).This player is unable to solve the temporal credit assignmentproblem and fails to learn to cooperate with the TFT player.The bottom panel shows that an m = 1 SDR player withrecurrent gradients that can learn to cooperate with the TFTplayer, thus finding the globally-optimal Pareto equilibrium.

SDR players learn to always cooperate.A Tit-for-Two-Tat (TF2T) player is more generous than

a TFT player. TF2T considers two previous actions of itsopponent and defects only after two opponent defections.When playing against TF2T, an m = 2 recurrent SDR agentlearns the optimal dynamic strategy to exploit TF2T’s gen-erosity. This exploitive strategy alternates cooperation anddefection: C-D-C-D-C-D-C-... . Without recurrent gradi-ents, SDR can not learn this alternating behavior.

We next consider a multi-agent generalization of IPD inwhich an SDR learner plays a heterogeneous population ofopponents. Table 1 shows the mixed strategy learned bySDR. The SDR player learns a stochastic Tit-for-Tat strat-egy (sometimes called “generous TFT”). Table 2 shows theQ-Table learned by a Q-Learner when playing IPD againsta heterogeneous population of opponents. The table showsthat the Q-Learner will only cooperate until an opponent de-fects, and from then on the Q-Learner will always defectregardless of the opponent’s actions. The SDR player on theother hand, has a positive probability of cooperating even ifthe opponent has just defected, thus allowing for a switch tothe more profitable cooperate/cooperate regime if the oppo-nent is amenable.

TFT is called an “evolutionary stable strategy” (ESS),since its presence in a heterogeneous population of adap-tive agents is key to the evolution of cooperation through-out the population. Recurrent SDR’s ability to learn TFTmay thus be significant for multi-agent learning; a recurrentSDR agent could in principle influence other learning agents

aSDRt \

(

aSDRt−1 , aopp

t−1

)

CC CD DC DDC 0.99 0.3 0.97 0.1D 0.01 0.7 0.03 0.9

Table 1: The mixed strategy learned by SDR when playingIPD against a heterogeneous population of opponents. Theelements of the table show the probability of taking an actionat time t given the previous action of the opponent and ofthe player itself. The SDR player learns a generous Tit-for-Tat strategy, wherein it will almost always cooperate afterits opponent cooperates and sometimes cooperate after itsopponent defects.

aQt \

(

aQt−1, a

oppt−1

)

CC CD DC DDC 69 2 -14 -12D 9.7 4 0 0

Table 2: The Q-Table learned by a Q-Learner when play-ing IPD against a heterogeneous population of opponents.The Q-Learner will continue to defect regardless of its oppo-nent’s actions after a single defection by its opponent. Notethe preference for defection indicated by the Q-values of col-umn 3 as compared to the SDR player’s strong preferencefor cooperation shown in Table 1, column 3.

to discover desirable individual policies that lead to Pareto-optimal behavior of a multi-agent population.

5 Closing RemarksOur studies of repeated matrix games with stochastic play-ers use a pure policy gradient algorithm, SDR, to competeagainst dynamic and predictive opponents. These studiesdistinguish between reactive and non-reactive, endogenousdynamics, emphasize the naturally recurrent structure of re-peated games with dynamic opponents and use recurrentlearning agents in repeated matrix games with stochastic ac-tions.

Dynamic SDR players can be reactive or predictive of op-ponents’ actions and can also generate endogenous dynamicbehavior by using their own past actions as inputs (recur-rence). When faced with a reactive or predictive opponent,an SDR player must have such recurrent inputs in order topredict the predictive responses of its opponent. The playermust know what its opponent knows in order to anticipatethe response. Hence, games with predictive competitorshave feedback cycles that induce recurrence in policies andgame dynamics.

The Stochastic Direct Reinforcement (SDR) algorithmused in this paper is a policy gradient algorithm for proba-bilistic actions, partially-observed states and recurrent, non-Markovian policies. The RRL algorithm of (Moody & Wu1997; Moody et al. 1998) and SDR are policy gradient RLalgorithms that distinguish between recurrent memory andstandard, non-recurrent memory via the use of recurrent gra-dients during learning. SDR is well-matched to learning inrepeated matrix games with unknown, dynamic opponents.

As summarized in Section 2, SDR’s recurrent gradientsare necessary for properly training agents with recurrent in-puts. Simply including an agent’s past actions in an observa-tion or state vector (a standard “memory-based” approach)amounts to a truncation that neglects the dependence of thecurrent policy gradient on previous policy gradients. Bycomputing the full recurrent gradient, SDR captures dynam-ical structure that is important for solving the temporal creditassignment problem in RL applications.

While we have focused on simple matrix games in this pa-per, the empirical results presented in Section 3 support theview that policy gradient methods may offer advantages insimplicity, learning efficiency, and performance over valuefunction type RL methods for certain applications. The im-portance of recurrence as illustrated empirically in Section4 may have implications for learning in multi-agent systemsand other dynamic application domains. Of particular inter-est are our results for the Iterated Prisoners Dilemma whichinclude discovery of the Pareto-optimal strategy “always co-operate”, a time-varying strategy that exploits a generousopponent and the evolutionary stable strategy Tit-for-Tat.

Some immediate extensions to the work presented in thispaper include opponent or (more generally) environmentmodeling that captures additional recurrence and incorporat-ing internal states and adaptive memories in recurrent SDRagents. These issues are described in another paper. It is notyet clear whether pure policy gradient algorithms like SDRcan learn effective policies for substantially more complexproblems or in environments with delayed rewards. Thereare still many issues to pursue further in this line of research,and the work presented here is just an early glimpse.

ReferencesAnderson, C. W. 2000. Approximating a policy can be easierthan approximating a value function. Technical Report CS-00-101, Colorado State University.Baxter, J., and Bartlett, P. L. 2001. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research15:319–350.Boutilier, C.; Das, R.; Kephart, J. O.; Tesauro, G.; and Walsh,W. E. 2003. Cooperative negotiation in autonomic systems usingincremental utility elicitation. In UAI 2003.Bowling, M., and Veloso, M. 2002. Multiagent learning using avariable learning rate. Artificial Intelligence 136:215–250.Brown, T. X. 2000. Policy vs. value function learning with vari-able discount factors. Talk presented at the NIPS 2000 Workshopentitled “Reinforcement Learning: Learn the Policy or Learn theValue Function?”.Claus, C., and Boutilier, C. 1998. The dynamics of reinforcementlearning in cooperative multiagent systems. In AAAI/IAAI, 746–752.Hu, J., and Wellman, M. P. 1998. Multiagent reinforcement learn-ing: Theoretical framework and an algorithm. In Fifteenth Inter-national Conference on Machine Learning, 242–250.Kephart, J. O.; Hansen, J. E.; and Greenwald, A. R. 2000.Dynamic pricing by software agents. Computer Networks32(6):731–752.Lin, L. J., and Mitchell, T. 1992. Memory approaches to rein-forcement learning in non-markovian domains. Technical Report

CMUCS -92-138, Carnegie Mellon University, School of Com-puter Science.Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In ICML-94, 157–163.McCallum, A. 1995. Instance-based utile distinctions for rein-forcement learning. In Twelfth International Conference on Ma-chine Learning.Moody, J., and Saffell, M. 1999. Reinforcement learning fortrading. In Michael S. Kearns, S. A. S., and Cohn, D. A., eds.,Advances in Neural Information Processing Systems, volume 11,917–923. MIT Press.Moody, J., and Saffell, M. 2001. Learning to trade via direct re-inforcement. IEEE Transactions on Neural Networks 12(4):875–889.Moody, J., and Wu, L. 1997. Optimization of trading systemsand portfolios. In Abu-Mostafa, Y.; Refenes, A. N.; and Weigend,A. S., eds., Decision Technologies for Financial Engineering, 23–35. London: World Scientific.Moody, J.; Wu, L.; Liao, Y.; and Saffell, M. 1998. Perfor-mance functions and reinforcement learning for trading systemsand portfolios. Journal of Forecasting 17:441–470.Ng, A., and Jordan, M. 2000. Pegasus: A policy search methodfor large mdps and pomdps. In Proceedings of the Sixteenth Con-ference on Uncertainty in Artificial Intelligence.Sandholm, T. W., and Crites, R. H. 1995. Multi-agent reinforce-ment learning in iterated prisoner’s dilemma. Biosystems 37:147–166.Singh, S.; Kearns, M.; and Mansour, Y. 2000. Nash convergenceof gradient dynamics in general-sum games. In Proceedings ofUAI-2000, 541–548. Morgan Kaufman.Tesauro, G., and Bredin, J. 2002. Strategic sequential biddingin auctions using dynamic programming. In Proceedings of thefirst International Joint Conference on Autonomous Agents andMulti-Agent Systems, 591–598.Tesauro, G., and Kephart, J. O. 2002. Pricing in agent economiesusing multi-agent qlearning. Autonomous Agents and Multi-AgentSystems 5(3):289–304.Tesauro, G. 2004. Extending Q-learning to general adaptivemulti-agent systems. In S. Thrun, L. S., and Schölkopf, B., eds.,Advances in Neural Information Processing Systems, volume 16.MIT Press.Williams, R. J. 1992. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. Machine Learn-ing 8:229–256.

Documents

Stochastic Direct Reinforcement: Application to Simple Games … · 2006-01-11 · Stochastic Direct Reinforcement: Application to Simple Games with Recurrence John Moody*, Yufeng