Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Multi-agent learning Reinforcement Learning
Multi-agent learning
Reinforcement Learning
Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department,
Faculty of Sciences, Utrecht University, The Netherlands.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 1
Multi-agent learning Reinforcement Learning
Reinforcement learning: motivation
• Nash equilibria in repeated games is a static analysis.
Dynamical analysis:
How do (or should) players develop their strategies and
behaviour in a repeated game?
“Do”: descriptive / economics; “should”: normative / agent theory.
• Reinforcement learning (RL) is a rudimentary learning technique.
1. RL is stimulus-response: it plays actions with the highest past payoff.
2. It is myopic: it is only interested in immediate success.
• Reinforcement learning can be applied to learning in games.
• When computer scientists mention RL, they usually mean multi-state RL.
• Single-state RL has already interesting and theoretically important
properties, especially when it is coupled to games.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 2
Multi-agent learning Reinforcement Learning
Plan for today
Part I: Single-state RL. Parts of Ch. 2 of Sutton et al.(1989): “Evaluative
Feedback”: ǫ-greedy, optimistic, value-based, proportional.
Part II: Single-state RL in games. First half of Ch. 2 of Peyton Young (2004):
“Reinforcement and Regret”.
1. By average: 1n r1 + · · ·+ 1
n rn.
2. With discounted past : γn−1r1 + γn−2r2 + · · ·+ γrn−1 + rn.
3. With an aspiration level (Sutton et al.: “reference reward”).
Part III: Convergence to dominant strategies. Begin of Beggs (2005): “On the
Convergence of Reinforcement Learning”.
#Players #Actions Result
☞ Theorem 1 : 1 2 Pr(dominant action) = 1
Theorem 2 : 1 ≥ 2 Pr(sub-dominant actions) = 0
Theorem 3 : ≥ 1 ≥ 2 Pr(dom) = 1, Pr(sub-dom) = 0
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 3
Multi-agent learning Reinforcement Learning
Part I:
Single-state
reinforcement learning
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 4
Multi-agent learning Reinforcement Learning
Exploration vs. exploitation
Problem. You are at the beginning of a new study year. Every fellow
student is interesting as a possible new friend.
How do you divide your time between your classmates to optimise your
happiness?
Strategies:
A. You make friends whe{n|r}ever possible. You could be called an explorer.
B. You stick to the nearest fellow-student. You could be called an exploiter.
C. What most people do: first explore, then “exploit”.
We ignore:
1. How quality of friendships is measured.
2. How changing personalities of friends (so-called “moving targets”) are
dealt with.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 5
Multi-agent learning Reinforcement Learning
An array of N slot machines
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 6
Multi-agent learning Reinforcement Learning
Exploitation vs. exploration
Given. An array of N slot
machines.
Suppose the yield of every machine is
normally distributed with mean and
variance unknown to us.
Random questions:
1. How long do to stick with your
first slot machine?
2. When do you leave the second?
3. If machine A so far yields more
than machine B, then would you
explore B ever again?
4. Try many machines, or opt for
security?
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 7
Multi-agent learning Reinforcement Learning
Experiment
Yield Machine 1 Yield Machine 1 Yield Machine 1
2 4 2
2 5 1
3 4
6
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 8
Multi-agent learning Reinforcement Learning
The N-armed bandit problem
Barto & Sutton: the N-armed bandit.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 9
Multi-agent learning Reinforcement Learning
Computation of the quality (offline version)
A reasonable measure for the quality of a slot machine after n tries, would be
the average benefit.
Formula for the quality of a slot machine after n tries.
Qn =Defr1 + · · ·+ rn
n
Simple formula, but:
– Every time Qn is computed, all values r1, . . . , rn must be retrieved.
– The idea is to draw conclusions only if you have all the data.
– The data is processed in batch.
– Learning proceeds off line.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 10
Multi-agent learning Reinforcement Learning
Computation of the quality (online version)
Qn =r1 + · · ·+ rn
n=
r1 + · · ·+ rn−1
n+
rn
n
=r1 + · · ·+ rn−1
n − 1·
n − 1
n+
rn
n= Qn−1·
n − 1
n+
rn
n
= Qn−1 −
(1
n
)
Qn−1 +
(1
n
)
rn
= Qn−1︸ ︷︷ ︸
oldvalue
+1
n︸︷︷︸
learningrate
( rn︸︷︷︸
goal
value
− Qn−1︸ ︷︷ ︸
oldvalue
︸ ︷︷ ︸
error
)
︸ ︷︷ ︸
correction︸ ︷︷ ︸
new value
.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 11
Multi-agent learning Reinforcement Learning
Progress of quality Qn
– Amplitude of correction is determined by the learning rate.
– Here, the learning rate is 1/n and decreases through time.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 12
Multi-agent learning Reinforcement Learning
Exploration: ǫ-greedy exploration
ǫ-greedy exploration. Let 0 < ǫ ≤ 1 close to 0.
1. Choose (1 − ǫ)% of the time an optimal action.
2. At other times, choose a random action.
– Item 1: exploitation.
– Item 2: exploration.
– With probability one, every action is explored infinitely many times.
(Why?)
– Is it guaranteed that every action is explored infinitely many times?
– It would be an idea to explore sub-optimal actions with relative high reward
more often. However, that is not how greedy exploration works.
And we may lose converge to optimal actions . . .
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 13
Multi-agent learning Reinforcement Learning
Optimistic initial values
An alternative for ǫ-greedy is to work
with optimistic initial values.
1. At the outset, an unrealistically
high quality is attributed to every
slot machine:
Qk0 = high
for 1 ≤ k ≤ N.
2. As usual, for every slot machine
its average profit is maintained.
3. Without exception, always exploit
machines with highest Q-values.
Random questions:
q1: Initially, many actions are tried
⇒ all actions are tried?
q2: How high should “high” be?
q3: What to do in case of ties (more
than one optimal machine)?
q4: Can we speak of exploration?
q5: Is optimism (as a method)
suitable to explore an array of
(possibly) infinitely many slot
machines? Why (not)?
q6: ǫ-greedy: Pr( every action is
explored infinitely many
times ) = 1. Also with optimism?
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 14
Multi-agent learning Reinforcement Learning
Optimistic initial values vs. ǫ-greedy
From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 15
Multi-agent learning Reinforcement Learning
Maintaining and exploring friendships: strategies
ǫ-Greedy Almost always spend time
with your best friends (greedy).
Occasionally spend ǫ of your time
to explore new friendships.
Optimistic In the beginning, foster
(unreasonably) high expectations
of everyone. You will be
disappointed many times. Adapt
your expectations based on
experience. Always spend time
with your best friends.
Values (This one is not discussed
here, but cf. Sutton et al.) Let
0 < α << 1. In the beginning rate
everyone with a 6, say. If a
friendship rated r involves a new
experience e ∈ [0, 10] then for
example
rnew = rold + sign(e)· α.
(Watch boundaries!) Other
method:
rnew = (1 − α)· rold + α· e
Proportions Give everyone equal
attention in the beginning. If
there is a positive experience,
then give that person a little more
attention in the future. (Similarly
with negative experiences.)
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 16
Multi-agent learning Reinforcement Learning
Part II:
Single-state
reinforcement learning
in games
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 17
Multi-agent learning Reinforcement Learning
Proportional techniques: basic setup
• There are two players: A (the subject) and B (the opponent, or “nature”).
• Play proceeds in (possibly an infinite number of) rounds 1, . . . , t, . . . .
• Identifiers X and Y denote finite sets of possible actions.
• Each round, t, players A and B choose actions x ∈ X and y ∈ Y,
respectively. This is denoted by
(xt, yt).
• A’s payoff is given by a fixed function
u : X × Y → R.
In other words, A’s payoff matrix is known.
• It follows that payoffs are time homogeneous, i.e.,
xs = xt and ys = yt ⇒ u(xs, ys) = u(xt, yt).
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 18
Multi-agent learning Reinforcement Learning
Propensity, and mixed strategy of play
• Let t ≥ 0. The propensity of A to play x at t is denoted by θtx. (Intuition:
the money thus far collected by playing x.)
• The vector of initial propensities, θ0 is not the result of play.
• A simple model of propensity is cumulative payoff matching (CPM):
θt+1x =
{
θtx + u(x, y) if x is played at round t,
θtx else.
• As a vector: θt+1 = θt + utet, where etx =Def x is played at t ? 1 : 0.
• A plausible mixed strategy is to play at round t the normalised propensity of
x at t:
qtx =Def
θtx
∑x′∈X θtx′
.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 19
Multi-agent learning Reinforcement Learning
An Example
The total payoff at round t, the sum ∑x∈X θtx is abbreviated by vt.
θ0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 θ15
x1 1 8 3 0 0 0 7 4 0 1 0 0 0 1 0 0 29 θ15x1
x2 1 0 0 6 0 5 0 0 0 0 6 0 0 0 8 0 26 θ15x2
x3 1 0 0 0 9 0 0 0 9 0 0 2 2 0 0 8 31 θ15x3
86 v15
Remarks:
• It is the cumulative payoff from each action that matters, not the average
payoff.
• In this example, it is assumed that the initial propensities, θ0x, are one. In
general, they could be anything. However, ‖θ0‖ = 0 is not very useful.
• Alternatively, scalar vt = ∑x∈X θ0x + ∑s≤t ut.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 20
Multi-agent learning Reinforcement Learning
Dynamics of the mixed strategy
We can obtain further insight in the dynamics of the process through the
change of the mixed strategy:
∆qtx = qt
x − qt−1x =
θtx
vt−
θt−1x
vt−1
=vt−1· θt
x
vt−1· vt−
vt· θt−1x
vt· vt−1=
vt−1· θtx − vt· θt−1
x
vt−1· vt
=vt−1· (θt−1
x + etx· ut)− (vt−1 + ut)· θt−1
x
vt−1· vt
= ��
���
vt−1· θt−1x + vt−1· et
x· ut −�
��
��
vt−1· θt−1x − ut· θt−1
x
vt−1· vt
=vt−1· et
x· ut − ut· θt−1x
vt−1· vt=
ut
vt
vt−1· etx − θt−1
x
vt−1
=ut
vt(et
x −θt−1
x
vt−1) =
ut
vt(et
x − qt−1x ).
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 21
Multi-agent learning Reinforcement Learning
Dynamics of the mixed strategy: convergence
The dynamics of the mixed strategy in round t is given by
∆qt =ut
vt(et − qt−1).
On coordinate x:
∆qtx =
ut
vt(et
x − qt−1x ).
We have:
|∆qt| = |ut
vt(et − qt−1)| =
ut
vt· |et − qt−1| ≤
ut
vt· 2
=ut
u1 + · · ·+ ut· 2 ≤
max{us | s ≤ t}
t·min{us | s ≤ t}· 2 = C·
1
t.
Thus, since all terms except vt are bounded, limt→∞ ‖∆qt‖ = 0 (⇒/ convg.).
Does qt converge? If so, to the “right” (e.g., a Pareto optimal) strategy? Beggs
(2005) provides more clarity in certain circumstances.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 22
Multi-agent learning Reinforcement Learning
Abstraction of past payoffs tp
In 1991 and 1993, B. Arthur proposed the following update formula:
∆qt =ut
Ctp + ut(et − qt−1)
Consequently,
‖∆qt‖ ≤1
tp .
Remarks:
• Arthur’s notation differs considerably from that of Peyton Young (2004).
• If the parameter p is set to, e.g., 2, then there is convergence. However . . .
• In related research, where the value of p is determined through
psychological experiments, it is estimated that p < 1.
B. Arthur (1993): “On Designing Economic Agents that Behave Like Human Agents”. In: Journal of Evolu-tionary Economy 3, pp 1-22.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 23
Multi-agent learning Reinforcement Learning
Past payoffs at discount rate of λ
In 1995, Erev and Roth proposed the following update formula:
θt+1 = λθt + ut· et.
Consequently,
∆qt =ut
∑s≤t λt−s· us(et − qt−1)
(For simplicity, we assume ‖θ0‖ = 0.)
Since
(∑s≤t
λt−s) min{us | s ≤ t} ≤ ∑s≤t
λt−s· us ≤ (∑s≤t
λt−s) max{us | s ≤ t}
and since
1 + λ + λ2 + · · ·+ λt−1 =1 − λt
1 − λ
for λ 6= 1, the mixed strategy tends to change at a rate ∼ ut/(1 − λ).
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 24
Multi-agent learning Reinforcement Learning
Past payoffs represented by an aspiration level
Assume an aspiration level at ∈ R at every round. (Intuition: payoff with
which one would be satisfied.)
Idea:
utx > at ⇒ positively reinforce action x
utx < at ⇒ negatively reinforce action x
Correspondingly, the mixed strategy evolves according to
∆qt = (ut − at)(et − qt−1).
Typical definitions for aspiration:
• Average past payoffs. at =Def vt/t. A.k.a. satisficing play (Crandall, 2005).
• Discounted past payoffs. at =Def ∑s≤t λt−s· us. (Erev & Roth, 1995).
Börgers and Sarin (2000). “Naïve Reinforcement Learning with Endogeneous Aspirations” in: Int. EconomicReview 41, pp. 921-950.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 25
Multi-agent learning Reinforcement Learning
Adequacy of reinforcement learning
Does reinforcement learning lead to optimal behaviour against B?
If A and B would both converge to optimal behaviour, i.e., to a best response,
this would yield a Nash equilibrium.
Less demanding:
Does reinforcement learning converge to optimal behaviour in a stationary
(and, perhaps, stochastic) environment?
• A history is a finite sequence of actions ξt : (x1, y1), . . . , (xt, yt).
• A strategy is a function g : H → ∆(X) that maps histories to probability
distributions over X. Write
qt+1 =Def g(ξt)
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 26
Multi-agent learning Reinforcement Learning
Optimality against stationary opponents
• Assume that B plays a fixed probability distribution q∗ ∈ ∆(Y).
• The combination of θ0, g and q∗ yields a realisation
ω = (x1, y1), . . . , (xt, yt), . . . .
• Define B(q∗) =Def {x ∈ X | x is a best response to q∗ }.
Definition. A strategy g is called optimal against q∗ if, with probability one,
for all x /∈ B(q∗) : limt→∞
qtx = 0 (1)
In this case, the phrase “with probability one” means that almost all (read: all
but finitely many) realisations satisfy (1).
Theorem. Given finite action sets X and Y, cumulative payoff matching on
X is optimal against every stationary distribution on Y.
Peyton Young (2004, p. 17): “Its proof is actually quite involved (. . . )”. (FSs.)
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 27
Multi-agent learning Reinforcement Learning
Part III: Beggs, 2005
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 28
Multi-agent learning Reinforcement Learning
The learning model
Single-state proportional reinforcement learning (Erev & Roth, 1995).
As usual:
Ai(n + 1) =
{
Ai(n) + πi(n + 1) if action i is chosen,
Ai(n) else.
As usual:
Pri(n + 1) =Ai(n)
∑mj=1 Aj(n)
The following two assumptions are crucial:
1. All past, current and future payoffs πi(n) are bounded away from zero
and bounded from above. More precisely, there are 0 < k1 ≤ k2 such that
all payoffs are in [k1, k2].
2. Initial propensities Ai(0) are strictly positive.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 29
Multi-agent learning Reinforcement Learning
Choice of actions
Lemma 1. Each action is chosen infinitely often with probability one.
Proof. From the above assumptions it follows that
Pri(n + 1) =Ai(n)
∑mj=1 Aj(n)
≥Ai(0)
Ai(0) + nk2
(Which is like worst case for i: as if i was never chosen and all previous n
rounds actions 6= i received the maximum possible payoff.)
Apply so-called conditional Borel-Cantelli lemma:a if {En}n are events, and
∞
∑n=1
Pr(En | X1, . . . , Xn−1)
is unbounded (we leave X1, . . . , Xn−1 undiscussed), then the probability that
an infinite number of En’s occur is one. �
aA.k.a. the second Borel-Cantelli lemma, or the Borel-Cantelli-Lévy lemma (Shiryaev, p. 518).
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 30
Multi-agent learning Reinforcement Learning
Unboundedness of propensities, and convergence
Lemma 2. For each i, Ai tends to infinity with probability one.
Proof. For each i, action i is chosen infinitely often with probability one. Since
payoff per round is bounded from below by k1, we have ∑∞j=1 k1 ≤ Ai, where j
runs over rounds where i is chosen. �
Lemma 1 + Lemma 2 + theory of martingales suffice to prove convergence.
Suppose there are only two possible actions: a1 and a2. The expression
E[ π(ai) | history ]
denotes the expected payoff of action ai, given history of play up to and
including the choice to play ai itself.
Theorem 1. If
E[ π(a1) | history ] > γE[ π(a2) | history ] (2)
for some fixed γ > 1, then the probability that a1 will be played converges to one.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 31
Multi-agent learning Reinforcement Learning
Convergence to the dominant action: proof in a nutshell
• If a1 is dominant (like in Eq. 2), the objective is to show that
A2
A1(n) (3)
goes to zero with probability one.
• To this end, Beggs shows that, for some n ≥ N, and for all 0 < ǫ < γ,
Aǫ2
A1(n)
is a so-called non-negative super-martingale. (Explained in a moment.)
• It is known that every non-negative super-martingale converges to a finite
limit with probability one. (Explained in a moment.)
• If we chose ǫ such that 1 < ǫ < γ (which is possible, since γ > 1), it
follows that (3) goes to zero. [Write A2/A1 = (Aǫ2/A1)· (1/Aǫ−1
2 ). Both
limits exist and their product is zero. (For A2 → ∞, hence Aǫ−12 → ∞.)]
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 32
Multi-agent learning Reinforcement Learning
Super-martingale
A super-martingale is a stochastic process in which the conditional expectation of
the next value, given the current and preceding values, is less than or equal to
the current value:
E[ Zn+1 | Z1, . . . , Zn ] ≤ Zn
A super-martingale embodies the concept of an unfair gambling game that
proceeds in rounds, for example roulette (remember the green pocket).
1. Taking expectations on both sides yields E[ Zn+1 ] ≤ E[ Zn ].
2. From (1) and the monotone convergence theorema it follows that the
expectations of a non-negative super-martingale converge to a limit L
somewhere in [0, E[ Z1 ]].
3. Doob’s Martingale Convergence Theorem: let {Zn}n be a martingale (or
sub-martingale, or super-martingale) such that E[ |Zn| ] is bounded. Then
limn→∞ Zn exists and is finite.aOrdinary mathematics.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 33
Multi-agent learning Reinforcement Learning
To show that Aǫ2/A1 is a non-neg super-martingale
E
[
∆Aǫ
2
A1(n + 1) | history
]
=
Pr(1 | history)E
[
∆Aǫ
2
A1(n + 1) | 1, history
]
+
Pr(2 | history)E
[
∆Aǫ
2
A1(n + 1) | 2, history
]
=A1(n)
A1(n) + A2(n)E
[Aǫ
2(n)
A1(n) + π1(n + 1)−
Aǫ2(n)
A1(n)
]
+
A2(n)
A1(n) + A2(n)E
[(A2(n) + π2(n + 1))ǫ
A1(n)−
Aǫ2(n)
A1(n)
]
.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 34
Multi-agent learning Reinforcement Learning
To show that Aǫ2/A1 is a non-neg super-martingale
Taylor expansion:
f (x + h) = f (x) + h f ′(x) +h2
2!f ′′(x) +
h3
3!f ′′′(x + θh)
︸ ︷︷ ︸
Lagrange
remainder
for some θ ∈ (0, 1). (Of course, there is nothing special about n = 3.)
Applied to f (x) = x−1 we obtain
(x + h)−1 = x−1 − hx−2 + h2(x + θh)−3
=1
x−
h
x2+
h2
(x + θh)3.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 35
Multi-agent learning Reinforcement Learning
To show that Aǫ2/A1 is a non-neg super-martingale
Applying Taylor expansion up to rest term n = 3 to f (x) = x−1 yields
(x + h)−1 = x−1 − hx−2 + h2(x + θh)−3
=1
x−
h
x2+
h2
(x + θh)3.
For non-negative x and h we have x3 ≤ (x + θh)3 so that
≤1
x−
h
x2+
h2
x3.
This first inequality puts an upper bound with “pure” x and h on
1
A1(n) + π1(n + 1).
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 36
Multi-agent learning Reinforcement Learning
To show that Aǫ2/A1 is a non-neg super-martingale
Similarly, applying Taylor expansion up to rest term n = 3 to f (x) = (x + h)ǫ
yields
(x + h)ǫ = xǫ + hǫxǫ−1 + h2(ǫ − 1)ǫ(x + θh)ǫ−2.
For non-negative x and h and ǫ > 1, we have
(ǫ − 1)(x + θh)ǫ−2 ≤ Cxǫ−2
for some constant C so that
(x + h)ǫ ≤ xǫ + hǫxǫ−1 + h2Cǫxǫ−2.
This second inequality puts an upper bound with “pure” x and h on
(A2(n) + π2(n + 1))ǫ.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 37
Multi-agent learning Reinforcement Learning
To show that Aǫ2/A1 is a non-neg super-martingale
• Using E[aX + b] = aE[X] + b and factoring out common terms, Beggs
obtains
A1
A1 + A2
Aǫ2
A21
(n)
[
−E[π1(n + 1)] + c1E[π1(n + 1)2]
A1(n)
]
+
1
A1 + A2
ǫAǫ2
A1(n)
[
E[π2(n + 1)] + c2E[π2(n + 1)2]
A2(n)
]
.
• Because payoffs are bounded, constants K1, K2, K3 can be found such that
Aǫ2
A1
1
A1 + A2
(
K1(ǫ − γ) +K2
A1+
K3
A2
)
(n)
• For ǫ ∈ (1, γ) and for n large enough, this expression is non-positive. �
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 38
Multi-agent learning Reinforcement Learning
Generalisation of Begg’s Theorem 1, and application to games
Let there be m ≥ 2 alternative actions, a1, . . . , am (rather than m = 2).
Theorem 2. If the expected payoff (conditional on the history) of ai dominates the
expected payoff (conditional on the history) of aj, for all j 6= i, then the probability
that aj will be played converges to zero, for all j 6= i.
Applied to games:
Theorem 3. In a game with finitely many actions and players, if a player learns
according the ER scheme then,
a. With probability 1, the probability and empirical frequency that he plays any
action that is strictly dominated by another pure strategy converges to zero.
b. Hence if he has a strictly dominant strategy, with probability 1, the probability
and empirical frequency with which he plays that action converges to 1.
(Beggs, 2005).
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 39
Multi-agent learning Reinforcement Learning
Summary
• There are several rules for
reinforcement learning on single
states.
• Sheer convergence is often easy
to prove.
• Converge to best actions in a
stationary environment is much
more difficult.
• Convergence to best actions in
non-stationary environments, e.g.,
convergence to dominant actions,
or best responses in self-play, is
state-of-the art research.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 40
Multi-agent learning Reinforcement Learning
What next?
• No-regret learning: this is a generalisation of reinforcement learning
No-regret =Def play those actions that would have been successful in
the past.
• Similarities with reinforcement learning:
1. Driven by past payoffs.
2. Not interested in (the behaviour of) the opponent.
3. Myopic.
• Differences:
a) Keeping accounts of hypothetical actions rests on the assumption that a
player is able to estimate payoffs of actions that were not actually
played. [Knowledge of the payoff matrix definitely helps, but is an
even more severe assumption.]
b) Bit more easy to obtain results regarding performance.
Gerard Vreeswijk. Last modified on February 15th, 2011 at 08:22 Slide 41