Markov Decision Evolutionary Gamesmescal.imag.fr/membres/corinne.touati/events/hayel.pdfUniversity of Avignon, France Popeye seminar, 2008 1/39. logo Introduction ... invaders (mutations)

logo

IntroductionMarkov Decision Evolutionary Games

Application to energy management in wireless networksNumerical illustrations

Conclusions and perspectives

Markov Decision Evolutionary Games

Eitan Altman, Yezekael Hayel,Hamidou Tembine, Rachid El Azouzi

INRIA Sophia-Antipolis, FranceUniversity of Avignon, France

Popeye seminar, 2008

1 / 39

logo




Plan

1 Introduction

2 Markov Decision Evolutionary Games

3 Application to energy management in wireless networks

4 Numerical illustrations

5 Conclusions and perspectives

2 / 39

logo




Plan

1 Introduction





3 / 39

logo




Evolutionary Game Theory

Evolutionary Stable Strategy (ESS)

The ESS is characterized by a property of robustness againstinvaders (mutations). More specifically,

if an ESS is reached, then the proportions of each population donot change in time.at ESS, the populations are immune from being invaded by othersmall populations.restriction to interactions that are limited to pairwise.

This notion is stronger than Nash equilibrium in which it is onlyrequested that a single user would not benefit by a change (mutation)of its behavior.ESS is robust against a deviation of a whole fraction of thepopulation.

4 / 39

logo




Definitions

ESSJ(p, q) the expected immediate payoff for an individual if it uses astrategy p when meeting another individual who adopts thestrategy q.K available strategies which are called pure strategies.

Definition

A strategy q is said to be an ESS if for every p 6= q there exists someεq > 0 such that for all ε ∈ (0, εq):

J(q, εp + (1− ε)q) > J(p, εp + (1− ε)q)

5 / 39

logo




Important theorem

TheoremA strategy q is an ESS if and only if it satisfies

for all p 6= q, J(q, q) > J(p, q),

orfor all p 6= q, J(q, q) = J(p, q) and J(q, p) > J(p, p).

6 / 39

logo




Markov Decision Evolutionary Games (MDEG)

MDEGThe fitness of a player depends not only on the actions chosen inthe interaction but also on the individual state of the players.Players have finite life time and take during which theyparticipate in several local interactions.The actions taken by a player determine not only the immediatefitness but also the transition probabilities to its next individualstate.

7 / 39

logo




Interactions and system modelComputing the Weak (resp. Strong) ESSWESS and dynamic programming

Plan

1 Introduction





8 / 39

logo





Model for individual player

Individual MDPWe associate with each player a Markov Decision Process (MDP)embedded at the instants of the interactions. The parameters of theMDP are given by the tuple {S,A, Q} where

S is the set of possible individual states of the player.A is the set of available actions. For each state s, a subset As ofactions is available.Q is the set of transition probabilities; for each s, s′ ∈ S anda ∈ As, Qs′(s, a) is the probability to move from state s to state s′

taking action a.∑

s′∈S Qs′(s, a) is allowed to be smaller than 1.

9 / 39

logo





Model for individual player

PoliciesDefine further

The set of policies is U . A general policy u is a sequenceu = (u1, u2, . . .) where ui is a distribution over action space A attime i .The subset of mixed (resp. pure or deterministic) policies is UM(resp. UD). We define also the set of stationary policies US wheresuch policy does not depend on time.α(u) = {α(u; s, a)} is the fraction of the population at individualstate s and that use action a when all the population usesstrategy u.

10 / 39

logo





Interactions and system model

Notationsr(s, a, s′, b) be the immediate reward that a player receives whenit is at state s and it uses action a while interacting with a playerwho is in state s′ that uses action b.The expected immediate reward of a player in state St andplaying action At at time t is given by

Rt =∑s,a

αt(u; s, a)r(St , At , s, a).

The global expected fitness when using a policy v is then

Fη(v , u) =∞∑

t=1

Eη,v [Rt ],

where η is the initial state distribution.11 / 39

logo





Assumptions

Assumptions

A1 : the expected lifetime of a player Tη,u is finite for all u ∈ UD.A2 : When the whole population uses a policy u, then at any timet which is either fixed or is an individual time of an arbitraryplayer, αt(u) is independent of t and is given by

αt(u; s, a) =fη,u(s, a)

Tη,u

for all s, a and where fη,u(s, a) =∑+∞

t=1 pt(η, u; s, a) is theexpected number of time units during which it is at state s and itchooses action a.

12 / 39

logo





Defining the weak (resp. strong) ESS

Equivalent class of strategies

We shall say that two strategies u and u′ are equivalent if thecorresponding occupation measures are equal for all state. We shallwrite u =e u′.

Definition of the WESS (resp. SESS)

A strategy u is a weak (resp. strong) ESS, denoted by WESS (resp.SESS), for the MDEG if and only if it satisfies one of the following:

for all v 6=e u (resp. v 6= u), F (u, u) > F (u, v) (1)

for all v 6=e u (resp. v 6= u), F (u, u) = F (v , u) and F (u, v) > F (v , v)(2)

13 / 39

logo





Transforming the MDEG into a standard EG

MDEG into an EGThe fitness function is bilinear in the occupation measures of theplayers. The set of occupation measures is a polytope whoseextreme points correspond to strategies in UD.Consider the following standard evolutionary game EG:

the finite set of actions of a player is UD,the fitness of a player that uses v ∈ UD when the other use apolicy u ∈ US is given by

F̃ (v , u) =∑s,a

fη,v (s, a)∑s′,a′

fη,u(s′, a′)r(s, a, s′, a′).

Enumerate the strategies in UD such that UD = (u1, ..., um).Define γ = (γ1, ..., γm) where γi is the fraction of the populationthat uses ui .

14 / 39

logo





Transforming the MDEG into a standard EG

Proposition

Let γ̂ be an ESS for the game EG. Then it is a WESS for the originalMDEG.

15 / 39

logo





What about stationary policies ?

Theorem

(i) A necessary condition for a policy u to be WESS is thatF (u, u) ≥ F (v , u) for all stationary v.(ii) Assume that the following set of dynamic programming equationsholds: For all state s ∈ S,

Fs(u, u) = maxa

[r(u; s, a) +

∑s′

Qs′(s, a)Fs′(u, u)

]. (3)

Then F (u, u) ≥ F (v , u) for any v.(iii) If η(s) > 0 for all s, then the converse also holds: and (3) isequivalent to F (u, u) ≥ F (v , u) for all stationary v.

16 / 39

logo




ModelDP approachMatrix Game approach

Plan

1 Introduction





17 / 39

logo





Model of Energy Management in a DistributedAloha Network

Actions and statesA terminal i attempts transmissions during time slots.At each attempt, it has to take a decision on the transmissionpower based on his battery energy state.We assume that the state can take three values: {F , A, E} forFull, Almost empty or Empty.The transmission signal power of a terminal can be High (h) orLow (l).Transmission at high power is possible only when the mobile is instate F .

18 / 39

logo





Model of Energy Management in a DistributedAloha Network

Aloha-type game

A mobile transmits a packet with success during a slot if:the mobile is the only one to transmit during this slotthe mobile transmits with high power and all others transmittingnodes use low power

19 / 39

logo





Some notations

Notationsp is the probability for a mobile to be the only transmitter during aslot.Qi(a) is the probability of remaining at energy level i when usingaction a.α is the fraction of the population who use the action h at anygiven time (situation in which the system attains a stationaryregime).

20 / 39

logo





Policies and fitness

PoliciesA general policy u is a sequence u = (u1, u2, ...) where ui is theprobability of choosing h if at time i the state is F . We consider onlystationary policies, ui = β for all time i .

FitnessLet Rt denote the number of packets (zero or one) successfullytransmitted at time slot t and the fitness of the terminal to be given by∑∞

t=1 Rt .Vβ(i , α) is the total expected fitness (i.e. reward or valuation) of auser given that it uses policy β, that it is in state i and given theparameter α.

21 / 39

logo





Computing fitness and sojourn times

State E and AWhen the level of energy is in state E , the valuation is equal toV (E) = 0.When the state is A, the valuation is V (A) = p

1−QAand expected

time during which a mobile spends in state A is T (A) = 11−QA

.

State FDefine the dynamic programming operator Y (v , a, α) to be the totalexpected fitness of an individual starting at state F , if

It takes action a at time 1,If at time 2 the state is F then the total sum of expected fitnessfrom time 2 onwards is v .At each time the mobile attempts transmission, the probabilitythat another interfering mobile uses action h is α.

22 / 39

logo






State F

Y (v , l) = p + QF (l)v + p1−QF (l)

1−QA.

and

Y (v , h, α) = α(p + QF (h)v + (1−QF (h))V (A))

+(1− α)(1 + QF (h)v + (1−QF (h))V (A)),

= αp + (1− α) + QF (h)v + p1−QF (h)

1−QA.

23 / 39

logo






State FThe expected time it spends at state F is

T (F ) =1

1− βQF (h)− (1− β)QF (l).

The fraction of time that the mobile uses action h is then

α̂(β) = βT (F )

T (F ) + T (A)= β

1−QA

2−QA − βQF (h)− (1− β)QF (l).

24 / 39

logo






State FThe total expected utility Vβ(F , α) the mobile gains starting from stateF is the unique solution of v = (1−β)Y (v , l)+βY (v , h, α). This gives

Vβ(F , α) = V (A) +p + β(1− p)(1− α)

1−QF (l) + β(QF (l)−QF (h)).

some remarks

aggressive policy β = 1, V1(F , α) = V (A) + 1−α(1−p)1−QF (h) ,

passive policy β = 0, V0(F , α) = V (A) + p1−QF (l) ,

Vβ(F , α) is either constant or strictly monotone in β over thewhole interval [0, 1].

25 / 39

logo





Characterization of the ESS

Relation with EGThe fitness that is maximized is not the outcome of a singleinteraction but of the sum of fitnesses obtained during all theopportunities in the mobile’s lifetime.The ESS can be defined using the following fitness:

Vβ(F , α̂(β′)) = J(β, β′).

A necessary condition for β∗ to be an ESS is

for all β′ 6= β∗, Vβ∗(F , α̂(β∗)) ≥ Vβ′(F , α̂(β∗)).

26 / 39

logo





Pure equilibrium with high power

TheoremDefine

∆h :=1−QF (h)

2−QA −QF (h)(1− p)− QF (l)−QF (h)

1−QF (l)p

Let u be the pure aggressive strategy that uses always h at state F .(i) ∆h > 0 is a sufficient condition for u to be an ESS.(ii) ∆h ≥ 0 is a necessary condition for u to be an ESS.

Remarks:If QF (l) = QF (h), the strategy high power is obviously an ESS.If p = 0, there is no benefit from transmission with high power.∆h > 0 is a sufficient and necessary condition for u to be astrongly immune ESS.

27 / 39

logo





Pure equilibrium with low power

TheoremDefine

∆l := p(1−QF (h))− (1−QF (l))

Let v be the pure strategy that uses always l at state F .(i) ∆l > 0 is a sufficient condition for v to be an ESS.(ii) ∆l ≥ 0 is a necessary condition for v to be an ESS.

Remarks:If QF (l) = QF (h), the strategy low power is not an ESS.The condition for the policy v to be ESS does not depend on QA.∆l > 0 is a sufficient and necessary condition for v to be astrongly immune ESS.

28 / 39

logo





Mixed equilibrium

Theorem(a) Each one of the following conditions is necessary for there to exista Weakly Immune ESS:

Condition (i): ∆l ≤ 0,Condition (ii): ∆h ≤ 0,

(b) Assume that Condition (i) and (ii) hold. Then there exists a uniqueweakly immune ESS given by

β∗ =(QA + QF (l))[QF (l)− pQF (h)]

QApQF (l)− (QF (l)−QF (h))(QF (l)− pQF (h))

TheoremFor all QA, QF (l), QF (h) and p, the ESS β∗ of the stochasticevolutionary game exists and is unique.

29 / 39

logo





Price of Anarchy (equilibrium aggressivenesscomparison)

PoA

The strategy β̃ is globally optimal if it maximizes Vβ(F , α̂(β)).The global optimal solution is solution of a second orderpolynomial function.We compare the optimal global solution to the ESS.

Theorem

ESS strategy is more aggressive than the social optimum strategy, i.e.

β∗ ≥ β̃.

30 / 39

logo





Matrix Game of the EG

Matrix Game with deterministic policies

We restrict to the deterministic policies u1 = (l , l) and u2 = (l , h) (usealways high power in state F ). The WESS of the MDEG is the ESS ofa standard EG defined by through the related matrix game:

G̃ =

(p(X1 + X3)

2 p(X1 + X3)(X1 + X4)

(X1 + X3)(p(X1 + X4) + (1− p)X4) p(X1 + X4)2 + (1− p)X1X4

)with

X1 =1

1 − Q(1, l), X2 =

1

1 − Q(1, h), X3 =

1

1 − Q(2, l), X4 =

1

1 − Q(2, h).

31 / 39

logo





ESS of the EG

Proposition

The ESS γ̂ exists and is unique.

Proposition

Policies β∗ and γ̂ are in the same equivalent class,i.e.

β∗ =e γ̂.

32 / 39

logo




Plan

1 Introduction





33 / 39

logo




ESS and the global optimumComparison of the ESS and the global optimum depending on theprobability p.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

p

popu

latio

ns

β social optimum

β* ESS

34 / 39

logo




ESS and the global optimumComparison of the ESS and the global optimum depending on thedifference QF (l)−QF (h).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

QF(l)−Q

F(h)

popu

latio

ns

β social optimum

β* ESS

35 / 39

logo




Comparison of the two approachesComparison of the two mixed ESS β∗ and γ̂ given by the twoapproaches.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

p

ES

S

β*

γ^

36 / 39

logo




Plan

1 Introduction





37 / 39

logo




Conclusions

ConclusionsExtension of evolutionary game paradigm considering actionstate dependance.Application to competitive energy management in wirelessterminals.Two different methods for computing ESS of a MDEG.

38 / 39

logo




Perspectives

Perspectives

Generalization of our energy management application.Develop the theoretical results to other rewards like the meanand the discounted ones.Notion of population dynamics into this framework.

39 / 39

Documents

Markov Decision Evolutionary Gamesmescal.imag.fr/membres/corinne.touati/events/hayel.pdfUniversity of Avignon, France Popeye seminar, 2008 1/39. logo Introduction ... invaders (mutations)