Dynamic Potential Games with Constraints: Fundamentals …Abstract—In a noncooperative dynamic game, multiple agents operating in a changing environment aim to optimize their utilities

1

Dynamic Potential Games with Constraints:Fundamentals and Applications in Communications

Santiago Zazo, Member, IEEE, Sergio Valcarcel Macua, Student Member, IEEE,Matilde Sánchez-Fernández, Senior Member, IEEE, Javier Zazo

Abstract—In a noncooperative dynamic game, multiple agentsoperating in a changing environment aim to optimize theirutilities over an infinite time horizon. Time-varying environmentsallow to model more realistic scenarios (e.g., mobile devicesequipped with batteries, wireless communications over a fadingchannel, etc.). However, solving a dynamic game is a difficulttask that requires dealing with multiple coupled optimal controlproblems. We focus our analysis on a class of problems, nameddynamic potential games, whose solution can be found througha single multivariate optimal control problem. Our analysisgeneralizes previous studies by considering that the set of envi-ronment’s states and the set of players’ actions are constrained,as it is required by most of the applications. We also show thatthe theoretical results are the natural extension of the analysisfor static potential games. We apply the analysis and providenumerical methods to solve four key example problems, withdifferent features each: i) energy demand control in a smart-grid network, ii) network flow optimization in which the relayshave bounded link capacity and limited battery life, iii) uplinkmultiple access communication with users that have to optimizethe use of their batteries, and iv) two optimal scheduling gameswith nonstationary channels.

Index Terms—Dynamic games, dynamic programming, gametheory, multiple access, network flow, optimal control, resourceallocation, scheduling, smart grid.

I. INTRODUCTION

GAME theory is a field of mathematics that studies con-flict and cooperation between intelligent decision makers[1]. It has become a useful tool for modeling communicationand networking problems, such as power control and resourcesharing (see, e.g., [2]), wherein the strategies followed bythe users (i.e., players) influence each other, and the actionshave to be taken in a decentralized manner. However, onemain assumption of classic game theory is that the usersoperate in a static environment, which is not influencedby the players’ actions. This assumption is unrealistic inmany communication and networking problems. For instance,wireless devices have to maximize throughput while facingtime-varying fading channels, and mobile devices may haveto control their transmitter power while saving their battery

This work has been partly funded by the Spanish Ministry of Economyand Competitiveness under the grant TEC2013-46011-C3-1-R, by the SpanishMinistry of Science and Innovation with the project ELISA (TEC2014-59255-C3-R3) and by an FPU doctoral grant to the fourth author.

S. Zazo, S. Valcarcel Macua and J. Zazo are with the Signals,Systems & Radiocommunications Dept., Universidad Politécnicade Madrid. E-mail: {santiago,sergio}@gaps.ssr.upm.es,[email protected].

M. Sánchez-Fernández is with the Signal Theory & Communications De-partment, Universidad Carlos III de Madrid. E-mail: [email protected].

level. These time-varying scenarios can be better modeled bydynamic games.

In a noncooperative dynamic game, the players compete ina time-varying environment, which we assume can be char-acterized by a deterministic discrete-time dynamical systemequipped with a set of states and a Markovian state-transitionequation. Each player has its utility function, which dependson the current state of the system and the players’ currentactions. Both the state and action sets are subject to constraints.Since the state-transitions induce a notion of time-evolutionin the game, we consider the general case wherein utilities,state-transition function and constraints can be nonstationary.A dynamic game starts at an initial state. Then, the playerstake some action, based on the current state of the game, andreceive some utility values. Then, the game moves to anotherstate. This sequence of state-transitions is repeated at everytime step over a (possibly) infinite time horizon. We considerthe case in which the aim of each player is to find the sequenceof actions that maximizes its long term cumulative utility,given other players’ sequence of actions. Thus, a game canbe represented as a set of coupled optimal-control-problems(OCP), which are difficult to solve in general. Fortunately,there is a class of dynamic games, named dynamic potentialgames (DPG), that can be solved through a single multivariate-optimal-control-problem (MOCP). The benefit of DPG is thatsolving a single MOCP is generally simpler than solving a setof coupled OCP (see [3] for a recent survey on DPG).

The pioneering work in the field of DPG is that of [4],later extended by [5] and [6]. There have been two mainapproaches to study DPG: the Euler-Lagrange equations andthe Pontryagin’s maximum (or minimum) principle. Recentanalysis by [3] and [7] used the Euler-Lagrange with DPGin its reduced form, that is when it is possible to isolate theaction from the state-transition equation, so that the action isexpressed as a function of the current and future (i.e., aftertransition) states. Consider, for example, that the future stateis linear in the current action; then, it is easy to invert thestate-transition function and rewrite the problem in reducedform, with the action expressed as a function of the currentand future states. However, in many cases, it is not possible tofind such reduced form of the game (i.e., we cannot isolate theaction) because the state-transition function is not invertible(e.g., when the state transition function is quadratic in theaction variable). The more general case of DPG in nonreducedform was studied with the Pontryagin’s maximum principleapproach by [5] and [8] for discrete and continuous timemodels, respectively. However, in all these studies [3]–[8],

arX

iv:1

509.

0131

3v2

[cs

.SY

] 2

8 D

ec 2

015

2

the games have been analyzed without explicitly consideringconstraints for the state and action sets.

Other works that consider potential games with state-dynamics include [9]–[11]. However, these references studythe myopic problem in which the agents aim to maximizetheir immediate reward. This is different from DPG, wherethe agents aim to maximize their long term utility by solvinga control problem.

Dynamic games offer two kinds of possible analysis basedon the type of control that players use. These cases arenormally referred to as open loop (OL) and closed loop (CL)game analysis. In the open loop approach, in order to find theoptimal action sequence, the players have to take into accountother players’ action sequences. On the other hand, in a closedloop approach, players find a strategy that is a function of thestate, i.e., it is a mapping from states to actions. Thus, in orderto find their optimal policies, they need to know the formof other players’ policy functions. The OL analysis has, ingeneral, more tractable analysis than the CL analysis. Indeed,there are only few CL known solutions for simple games,such as the fish war example presented in [12], oligopolisticCournot games [13], or quadratic games [14].

The main theoretical contribution of this work is to analyzeDPG with constrained action and state sets, as it is requiredby most of applications (e.g., in a network flow problem, theaggregated throughput of multiple users is bounded by themaximum link capacity; or in cognitive radio, the aggregatedpower of all secondary users is bounded by the maximuminterference allowed by the primary users). To do so, weapply the Euler-Lagrange equation to the Lagrangian (as itis customary in the MOCP literature [15]), rather than to theutility function (as done by earlier works [3] and [7]). Usingthe Lagrangian, we can formulate the optimality condition inthe general nonreduced form (i.e., it is not necessary to isolatethe action in the transition equation). In addition, we establishthe existence of a suitable conservative vector field as an easilyverifiable condition for a dynamic game to be of the potentialtype. To the best of our knowledge, this is a novel extensionof the conditions established for static games by [16] and [17].

The second main contribution of this work is to show thatthe proposed framework can be applied to several commu-nication and networking problems in a unified manner. Wepresent four examples with increasing complexity level. First,we model the energy demand control in a smart grid networkas a linear-quadratic-dynamic-game (LQDG). This scenariois illustrative because the analytical solution of an LQDGis known. The second example is an optimal network flowproblem, in which there are two levels of relay nodes equippedwith finite batteries. The users aim to maximize their flowwhile optimizing the use of the nodes’ batteries. This problemillustrates that, when the utilities have some separable form,it is straightforward to establish that the problem is a DPG.However, the analytical solution for this problem is unknownand we have to solve it numerically. It turns out that, sinceall batteries will deplete eventually, the game will get stuck inthis depletion-state. Hence, we can approximate the infinite-horizon MOCP by an effective finite-horizon problem, whichsimplifies the numerical computation. The third example is

an uplink multiple access channel wherein the users’ devicesare also equipped with batteries (this example was introducedin the preliminary paper [18]). Again, the simple—but morerealistic—extension of battery-usage optimization makes thegame dynamic. In this example, instead of rewriting the util-ities in a separable form, we perform a very general analysisto establish that the problem is a DPG. The fourth examplestudies two decentralized scheduling problems: proportionalfair and equal rate scheduling, where multiple users share atime-varying channel (see the preliminary paper [19]). Thisexample shows how to use the proposed framework in itsmost general form. The problems are nonconcave and theutilities have a nonobvious separable form. The problem isnonstationary, with state-transition equation changing withtime. And there is no reason that justifies a finite horizonapproximation of the problem, so we have to use optimalcontrol methods (e.g., dynamic programming) to solve itnumerically.

Outline: Sec. II introduces the problem setting, its solutionand the assumptions on which we base our analysis. In Sec. III,we review static potential games together with the instrumentalnotion of conservative vector field. In Sec. IV, we providesufficient conditions for a dynamic game with constrained stateand action sets to be a DPG, and show that a DPG can besolved through and equivalent MOCP. Sections V–VIII dealwith application examples, the methods for solving them, andsome illustrative simulations. We provide some conclusions inSec. IX.

II. PROBLEM SETTINGLet Q , {1, . . . , Q} denote the set of players and let

X ⊆

3

vector of all players’ actions except that of player i. Hence,by slightly abusing notation, we can rewrite ut =

(uit,u

−it

).

The state transitions are determined by f : X×U×N→ X ,such that the nonstationary Markovian dynamic equation ofthe game is xt+1 = f(xt,ut, t), which can be split amongcomponents: xkt+1 = f

k(xt,ut, t) for k = 1, . . . , S, suchthat f ,

(fk)Sk=1

. The dynamic is Markovian because thestate transition to xt+1 depends on the current state-actionpair (xt,ut), rather than on the whole history of state-actionpairs {(x0,u0), . . . (xt,ut)}. We remark that f correspondsto a nonreduced form, such that there is no function ϕ suchthat ut = ϕ(xt,xt+1, t).

We include a vector of C nonstationary constraints g ,(gc)

Cc=1, as it is required by most applications, and define the

sets Ct , {X ×U}∩{(xt,ut) : g(xt,ut, t) ≤ 0}∩{(xt,ut) :xt+1 = f(xt,ut, t)}.

Each player has its nonstationary utility function πi :X i × U × N →

4

the concept of conservative vector field. The following lemmawill be useful to this end.

Lemma 1. Let F(u) = (F1(u), . . . , FQ(u)) be a vector fieldwith continuous derivatives defined over an open convex setU ∈

5

Lemma 3. Problem (1) is a DPG if the utility function ofevery player i ∈ Q can be expressed as the sum of a termthat is common to all players plus another term that dependsneither on its own action, nor on its own state-components:

πi(xit, u

it,u−it , t

)= Π

(xt, u

it,u−it , t

)+ Θ(x−it ,u

−it , t) (16)

Proof: By taking the partial derivative of (16) we obtain(12). Therefore, we can apply Lemma 2 (see also [16, Prop.1]).

However, posing the utility in the separable structure (16)may be difficult. We need a more general framework thatallows us to check whether problem (3) is a DPG when theplayer’s utilities have a nonobvious separable structure. Thisframework is formally introduced in the following lemma.

Lemma 4. Problem (1) is a DPG if all players’ utilities satisfythe following conditions, ∀i, j ∈ Q, ∀m ∈ X (i), ∀n ∈ X (j):

∂2πi(xit,ut, t)

∂xmt ∂ujt

=∂2πj(xjt ,ut, t)

∂xnt ∂uit

(17)

∂2πi(xit,ut, t)

∂xmt ∂xnt


∂xnt ∂xmt

(18)

∂2πi(xit,ut, t)

∂uit∂ujt


∂ujt∂uit

(19)

Proof: Under Assumption 1, we can introduce the fol-lowing vector field:

F ,

(∇x1tπ

1(x1t ,ut, t)>, . . . ,∇xQt π

Q(xQt ,ut, t)>

,∂π1(x1t ,ut, t)

∂u1t, . . . ,

∂πQ(xQt ,ut, t)

∂uQt

)(20)

where∇xitπi(xit,ut, t) =

(∂πi(xit,ut,t)

∂xmt

)m∈X (i)

. From Lemma

2, we can express (20) as

F = ∇Π(xt,ut, t) (21)

From Assumption 2 and Lemma 1.1, we know that F isconservative. Hence, Lemma 1.2 establishes that the secondpartial derivatives must satisfy (17)–(19).

Introduce the following MOCP:

P2 :maximize{ut}∈

∏∞t=0 U

∞∑t=0

βtΠ(xt,ut, t)

s.t. xt+1 = f(xt,ut, t), x0 given

g(xt,ut, t) ≤ 0

(22)

Let us consider the following assumption, which is needed forestablishing equivalence between a DPG and the MOCP (22).

Assumption 4. The MOCP (22) has a nonempty solution set.

Sufficient—and easily verifiable—conditions to satisfy As-sumption 4 are given by the following lemma, which is astandard result in optimal control theory.

Lemma 5. Let Π : X ×U ×N→ [−∞,∞) be a proper con-tinuous function. And let any one of the following conditions

hold for t = 1, . . . ,∞:1) The constraint sets Ct are bounded.2) Π(xt,ut, t)→ −∞ as ‖(xt,ut)‖ → ∞ (coercive).3) There exists a scalar M such that the level sets, defined

by {(xt,ut, t)|Π(xt,ut, t) ≥M}∞t=1, are nonempty andbounded.

Then, ∀x0 ∈ X , there exists an optimal sequence of actions{u?t }∞t=0 that is solution to the MOCP (22). Moreover, thereexists an optimal policy φ? : X ×N→ U , which is a mappingfrom states to optimal actions, such that when applied overthe state-trajectory {xt}∞t=0, it provides an optimal sequenceof actions {u?t , φ?(xt, t)}∞t=0.

Proof: Since Π is proper, it has some nonempty levelset. Since Π is continuous, its bounded level sets are compact.Hence, we can use [23, Prop. 3.1.7] (see, also [23, Sections1.2 and 3.6]) to establish existence of an optimal policy.

The main theoretical result of this work is that we can findan NE of a DPG by solving the MOCP (22). This is provedin the following theorem.

Theorem 1. If problem (1) is a DPG, under Assumptions 1–4, the solution of the MOCP (22) is an NE of (1) when theobjective function of the MOCP is given by

Π(xt,ut, t) =

∫ 10

Q∑i=1

( ∑m∈X (i)

∂πi(η(λ),ut, t)

∂xmt

dηm(λ)

dλ

+∂πi(xt, ξ(λ), t)

∂uit

dξi(λ)

dλ

)dλ (23)

where η(λ) ,(ηk(λ)

)Sk=1

, ξ(λ) ,(ξi(λ)

)Qi=1

, and η(0)-ξ(0) and η(1)-ξ(1) correspond to the initial and final state-action conditions, respectively.

The usefulness of Theorem 1 is that, in order to find an NEof (1), instead of solving several coupled control problems, wecan check whether (1) is a DPG (i.e., anyone of Lemmas 2–4holds). If so, we can find an NE by computing the potentialfunction (23) and, then, by solving the equivalent MOCP (22).

Proof: The proof is structured in five steps. First, wecompute the Euler equation of the Lagrangian of the dynamicgame and derive the KKT optimality conditions. Assumption3 is required to ensure that the KKT conditions hold at theoptimal point and that there exist feasible dual variables [20,Prop. 3.3.8]. Second, we study when the necessary optimalityconditions of the game become equal to those of the MOCP.Third, we show that having the same necessary optimalityconditions is sufficient condition for the dynamic game to bepotential. Fourth, having established that the dynamic gameis a DPG we show that the solution to the MOCP (whoseexistence is guaranteed by Assumption 4) is also an NE ofthe DPG. Finally, we derive the per stage utility of the MOCPas the potential function of a suitable vector field. We proceedto explain the details.

First, for problem (1), introduce each player’s Lagrangian∀i ∈ Q:

Li(xt,ut,λ

it,µ

it

)=

∞∑t=0

βt(πi(xit,ut, t

)

6

+ λi>

t (f (xt,ut, t)− xt+1) + µi>

t g (xt,ut, t))

=

∞∑t=0

βtΦi(xt,ut, t,λ

it,µ

it

)(24)

where λit ,(λikt)Sk=1

and µit ,(µict)Cc=1

are the correspond-ing vectors of multipliers, and we introduced the shorthand:

Φi(xt,ut, t,λ

it,µ

it

), πi

(xit,ut, t

)+ λi

>

t (f (xt,ut, t)− xt+1) + µi>

t g (xt,ut, t) (25)

The discrete time Euler-Lagrange equations [15, Sec. 6.1]applied to each player’s Lagrangian are given by:

∂Φi(xt−1,ut−1, t− 1,λit−1,µit−1

)∂xmt

+∂Φi

(xt,ut, t,λ

it,µ

it

)∂xmt

= 0, ∀m ∈ X (i) (26)

∂Φi(xt−1,ut−1, t− 1,λit−1,µit−1

)∂uit

+∂Φi

(xt,ut, t,λ

it,µ

it

)∂uit

= 0 (27)

Actually, note that (26)–(27) are the Euler-Lagrange equationsin a more general form than the standard reduced form. Asmentioned in Sec. II (see also, e.g., [15, Sec. 6.1], [3]), in thestandard reduced form, the current action can be posed as afunction of the current and future states: ut = ϕ(xt,xt+1, t),for some function ϕ : X × X ×N→ U . The reason why weintroduced this general form of the Euler-Lagrange equationsis that such function ϕ may not exist for an arbitrary state-transition function f . By substituting (25) into (26)–(27), andadding the corresponding constraints, we obtain the KKTconditions of the game for every player i ∈ Q, the state-components m ∈ X (i), and all extra constraints:

∂πi(xit,ut, t

)∂xmt

+

S∑k=1

λikt∂fk (xt,ut, t)

∂xmt

+

C∑c=1

µict∂gc (xt,ut, t)

∂xmt− λimt−1 = 0 (28)

∂πi(xit,ut, t

)∂uit

+

S∑k=1

λikt∂fk (xt,ut, t)

∂uit

+

C∑c=1

µict∂gc (xt,ut, t)

∂uit= 0 (29)

xt+1 = f (xt,ut, t) , g (xt,ut, t) ≤ 0 (30)µit ≤ 0, µi

>

t g (xt,ut, t) = 0 (31)

Second, we find the KKT conditions of the MOCP. To doso, we obtain the Lagrangian of (22):

LΠ(xt,ut,γt, δt) =∞∑t=0

βt(

Π (xt,ut, t)

+ γ>t (f (xt,ut, t)− xt+1) + δ>t g (xt,ut, t))

(32)

where γt ,(γkt)Sk=1

and δt , (δct )Cc=1 are the corresponding

multipliers. Again, from (32) we derive the Euler-Lagrangeequations, which, together with the corresponding constraints,yield the KKT system of optimality conditions for all state-components, m = 1, . . . , S, and all actions, i = 1, . . . , Q:

∂Π (xt,ut, t)

∂xmt+

S∑k=1

γkt∂fk (xt,ut, t)

∂xmt

+

C∑c=1

δct∂gc (xt,ut, t)

∂xmt− γmt−1 = 0 (33)

∂Π (xt,ut, t)

∂uit+

S∑k=1

γkt∂fk (xt,ut, t)

∂uit

+

C∑c=1

δct∂gc (xt,ut, t)

∂uit= 0 (34)

xt+1 = f (xt,ut, t) , g (xt,ut, t) ≤ 0 (35)δt ≤ 0, δ>t g (xt,ut, t) = 0 (36)

In order for the MOCP (22) to have the same optimalityconditions as the game (1), by comparing (28)–(31) with(33)–(36), we conclude that the following conditions must besatisfied ∀i ∈ Q:

∂πi(xit,ut, t

)∂xmt

=∂Π (xt,ut, t)

∂xmt, ∀m ∈ X (i) (37)

∂πi(xit,ut, t

)∂uit

=∂Π (xt,ut, t)

∂uit(38)

λit = γt , µit = δt (39)

Third, when conditions (37)–(38) are satisfied, Lemma 2states that problem (1) is a DPG.

Fourth, note that condition (39) represents a feasible pointof the game. The reason is that if there exists an optimalprimal variable, then the existence of dual variables in theMOCP is guaranteed by suitable regularity conditions. Sincethe existence of optimal primal variables of the MOCP isensured by Assumption 4, the regularity conditions establishedby Assumption 3 guarantee that there exist some γt and δtthat satisfy the KKT conditions of the MOCP. Substitutingthese dual variables of the MOCP in place of the individualλit and µ

it in (28)–(31) for every i ∈ Q, results in a system

of equations where the only unknowns are the user strategies.This system has exactly the same structure as the one alreadypresented for the MOCP in the primal variables. Therefore,the MOCP primal solution also satisfies the KKT conditionsof the DPG. Indeed, it is straightforward to see that an optimalsolution of the MOCP is also an NE of the game. Let {u?t }∞t=0denote the MOCP solution, so that it satisfies the followinginequality ∀uit ∈ U i:

∞∑t=0

βtΠ(xt, ui?t ,u

?−it , t) ≥

∞∑t=0

βtΠ(xt, uit,u

?−it , t) (40)

From Definition 3, we conclude that the MOCP optimalsolution is also an NE of game (1). The opposite may not betrue in general. Indeed, this solution, in which dual variablesare shared between players, is only a subclass of the possibleNE of the game. Nevertheless, other NE that do not share this

7

property have been referred to as unstable by [16] for staticgames.

Fifth, although we have shown that we can find an NE of theDPG by solving a MOCP, we still need to find the objectiveof the MOCP. In order to find Π, we deduce from (37), (38),(20) and (21) that the vector field (20) can be expressed as

F , ∇Π (xt,ut, t) (41)

Lemma 1 establishes that F is conservative. Thus, the objectiveof the MOCP is the potential of the field, which can becomputed through the line integral (23).

In the next sections, we show how to apply thismethodology—of solving DPG through an equivalentMOCP—to different practical problems.

V. ENERGY DEMAND IN THE SMART GRIDAS A LINEAR QUADRATIC DYNAMIC GAME

Our first example consists in a linear-quadratic-dynamic-game (LQDG) that solves a smart grid resource allocationproblem. LQDG are convenient because they are amenable toanalytical and closed form solutions [24, Ch. 6]. Our analysisis novel though. To the best of our knowledge, LQDG havenot been studied under the easier DPG framework before.

A. Energy demand control DPG and equivalent MOCP

Consider a community of Q users (i.e., players) that use thesmart grid resources in different activities (like communica-tions, heating, lighting, home appliances or production needs).Suppose that the electrical grid has S types of energy resources(such as rechargeable batteries, coal, fuel, hydroelectric poweror biomass). The state of the game xt ∈ t−1,

]>, ũit , D

ixt − uit (43)

we can rewrite (42) in the standard linear-quadratic form:

maximize{uit}∈

∏∞t=0 Ui

∞∑t=0

βt(x̃>t R̃x̃t + ũ

i>

t Qiũit

)s.t. x̃t+1 = Ax̃t −

Q∑i=1

B̃iũit, x0 given

(44)

where

A ,

[C +

∑i∈QB

iDi 0S×S

IS 0S×S

], B̃i ,

[Bi

0S×Ai

](45)

R̃ ,

[R −R−R R

](46)

and where IS and 0S×S denote the identity and null matricesof size S × S, respectively. LQDG games in the form (44)have been presented in [24, Ch. 6], where an NE is foundby i) solving the system of coupled finite horizon OCP, ii)finding the limit of this solution as the horizon tends to infinity,and then iii) verifying that this limiting solution provides aNE solution for the infinite-horizon game. Here we follow adifferent and simpler approach. First, we show that problem(44) can be expressed in the separable form (16):

πi(x̃t,ũt) = x̃>t R̃x̃t + ũ

i>

t Qiũit

= x̃>t R̃x̃t +∑p∈Q

ũp>

t Qpũpt −

∑j∈Q:j 6=i

ũj>

t Qjũjt (47)

We identify the potential and separable functions in (47):

Π (x̃t, ũt, t) = x̃>t R̃x̃t +

∑p∈Q

ũp>

t Qpũpt (48)

Θ(ũ−it , t) = −∑

j∈Q:j 6=i

ũj>

t Qjũjt (49)

From Lemma 3, we conclude that problem (42) is a DPG.Note also that Assumptions 1–2 hold. Moreover, the objectivein (44) is concave and the state dynamics—which is the onlyequality constraint—are linear. Therefore, Slater’s constraintqualification is satisfied and Assumption 3 holds. In addition,the matrices Qi ≺ 0, ∀i ∈ Q, and R � 0 make the potential(49) coercive. Hence, Lemma 5 states that Assumption 4 issatisfied. Since Assumptions 1–4 hold, Theorem 1 establishesthat we can find an NE of (42) by solving an equivalent

8

MOCP:

P3 :

maximize{ut}∈

∏∞t=0 U

V (x̃0) ,∞∑t=0

βt(x̃>t R̃x̃t

+∑p∈Q

ũp>

t Qpũpt

)

s.t. x̃t+1 = Ax̃t −Q∑i=1

B̃iũit, x̃0 given

(50)

where the cumulative objective function V is known as valuefunction in the optimal control literature (see, e.g., [23]). Letũt ,

(ũit)Qi=1

be the vector of all players’ augmented actions.Aggregate all players’ demand matrices in a block diagonalmatrix Q , diag

(Q1, . . . ,QQ

)of size

∑Qi=1A

i×∑Qi=1A

i,and aggregate all players’ expenditure weighting matrices ina S ×

∑Qi=1A

i thick matrix B̃ ,(B̃1, . . . , B̃Q

). Then, we

can rewrite the value and transition functions as follows:

V (x̃0) =

∞∑t=0

βt(ũ>t Qũt + x̃

>t R̃x̃t

)(51)

x̃t+1 = Ax̃t −Bũt (52)

B. Analytical solution to the MOCP and simulation resultsIt is well known that the value function satisfies a recursive

relationship, known as Bellman equation (see, e.g., [23]):

V (x̃t) = βt(ũ>t Qũt + x̃

>t R̃x̃t

)+ βt+1V (x̃t+1) (53)

Moreover, for an LQ control problem, it is known [24, Ch. 6]that the optimal value function can be expressed as a quadraticform of the state:

V (x̃t) = x̃>t Px̃t (54)

for some negative semidefinite matrix P. We can use (54)to find a closed form expression for the sequence of optimalactions as follows. Expand (52) and (54) into (53):

V (x̃t) = βt(ũ>t Qũt + x̃

>t R̃x̃t

)+ βt+1 (Ax̃t −Bũt)>P (Ax̃t −Bũt) (55)

Now, we just have to maximize (55) over ũt. Since Q and Pare negative definite and semidefinite matrices, respectively, anecessary and sufficient condition for the maximum is

∇ũtV (x̃t) = βtQũt − βt+1B>P (Ax̃t −Bũt) = 0 (56)

From (56), we obtain an analytical expression for the optimalaction at any time step:

ũt = β(Q + βB>PB

)−1B>PAx̃t (57)

If we are also interested in finding the optimal value, wecan expand (57) into (55) and isolate P:

P = R̃ + βA>PA

− β2A>PB(Q + βB>PB

)−1B>PA (58)

Note that (58) is a discrete algebraic Riccati equation, whichis known to be a contraction mapping if Q ≺ 0, R̃ � 0 andthe spectral radius of A is smaller than one [25, Ch. 5] (the

analysis can be performed under weaker conditions though[23], [26]). When (58) is a contraction, it has a unique solutionP? that can be approximated by iterating the following fixedpoint equation, such that limn→∞Pn = P?:

Pn+1 = R̃ + βA>PnA

− β2A>PnB(Q + βB>PnB

)−1B>PnA (59)

We have simulated the smart grid model for Q = 8 players,S = 4 resources, Ai = 6 activities for every player, randomnegative definite matrices Qi, ∀i ∈ Q, and random negativesemidefinite matrix R (to build these negative matrices webuild an intermediate matrix, e.g., Rint, by drawing randomnumbers from a uniform distribution, with support [0, 10] forQi and [0, 5] for R, and compute R = −R>intRint). MatricesC, Bi and Di are also random with elements drawn from thespherical normal distribution. Finally, the initial state was setto a vector of ones, and discount factor β = 0.9.

Figure 1-Top shows the instant utilities per player over time.Recall that the utilities have been defined as negative costs.Therefore, each player’s utility starts being a negative valueand converges to zero with time. This behaviour illustrates thatall players attain an NE in which they are able to satisfy theirdemand as well as to hold just enough available resources.Figure 1-Bottom shows the evolution of the part of the costcorresponding to the individual coefficients ũit = D

ixt − uit.These coefficients represent the mismatch among target de-mand, Dixt, and the actual player activities uit. We can seethat the agents adjust their actions uit to satisfy the targetdemand. The equilibrium between target demand and players’activities is an expected consequence of the stability of theLQ game in infinite horizon [24, Ch. 6].

0 5 10 15 20 25 30−20

−15

−10

−5

0

Time

Util

ities

User 1User 2User 3User 4User 5User 6User 7User 8

0 5 10 15 20 25 30

−2

0

2

Time

Dec

isio

nC

oeffi

cien

ts

Fig. 1. Dynamic smart grid scenario with Q = 8 players. (Top) Instant utilityvalues of players. (Bottom) Players’ decision coefficients evolution in time.

9

VI. NETWORK FLOW CONTROL: INFINITE HORIZONAPPROXIMATED BY A FINITE HORIZON DYNAMIC GAME

Several works (see, e.g., [27]–[30]) have considered networkflow control as an optimization problem wherein each sourceis characterized by a utility function that depends on thetransmission rate, and the goal is to maximize the aggregatedutility. We generalize the standard model by considering thatthe nodes are equipped with batteries that are depleted propor-tionally to the outgoing flow. In addition we consider severallayers of relay nodes, each one with multiple links, so thereare several paths between source and destination. When thebatteries are completely depleted, no more transmissions areallowed and the game is over. Hence, although we formulatethe problem as an infinite horizon dynamic game, the effectivetime horizon—before the batteries deplete—is finite. Thisproblem has no known analytical solution, but the utilitiesare concave. Therefore, the finite horizon approximation isconvenient because we can solve an equivalent concave op-timization problem, significantly reducing the computationalload with respect to other optimal control algorithms (e.g.,dynamic programming).

A. Network flow control dynamic game and equivalent MOCP

Let uiat denote the flow along path a for user i at time t.Suppose there are Ai possible paths for each player i ∈ Q,so that uit ,

(uiat)Aia=1

denotes the i-th player’s action vector.Let A =

∑Qi=1A

i denote the total number of available paths.Suppose there are S relay nodes. Let xkt denote the battery

level of relay node k. The state of the game is given by xt ,(xkt)Sk=1

, such that all players share all components of thestate-vector (i.e., X i = X and X (i) = {1, . . . , S}, ∀i ∈ Q).The battery level evolves with the following state-transitionequation for all components k = 1, . . . , S:

xkt+1 = xkt − δ

Q∑i=1

∑uia∈Fk

uiat , xk0 = B

kmax (60)

where Fk denotes the subset of flows through node k, Bkmaxis a positive scalar that stands for the maximum battery levelof node k, and δ is a proportional factor.

Similar to the standard static flow control problem, eachplayer intends to maximize a concave function Γ : U i → < ofthe sum of rates across all available paths. This function Γ cantake different forms depending on the scenario under study,like the square root [31] or a capacity form. In addition to thetransmission rate, we include the relay nodes’ battery level ineach player’s utility, weighted by some positive parameter α.The combination of these two objectives can be understoodas the player aiming to maximize its total transmission rate,while saving the batteries of the relays.

There is some capacity constraint of the maximum aggre-gated rate at every relay and destination node. Let cmax ∈ 0 is only added to avoid

differentiability issues when uiat = 0). Let X and U i beopen convex sets containing the Cartesian products of intervals[0, Bkmax] and [0,∞), respectively. It follows that Assumptions1–2 hold. Moreover, since Γ is concave and problem (61) haslinear equality constraints and concave inequality constraints,Slater’s condition holds, i.e., Assumption 3 is satisfied. Finally,since the constraint set in (61) is compact, Lemma 5.1 statesthat Assumption 4 holds. Hence, Theorem 1 establishes thatwe can find an NE of (61) by solving the following MOCP:

P4 :

maximize{ut}∈

∏∞t=0 U

∞∑t=0

βt

∑i∈Q

Γ

Ai∑a=1

uiat

+ α S∑k=1

xkt

s.t. xkt+1 = x

kt − δ

∑i∈Q

∑uia∈Fk

uiat

xk0 = Bkmax, 0 ≤ xkt ≤ Bkmax

Mut ≤ cmax, ut ≥ 0k = 1, . . . , S

(64)

B. Finite horizon approximation and simulation results

As opposed to the LQ smart-grid problem, there is notknown closed form solution for problem (64). Thus, we haveto rely on numerical methods to solve the MOCP. Supposethat we set the weight parameter α in Π low enough toincentivize some positive transmission. Eventually, the nodes’

10

batteries will be depleted, so the system will get stuck inan equilibrium state, with no further state transitions. Thus,we can approximate the infinite-horizon problem (64) as afinite-horizon problem, with horizon bounded by the time-step at which all batteries have been depleted. Moreover, inour setting, we have assumed Γ to be concave. Therefore, wecan effectively solve (64) with convex optimization solvers(we use the software described in [32]). The benefit of usinga convex optimization solver is that standard optimal controlalgorithms are computationally demanding when the state andaction spaces are subsets of vector spaces.

For our numerical experiment, we consider Q = 2 playersthat share a network of S = 4 relay nodes, organized in twolayers (see Figure 2). In this particular setting, each player isallowed to use four paths, A1 = A2 = 4. The connectivitymatrix M can be obtained from Figure 2. The battery isinitialized to Bmax = 1 for the four relay nodes, we set thedepleting factor δ = 0.05, discount factor β = 0.9, the weightα = 1, � = 0.001 and the vector of maximum capacitiescmax = [0.5, 0.15, 0.5, 0.15, 0.4, 0.4]

>.

S1

S2 D2

D1

N3

N4

N1

N2

u11t u12t u

13t u

14t

u21t u22t u

23t u

24t

N3

N4

N1

N2

Fig. 2. Network scenario for two users and two levels of relying nodes.Player S1 aims to transmit to destination D1, while S2 aims to transmit todestination D2. They have to share relay nodes N1, . . . , N4. We denote theL = 6 aggregated flows as L1, . . . , L6.

Figure 3 shows the evolution of the L = 6 aggregatedflows, the A = 8 flows and the battery of each of theN = 4 relay nodes. Since we have included the battery levelof the relay nodes in the users’ utilities (i.e., α > 0), theusers have an extra incentive to limit their flow rate. Thus,there are two effective reasons to limit the flow rate: satisfythe problem constraints and save battery. We can see thatthe aggregated flows with higher maximum capacity are notsaturated (L1 < 0.5, L3 < 0.5, L4 < 0.4, and L6 < 0.4).The reason is that the users have limited their individual flowrates in order to save relays’ batteries. On the other hand, theaggregated flows with lower maximum capacity are saturated(L2 = L4 = 0.15) because the capacity constraint is morerestrictive than the self-limitation incentive. When the batteriesof the nodes with higher maximum capacity (N1, N3) aredepleted (around t = 70), the flows through these nodesstop. This allows the other flows (u14t , u

24t ) to transmit at a

higher rate. At this time, the capacity constraint in L2, L4 ismore restrictive than the self-limitation incentive for savingthe batteries, so that the users transmit at the maximum rate

allowed by the capacity constraints (note that L2 = L4 = 0.15remains constant). When the battery of every node is depleted,none of the users is allowed to transmit anymore and thesystem enters in an equilibrium state.

We remark that the solution obtained is an NE based onan OL game analysis. Finally, the results shown in Figure 3have been obtained with a centralized convex optimizationalgorithm, meaning that it should be run off-line by the sys-tem designer, before deploying the real system. Alternatively,we could have used the distributed algorithms proposed byreference [33], enabling the players to solve the finite horizonapproximation of problem (64) in a decentralized manner, evenwith the coupled capacity constraints.

0

0.1

0.2

0.3

Agg

rega

ted

flow

rate L1

L2L3L4L5L6

0

0.05

0.1

Indi

vidu

alflo

wra

teu11tu12tu13tu14tu21tu22tu23tu24t

0 20 40 60 80 100 120 140 160 1800

0.2

0.4

0.6

0.8

1

Time

Bat

tery

leve

l Node 1Node 2Node 3Node 4

Fig. 3. Network flow control with Q = 2 players, S = 4 relay nodes andA1 = A2 = 4 available paths per node. (Top) Aggregated flow rates atL1, . . . , L6. (Middle) Flow for each of the A = 8 available paths. (Bottom)Battery level in each of the S = 4 relay nodes.

VII. DYNAMIC MULTIPLE ACCESS CHANNEL:NONSEPARABLE UTILITIES

In this section, we consider an uplink scenario in whichevery user i ∈ Q independently chooses its transmitter power,uit, aiming to achieve the maximum rate allowed by thechannel [18]. If multiple users transmit at the same time, theywill interfere each other, which will decrease their rate, sothat they have to find an equilibrium. Let Rit denote the rateachieved by user i with normalized noise at time t:

Rit , log

(1 +

∣∣hi∣∣2 uit1 +

∑j∈Q:j 6=i |hj |

2ujt

)(65)

11

where hi denotes the fading channel coefficient of user i.

A. Multiple access channel DPG and equivalent MOCPLet xit ∈

[0, Bimax

]denote the battery level for each player

i ∈ Q, which is discharged proportionally to the transmittedpower uit. The state of the system is given by the vector withall individual battery levels: xt =

(xit)i∈Q ∈ X . Thus, each

player is only affected by its own battery, such that S = Q,X (i) = {i} and xit = xit. Suppose the agents aim to maximizeits transmission rate, while also saving their battery. Thisscenario yields the following dynamic game:

G5 :∀i ∈ Q

maximize{uit}∈

∏∞t=0 Ui

∞∑t=0

βt(Rit + αx

it

)s.t. xit+1 = x

it − δuit, xi0 = Bimax

0 ≤ uit ≤ P imax, 0 ≤ xit ≤ Bimax

(66)

where α is the weight given for saving the battery, δ is thedischarging factor, and P imax and B

imax denote the maximum

transmitter power and maximum available battery level fornode i, respectively. Problem (66) is a dynamic infinite-horizonextension of the static problem proposed in [34].

Instead of looking for a separable structure in the players’utilities, we show that Lemma 4 holds and, hence, problem(66) is a DPG:

∂2πi(xit,ut, t)

∂xit∂ujt

=∂2πj(xit,ut, t)

∂xjt∂uit

= 0 (67)

∂2πi(xit,ut, t)

∂xit∂xjt

=∂2πj(xit,ut, t)

∂xjt∂xit

= 0 (68)

∂2πi(xit,ut, t)

∂uit∂ujt

=∂2πj(xit,ut, t)

∂ujt∂uit

=−∣∣hi∣∣2 ∣∣hj∣∣2(

1 +∑p∈Q |hp|

2upt

)2(69)

In order to find an equivalent MOCP, let us define X i and U i asopen convex sets containing the closed intervals

[0, Bimax

]and

[0, P imax], respectively, so that Assumptions 1–2 hold. Derivethe potential function from (23):

Π(xt,ut, t) = log

(1 +

Q∑i=1

|hi|2uit

)+ α

Q∑i=1

xit (70)

Since (70) is concave and all equality and inequality con-straints in (66) are linear, Assumption 3 is satisfied throughSlater’s condition. Moreover, since the constraint set is com-pact and the potential is continuous, Lemma 5.1 establishesthat Assumption 4 holds. Therefore, Theorem (1) states thatwe can find an NE of (66) by solving the following MOCP:

P5 :

maximize{ut}∈

∏∞t=0 U

∞∑t=0

βt

(log

(1 +

Q∑i=1

|hi|2uit

)

+ α

Q∑i=1

xit

)s.t. xit+1 = x

it − δuit, xi0 = Bimax

0 ≤ uit ≤ P imax, 0 ≤ xit ≤ Bimax∀i ∈ Q

(71)

B. Simulation results

Similar to Sec. VI-B, the system reaches an equilibriumstate when the batteries have been depleted. Thus, the solutioncan be approximated by solving a finite horizon problem.Moreover, since the problem is concave, we can use convexoptimization software, like [32]. Alternatively, we could solvethe KKT system with an efficient ad-hoc distributed algorithm,like in [18].

We simulated an scenario with Q = 4 users. We set themaximum battery level Bimax = 33 for all users, the maximumpower allowed per user P imax = 5 for all users, the weightbattery utility factor α = 0.001, the transmitter power batterydepletion factor δ = 1, and the discount factor β = 0.95. Thechannel gains are |h1| = 2.019. |h2| = 1.002 |h3| = 0.514,and |h4| = 0.308.

Figure 4 shows appealing results: the solution of theMOCP—which is an NE of the game—is actually a sched-ule. In other words, instead of creating interference amongusers, they wait until the users with higher channel-gain havedepleted their batteries.

10 20 30 40 50 600

2

4

6

Time

Pow

erUser 1User 2User 3User 4

10 20 30 40 50 600

1

2

3

Time

Rat

e

Fig. 4. Dynamic multiple access scenario with Q = 4 users. (Top)Sequence of transmitter power chosen by every user. (Bottom) Evolution ofthe transmission rates.

VIII. OPTIMAL SCHEDULING: NONSTATIONARY PROBLEMWITH DYNAMIC PROGRAMMING SOLUTION

In this section we present the most general form of the pro-posed framework, and show its applicability to two schedulingproblems. First, one of the games has nonseparable utilities, sowe have to verify second order conditions (17)–(19). Second,neither the equivalent MOCP can be approximated by a finitehorizon problem, nor the utilities are concave. Thus, we cannotrely upon convex optimization software and we have to useoptimal control methods, like dynamic programming [23].Finally, we consider a nonstationary scenario, in which thechannel coefficients evolve with time. This makes the state-transition equations (and the utility for the equal rate problem)depend not only on the current state, but also on time. Thisproblem was introduced in the preliminary paper [19].

12

A. Proportional fair and equal rate scheduling games andtheir equivalent MOCP

Let us redefine the rate achieved by user i at time t, so thatwe consider nonstationary channel coefficients:

Rit , log

1 + ∣∣hit∣∣2 uit1 +

∑j∈Q:j 6=i

∣∣∣hjt ∣∣∣2 ujt (72)

where uit is the transmitter power of player i, and |hit| is itstime-varying channel coefficient.

We propose two different scheduling games, namely, pro-portional fair and equal rate scheduling.

1) Proportional fair scheduling: Proportional fair is acompromise-based scheduling algorithm. It aims to maintain abalance between two competing interests: trying to maximizetotal throughput while, at the same time, guaranteeing aminimal level of service for all users [35]–[37].

In order to achieve this tradeoff, we propose the followinggame:

G6 :∀i ∈ Q

maximize{uit}∈

∏∞t=0 Ui

∞∑t=0

βtxit

s.t. xit+1 =

(1− 1

t

)xit +

Ritt

xi0 = 0, 0 ≤ uit ≤ P imax

(73)

where the state of the system is the vector of all players’ aver-age rates xt =

(xit)i∈Q. Since each player aims to maximize

its own average rate, the state-components are unshared amongplayers: S = Q and X (i) = {i}.

In order to show that problem (73) is a DPG, we evaluateLemma 3 with positive result, and obtain Π from (16):

Π(xt,ut, t) =

Q∑i=1

xit (74)

Now, we show that we can derive an equivalent MOCP. It isclear that Assumptions 1–2 hold. By taking the gradient of theconstraints of (73) and building a matrix with the gradientsof the constraints (i.e., the gradient of each constraint is acolumn of this matrix), it is straightforward to show that thematrix is full rank. Hence, the linear independence constraintqualification holds (see, e.g., [20, Sec. 3.3.5], [21]), meaningthat Assumption 3 is satisfied. Finally, since Rit ≥ 0 and xi0 =0, we conclude that there exists some scalar M ≥ 0 such thatthe level set {xt|

∑i∈Q x

it ≥ M} is nonempty and bounded,

so that Lemma 5.3 establishes that Assumption 4 is satisfied.Thus, from Theorem 1, we can find an NE of DPG (73) bysolving the following MOCP:

P6 :

maximize{ut}∈

∏∞t=0 U

∞∑t=0

βtQ∑i=1

xit

s.t. xit+1 =

(1− 1

t

)xit +

Ritt


(75)

2) Equal rate scheduling: In this problem, the aim of eachuser is to maximize its rate, while at the same time keeping the

users’ cumulative rates as close as possible. Let xit denote thecumulative rate of user i. The state of the system is the vectorof all users’ cumulative rate xt =

(xit)i∈Q. Again S = Q and

X (i) = {i}. This problem is modeled by the following game:

G7 :∀i ∈ Q

maximize{uit}∈

∏∞t=0 Ui

∞∑t=0

βt

((1− α)Rit

− α∑

j∈Q:j 6=i

(xit − x

jt

)2)s.t. xit+1 = x

it +R

it


(76)

where parameter α weights the contribution of both terms.

It is easy to verify that conditions (17)–(19) are satisfied.Hence, from Lemma 4, we know that problem (76) is a DPG.In order to obtain an equivalent MOCP, let us define X iand U i as open convex sets that contain the intervals [0,∞)and [0, P imax], respectively. It follows that Assumptions 1–2hold. Similar to the proportional fair scheduling problem (73),Assumption 3 holds through the linear independence constraintqualification. Finally, let us check Assumption 4 as follows.Derive the potential Π by integrating (23):

Π(xt,ut, t) = (1− α) log

(1 +

Q∑i=1

|hit|2uit

)

− αQ−1∑i=1

Q∑j=i+1

(xit − x

jt

)2(77)

We distinguish two extreme cases: i) all players have exactlythe same rate (i.e., xit = x

jt , i, j = 1, . . . , Q); and ii) each

player’s rate is different from any other player’s rate (i.e.,xit 6= x

jt , i 6= j). When all players have exactly the same rate,

the terms (xit − xjt )

2 vanish for all (i, j) pairs, and (77) onlydepends on the actions (the state becomes irrelevant). Sincethe action constraint set is compact, existence of solution isguaranteed by Lemma 5.1. When each player’s rate is differentfrom any other player’s rate, the term −(xit−x

jt )

2 is coercive,so that (77) becomes coercive too (since the constraint actionset is compact, the term depending on uit is bounded). Thus,existence of optimal solution is guaranteed by Lemma 5.2.Finally, the case where some player’s rate are equal andsome are different is a combination of the two cases alreadymentioned. so that the equal terms vanish and the differentterms make (77) coercive. Hence, Theorem 1 states that wecan find an NE of DPG (76) by solving the following MOCP:

P7 :

maximize{ut}∈

∏∞t=0 U

∞∑t=0

βt

((1− α) log

(1 +

Q∑i=1

|hit|2uit

)

− αQ−1∑i=1

Q∑j=i+1

(xit − x

jt

)2)s.t. xit+1 = x

it +R

it, x

i0 = 0

0 ≤ uit ≤ P imax

(78)

13

B. Solving the MOCP with dynamic programming and simu-lation results

Although Lemma 5 establishes existence of optimal solutionto these MOCP, these problems are nonconcave and cannot beapproximated by finite horizon problems. Thus, we cannot relyon efficient convex optimization software. In order to numeri-cally solve these problems, we can use dynamic programmingmethods [23].

Standard dynamic programming methods assume that theMOCP is stationary. One standard method to cope withnonstationary MOCP is to augment the state space so that itincludes the time as an extra dimension for some time lengthT . Let the augmented state-vector at time t be denoted by x̃t =(xt, t) ∈ X̃ , X×{0, . . . , T}. The state-transition equation inthe augmented state space becomes f̃ : X̃ ×U → X̃ . Since weare tackling an infinite horizon problem, when augmenting thestate space with the time dimension, it is convenient to imposea periodic time variation:

f̃(x̃t,ut) ,

[f(xt,ut, t)

t+ 1 (if t < T ) or 0 (if t = T )

](79)

Otherwise, it could be difficult to apply computational dy-namic programming methods.

One further difficulty for solving MOCP with continuousstate and action spaces is that dynamic programming meth-ods are mainly derived for discrete state-action spaces. Twocommon approaches to overcome this limitation are i) touse a parametric approximation of the value function (e.g.,consider a neural network with inputs the continuous stateaction variables that is trained by minimizing the error in theBellman equation); or ii) to discretize the continuous spaces,so the value function is approximated in a set of points. Forsimplicity, we follow the discretization approach here. Weremark that it may be problematic to finely discretize the state-action spaces in high-dimensional problems though, since thecomputational load increases exponentially with the numberof states. These and other approximation techniques, usuallyknown as approximate dynamic programming, are still anactive area of research (see, e.g., [23, Ch. 6], [38]).

Introduce the optimal value function for the augmented set:

V ?(x̃0) , max{ut}∈

∏∞t=0 U

∞∑t=0

βtΠ(x̃t,ut, t)

=

∞∑t=0

βtΠ(x̃t, φ?(xt, t), t)

=

∞∑t=0

βtΠ(x̃t,u?t , t) (80)

where φ? : X̃ → U is the optimal policy that provides thesequence of actions {u?t , φ?(x̃t)}∞t=0 that is the solutionto the MOCP, as explained by Lemma 5. Then, the Bellmanoptimality equation is given by

V ?(x̃t) = Π(x̃t,u?t ) + βV

?(f̃(x̃t,u?t )) (81)

Among the available dynamic programming methods, wechoose value iteration (VI) for its reduced complexity per

iteration with respect to policy iteration (PI), which is es-pecially relevant when the state-grid has fine resolution (i.e.,large number of states). VI is obtained by turning the Bellmanoptimality equation (81) into an update rule, so that it gener-ates a sequence of value functions V k that converge to theoptimal value (i.e., limk→∞ V k = V ?, where V0 is arbitrary).In particular, at every iteration k, we obtain the policy φ thatmaximizes V k (policy improvement). Then, we update thevalue function V k+1 for the latest policy (policy evaluation).VI is summarized in Algorithm 1, where the operator dx̃edenotes the closest point to x̃ in the discrete grid.

Algorithm 1: Value Iteration for the non-stationary MOCPInputs: number of states S, threshold �Discretize the augmented space X̃ into a grid of S statesInitialize ∆ =∞, k = 0 and V0(x̃s) = 0 for s = 1 . . . Swhile ∆ > �

for every state s = 1 to S dox̃s ← the s-th point on the gridφ(x̃s) = arg maxu Π(x̃s,u) + βVk(df̃ (x̃s,u)e)Vk+1(x̃s) = Π(x̃s, φ(x̃s)) + βVk(df̃ (x̃s, φ(x̃s)e))

end fork = k + 1∆ = maxs |Vk+1(x̃s))− Vk(x̃s))|

end whileReturn: φ(x̃s) and Vk+1(x̃s) for s = 1, . . . , S

Note that the output of the value iteration algorithm is apolicy (i.e., a function), rather than a sequence of actions. Thisresult allows to compute the optimal actions of every playerfrom the current state at every time-step of the game. Whenthere is no reason to propose a finite-horizon approximationof the game, a policy is a more practical representation of thesolution than an infinite sequence of actions.

We simulate a simple scenario with Q = 2 users. Thechannel coefficients are sinusoids with different frequency anddifferent amplitude for each user (see Fig. 5). The maximumtransmitter power is P 1max = P

2max = 5, with 20 possible

power levels per user, which amounts to 400 possible actions.We discretize the state-space (i.e., the users’ rates) into a gridof 30 points per user. The nonstationarity of the environmentis surmounted by augmenting the state-space with T = 20time steps. Hence, the augmented state space has a total of302 × 20 = 18.000 states. For the equal-rate problem, theutility function uses α = 0.9.

The solution of the proportional fair game leads to anefficient scheduler (see Figure 6), in which both users try tominimize interference so that they approach their respectivemaximum rates.

For the equal rate problem, we observe that the agentsachieve much lower rate, but very similar between them (seeFigure 7). The trend is that the user with a channel with lessgain (User 2, red-dashed line) tries to achieve its maximumrate, while the user with higher gain channel (User 1, blue-continuous line) reduces its transmitter power to match therate of the other user. In other words, the user with poorestchannel sets a bottleneck for the other user.

14

5 10 15 200

0.2

0.4

0.6

0.8

1

Time

Cha

nnel

User 1User 2

Fig. 5. Periodic time variation of the channel coefficients |hit|2 for Q = 2users. All possible combination of coefficients are included in a window ofT = 20 time steps.

5 10 15 200

5

10

Pow

er

User 1User 2

5 10 15 200

0.5

1

Time

Ave

rage

rate

Fig. 6. Proportional fair scheduling problem for Q = 2 users. (Top)Transmitter power uit. (Bottom) Average rate x

it given by (73). Both users

achieve near maximum average rates for their channel coefficients |hit|2.

5 10 15 200

5

10

Pow

er

User 1User 2

5 10 15 200

0.2

0.4

0.6

Time

Ave

rage

rate

Fig. 7. Equal rate problem for Q = 2 users. (Top) Transmitter power uit.(Bottom) Average rate xit/t (recall that x

it given by (76) denotes accumulated

rate). User 1 reduces its average rate to match that of User 2, regardless ofhaving higher channel coefficient.

Finally, note that Algorithm 1 is centralized, such that theresults displayed in Figures 6–7 have been obtained assum-ing the existence of a central unit that knows the channelcoefficients, transmission power and average rate for all users,so that it can update the value and policy functions for allstates. We remark that the design and analysis of distributeddynamic programming algorithms when multiple players sharestate-vector components and/or have coupled constraints is a

nontrivial task. Nevertheless, when the players share no state-vector components and they have uncoupled constraints, thereare distributed implementations of VI and PI that convergeto the optimal solution [39]–[42]. This is indeed the case forproblems (75) and (78), where each agent i has a unique state-vector component xit and the constraints are uncoupled. There-fore, the agents could solve these problems in a decentralizedmanner.

IX. CONCLUSIONS

DPG provide a useful framework for competitive multia-gent applications under time-varying environments. On onehand, DPG allows nonstationary scenarios, thus, more realisticmodels. On the other hand, the analysis and solution of DPGis affordable through an equivalent MOCP. We presented acomplete description of DPG and provided conditions for adynamic game with constrained state and action sets to beof the potential type. To the best of our knowledge, previousworks have not dealt with DPG with constraints explicitly.

We also introduced a range of communication and network-ing examples: energy demand control in a smart-grid network,network flow with relays that have bounded link capacity andlimited battery life, multiple access communication in whichusers have to optimize the use of their batteries, and twooptimal scheduling games with nonstationary channels. Al-though these problems have different features each—includingutilities in separable and nonseparable form, convex and non-convex objectives, closed-form and numerical solutions, andsolution methods based on convex optimization and dynamicprogramming algorithms—the proposed framework allowed usto analyze and solve them in a unified manner.

The DPG framework is promising in the sense that, oncethe equivalent MOCP has been formulated, it is possible to useideas from optimal control theory to extend the current analy-sis. In particular, we have assumed that the agents can observeall the variables that influence their objective functions andconstraints. This is known as perfect information. Althoughperfect information is reasonable in many applications, thereare problems in which all the information is not available toall agents. An example of games with imperfect informationis when the agents cannot directly observe the variables thatinfluence their objective and constraints; rather, they onlyhave access to another variable that depends on the state.The current framework could possibly be extended to thiscase by using a partially-observable-Markov-decision-process(POMDP) formulation [43], [44]. Nevertheless, other formsof imperfect information—like when the agents cannot seeother players’ actions—would require further study. Anotherpossible direction to extend the current analysis is to allowstochastic state transitions and utilities (i.e., considering xt+1and πi random variables). This can be done by applying theEuler equation to the stochastic Lagrangian in order to derivea set of stochastic optimality conditions. Finally, we couldalso consider the case where the agents know nothing aboutthe problem; rather, they have to learn the optimal policyfrom trial-and-error experimentation. To this end, we couldapply reinforcement learning (RL) and approximate dynamic

15

programming (APD) techniques (such as Q-learning) [45]–[47], [23, Ch. 6]. The main difficulty with standard APD/RLtechniques is that they have been developed for unconstrainedMOCP, and some adaptation of these techniques is necessary.

REFERENCES

[1] Z. Han, D. Niyato, W. Saad, T. Baar, and A. Hjrungnes, Game Theoryin Wireless and Communication Networks: Theory, Models, and Appli-cations. Cambridge University Press, 2012.

[2] G. Scutari, D. Palomar, F. Facchinei, and J.-S. Pang, “Convex opti-mization, game theory, and variational inequality theory,” IEEE SignalProcessing Magazine, vol. 27, no. 3, pp. 35–49, May 2010.

[3] O. Hernández-Lerma and D. González-Sánchez, Discrete Time Stochas-tic Control and Dynamic Potential Games: The Euler Equation Ap-proach. Springer, Aug. 2013.

[4] W. D. Dechert, “Optimal control problems from second-order differenceequations,” Journal of Economic Theory, vol. 19, no. 1, pp. 50–63, Oct.1978.

[5] ——, “Non cooperative dynamic games: a control theoretic approach,”Tech. Rep., 1997.

[6] W. D. Dechert and W. A. Brock, “The lake game,” Tech. Rep., 2000.[7] D. González-Sánchez and O. Hernández-Lerma, “Dynamic potential

games: The discrete-time stochastic case,” Dynamic Games and Appli-cations, pp. 1–20, Mar. 2014.

[8] D. Dragone, L. Lambertini, G. Leitmann, and A. Palestini, “Hamiltonianpotential functions for differential games,” Proceedings of IFAC CAO,vol. 9, 2009.

[9] J. R. Marden, “State based potential games,” Automatica, vol. 48, no. 12,pp. 3075–3088, Dec. 2012.

[10] N. Li and J. Marden, “Designing games for distributed optimization,”IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp.230–242, April 2013.

[11] ——, “Decoupling coupled constraints through utility design,” IEEETransactions on Automatic Control, vol. 59, no. 8, pp. 2289–2294, Aug2014.

[12] R. Amir and N. Nannerup, “Information structure and the tragedy ofthe commons in resource extraction,” Journal of Bioeconomics, vol. 8,no. 2, pp. 147–165, Aug. 2006.

[13] E. Dockner, “On the relation between dynamic oligopolistic competitionand long-run competitive equilibrimn,” European Journal of PoliticalEconomy, vol. 4, no. 1, pp. 47–64, 1988.

[14] F. Kydland, “Noncooperative and dominant player solutions in discretedynamic games,” International Economic Review, pp. 321–335, 1975.

[15] A. P. Sage and C. C. White, Optimum systems control, 2nd ed. Prentice-Hall, 1977.

[16] M. E. Slade, “What does an oligopoly maximize?” The Journal ofIndustrial Economics, vol. 42, no. 1, pp. 45–61, Mar. 1994.

[17] D. Monderer and L. S. Shapley, “Potential games,” Games and EconomicBehavior, vol. 14, no. 1, pp. 124–143, May 1996.

[18] S. Zazo, J. Zazo, and M. Sánchez-Fernández, “A control theoreticapproach to solve a constrained uplink power dynamic game,” inProc. European Signal Processing Conference (EUSIPCO), Sept. 2014,Lisbon, Portugal.

[19] S. Zazo, S. Valcarcel Macua, M. Sánchez-Fernández, and J. Zazo, “Anew framework for solving dynamic schedulling games,” in Proc. IEEEInt. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Apr.2015, Brisbane, QLD, Australia.

[20] D. Bertsekas, Nonlinear programming. Athena Scientific, 1999.[21] Z. Wang, S.-C. Fang, and W. Xing, “On constraint qualifications: motiva-

tion, design and inter-relations,” Journal of Industrial and ManagementOptimization, vol. 9, no. 4, pp. 983–1001, 2013.

[22] T. Apostol, Calculus: Multi-variable calculus and linear algebra, withapplications to differential equations and probability. Wiley, 1969.

[23] D. P. Bertsekas, Dynamic Programming and Optimal Control, 3rd ed.Athena Scientific, 2007, vol. 2.

[24] T. Basar and G. J. Olsder, Dynamic Noncooperative Game Theory.Society for Industrial and Applied Mathematics, 1999.

[25] L. Ljungqvist and T. J. Sargent, Recursive Macroeconomic Theory. MITPress, 2012.

[26] T. J. Sargent, Linear Optimal Control, filtering, and rational expecta-tions. Federal Reserve Bank of Minneapolis, Nov. 1988, no. 224.

[27] S. Low, “Optimization flow control with on-line measurement or multi-ple paths,” in Int. Teletraffic Congress, 1999, Edinburgh, UK, pp. 237–249.

[28] J. Mo and J. Walrand, “Fair end-to-end window-based congestioncontrol,” IEEE/ACM Transactions on Networking, vol. 8, no. 5, pp. 556–567, Oct. 2000.

[29] W.-H. Wang, M. Palaniswami, and S. H. Low, “Optimal flow controland routing in multi-path networks,” Performance Evaluation, vol. 52,no. 2-3, pp. 119–132, Apr. 2003.

[30] M. Chiang, S. Low, A. Calderbank, and J. Doyle, “Layering as optimiza-tion decomposition: a mathematical theory of network architectures,”Proceedings of the IEEE, vol. 95, no. 1, pp. 255–312, Jan. 2007.

[31] A. Nedic and A. Ozdaglar, “Cooperative distributed multi-agent opti-mization,” in Convex Optimization in Signal Processing and Communi-cations. Cambridge University Press, 2010.

[32] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.

[33] F. Facchinei, V. Piccialli, and M. Sciandrone, “Decomposition algo-rithms for generalized potential games,” Computational Optimizationand Applications, vol. 50, no. 2, pp. 237–262, 2011.

[34] G. Scutari, S. Barbarossa, and D. Palomar, “Potential games: A frame-work for vector power control problems with coupled constraints,” inProc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing(ICASSP), May 2006, Toulouse, France.

[35] F. Kelly, “Charging and rate control for elastic traffic,” EuropeanTransactions on Telecommunications, vol. 8, no. 1, pp. 33–37, 1997.

[36] F. Kelly, A. Maulloo, and D. Tan, “Rate control in communicationnetworks: shadow prices, proportional fairness and stability,” vol. 49,1998.

[37] H. Zhou, P. Fan, and J. Li, “Global proportional fair scheduling fornetworks with multiple base stations,” IEEE Transactions on VehicularTechnology, vol. 60, no. 4, pp. 1867–1879, May 2011.

[38] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.Athena Scientific, 1996.

[39] D. Bertsekas, “Distributed dynamic programming,” IEEE Transactionson Automatic Control, vol. 27, no. 3, pp. 610–616, Jun 1982.

[40] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods. Athena Scientific, 1997.

[41] A. Jalali and M. Ferguson, “On distributed dynamic programming,”IEEE Transactions on Automatic Control, vol. 37, no. 5, pp. 685–689,May 1992.

[42] D. P. Bertsekas and H. Yu, “Distributed asynchronous policy iterationin dynamic programming,” in IEEE Allerton Conf. on Communication,Control, and Computing, 2010, Allerton, IL, USA, pp. 1368–1375.

[43] J. M. Porta, N. Vlassis, M. T. Spaan, and P. Poupart, “Point-basedvalue iteration for continuous pomdps,” Journal of Machine LearningResearch, vol. 7, pp. 2329–2367, 2006.

[44] S. Brechtel, T. Gindele et al., “Solving continuous pomdps: Valueiteration with incremental learning of an efficient space representation,”in Proc. Int. Conf. on Machine Learning (ICML), 2013, Atlanta, GA,USA, pp. 370–378.

[45] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.MIT Press, 1998.

[46] C. Szepesvari, Algorithms for Reinforcement Learning. Morgan &Claypool Publishers, 2009.

[47] L. Busoniu, R. Babuska, D. Schutter, and D. Ernst, ReinforcementLearning and Dynamic Programming Using Function Approximators.CRC Press, 2010.

http://cvxr.com/cvx

I IntroductionII Problem SettingIII Overview of Static Potential GamesIV Dynamic Potential Games with ConstraintsV Energy Demand in the Smart Grid as a Linear Quadratic Dynamic GameV-A Energy demand control DPG and equivalent MOCPV-B Analytical solution to the MOCP and simulation results

VI Network Flow Control: Infinite Horizon Approximated by a Finite Horizon Dynamic GameVI-A Network flow control dynamic game and equivalent MOCPVI-B Finite horizon approximation and simulation results

VII Dynamic Multiple Access Channel: Nonseparable utilitiesVII-A Multiple access channel DPG and equivalent MOCPVII-B Simulation results

VIII Optimal scheduling: Nonstationary problem with dynamic programming solutionVIII-A Proportional fair and equal rate scheduling games and their equivalent MOCPVIII-A1 Proportional fair schedulingVIII-A2 Equal rate scheduling

VIII-B Solving the MOCP with dynamic programming and simulation results

IX ConclusionsReferences

Documents

Dynamic Potential Games with Constraints: Fundamentals …Abstract—In a noncooperative dynamic game, multiple agents operating in a changing environment aim to optimize their utilities