80
Efficient Control for Multi-Agent Jump Processes Master’s Thesis submitted by Alexander Schlegel alexander.schlegel@bccn berlin.de Bernstein Center for Computational Neuroscience Berlin Technische Universit¨ at Berlin Humboldt-Universit¨ at zu Berlin June 19, 2013 Supervisors: Prof. Dr. Manfred Opper Dr. Andreas Ruttor

E cient Control for Multi-Agent Jump Processes ·  · 2013-07-10E cient Control for Multi-Agent Jump Processes Master’s Thesis submitted by ... Solving the Backwards Equation

Embed Size (px)

Citation preview

Efficient Control for Multi-Agent JumpProcesses

Master’s Thesis

submitted by

Alexander Schlegelalexander.schlegel@bccn berlin.de

Bernstein Center for Computational Neuroscience BerlinTechnische Universitat Berlin

Humboldt-Universitat zu Berlin

June 19, 2013

Supervisors:

Prof. Dr. Manfred OpperDr. Andreas Ruttor

Eidesstattliche Versicherung Statutory Declaration

Die selbststandige und eigenhandige Ausferti-gung versichert an Eides statt

I declare in lieu of oath that I have written thisthesis myself and have not used any sources orresources other than stated for its preparation

Datum / Date Ort / Place

Unterschrift / Signature

2

Abstract

Optimal control of multi-agent systems is a hard problem. Traditional methods in-clude formulating control problems as Markov decision processes (MDPs) and usingdynamic programming to solve them. Problematically, this is a non-linear optimiza-tion problem and the state-space in multi-agent control is exponential in the numberof agents. Therefore, finding optimal solutions to multi-agent control problems usingthese methods is computationally very demanding and often not feasible.

Recently, linearly solvable MDPs (LSMDPs) have been introduced, a subclass ofMDPs in which the cost function is restricted in a way that makes the control problemlinear. Additionally, LSMDPs are equivalent to probabilistic inference problems andapproximate solutions can be found using approximate inference techniques.

In this thesis, I derive methods for multi-agent control on Markov jump processesbuilding on the principles of LSMDPs and using approximate inference techniques.

Five different methods are presented and tested. Results are promising.

3

Zusammenfassung

Die optimale Steuerung von Mehragentensystemen ist ein schwieriges Problem. Eineherkommliche Methode zu dessen Losung ist, sie als markovsche Entscheidungsprozesse(engl. Markov decision processes, MDPs) zu formulieren und Methoden des dynamis-chen Programmierens anzuwenden. Ein Problem dabei ist, dass die Optimierung vonMDPs nicht-linear ist und dass der Zustandsraum in Mehragentensystem exponen-tiell von der Anzahl der Agenten abhangt. Aus diesen Grunden ist die Losung vonSteuerungsproblemen in Mehragentensystemen sehr rechenintensiv und kann in denmeisten Fallen nicht erreicht werden.

Eine neue Forschungsrichtung in der Theorie der optimalen Steuerung betrifft lin-ear losbare MDPs (engl. linarly solvable MDPs, LSMDPs). Hierbei ist die Kosten-funktion in den MDPs derart eingeschrnkt, dass das Optimierungsproblem linearwird. Zusatzlich sind LSMDPs equivalent zu probabilistischer Inferenz und man kannannahernd optimale Losungen finden, indem man annahernde Inferenztechniken be-nutzt.

In dieser Arbeit entwickle ich Methoden fur die Steuerung von Mehragentensys-temen, die auf den Prinzipien der LSMDPs und annahernden Inferenzmethodenbasieren.

Funf unterschiedliche Methoden werden vorgestellt und getestet. Erste Ergebnissesind vielversprechend.

4

Contents

I. Introduction 8

1. Outline 10

II. Background and Related Work 11

2. Traditional MDPs and Stochastic Control 11

3. Linearly Solvable MDPs 123.1. LSMDPs and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4. Markov Jump Processes 144.1. Properties of MJPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2. Monomolecular MJPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3. Analytical Solution to the Master Equation for Monomolecular MJPs . . . . . . 164.4. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5. Inference for Markov Jump Processes . . . . . . . . . . . . . . . . . . . . . . . . . 19

5. Approximate Inference for MJPs 205.1. Weak Noise Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2. Variational Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

III. Methods 24

6. Control for MJPs 24

7. Simple Problems 257.1. Poisson Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.1.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.2. Single-Agent Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.2.1. Weak Noise Approximation for Single-Agent Control . . . . . . . . . . . . 287.2.2. Marginal Probability for Single-Agent Systems . . . . . . . . . . . . . . . 297.2.3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8. Exact Multi-Agent Control 358.1. Multi Agent Control with Linear Costs . . . . . . . . . . . . . . . . . . . . . . . . 35

8.1.1. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.2. Solving the Backwards Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.3. Forward Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378.4. Backward Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9. Approximate Multi-Agent Control 399.1. Weak Noise Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399.2. Variational Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419.3. Expectation Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439.4. Partial Evaluation of the Solution to the Forward Master Equation . . . . . . . . 44

5

9.5. Gaussian approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

10.Ergodic Control 4810.1. Single-Agent Ergodic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.2. Multi-Agent Ergodic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

10.2.1. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5010.3. Collision Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

IV. Simulations 53

11.Tasks 5311.1. Goal directed control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5311.2. Ergodic control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

12.Controllers 5412.1. Exact Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5512.2. Variational Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5512.3. Partial Evaluation of the Solution to the Master Equation . . . . . . . . . . . . . 5512.4. Weak Noise Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5512.5. Gaussian Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

13.Sampling 55

14.Measure of Performance 56

15.Results 5615.1. Goal-Directed Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5615.2. Ergodic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5915.3. Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

V. Discussion & Conclusion 72

16.Discussion 72

17.Challenges 73

18.Further Approaches 74

19.Conclusion 74

20.Acknowledgements 75

A. MJP Inference with Arbitrary Cost Function 76

B. Single Agent Control 77

C. Variational Approximation with Gaussian Marginal 77

D. EM-Formulation of Expectation Control 77

6

E. Implementation Details 78

7

Part I.

IntroductionControl, or, more precisely optimal control, is “optimizing a sequence of actions to attain some

future goal” [Kappen, 2007, p 3]. This goal is often formalized in terms of a cost-function, which

evaluates the actions of the controlled entity (the agent) and the states it is in. The purpose

of control is then to minimize the accumulated costs over time. Optimal control problems are

often dealt with in the framework of Markov decision processes (cf. [Sutton and Barto, 1998]),

stochastic processes in which an agent may shape the probability of transition from the state it

is currently in to the next state by choosing an action. Thereby, it is assumed that the future

depends only on the present, not on the past (this is the Markov property). One may reach

optimal control in an MDP by adhering to the following principle: Assuming that it is known

how to act optimally after taking the next step it is relatively simple to choose the next step

optimally. This gives rise to a recursive equation – the Bellman equation – which implies the

solution to optimal control problems and gives rise to a collection of methods called dynamic

programming. However, solving the Bellman equation is a non-linear optimization problem and

requires iterating over the complete state and action space several times – optimal control is a

difficult problem.

Recently, research in optimal control theory has taken a new direction with the advent of

linearly solvable MDPs (LSMDPs, [Todorov, 2009]) – A class of MDPs in which the set of

actions and the cost function is restricted in a way that makes the Bellman equation linear and

thereby more efficiently solvable. In addition, LSMDPs are equivalent to probabilistic inference

problems [Kappen et al., 2012], which allows one to use approximate methods from probabilistic

inference to get near optimal solutions to control problems.

The focus of the work on LSMDPs is on control problems in discrete time and space. Similar

ideas have been applied to control in continuous time and space in the framework of path integral

control [Kappen, 2005].

In this thesis, building on the work on LSMDPs, I investigate control on Markov jump processes

(MJPs), stochastic processes on discrete state-spaces that are continuous in time, something

which has, to the best of my knowledge, not been done before. As a particular application I

concentrate on the control of multi-agent systems.

Multi-agent control is the control of multiple autonomous agents which collaborate to reach

a common goal, that is, to minimize a cost-function which is defined over the joint state-space

of all agents. Here, “autonomous” means that agents behave independently from each other

in the absence of control. A naıve approach to controlling multi-agent systems is to conceive

them as traditional MDPs on the joint state-space and apply the usual methods. However, since

the joint state-space grows exponentially with the number of agents, this is usually intractable.

More sophisticated approaches include factored MDPs, which allow to exploit the structure and

8

independence properties of the system that is to be controlled and to apply approximations that

provide significant boosts in efficiency [Guestrin et al., 2003].

The approach I follow is slightly different: building on the work of [Todorov, 2009] on LSMDPs,

I formulate multi-agent control problems as inference problems and apply approximate methods

to solve them.

A similar idea has been pursued by [van den Broek et al., 2008], who investigates multi-

agent control using the path integral method. By using the path-integral method, the authors

restrict themselves to the continuous time and space-setting. Working in continuous space has

the advantage of evading the exponential blow-up of the size of state-space with rising numbers

of dimensions. On the other hand, agents may only move to neighbouring regions of space.

With a discrete state space, arbitrary transitions are possible and state space may be structured

arbitrarily. To give an example, while the path integral method may be readily applied to agents

moving in physical space, methods that work on discrete space are more appropriate for agents

navigating through a network structure, such as the Internet1.

Control theory in general has a wide range of real world applications, including movement

control, planning of actions in robots, optimization of financial investment policies and control

of chemical plants [Kappen, 2005].

Applications of multi-agent control include traffic control [Chen and Cheng, 2010,Kesting et al.,

2008], grid energy management [Roche et al., 2010] crowd behaviour modelling [Bandini et al.,

2007] and the joint control of several robotic agents, for example in rescue scenarios [Kitano,

2000] or robot football [Chen and Dong, 2013].

Apart from this wealth of possible applications, the investigation of optimal control is an

important component in the neuroscientific effort to understand control in animals. Here, the

range of behaviour that is to be explained goes from limb control to decision making [Sugrue

et al., 2005] in multi-agent systems [Barraclough et al., 2004].

Optimal control theory is abstract and its relation to the working of neurons is not self-

evident. Still, as [Marr, 1982] has famously pointed out, to understand a neuronal system

(or any information processing system) it is not sufficient to study the detailed wirings and

interactions of neurons on a biological or physical level – it is also necessary to study algorithms

and representations which give rise to behaviour, and, this is the place optimal control theory

occupies in this endeavour, to understand the abstract underlying principles that define a specific

type of computation. As he puts it: “An algorithm is likely to be understood more readily by

understanding the nature of the problem solved than by examining the mechanism (and the

hardware) in which it is embodied.” [Marr, 1982, p 27]. In relation to optimal control theory,

this idea has been pursued by several authors (e.g. [Todorov, 2004], [Kording, 2007]). Common to

all these works is the assumption that control in animal approaches optimality (the “optimality

principle”), and thus optimal control is the right theoretical framework. Research on animal

control that bases its models on optimal control theory seems to be fruitful, [Todorov, 2004, p

1], for example, claims that “Optimal control models of biological movement explain behavioural

1Still, it may be possible to apply the path integral method to networks using kernel methods.

9

observations on multiple levels of analysis [. . . ] and have arguably been more successful than

any other class of models.”

1. Outline

In the first part of the thesis, I introduce necessary background on the work that follows. This

includes MDPs, LSMDPs and the relation of LSMDPs to probabilistic inference. I continue with

the discussion of Markov jump processes and methods for probabilistic inference on them.

The second part provides the substantial contribution of this thesis. First, I introduce optimal

control on Markov jump processes using inference methods. In the following, I apply this on

control problems, beginning with simple (mostly single-agent-) tasks that may be solved using

exact methods and continuing with more complex, multi-agent tasks, that require the use of

approximate methods. Thereby, I concentrate on two scenarios: Goal directed control, where

agents act jointly to reach some goal-state at a final time T , and ergodic control, where the aim

is to minimize a cost-function that is independent of time.

In the third part of the thesis, I present results of simulations, with the aim to evaluate the

methods presented before. The focus here is to assess the appropriateness of approximations.

In the fourth part, I discuss the results, indicate benefits and drawbacks of the presented

methods and point out directions for future work.

Throughout the thesis, I illustrate the theoretical discussion with examples. In all examples,

agents move on a one-dimensional grid. This is mainly because paper is two-dimensional and

one dimension is needed for time. All methods work in higher dimensional grids as well, with

the restriction that the increased size of state-space makes computations more expensive. In the

experimental part of the thesis, I present results from simulations performed on two dimensional

grids. It should be pointed out that the methods are not restricted to grids, they should work

equally well on arbitrarily structured state spaces. However, I don’t investigate this in the thesis.

10

Part II.

Background and Related Work

In this part of the thesis, I set the ground for the work that is presented in the remainder.

I begin with outlining the traditional theory of stochastic control. Then, I introduce linearly

solvable Markov decision processes (LSMDPs) a relatively recent direction in control research

that I follow in this thesis. Finally, since this thesis is mainly concerned with multi-agent control

on Markov jump processes and LSMDPs are closely connected to probabilistic inference, I will

introduce MJPs and probabilistic inference thereon.

2. Traditional MDPs and Stochastic Control

The task in optimal control is to make an agent act in such a way that costs caused by its actions

are minimized over time. The costs depend on the task the agent should perform and they

specify, for instance, beneficial and detrimental courses of action, paths or states do be desired

or avoided. One example for optimal control is keeping a helicopter above a certain height. Here,

the helicopter is the agent and the costs would be high only whenever the helicopter gets below

the crucial height. Another example is gripping a cup of coffee, where the cost is always high,

except when the coffee is safely in the hand of the agent. Other examples are driving a car,

performing a surgery or leading a successful life.

Optimal control is often formalized in the framework of Markov decision processes (MDPs)

(cf. [Sutton and Barto, 1998]). The central feature of MDPs is that they satisfy the Markov

property, that is what happens next only depends on the present, not on the past. This can

simplify matters dramatically and all control problems in this thesis can be formalized as MDPs.

A (discrete-time) MDP is a 4-tuple (S,A,P,R) including a state-set S, a set of actions A,

transition probabilities P and a function q : S ×A→ < assinging costs to performing an action

in a given state. Formally, the Markov property states that p(st+1 = s′|at = a, st = s, st−1 =

s′′, . . . ) = p(st+1 = s′|at = a, st = s). MDPs can also be defined for continuous time. In that

case, transition probabilities are replaced with a transition rate function and the cost function is

replaced with a cost-rate function. In both discrete-time- and continious-time MDPs, the state

set and the action set can be both discrete or continuous.

Formally, the task in optimal control is to find a policy π? that assigns to each state s a

probability distribution over actions action P (a) := π?(s) such that chosing actions accordingly,

the expected accumulated cost over a specified time period is minimized. One can evaluate the

performance of a given policy using a value function Vπ(s) = Eπ

{∑Tt q(st, at)

}which gives the

expected accumulated cost until some time T (which I always assume to be finite) when following

a policy π after starting in state s (the cost-to-go). Accordingly, we have Vπ?(s) = minπVπ(s).

11

Vπ? is called the optimal value function which I will also denote V ?. Famously, it holds that

V ?(s) = mina

{q(s, a) + Es′∼p(·|s,a) {V ?(s′)}

}(1)

This is the Bellman equation. It states that the optimal value function is the minimum (over

actions) of the immediate reward plus the expected remaining cost-to-go. As a recursive formu-

lation of the value function, the Bellman equation gives rise to a collection of methods for finding

the optimal value function termed Dynamic Programming (e.g. Policy Iteration or Value Iter-

ation). These methods all use the fact that with the Bellman equation one can easily compute

the value function for a state if the value functions for potential successor states are known.

Problematically, finding solutions to the Bellman equation using Dynamic Programming re-

quires iterating over the product space of actions and states, which is often very large, making

the application of Dynamic Programming prohihitively inefficient. Additional dificulties arise

due to the stochastic nature of most problems.

3. Linearly Solvable MDPs

Recently, a new approach to rendering the solution to MDPs feasible was proposed by [Todorov,

2009]. In the approach, instead of allowing an arbitrary set of symbolic actions that would

then be mapped to transition probabilities, agents are allowed to shape transition probabilities

directly, such that p(s′|s, a) = a(s|s′). The cost function is then restricted to a combination of

a state dependent part q(s) and a control dependent part. While the state dependent cost can

be an arbitrary function, the cost that depends on the control is defined as the Kullback-Leibler

(KL) divergence between the control distribution a(s|s′) and the passive dynamics p(s′|s) of the

system. The passive dynamics can be interpreted as the behaviour of the system in the absence

of control. Hence, the control cost reflects how much the control changes the behaviour of the

system from normal behaviour.

It turns out that by applying these restrictions, the Bellman equation becomes linear: with

the restrictions, the cost function is

l(s, a) = q(s) + KL(a(·|s)||p(·|x)) = q(s) + Es′∼a(·|s)

{lna(s′|s)p(s′|s)

}, (2)

where E· {·} denotes the expectation. Introducing a desirability function z(s) = exp(−V ?(s)),the Bellman equation can be written as

− ln(z(s)) = mina

{q(s) + Es′∼a(·|s)

{lna(s′|s)p(s′|s)

}− Es′∼a(·|s) {ln z(s′)}

}(3)

= q(s) + mina

{Es′∼a(·|s)

{ln

a(s′|s)p(s′|s)z(s′)

}}(4)

Introducing a normalization term G[z](s) = Es′∼p(·|s)[z(s′)] and the action a?(s′|s) ..= p(s′|s)z(s′)

G[z](s) ,

12

the term to be minimized can be written as a KL-divergence:

Es′∼a(·|s)

{ln

a(s′|s)p(s′|s)z(s′)

}= Es′∼a(·|s)

lna(s′|s)

p(s′|s)z(s′)G[z](s)G[z](s)

(5)

= Es′∼a(·|s)

{ln

a(s′|s)a?(s′|s)G[z](s)

}(6)

=Es′∼a(·|s)

{ln

a(s′|s)a?(s′|s)

}− lnG[z](s) (7)

=KL(a(·|s)||a?(·|s))− lnG[z](s) (8)

Since the KL-divergence assumes its minimal value at 0 iff both distributions are equal, we see

that a? is the optimal action and the desirability function becomes

z(s) = exp(−q(s))G[z](s) (9)

This is a linear equation and it can be solved relatively efficiently, for example as an eigenvalue

problem. In particular, its complexity only depends on the size of the state set S and not on the

combined state-action set S ×A.

3.1. LSMDPs and Inference

[Kappen et al., 2012] have shown that the above approach is closely related to probabilistic

inference.

By unfolding the Bellman equation,

z(s0) = exp(−q(s0))∑s1

p(s1|s0) exp(−q(s1))∑s2

p(s2|s1) exp(−q(s2)) . . . (10)

=∑s1:T

p(s1:T |s0) exp

(−

T∑t=0

q(st)

), (11)

s1:T denoting a sequence of states from time 1 to T , we see that the optimal action a? is

a?(s1|s0) =p(s1|s0)z(s1)

G[z](s0)(12)

=p(s1|s0)

∑s2:T p(s

2:T |s1) exp(−∑Tt=1 q(s

t))

G[z](s0)(13)

=p(s1|s0)

∑s2:T p(s

2:T |s1) exp(−∑Tt=0 q(s

t))

Z(s0)(14)

=∑s2:T

a?(s1:T |s0), (15)

13

where

a?(s1:T |s0) =p(s1:T |s0) exp

(−∑Tt=0 q(s

t))

Z(s0)(16)

This is a probabilistic inference problem – we can interpret p(s1:T ) as a prior probability of

the state sequence s1:T , exp(−∑Tt=0 q(s

t))

as a likelihood and Z(s0) as a partition function.

Finding the optimal action corresponds to computing a posterior and marginalizing out all but

the current actions. For doing this, one can use all the machinery available in probabilistic

inference, including approximate methods.

The prior probability in inference corresponds to the uncontrolled (or free) dynamics of the

system that is to be controlled. The likelihood in inference corresponds to the exponential of the

negative accumulated costs and the posterior corresponds to the controlled dynamics.

These results have an analogue in the case of continuous time and space, the so-called path-

integral method, by [Kappen, 2005].

4. Markov Jump Processes

The previous section outlines work that deals with control in discrete time and space. If time

and space are treated as both continuous, control can be done in a similar way in the framework

of path integral control. In contrast, this thesis is concerned with control problems in a discrete

state space, but with continuous time. These types of Markov decision problems are Markov

jump processes (MJPs) – stochastic processes in continous time that have a countable state set

S and obey the Markov property. In the following, I will introduce formalisms and properties of

MJPs that are important for the remainder of the thesis, particularly for multi-agent control.

4.1. Properties of MJPs

The following is based on [Ruttor et al., 2009] and [Wilkinson, 2011].

The behaviour of a MJP is fully determined by its process rates f(X ′|X). They determine the

probability of transition (a “jump”) from a state X ∈ S to a state X ′ ∈ S in an infinitesimal

time interval ∆t:

p(X ′|X) ≈ δX′,X + ∆tf(X ′|X), (17)

where δ denotes the Kronecker delta. This approximation becomes exact in the limit ∆t → 0.

By normalization, f(X|X) = −∑X′ 6=X f(X ′|X).

It is useful to give f some more structure, which I will do in the following. I deviate from the

usual terminology in literature on MJPs – which comes from chemistry – and use terms that

more intuitively relate to agent control. When appropriate, I will mention the traditional terms.

14

Let S ⊆ ND. I will call one entry Xi of a state-vector X a location. Its value represents the

number of agents at that location2. One can define jumps between states using a set of rules

p11X1 + · · ·+ p1dXdh1−→ q11X1 + · · ·+ q1dXd (18)

......

... (19)

pn1X1 + · · ·+ pndXdhn−−→ qn1X1 + · · ·+ qndXd (20)

(21)

Each rule specifies one possible transition: pij determines the number of agents leaving location

j whenever transition i occurs, qij determines the number of agents entering location j and hi

gives the rate with which transition i happens. Thus, whenever transition i takes place, the value

of Xj changes to X ′j = Xj + qij − pij . The transition rates hi are functions of the state of the

system. Usually,

hi(X) = ci

d∏j=1

pij−1∏k=0

(Xj − k) (22)

This reflects that a transition depends only on the number of agents on each relevant location

and some rate constant ci. Now, the process rate f(X ′|X) is the sum of the transition rates for

each transition leading from X to X ′.

f(X ′|X) =

n∑i=1

δX′,X−pi·+qi·hi(X) (23)

The marginal probability p(X, t) of a state evolves according to the (forward) Master equation:

∂tp(X, t) =

∑X′ 6=X

(p(X ′, t)f(X|X ′)− p(X, t)f(X ′|X)) , (24)

which, intuitively, states that the probability of being in state X changes with the probability of

jumping into it minus the probability of jumping away from it. The master equation specifies a

system of about ND differential equations (the number of possible states), N being the number

of locations and D the typical number of agents. Since ND is usually very large, the Master

equation can seldom be solved in practice.

4.2. Monomolecular MJPs

MJPs are called monomolecular if none of the transition rules have more than one location

(traditionally: molecular species) as antecedent or consequent. In the context of multi-agent

control in MJPs, behaviour of agents in the absence of mutual interactions can be described

as monomolecular MJPs. We will assume this to be the case in the absence of control, thus

2In the chemical literature, Xi is usually called a molecular species and its value represents the number ofmolecules of that species

15

monomolecular processes are of particular importance.

In monomolecular MJPs, there are only three types of possible transitions:

Xjcjk−−→ Xk (25)

?cj0−−→ Xk (26)

Xjc0k−−→ ?. (27)

Using the vocabulary of agents, this means that a transition can only be such that one agent

moves to a different location with rate cjk (25), disappears with rate cj0 (26), or appears at some

location k with rate c0k (27) – in all cases independently from the overall state of the system.

The process rates in monomolecular systems are

f(X ′|X) =

cjkXj if X ′k = Xk + 1 and X ′j = Xj − 1 and X ′l = Xl for all l 6= k and j 6= k

cj0Xj if X ′j = Xj − 1 and X ′l = Xl for all l 6= j

c0k if X ′k = Xk + 1 and X ′l = Xl for all l 6= k

(28)

The treatment of monomolecular MJPs is simpler than that of general MJPs and there are

some results concerning this type of MJPs that do not hold in general. This is mainly because of

the absence of interactions between agents: since all agents act independently from each other,

the whole system’s evolution can be treated as the sum of what happens to individual agents.

One useful result is that for monomolecular MJPs (but not for MJPs in general), it is straight-

forward to calculate the expected state of the system at some given time t [Wilkinson, 2011, p

159], [Jahnke and Huisinga, 2007]: The expectation evolves as

∂tE {X(t)} = M(t)>E {X(t)}+ m, (29)

where

M(t)ij =

c(t)ij if i 6= j

−∑Dk=0 c(t)kj else

, (30)

mi = c(t)0i (31)

using time-dependent rates c(t).

4.3. Analytical Solution to the Master Equation for Monomolecular MJPs

For monomolexular MJPs, [Jahnke and Huisinga, 2007] have shown that there exists a closed-

form solution to the master equation. This is an important result for the development of this

thesis, since, as we will see in section 8.3, finding a solution to the master equation is a way to

acquire controlled process rates for multi-agent systems. The idea to the solution is based on the

16

fact that in monomolecular systems, molecular species (i.e. agents on different locations) do not

interact and therefore whatever happens to the whole system can be described as a sum of what

happens to parts of the system. Furthermore, the authors have shown that any system can be

split up into subsystems for which the master equation can be solved easily. These subsystems

are of two types: First, for a system with no inflow of new molecules (i.e. cik = 0 for all k) and

with a multinomial initial distribution,

M(x, N,p) =

N ! (1−|p|)N−|x|(N−|x|)!

∏nk=1

pxkk

xk! if |x| ≤ N and x ∈ Nn

0 else(32)

the marginal distribution at any time t will still be a multinomial distribution with parameters

evolving according to

dp(t)

dt= M(t)p(t) (33)

p(0) = p0 (34)

[Jahnke and Huisinga, 2007, proposition 1, p 7], with M as defined in the previous section

(equation 31). Second, if the initial distribution is a product Poisson distribution,

P(x, λ) =λ1

x1!· · · λn

xn!e−|λ| (35)

the marginal distribution will always remain a product Poisson distribution with parameters

evolving as

dλ(t)

dt= M(t)λ(t) (36)

λ(0) = λ0 (37)

[Jahnke and Huisinga, 2007, proposition 2, p 9].

Now, these results only apply to special cases of initial distributions. However, as we are dealing

with monomolecular reaction systems, it is possible to split up the systems in such a way that

the subsystems have the right kind of initial distributions and afterwards combine the solutions.

As it turns out, this is always possible: It suffices to show how to do this for deterministic initial

conditions, since then, any initial distribution can be dealt with using superpositions of solutions

with deterministic initial conditions. Any deterministic state of the system can be split up in

n + 1 groups, such that n groups contain molecules of one species (or agents at one location)

each and one group contains no molecules. The marginal distribution of a monomolecular MJP

with deterministic initial condition and only molecules of species k at t = 0 is a Multinomial

with parameters p(k) evolving according to equation 33, where the initial condition p(k)(0) is a

vector with p(k)i = 0 for all i 6= k and p

(k)k = 1. The remaining molecules, those that do not exist

at t = 0, follow equation 35, since “nothing” follows a product Poisson distribution. The initial

17

condtition λ0 is a zero vector.

Now, the state of the system at any time t is the sum of the states of the n+1 subsystems and

the subsystems evolve independently from each other. The sum of independent random variables

is distributed according to the convolution of the distributions of the individual random variables,

thus the probability distribution for the whole system will be a convolution of the subsystems’

probability distributions. A convolution can be defined as

(P1 ? P2)(x) =∑z

P1(z)P2(x− z), (38)

where the sum is over all z ∈ Nn with (x− z) ∈ Nn. The solution to the Master-equation is thus

P (t, ·) = P(·, λ(t)) ?M(·, ξ1,p(1)(t)) ? · · · ? M(·, ξn,p(n)(t)), (39)

where ξi equals the number of molecules of species i at time t = 0.

For the expectation and covariance of the marginal, one gets

E [X(T )] = λ(t) +

n∑k=1

ξkp(k)(t) (40)

Cov(Xj , Xk) =

∑ni=1 ξip

(i)j (1− p(i)

j ) + λj if j = k

−∑ni=1 ξip

(i)j p

(i)k else

(41)

[Jahnke and Huisinga, 2007, p 14].

It needs to be pointed out that due to the complexity of the convolution, computing this exact

solution to the Master Equation is intractable in all but the most simple cases.

4.4. Sampling

For the simulations presented in this thesis as examples or for evaluation it is necessary to sample

from MJPs, for the purpose of which there exist several approaches [Wilkinson, 2011, p 125]. A

simple approach is to discretize time and use that for a small time interval ∆t

p(X ′(t+ ∆t)|X(t)) ' δX′,X + ∆tf(X ′|X) (42)

Problematically, in order to get accurate samples using this approach, ∆t has to be small,

but with small ∆t, sampling becomes inefficient (there will be much more time intervals than

jumps). A more efficient approach is to separately sample the time that passes until the next

jump and the state the system jumps to. This is known as Gillespie’s method in the context of

chemical reaction processes. The time until the next jump is exponentially distributed with rate∑X′ 6=X f(X ′|X) (the rate of jumping out of state X) and the probability that the jump lands in

state X+ is f(X+|X)∑X′ 6=X f(X′|X) . An issue here is that this is only exact if the process rates are time

independent, since changes in rate between two jumps are not taken into account. We use this

18

type of sampling in this thesis either for processes that have constant rates or rates that change

slowly (in relation to the expected time between jumps), such that the resulting error can be

neglected.

4.5. Inference for Markov Jump Processes

As discussed in section 3, there is a tight connection between probabilistic inference and control

– an insight that gives rise to the methods for multi-agent control presented in later parts of the

thesis. This section gives some background on the theory of inference on MJPs.

In inference, the task is to, given N noisy observations D and a prior MJP pprior, compute

a posterior process ppost(X|D). If the observation noise is independent across different time

points, the posterior process will also be a MJP [Ruttor et al., 2009, p 242] and can hence be

characterized using a, possibly time-dependent, rate function. In what comes next, I briefly

sketch how this rate function can theoretically (but in most cases not practically) be computed

in an exact way. These results are taken from [Ruttor et al., 2009].

Given N observations Dl, a noise model p and a prior pprior, the posterior process is, according

to Bayes rule,

ppost(X|D) =1

Zpprior(X)

N∏l=1

p(Dl|X(tl)) (43)

where Z = p(D1, . . . , DN ) is a normalization term. ppost minimizes

KL(q||ppost) = lnZ + KL(q||pprior)−N∑l=1

Eq {p(Dl|X(tl))} (44)

The KL-divergence between two MJPs is

KL(q||p) =

∫ T

0

dt∑X

q(X, t)∑X′ 6=X

(gt(X

′|X) lngt(X

′|X)

f(X ′|X)+ f(X ′|X)− gt(X ′|X)

)(45)

where f is the process rate of p and gt is the (time-dependent) process rate of q.

To compute the process rate gt of ppost, one needs to minimize KL(q||ppost) with the condi-

tion that the master-equation holds. This is done by computing the stationary values of the

Lagrangian

L = KL(q||ppost)−∫ T

0

dt∑X

λ(X, t)

∂∂tq(X, t)−

∑X′ 6=X

(gt(X|X ′)q(X ′, t)− gt(X ′|X)q(X, t))

(46)

19

The functional derivatives with respect to q(X, t) and g(X ′|X) are

δL

δq(X, t)=∑X′ 6=X

(gt(X

′|X) lngt(X

′|X)

f(X ′|X)− gt(X ′|X) + f(X ′|X)

)(47)

+∂

∂tλ(X,T ) +

∑X′

gt(X′|X) (λ(X ′, t)− λ(X, t)) (48)

−∑l

ln p(Dl|X(t))δ(t− tl) (49)

= 0 (50)

δL

δgt(X ′|X)= qt(X)

(lngt(X

′|X)

f(X ′|X)+ λ(X ′, t)− λ(X, t)

)(51)

= 0 (52)

Solving equation 52 yieldsgt(X

′|X)

f(X ′|X)=r(X ′, t)

r(X, t))(53)

Where r(X, t) = e−λ(X,t). By putting this into equation 50 one gets the system of linear differ-

ential equations∂

∂tr(X, t) =

∑X′ 6=X

f(X ′|X) (r(X, t)− r(X ′, t)) (54)

and jump conditions at the times of observation.

limt→t−l

r(X, t) = p(Dl|X(tl)) limt→t+l

r(X, t) (55)

By solving the system of equations 54 backwards in time, one gets the posterior rate-function gt

using equation 53:

gt(X′|X) = f(X ′|X)

r(X ′, t)

r(X, t). (56)

Problematically, because 54 is a system of as many equations as there are states in the system,

finding a solution is only feasible in very simple cases. Therefore one often has to rely on

approximate methods, two of which I will introduce in the following section.

Importantly, r(X, t) can be interpreted as the likelihood of future observations D≥t, given the

present state (X) of the system, i.e. r(X, t) = p(D≥t|X(t) = X) [Ruttor and Opper, 2010].

5. Approximate Inference for MJPs

In the following, I will outline two methods for approximate probabilistic inference on Markov

jump processes. Both methods are applied to multi-agent control in later parts of the thesis.

20

5.1. Weak Noise Approximation

[Ruttor et al., 2009] have proposed a method for approximate inference on MJPs, the weak

noise approximation. The idea is to approximate the backward equation (equation 54, Section

4.5) with a Gaussian diffusion. To do that, a formal expansion parameter ε is introduced, such

that r(X ′|t) = r(X + ε(X ′ −X)). Now, the backward equation is expanded to second order in

ε, giving [∂

∂t+ εf(X)>∇+

1

2ε2tr(D(X)∇∇>)

]r(X, t) = 0 (57)

This includes a drift vector f(X) and a diffusion matrix D(X), which are defined as

f(X) =∑X′ 6=X

f(X ′|X)(X ′ −X) (58)

D(X) =∑X′ 6=X

(X ′ −X)f(X ′|X)(X ′ −X)> (59)

Assuming that typical state vectors can be expected to be close to some time dependent state

b(t), one can write X = b(t) + εy and express r as a function of y: r(X, t) = Ψ(y, t). Requiring

thatdb

dt= f(b(t)), (60)

another expansion to second order in ε yields[∂

∂t+ y>A(X)>(b(t))∇+

1

2tr)(D(b(t))∇∇>

)Ψ(y, t)

]= 0, (61)

with Aij(X) = ∂fi∂xj

. The solution to this is

r(X, t) ≈ η(t) exp

(−1

2(X − b(t))>B−1(t)(X − b(t))

), (62)

with

dB

dt= A(b(t))B(t)A(b(t))> −D(b(t)), (63)

dt= η(t)tr(A(b(t))). (64)

This can be used to compute the posterior rate gt(X′|X)

5.2. Variational Approximation

[Opper and Ruttor, 2010] have developed a method for approximate inference on monomolecular

MJPs that is based on an optimization of a variational lower bound to the free energy of the

process.

The goal is, again, to compute a posterior rate gt for the process at hand. As we have

21

seen, gt(X′|X) = f(X ′|X) r(X

′,t)r(x,t) where r(X, t) = p(DT>t|X(t) = X), the likelihood of future

observations. Thus, given a method for computing that likelihood, one would be able to compute

posterior rates.

We have

p(DT>t|X(t) = X) =∑X

p(DT>t|X)p(X|X(t) = X) (65)

= E [p(DT>t|X)|X(t) = X] , (66)

where the sum goes over all possible trajectories X between t and T . The likelihood for observing

some data DT>t, p(DT>t|X) could be defined as

p(DT>t|X) ∝ exp

(− 1

2σ2

∑k

||Dk − L[X(tk)]||2), (67)

where L is a linear operator. This models that data points Dk are noisy measurements of linear

transformations of the state of the system at time t = k. Computing the expectation (equation

66) of this kind of likelihood is not feasible, in particular because the sum goes over an infinite

number of trajectories X. However, if one uses a different definition of likelihood,

p(DT>t|X) ∝ exp

(− 1

2σ2

∑k

u>X(tk)

), (68)

for some u, computing the expectation becomes possible: One can show that

r(X, t) = a(t) exp(b(t)>X

)(69)

with ri(t) ..= ebi(t) and a(t) obeying the system of equations

dridt

= −∑k 6=0

cik(rk − 1) (70)

da

dt= −a

∑k 6=0

c0k(rk − 1) (71)

with jump conditions

ri(t−k ) = ri(t

+k ) exp (ui(tk)) (72)

at the times of observation. cik represent transition rates.

Note that equation 68 does not correspond to any realistic measurement model3 . Nevertheless,

by finding appropriate values for u (which I denote φ), it can be used in a variational approach

to approximate the more realistic likelihood in equation 66: By re-representing equation 67 using

3Although this is true for classical inference applications, this kind of “likelihood” may well make sense in acontrol setting. See section 8.1 for more.

22

the convex duality transform (see e.g. [Bishop, 2006, p. 493]), [Opper and Ruttor, 2010] derive

a lower bound to the free energy

− lnZ ≥ max{φ}Kk=1

{−σ

2

2

∑k

|φk||2 +∑k

φ>kD− lnE

[exp

(∑k

φ>k L(X(tk))

)|X0 = X0

]}..= f.

(73)

The maximum on the right-hand-side of this inequation can be found using gradient-ascent with

the gradient

∇φkf = −σ2φk + yk − E (L[n(tk)]) , (74)

where the expectation is under the posterior process with the current parameter-vector φ. This

expectation can be readily computed since the posterior process remains monomolecular [Opper

and Ruttor, 2010, p. 5].

23

Part III.

Methods

6. Control for MJPs

Recall from section 3 that the optimal sequence of actions for a discrete-time linearly solvable

MDP is

a∗(s1:T |s0) =1

Zp(s1:T |s0) exp

(−

T∑t=0

q(st, t)

)(75)

In the case of continuous time, a∗ and p(s1:T )) become MJPs and we have

a∗(X|X(0)) =1

Zp(X|X(0)) exp

(−∫ T

0

q(X(t), t)dt

)(76)

this is Bayes’ formula for MJPs with a prior process p(X|X(0)) and a “likelihood” exp(−∫ T

0q(X(t))dt

).

Thus, in order to find optimal actions in a continuous time linearly solvable MDP, one has to

solve the inference problem given by equation 76.

For the special case that the cost-function q is non-zero for a finite number of N timepoints

tl, equation 76 becomes

a∗(X|X(0)) =1

Zp(X|X(0)) exp

(−

N∑l=0

q(X(t), tl)dt

)(77)

=1

Zp(X|X(0))

N∏l=0

exp (−q(X(tl), tl)) (78)

This is exactly equation 43 with a “likelihood” p(Dl|X(tl)) = exp (−q(X(tl), tl)), except that in

the case of control it makes little sense to talk about observations Dl.

In control, as opposed to in the classical applications of inference, where we have data at some

discrete timepoints, we would like to be able to define continuous cost functions that are non-zero

at more than finitely many times. The results from the previous section can be easily adapted to

that situation: The backwards equation used for computing the posterior rate function (equation

54, Section 4.5) becomes

∂tr(X, t) =

∑X′ 6=X

f(X ′|X) (r(X, t)− r(X ′, t)) + r(X, t)q(X, t) (79)

and the jump conditions disappear (see Appendix ?? for a derivation).

24

7. Simple Problems

In this section, I present the solutions to some control problems with MJPs that are simple in the

sense that exact solutions are usually feasible. The section starts with the discussion of Poisson

process control, where analytical solutions are available in some cases. I proceed with single-

agent Markov jump processes, which I discuss in some depth, because they lay the foundation

for the later development of methods for multi-agent control.

7.1. Poisson Control

As a first example, take the control of a Poisson process. A Poisson process is a MJP with state-

space N and rate function f(i, j) = λ(t)δi,j for some rate λ(t)4. In essence, the Poisson process

counts through the natural numbers, with waiting times between counts that are exponentially

distributed with rate λ(t). Although the state space is infinite, the rate function is such that the

backward equation becomes relatively simple:

∂tr(i, t) =

∑j 6=i

f(i, j)(r(i, t)− r(j, t)) + r(i, t)q(i, t) (80)

= λ(t) (r(i, t)− r(i+ 1, t)) + r(i, t)q(i, t) (81)

For an arbitrary cost function q, this can be solved numerically (with the restriction that one

can only look at a finite number of states, which seems not to be a problem in most realistic

cases).

Interestingly, for some special cases there exist analytical solutions. One instance is this:

Let the task be to count to a certain number N by time T , using a free dynamics with time-

independent rate λ. The cost in this scenario gives 0 for all times and states, except for t = T ,

where it gives infinity for all states but the goal state. This is equivalent to an inference problem

with one noiseless observation at time T and can be treated with the tools from section 4.5. In

this case, the cost function gives a boundary condition for the backward equation:

r(i, T ) = δi,igoal(82)

The system of differential equations

∂tr(i, T ) = λ (r(i, t)− r(i+ 1, t)) (83)

has the solution

r(i, t) =

e−λ(T−t) (λ(T−t))N−i

(N−i)! if i ≤ N

0 else(84)

4Often, the rate is time independent, but in this case I use the more general definition.

25

0 20 40 60 80 100t

0

10

20

30

40

50

i

0 10 20 30 40 50i

50

40

30

20

10

0

log

10r(t,i)

t=0t=25t=50t=75t=99

Figure 1: Poisson control with goal state. The figure on the left shows twenty samples of acontrolled poisson process with goal-state i = 50. The base-10 logarithm of the solutionof the backwards-equation, log10 r(t, i), is displayed on the right for different times.

Accordingly, the control process has the rate

gt(i|j) = f(i|j)r(j, t)r(i, t)

=

(N−i+1)T−t if i = j + 1 and i ≤ N

0 else(85)

Interestingly, this is independent from the rate λ of the uncontrolled process.

7.1.1. Examples

See figure 1 for two examples of Poisson control. The first example shows Poisson control with

goal state i = 50 at time T = 100. See 1, (a), for 20 samples of the controlled process (note

that the uncontrolled process is irrelevant here, since it has no influence on the solution for the

controlled rate). Figure 1, (b), shows examples of the decimal logarithm of r, the likelihood of

reaching the goal state from any state at different times. In this case, the uncontrolled process

rate was 0.2 (here it is relevant).

Figure 2, (b), shows 20 samples of a controlled Poisson process using a continuous cost function

q(i, t) with

q(i, t) =(g(t)− i)2

20(86)

and

g(t) =

t2 if t < 20

40 if 40 ≤ t < 60

t if t ≥ 60

. (87)

r is shown for all times and states in figure 2, (b).

26

0 20 40 60 80t

0

20

40

60

80i

0 50 100 150 200 250 300 350 400 450cost

0 10 20 30 40 50 60 70 80 90 100t

0

20

40

60

80

i

0.00 0.15 0.30 0.45 0.60 0.75 0.90r

Figure 2: Poisson control with time-dependent cost-function. The figure on the left shows twentysamples of a controlled poisson process with a time-dependent cost function as definedin section 7.1.1. The background of the graph is colored according to the values of thecost-function. The figure on the right depicts values of the backwards-solution, r(t, i),over time.

7.2. Single-Agent Control

Poisson process control is a special case (with very restricted dynamics) of single agent MJP-

control. More generally, the situation is this: Given D positions, we have a state-space S =

{0, . . . , D} and an uncontrolled process rate function f(i, j). Given an arbitrary state-dependent

cost function q(s, t) the control costs should be minimized. If D is a finite, not too large number,

this problem can be solved directly, using the results from Sections 6 and 4.5.

First of all, we note that the prior rate of the process (that is, the uncontrolled dynamics of

the system), can be characterized by a matrix C with Cij being the rate of the agent jumping to

position j if it is at position i, f(j|i) = Cij . Plus, we define a vector r with ri(t) = r(i, t). The

optimal action for this problem is given as the posterior MJP by equation 78. Its process rate

can be derived using the method from Section 4.5. The system’s backward equation becomes

∂tri =

∑j 6=i

Cij(ri − rj) + qi(t)ri(t), (88)

with a cost-vector q(t). This is a system of D linear differential equations. For the special case

that there are only final costs at some time T , the solution is

r(t) = exp (C(T − t))> r(T ) (89)

(see appendix B for details), with boundary conditions

ri(T ) = exp(−qi(T )). (90)

27

Accordingly, we have for the posterior rate function gt(j|i) ..= Gij(t) = Cijrj(t)ri(t)

for i 6= j and

gt(i|i) ..= Gii = −∑j 6=i gt(j|i).

7.2.1. Weak Noise Approximation for Single-Agent Control

Deriving the posterior rate for the single-agent case involves the solution of a D-dimensional

system of linear differential equations. If D is large, this can become a problem.

With structured state-space and prior rate function, and a cost-function q(i, t) that is non-zero

only at some final time T , the solution can be simplified using the weak noise approximation

by [Ruttor et al., 2009] (see Section 9.1): As an example, we represent a state by a vector X ∈ Z2.

This can be interpreted as the position of the agent in a two dimensional grid, relative to some

origin. If the agent can only move to adjacent positions, uncontrolled rates can be specified as

f(

(a′

b′

),

(a

b

)) =

λleft(t,

ab

) if a′ − a = 1 and b′ = b

λright(t,

ab

) if a′ − a = −1 and b′ = b

λup(t,

ab

) if b′ − b = 1 and a′ = a

λdown(t,

ab

) if b′ − b = −1 and a′ = a

0 else.

(91)

Accordingly, the drift vector (equation 58, Section 5.1) becomes

f(X) =∑X′ 6=X

f(X ′|X)(X ′ −X) (92)

=

(λleft(t,X)− λright(t,X)

λup(t,X)− λdown(t,X)

), (93)

and the diffusion matrix (equation 59, Section 5.1)

D(X) =∑X′ 6=X

(X ′ −X)f(X ′|X)(X ′ −X)> (94)

=

(λleft(t,X) + λright(t,X) 0

0 λup(t,X) + λdown(t,X)

). (95)

Now,

r(X, t) ∝ exp

(−1

2(X − b(t))>B(t)−1(X − b(t))

)(96)

28

0 2 4 6 8 10t

0

10

20

30

40

i

0 2 4 6 8 10t

0

10

20

30

40

i

0 2 4 6 8 10t

0

10

20

30

40

i

0.075

0.050

0.025

0.000

0.025

0.050

0.075

0.100

Figure 3: Single-agent weak noise approximation in a one-dimensional state-space. On the left,the function r(i, t) computed according to the single-agent backwards-equation (equa-tion 88) is shown. The image in the center shows the corresponding weak noise ap-proximation rwn(i, t). The graph on the right depicts the difference r(i, t)− rwn(i, t).

with b(t) and B(t) evolving according to equations 60 and 64 (Section 9.1). If the rates are

independent from time and state, we get

b(t) = bT −

(λleft − λrightλup − λdown

)(T − t) (97)

B(t) = BT +

(λleft + λright 0

0 λup + λdown

)(T − t). (98)

where bT = e−q(T ) and qi(T ) ..= q(i, T ).

This result can be easily extended to N -dimensional state-spaces. In any case, computing the

posterior rate function involves solving systems of N linear differential equations, which simplifies

matters substantially, since usually N � D.

See Figure 3 for an example: Here, the solution to the backwards equation (equation 88) of

a single agent control problem on a one-dimensional state-space with 40 locations is shown on

the left, next to the weak noise approximation for the same task. The rightmost picture shows

the difference between the exact solution and the approximation. Note that the error is small

except at the edge of state space, which is due to the fact that the approximation presupposes

an infinite state-space.

7.2.2. Marginal Probability for Single-Agent Systems

Single-agent systems are trivially monomolecular – since there is only one agent, rates can never

depend on interactions between agents. For that reason, the expected state of the system can

be computed using equation 29 (Section 4.2). Agent’s locations at some time t are categorically

distributed and the marginal probability for an agent occupying location i is P (i, t) = E {X(t)i}.

29

7.2.3. Examples

As an example, let the task be to control an agent in such a way that it ends up at some

specified position sgoal at a given time T . The cost function encoding this problem could be

given as follows:

qi(t) =

∞ if t = T and i 6= sgoal

0 else. (99)

Consequently,

ri(T ) =

1 if i = sgoal

0 else. (100)

See figures 4 and 5 for simulation results: The one agent in this task starts at state 10 and is

supposed to reach state 35 at time T = 200. The uncontrolled process rate is set to a constant

value λ for adjacent states and 0, else. Figure 4, (a), shows one sample of a controlled agent

with λ = 1. For the sample in Figure 5, (a), λ = 10. Figures 4 and 5, (b), show expected values

over time for all states of the system for the controlled processes. Samples were acquired using

time-step-sampling (see Section 4.4) with appropriately small time steps.

More complex cost functions are possible just as simple. See Figure 6 for a simulation with

two goal states that have equal costs.

Another example shows the effect of noise on control: in a setting with several goals, the

probability for the agent reaching a specific goal depends on the level of noise in the uncontrolled

dynamics. See Figure 7. Here, the agent starts at position 10 and may, by time T, move either

to position 40 at time T or return to position 10. With a low level of noise, the agent almost

always returns to position 10, since reaching position 40 would necessitate a high deviation from

the uncontrolled dynamics. In contrast, the agent ends up at position 40 more and more often

with increasing level of noise, because it may move close to that goal by chance.

A related interesting feature of stochastic control is symmetry breaking [Kappen, 2005]: if an

agent may choose between multiple goal states, control will be weak as long as the goal time is

sufficiently far in the future. The agent will first wander around without control, according to its

uncontrolled dynamics, and at the end steer to any goal that turned out close. This can be seen

in Figure 8. It shows the average deviation of the agents’ jump-rate from the uncontrolled rate

over time for different levels of noise in the uncontrolled dynamics for the task shown in Figure

6. The figure illustrates that if noise is high (red line), control happens mostly in the final stage

of the task, when the destination becomes clear and controlled movements will not be washed

out by future noise. If noise is low, though, (blue line) control is relatively uniform during the

whole task.

30

(a)

0 5 10 15 20t

0

10

20

30

40

50

i

0 5 10 15 20t

0

10

20

30

40

50

i

0 5 10 15 20t

0

10

20

30

40

50

i(b)

0 5 10 15 20t

0

10

20

30

40

50

i

0.00 0.15 0.30 0.45 0.60 0.75 0.90expected value

0 5 10 15 20t

0

10

20

30

40

50

i

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8r

Figure 4: Single-agent control with low noise. (a) shows three samples of a controlled single-agentprocess with start-location 11 and goal location 35 at t = 20. (b) shows the expectedvalue of the controlled process over time (left) and the solution to the backwards-equation (right). Without control, transitions to adjacent locations occur with rate1.

31

(a)

0 5 10 15 20t

0

10

20

30

40

50

i

0 5 10 15 20t

0

10

20

30

40

50i

0 5 10 15 20t

0

10

20

30

40

50

i(b)

0 5 10 15 20t

0

10

20

30

40

50

i

0.00 0.15 0.30 0.45 0.60 0.75 0.90expected value

0 5 10 15 20t

0

10

20

30

40

50

i

0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28r

Figure 5: Single-agent control with high noise. (a) shows three samples of a controlled single-agent process with start-location 11 and goal location 35 at t = 20. (b) shows the ex-pected value of the controlled process over time (left) and the solution to the backwards-equation (right). Without control, transitions to adjacent locations occur with rate 10.

32

(a)

0 5 10 15 20t

0

10

20

30

40

50

i

0 5 10 15 20t

0

10

20

30

40

50i

0 5 10 15 20t

0

10

20

30

40

50

i(b)

0 5 10 15 20t

0

10

20

30

40

50

i

0.00 0.15 0.30 0.45 0.60 0.75 0.90expected value

0.0 5.0 10.0 15.0 20.0t

0

10

20

30

40

50

i

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8r

Figure 6: Single-agent control with two goals. (a) shows three samples of a controlled single-agent process with start-location 25 and goal location 10 and 40 at t = 20. (b) showsthe expected value of the controlled process over time (left) and the solution to thebackwards-equation (right). Without control, transitions to adjacent locations occurwith rate 1.

33

0.0 5.0 10.0 15.0t

010203040

i

0.0 5.0 10.0 15.0t

010203040

i

0.0 5.0 10.0 15.0t

010203040

i

0 10 20 30 40 50λ

0.0

0.2

0.4

0.6

0.8

1.0

p

goal 1goal 2

Figure 7: The effect of noise. Three samples of a controlled single-agent process with two goal-locations are shown on the left. The uncontrolled rate for transition to neighboringlocation for the top sample is 0.1, for the other two samples it is 10. The graph on theright the probability of reaching the bottom goal (“goal 1”) or the top goal (“goal 2”),dependent on the transition rates of the uncontrolled process.

0.0 5.0 10.0 15.0 20.0t

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

mea

n co

ntro

l

λ=.1

λ=1

λ=10

Figure 8: Symmetry braking. Average control costs over time for single-agent goal-directed con-trol with uncontrolled transition rates λ = .1 (blue), λ = 1 (green) and λ = 10 (red).

34

8. Exact Multi-Agent Control

This section introduces exact methods for multi-agent control in Markov jump processes. Note

that in some cases, numerical solutions of differential equations may be necessary. “Exact” refers

to the remaining aspects of the methods.

8.1. Multi Agent Control with Linear Costs

Multi-agent control is simple if the state costs are linear functions of the state, that is q(X, t) =

q(t)>X, since in that case, the likelihood term in the corresponding inference problem factorizes

(i.e. exp(−q(t)>X

)=∏i exp (−qi(t)Xi)) and agents behave independently from each other.

Due to this, the posterior rate function can be computed by solving the problem for the single-

agent case if no new agents can enter the system5 (i.e. c0i = 0 for all i). For the posterior rate

function, we get

gt(X′|X) =

n∑i=1

n∑j=1

δX′,X−1i+1jGij(t)Xi, (101)

where Gij(t) is the single-agent posterior rate-matrix and 1i the ith column of the N×N identity

matrix.

One can also solve Multi-agent control with linear costs by computing

r(X, t) = a(t) exp(ln(r(t)>X

)(102)

with

dridt

= −∑k 6=0

cik(rk − 1)− qi(t)ri (103)

da

dt= −a

∑k 6=0

c0k(rk − 1). (104)

This result can be applied to systems with agents entering the system. See [Opper and Ruttor,

2010] for a derivation.

5Agents leaving the system can be modelled by introductin an absorbing state X0

35

0 4 9 14 19t

0

10

20

30

40

i

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26number of agents

0 50 100 150t

0

10

20

30

40

i

10 8 6 4 2 0 2 4 6 8 10costs

Figure 9: Multi-agent control with linear cost function. One sample of a multi-agent control taskinvolving 40 agents with linear, time-dependent costs is shown on the left. The statecost-function is depicted on the right. Without control, agents switch to neighboringlocations with rate 3.

8.1.1. Example

See figure 9 for an example: Initially, each location is occupied by one agent. Agents are controlled

using a time-dependent, linear cost-function q(t)>X with

qi(t) =

−10 if t mod 90 < 30 and i < 11

−10 if 30 ≤ t mod 90 < 60 and 10 < i < 30

−10 if 60 ≤ t mod 90 < 80 and 29 < i

10 else

(105)

and an uncontrolled dynamic which lets them switch to adjacent locations with rate 3.

8.2. Solving the Backwards Equation

The naıve approach to multi-agent control in Markov jump processes is deriving the controlled

process rate gt by solving the backwards equation

∂tr(s, t) =

∑s′ 6=s

f(s′|s) (r(s, t)− r(s′, t)) + r(s, t)c(s, t) (106)

(see also Section 6) and using gt(X′|X) = f(X ′|X) r(X

′,t)r(X,t) . However, as stated previously, this is

a system of as many linear differential equations as there are states in the system – In the case

of multi-agent control, there are DN states, where D is the number of locations and N is the

number of agents (assuming that all agents have access to all locations). Therefore, applying

this approach directly is usually not feasible, even with systems involving few agents.

36

8.3. Forward Solution

A different approach exploits the fact that agents behave independently from each other without

control, which implies that in the absence of control, a multi-agent system is monomolecular.

Now, as stated in Section 4.5, r(X, t) = p(D>t|X(t) = X) – in the context of inference, the

solution of the backwards-equation equals the probability for future observations given the current

state of the system. Transferred to control, this means that r(X, t) = p(XT , T |X(t) = X), the

likelihood of reaching state XT at time T when starting at Xt. In a monomolecular system, one

can compute this probability directly by solving the (forward) Master equation, using the results

from [Jahnke and Huisinga, 2007, p. 11] (see Section 4.3) and derive the controlled process as

gt(X′|X) = f(X ′|X) r(X

′,t)r(X,t) .

For the marginal distribution p(·, T |Xt = X), we have6

p(·, T |Xt = X) =M(·, X(1)t ,p(1)(T − t)) ? · · · ?M(·, X(n)

t ,p(n)(T − t)) (107)

Where

dp(k)

dt= A(t)p(k) (108)

p(k)(0) = 1k. (109)

1k is the kth column of the D ×D identity matrix, D being the number of locations.

Here, we have one multinomial for each occupied location at the start-state X(t). This can

be used to compute r(X, t) = p(XT |X(t) = X) and thereby the process rate of the controlled

system.

The intuition is this: The location of an agent at time T is a categorically distributed random

variable with p(X(T ) = i|X(t) = k) = p(k)i (see Section 7.2.2). Each agent behaves independently

from all others, thus the locations at T of several agents starting at the same location are

multinomially distributed. The final state X(T ) is a sum over the states resulting from groups

of agents starting from the same location. Since all those groups are independent, the resulting

probability distribution will be a convolution of the individual probability distributions (the sum

over independent random variables is distributed according to the convolution over the individual

distributions).

In contrast to the naıve approach of solving the backwards equation directly, with required

the solution of a system of DN linear equations, this method requires solving at most D systems

of D equations (again, D being the number of locations and N the number of agents). Still,

the method can realistically only be applied to small systems because of the complexity of the

convolution.

Note that for computing the controlled rate gt, equation 108 needs to be solved at every

timestep for each occupied location, which is computationally demanding and can be an issue

6I neglect the case that new agents can enter the system.

37

when agents should be controlled in real-time. To circumvent this problem, one may precompute

parameter-vectors p for all times up to time T , the end of the trial, before starting it (when using

numerical solution methods for integrating the parameter vectors, this comes at the relatively

moderate cost of some additional memory). This reduces computation during actual control.

However, it needs to be done not only for those locations that are occupied at t = 0, but for all

locations that can possibly be reached.

8.4. Backward Solution

Using the same basic idea as before, it is also possible to compute an exact solution using the

single agent backwards solution for r(X, t): As stated in the previous section, an entry of a

parameter vector in the forward solution, p(k)i , can be interpreted as the single-agent marginal

probability p(X(T ) = i|X(t) = k) that the agent occupies location i at time T , given that it

started at location k at t. Now, this is exactly the solution r(i)k of the single-agent backwards

equation for a single-agent control task with goal state XT = i (see Section 7.2). Hence, one can

derive the marginal probabilities in the multi-agent case again as a convolution of multinomials

(equation 107), but using p(k)i = r

(i)k , where

dr(i)

dt= −C(t)r(i) (110)

r(i)(T ) = 1i, (111)

with C defined as in Section 7.2. As before, this allows one to compute r(X, t) = p(XT |X(t) = X)

and consequently the controlled process rate gt.

Computing the parameters this way has the advantage that, instead of solving the single-

agent backwards equation for every single location, one can specify a number of single agent

goals and compute the solution for these goals. The marginal probability distribution is then

over assignments of numbers of agents to single-agent goals instead of assignments of numbers of

agents to locations . This is beneficial from a computational point of view, since only as many

systems of equations need to be solved as there are goals. In addition, these goals need not

consist in reaching specific locations, they can be derived from any kind of state-dependent cost

function. In particular, this provides a way of doing a particular kind of multi-agent ergodic

control (see Section 10.2)

See Figure 10 for an example: Here, there are two single-agent goals to be fulfilled at time T ,

one consists in reaching location 0 and the other consists in ending up in the upper half of the

state space (locations 20 to 40). Both goals should be reached by exactly two agents.

Although the backwards method for computing parameters for the marginal probability is

arguably more efficient than the forward method, the complexity of the convolution remains.

38

0 5 10 15 20t

0

10

20

30

40

i

0 1 2 3 4number of agents

0 5 10 15 20t

0

10

20

30

40

i

0 1 2 3 4number of agents

Figure 10: Multi-agent goal-directed control with four agents. In the uncontrolled process, tran-sitions occur with rate 1.

9. Approximate Multi-Agent Control

One approach to render multi-agent MJP-control involving more than just a few agents feasible is

to exploit the equivalence of control to probabilistic inference and to apply approximate inference

techniques. In the following sections I present different methods based on this idea.

9.1. Weak Noise Approximation

One instance of an approximate inference method is the weak noise method as introduced in

Section 5.1. The methods can be used for control problems with a goal to reach a certain state

XT at time T and no further state dependent costs.

To repeat, the solution of the backwards equation r(X, t) is approximated by

r(X, t) ≈ η(t) exp

(−1

2(X − b(t))>B−1(t)(X − b(t))

), (112)

with b and B evolving according todb

dt= f(b), (113)

and

dB

dt= A(b(t))B(t)A(b(t))> −D(b(t)), (114)

39

with

f(X) =∑X′ 6=X

f(X ′|X)(X ′ −X) (115)

D(X) =∑X′ 6=X

(X ′ −X)f(X ′|X)(X ′ −X)> (116)

Aij(X) =∂fi∂xj

. (117)

For the control task to reach a goal XT at T , one gets boundary conditions b(T ) = XT and

B(T ) = 1ε with a small ε. Accordingly, the controlled rate of the process is approximated by

gt(X′|X) = f(X ′|X)

r(X ′, t)

r(X, t)(118)

≈ f(X ′|X)exp

(− 1

2 (X ′ − b(t))>)B−1(t)(X ′ − b(t)))

exp(− 1

2 (X − b(t))>B−1(t)(X − b(t))) (119)

= f(X ′|X) exp

(−1

2(X ′ −X)>B−1(t)(X ′ −X) + b(t)>B−1(t)(X ′ −X)

). (120)

An issue with the weak-noise approximation in the context of multi-agent control is the ap-

proximation of r by an unnormalized Gaussian: Since no location can be occupied by a negative

number of agents, this approximation is only valid if the mean b of the Gaussian is large (with re-

spect to the covariance matrix) in all dimensions. Problematically, in typical multi-agent control

scenarios, only few locations are occupied and many are empty. In particular, since b(T ) = XT ,

XT should be large in all dimensions. Even if that is the case, some entries of b may quckly

become small when developed according to equation 113. This can happen, for example, if tran-

sitions from some location i to a location k occur at a high rate, but not the other way round.

Given these considerations, one would expect the weak noise approximation to be appropriate

only for control of large numbers of agents.

A second problem with the weak noise approximation in the context of control is the assump-

tion that typical state vectors can be expected to be close to b(t). This assumption is needed

for the second expansion in the derivation of the weak noise approximation (equation 61, Section

5.1). While the assumption is justified in a probabilistic inference setting, this is not the case

for control: The assumption should hold for state vectors at all times, in particular for t = 0.

Hence, X(0) should be close to b(0). However, as X(0) is the starting state of the control task,

one should be able to choose it arbitrarily. See Figure 11 for an example illustrating the issue:

I performed 100 simulations of a control problem with a goal-state XT that has X(i)T = 100 for

all i and an initial state X(0) with X(0)(i) = 0 for all i except X(0)(5) = 1000. The rate of the

uncontrolled process is 1 for transitions to adjacent locations in either direction. Since there is no

drift in the system, b(t)i = 100 for all i and all t and the unnormalized Gaussian approximating

r is concentrated far from the boundary of state space at all times.

The overall average control costs using a controller based on the variational approximation

40

0.0 0.2 0.5 0.8 1.0t

02468

i

0.0 0.2 0.5 0.8 1.0t

02468

i

0 100 200 300 400 500 600 700 800 900 1000number of agents 0.0 0.2 0.4 0.6 0.8 1.0

t0

2000

4000

6000

8000

10000

aver

age

cont

rol c

ost

variational approximationweak noise approximation

Figure 11: Weak noise control. Left: Samples of a goal-directed control task using the variationalmethod (top) and the weak noise method (bottom). Right: Average control costs overtime using both methods.

(see next section) are 938. With the weak noise approximation, average control costs are larger

than 5.8 · 108. In contrast, in a control task with the same goal-state, but with a starting-state

that equals the goal-state, control costs using both methods are comparable.

9.2. Variational Approximation

One can approximate the posterior rate function of a multi-agent MJP using the variational

method7 by [Opper and Ruttor, 2010], outlined in Section 5.2. In a setting with goal-states, that

is with state-dependent costs that depend only on the success of reaching a certain state XT at

time T, the method can be directly adapted. The costs can be modeled as

1

2σ2||XT − L[X(T )]||2, (121)

the squared distance of the linearly transformed state of the system at time T from the goal

state XT8. σ2 can be seen as a parameter that specifies the penalty due to deviation from

the goal. Deviations will be tolerated more if σ2 is large. In that case the posterior rate

function gt will be more similar to the prior rate f than in the case of small σ2. We have

seen in Section 6 that the cost-function defined in equation 121 corresponds to a “likelihood”

p(XT |X(T )) = exp(− 1

2σ2 ||XT − L[X(T )]||2). This is exactly the measurement model [Opper

and Ruttor, 2010] use as a basis for their approximation (equation 67, Section 5.2) and the

method can be applied without further ado: As discussed in Section 5.2, the solution to the

7Note that this method only applies to monomolecular systems. In the case of multi-agent control it is applicabledue to the independence of agents without control.

8For simplicity, I will omit the linear transformation L in the following and assume it to be the identity function.

41

backwards equation r(X, t) can be approximated by

r(X, t) ≈ a(t) exp(ln r>X

), (122)

with r and a as defined in Section 5.2 and with a boundary condition r(T ) = exp(φ). φ maximizes

a variational lower bound to the free energy (equation 73, Section 5.2). It can be foud using

gradient ascent methods with the gradient

∇φf = −σ2φ+XT − E (L[X(t)]) . (123)

Here, the expectation is computed with respect to the current φ. Computing this expectation

can be expensive, as it involves solving a system of linear equations (equation 29, Section 4.2).

However, as is discussed in Section 8.1, in the case that no new agents can enter the system, the

multi-agent-control in the case of linear costs can be expressed as a combination of single-agent

solutions:

gt(S′|S) =

n∑i=1

n∑j=1

δS′,S−1i+1jGtijSi. (124)

If there are N goal locations that should be occupied by at least one agent with positive proba-

bility at time T one can express the controlled rate-matrix G as

Gtij =

f(j|i)∑N

k=1 wkr(t)(j)k∑N

k=1 wkr(t)(i)k

if i 6= j

−∑j 6=iG

tij else

, (125)

where r(t)(j)k are solution to the N single-agent control problems corresponding to each goal and

wk = eφk . We have

E [X(T )k|w, Xt] =

D∑i=0

X(i)t

wkr(t)(i)k

Z(i). (126)

Here,wkr(t)

(i)k

Z(i)..= P (k|i) can be interpreted as the likelihood that an agent occupying location i

at time t reaches goal k at T in the controlled process. X(i)t denotes the number of agents at

location i at t. Z(i) =∑Nl=0 wlr(t)

(i)l is a normalization term.

It is important to stress that the method cannot give a solution to the multi-agent control

problem directly: The posterior rate function in the approximation is determined by the “likeli-

hood”

r(X, t) ∝ exp(r(t)>X

)(127)

=

M∏i=1

exp (φ(t)iXi) (128)

42

(equation 69, Section 5.2). This likelihood factorizes. Consequently, the posterior factorizes if the

prior factorizes, which we assume9. This is a problem, since it implies that agents will behave

independently not only in the uncontrolled process, but also in the controlled process, which

contradicts the purpose of the whole procedure: in order to reach the goal state, agents can’t

act independently from each other since otherwise it is impossible to consistently get a specific

number of agents per location. See the next section for a further discussion of this issue.

Another direction one can take here is is to approximate the marginal distribution of X at

T by a Gaussian, wich gives a closed-form solution for φ. However, preliminary experiments

indicate that this is not appropriate for control. Plus, one can use the Gaussian approximation

directly for deriving controlled rates (see Section 9.5) without a further approximation using the

method presented in this section. See appendix C for details.

9.3. Expectation Control

One can gain a a further perspective on the above by approximating a solution to a non-linear

control problem using a linear cost-function by minimizing the divergence of the expected state

of a controlled process at time T from some goal state XT . Note that this does not solve the

problem of reaching XT in every individual trial.

In order to minimize the divergence of the expected state at T from the goal state XT , we

need to find a weight-vector w∗ with

w∗ = argminw

||E [X(T )|w, Xt]−XT ||2, (129)

where E [X(T )k|w, Xt] is defined as above (equation 126) The optimal weights can be found

using an iterative, EM-type procedure (see appendix D for a reformulation in the usual terms in

context of the EM-algorithm). In the E-step, the normalization Z(i) is computed. In the M-step,

the weights are optimized, using

wk =X

(k)T∑D

i=0X(i)t

r(t)(i)k

Z(i)

, (130)

which is found by differentiating the objective function

∂wk

1

2||E [X(T )|w, Xt]−XT ||2 =

(wk

D∑i=0

X(i)t

r(t)(i)k

Z(i)−X(k)

T

)D∑i=0

X(i)t

r(t)(i)k

Z(i)(131)

!= 0. (132)

The result is guaranteed to improve in each step [Bishop, 2006] and one can stop iteration after

some error-criterion has been reached.

9This is equivalent to the observation that the posterior process will remain monomolecular if the prior processis monomolecular, made by [Opper and Ruttor, 2010, p. 5].

43

Returning to the discussion from the previous section, we see that if σ = 0, the gradient in

equation 123 can be rewritten in terms of w as

∂f

∂wk= E [X(T )k|w, Xt]

X(k)T∑D

i=1X(i)t

r(t)(i)k

Z

− wk

. (133)

The sign of the gradient does only depend onX

(k)T∑D

i=1X(i)t

r(t)(i)k

Z

−wk, therefore, using it in gradient-

ascent will lead to the same solution as using the update rule in equation 130. Hence, finding the

lower-bound of the free energy (equation 73, Section 5.2) and minimizing the distance between

the expected state of the process at T to some goal state are dual problems. Thus, a multi-agent

system that is controlled according to the method presented in the previous section would reach

the goal state in the expectation, but not in each individual trial.

The above is an interesting result in itself, since it provides an efficient method for computing

the expected evolution of a multi-agent system. Still, one can take a further step and derive

a control method. We note that if the covariance matrix of the marginal distribution at the

final time T under controlled dynamics is diagonal, that is, if the amounts of agents at different

locations are uncorrelated, a system controlled according to the approximation would reach the

goal state not only in the expectation, but in each individual trial. This is true because correlation

arises only if there are several ways to assign agents from start- to goal locations – if there exist

k and l with k 6= l such that P (k|i) > 0 and P (l|i) > 0 for some i with X(t)(i) > 0. If that is

not the case, there is only one option and the expectation becomes a deterministic solution. One

can expect correlation to be small if T − t is small with respect to the uncontrolled rates, since

in that case divergence between X(t) and X(T ) becomes expensive, which makes it likely that

a single agent is assigned to just one goal location. Hence, one can construct an approximately

exact controller by recomputing the above linear-cost approximation at every timepoint. As T−tbecomes small, the control will become more exact. In practice, one can monitor the expected

error ||E [X(T )|w, Xt]−XT ||2 and adapt the weights only whenever it exceeds some predefined

threshold. This adaptation should take only few iterations, because weights that minimize the

error can be expected to be close to previous weights. Simulations indicate that the method

works well, even if T − t is large (see Section 15).

9.4. Partial Evaluation of the Solution to the Forward Master Equation

We have seen in Sections 8.3 and 8.4 that one can get an exact solution for multi-agent control by

computing the solution to the forward Master equation, which is a convolutions of single-agent

solutions. However, each convolution involves a sum over too many terms for its computation

to be tractable for cases involving more than a handful of agents. One way to circumvent this

problem is to compute the solution to the Master equation only partially, leaving out most parts

of the sum and, doing this, approximate a solution.

Instead of expressing the probability r(X, t) = P (XT , T |X(t) = X) as a convolution of Multi-

44

nomials, one can express it as a sum over possible assignments from agents at time t to single-

agent goals r(X, t) =∑

a∈as r(a1)1 · · · r(an)

n , where as is the set of all possible assignments of agents

to goals – the probability of fulfilling a goal is the sum of probabilities of all possible ways to

fulfill it . In a situation where some agents are likely to reach only a subset of the goals under

the uncontrolled dynamics, some assignments of agents to goals are very unlikely and therefore

contribute very little to this sum. For that reason, it is reasonable to approximate r(X, t) by a

sum∑

a∈as∗ r(a1)1 · · · r(an)

n with as∗ = {a ∈ as | r(a1)1 · · · r(an)

n > ε}, the set containing assignments

which occur with a likelihood larger than some small ε.

Constructing as∗ directly is not helpful, since computing likelihoods for all assignments would

be as demanding as computing the whole sum directly. Instead, one can construct the set

heuristically – One method for doing this is to first assign agents to goals in such a way that those

agents get assigned first for which the drop in likelihood by getting assigned to a different would

be steepest 10 (See Algorithm 1 for the precise method) and then repeatedly create subsequent

assignments by switching assignments for pairs of agents (See Algorithm 2). This can be done

for all pairs11 of the most likely assignment that has been created so far.

However, this computation of assignments is computationally demanding and should not be

repeated each time r(X, t) needs to be evaluated (which is at every time step for the current

state of the system and all possible successor states, since gt(X′|X) = r(X′,t)

r(X,t) )). Instead, one can

compute assignments before control is started and subsequently adapt these as jumps occur. This

is based on the idea that relative likelihoods of assignments stay roughly constant. In practice,

it has proven useful to compute a number of assignments before the beginning of a control task,

then select the n most likely assignments and after each jump compute all possible resulting new

assignments, again selecting the n most likely for the continuation of the procedure.

In practice, one should take into account repetitions of assignments of agents that are at the

same location, by computing

r(X, t) =∑a∈as′

D∏i=1

Xi!

∏Dj=1(r

(j)1 )aj→1 · · · (r(j)

n )aj→n∏Nk=1

∏Dl=1 al→k!

(134)

instead of∑

a∈as∗ r(a1)1 · · · r(an)

n . Here, as′ is a set assignments of agents to goals with ai→k ∈ a

denoting the number of agents at position i that are assigned to goal k.

The number of likely assignments of agents to goals decreases as t approaches the final time T .

Thus, one can expect the approximation of r to be more accurate in late stages of control. See

Figure 12 for data illustrating this point. Here, a goal directed control task with 20 agents in one

dimension was performed 10 times. The graph on the left shows the average normalized difference

of two approximations of r(X, t) (rapprox) to its exact value (r) over time, i.e.r−rapprox

r . The

graph on the right shows average divergence of process rates gapprox using the approximation

from optimal process rates g, which is calculated as |1 − gapprox

g |. For the approximations, 1

10Assigning agents greedily would be another option, but has turned out to be less adequate.11Selecting only the best candidates for switching is computationally inefficient.

45

Algorithm 1: initial assignment

Data: the set G = {1 . . . N} of goal-indices; the set A = {1 . . . D} of agent-indicesResult: an assignment a ∈ DN from agents to goalswhile G not empty do

lowestRatio ←∞foreach i ∈ A do

bestGoal ← argmaxk∈G

r(i)k

best ← maxk∈G

r(i)k

secondBest ← maxk∈G\best

r(i)k

ratio ← secondBestbest

if ratio < lowestRatio thenlowestRatio ← rationextAgent ← inextGoal ← bestGoal

anextGoal ← nextAgentremove nextGoal from Gremove nextAgent from A

Algorithm 2: switching assignments

Data: the set as∗ of assignments for which flips have not been computedResult: as∗ with added assignments

a∗ ← argmaxa∈as∗

r(a1)1 . . . r

(an)n

for i ← 1 to n dofor j ← i + 1 to n do

a′ ← a∗

a′i ← a∗ja′j ← a∗iadd a′ to as∗

assignment and 10 assignments from agents to goals were used to estimate the solution of the

Master equation as explained above.

Note that differences in r are large in the beginning of the task and small at the end. Inter-

estingly, although the normalized difference of the approximation of r(X, t) to its exact value is

very large in the beginning of the task using an approximation based on one assignment from

agents to goals, this is not reflected in the divergence of process rates. An explanation may be

the following: In the beginning of the task, the likelihood of different assignments to goals is

relatively uniform across different assignments. For this reason, the approximation of r(X, t) is

rough if only few assignments are evaluated. However, for the same reason, differences between

r(X ′, t) and r(X, t) can be expected to be small for any X ′, both in the approximation and in

the exact solution and gt(X′|X) ≈ f(X ′|X).

46

0.0 0.2 0.4 0.6 0.8 1.0t

0.0

0.1

0.2

0.3

0.4

0.5

0.6

norm

aliz

ed d

iffer

ence

(r−r approx)/r

n=1

n=10

0.0 0.2 0.4 0.6 0.8 1.0t

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

dive

rgen

ce o

f rat

es |(

1−g approx/g

)|

n=1

n=10

Figure 12: Left: Average normalized difference of approximated likelihood rapprox to exact like-lihood r over time in a goal directed control task with 20 agents. Right: Averagedivergence of approximated controlled process rate to exact controlled process rates.Averages are over 10 trials.

9.5. Gaussian approximation

A further option to approximate the exact solution as given in Sections 8.3 and 8.4 is to use

a Gaussian approximation to the solution of the Master equation: A Multinomial distribution

M(·, N,p) is well approximated by a multivariate Normal distribution if N is large and p not

near the boundary of parameter space [Severini, 2005, p 378]. The sum of several independent

Gaussian random variables stays Gaussian [Severini, 2005, p 235], thus the marginal probability

of some state X at time t, being a sum of multiple independent Multinomials12, should be well

approximated by a Gaussian if the individual Multinomials satisfy the conditions for a Gaussian

approximation. Even if that is not the case, a Gaussian approximation of the marginal can still

be appropriate, since, due to the central limit theorem, the sum of several independent random

variables is closer to a Gaussian than the individual random variables.

The conditions for the appropriateness of a normal approximation of a Multinomial random

variable have two implications for its application in multi-agent control: First, there should be

many agents at each location (since that determines parameters N in the Multinomials) Second,

the likelihood r(k)i of an agent at location i reaching goal k should neither be close to one nor close

to zero for all agents and all goals (since, as shown in Section 8.4, r(k)i constitute the parameters p

in the Multinomials). The latter condition is most likely to be fulfilled if the uncontrolled process

rates are large in relation to T − t, the time remaining until the goal state is to be reached. In

any task this will eventually cease to be the case as T − t approaches 0. Still, the Gaussian

approximation may be an option for the early stages of control, where the methods presented in

Sections 9.2 and 9.4 have their weaknesses.

Mean and covariance matrix of the normal approximation are according to equations 40 and

12Independence follows from the monomolecularity of the uncontrolled process.

47

41, Section 4.3.

10. Ergodic Control

Ergodic control is control with a state dependent cost function q(X) that is independent of time.

Control should be maintained over indefinite time, thus the goal is not to minimize accumulated

costs, but to minimize a cost-rate.

10.1. Single-Agent Ergodic Control

With time-independent costs, the backward-equation for single-agent control becomes

dr

dt= −(C − Iq)r, (135)

where q is a cost-vector and C as defined in Section 7.2. For a final time T , this is solved by

r(T − t) =

D∑i=1

e−λi(T−t)v(i). (136)

where λi and v(i) are the eigenvalues and eigenvectors of −(C − Iq). Since, in ergodic control,

there is no final time T , we let T − t go to infinity. Doing this, |r| goes to 0. However, for

calculating ergodic controlled rates ge, we are only interested in ratios between ris. Hence,

ge(l|k) = f(l|k) limt→∞

rl(t)

rk(t)(137)

= f(l|k) limt→∞

(∑Di=1 e

−λitv(i))l(∑D

i=1 e−λitv(i)

)k

(138)

= f(l|k)v

(j)l

v(j)k

, (139)

with j = argmini

Re(λi). The last equality holds because in the limit the sum is dominated by

e−λjtv(j).

See Figure 13 for an example of single agent control and the largest eigenvector of the corre-

sponding backward-equation that is used for control.

10.2. Multi-Agent Ergodic Control

The result from the previous section can be transfered to the multi-agent case in a relatively

straight-forward way if costs can be constructed as combination of single agent-costs.

As we have seen, the likelihood r for single agents goes to zero in the limit of ergodic control.

Still, the controlled rates can be computed since they depend on the ratio between two rs that go

48

0 24 49 74 99t

0

10

20

30

40

i

0 5 10 15 20 25 30 35 40i

0.00

0.05

0.10

0.15

0.20

0.25

r

Figure 13: Single-agent ergodic control. Left: One sample of a single-agent ergodic control taskwith constant state costs q(i) = −2δi,20. Uncontrolled transitions to adjacent loca-tions occur with rate 2. Right: Normalized eigenvector corresponding to the largesteigenvalue for the control problem.

to zero with equal pace. The same holds for the multi-agent case: As in section 9.4, we express

r(X, t) as a sum∑

a∈asa r(a1)1 · · · r(an)

n , where asa is the set of all possible assignments of agents

to goals and r(ai)i is the single-agent likelihood for agent ai fulfilling goal i. We have

r(X ′, t)

r(X, t)=

∑a∈asa r

(a1)1 · · · r(an)

n∑a∈asb r

(a1)1 · · · r(an)

n

(140)

= limt→∞

∑a∈asa

∑Di=1 e

−λ1,itv(i)1,a1· · ·∑Dj=1 e

−λn,jtv(j)n,an∑

a∈asb∑Di=1 e

−λ1,itv(i)1,a1· · ·∑Dj=1 e

−λn,jtv(j)n,an

(141)

= limt→∞

∑Di=1 · · ·

∑Dj=1 e

−(λ1,i+...+λn,j)t∑

a∈asa v(i)1,a1· · · v(j)

n,an∑Di=1 · · ·

∑Dj=1 e

−(λ1,i+...+λn,j)t∑

a∈asb v(i)1,a1· · · v(j)

n,an

(142)

=

∑a∈asa v

∗1,a1 · · · v

∗n,an∑

a∈asb v∗1,a1· · · v∗n,an

, (143)

where v∗i is the eigenvector corresponding to the smallest eigenvalue for the solution of goal i.

Hence, the solution from section 8.4 transfers to ergodic control.

As stated above, this works only if the multi-agent task can be expressed in terms of single-

agent tasks – for instance if the goal is to maintain a certain number of agents on specific

locations. In contrast, it cannot be applied to tasks such as collision avoidance (see section 10.3),

in which costs arise only due to interactions between agents.

All approximations except the weak noise approximation can be applied to this.

49

0.0 25.0 50.0 75.0 100.00

10

20

30

40

0 4 9 13 18number of agents

Figure 14: Multi agent ergodic control. 18 agents with uncontrolled rates λ = 2 in either directionare controlled to minimize the cost rate with a state dependent cost-function as givenin section 10.2.1. All agents start at location 10.

10.2.1. Example

See figure 14 for an example of multi-agent ergodic control in one dimension. Single agent

state-dependent costs are defined as

q(i)j =

−2 if i = j

0 else(144)

with j ∈ {5, 15, 20, 25, 35}. q(i)20 is applied to ten agents, all other cost-functions are applied to

two agents, each. Uncontrolled process rates are λ = 2 for either direction.

10.3. Collision Avoidance

In collision avoidance, no two agents should ever occupy the same position. This can be expressed

with a state dependent cost function

q(X) =

∞ if maxiXi > 1

0 else(145)

This is a time-independent cost-function, thus the task is an instance of ergodic control. Un-

fortunately, one cannot easily express this as a combination of single-agent tasks, therefore the

results from the previous section do not apply directly.

Instead, we compute the likelihood of a collision given the current state, which can be expressed

50

as a product of likelihoods for pairwise collisions:

p(collision|Xt) =

N∏a=1

N∏b=a+1

p(collision(a, b)|Xt) (146)

One can compute r(a, b,Xt) ∝ p(collision(a, b)|Xt) as a ’single-agent’-solution in the product-

space S × S, with uncontrolled process rates

f((ja, jb)|(ia, ib)) =

f(ia|ja) if jb = ib and ja 6= ia

f(ib|jb) if ja = ia and jb 6= ib

f(ia|ja) + f(ib|jb) if jb = ib and ja = ia

0 else

. (147)

Using this, one can compute the solution for ergodic control with a cost-function

q(ia, ib) =

∞ if ia = ib

0 else(148)

with the method from section 10.1. The controlled rate for the process then becomes

g(X ′|X) = f(X ′|X)p(collision|X ′)p(collision|X)

(149)

= f(X ′|X)

∏Na=1

∏Nb=a+1 p(collision(a, b)|X ′)∏N

a=1

∏Nb=a+1 p(collision(a, b)|X)

(150)

= f(X ′|X)

∏Na=1

∏Nb=a+1 r(a, b,X

′)∏Na=1

∏Nb=a+1 r(a, b,X)

. (151)

Problematically, this requires the solution of the eigenvalue-problem for a D2 × D2 matrix.

However, once that solution is computed for a given state space, it can be used forever.

See Figure 15 for an example of collision avoidance in one dimension.

Figure 16 shows the solution of the ergodic control problem for collision avoidance in two-agent

product space.

51

0 8 17 26 34t

0

10

20

30

40

i

0 1number of agents

0 8 17 26 34t

0

10

20

30

40

i

0 1 2 3 4number of agents

Figure 15: Collision avoidance. The picture on the left shows a sample of a controlled multi-agent process with collision avoidance. For comparison, a sample of the uncontrolledprocess with the same initial state is shown on the right. Rates for transitions toneighboring locations are 1 in the uncontrolled case.

0 10 20 30 40i1

0

10

20

30

40

i2

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

0.045

r

Figure 16: Solution to the backwards equation for the ergodic collision-avoidance problem in thetwo-agent product space.

52

Part IV.

Simulations

In this part of the thesis, I present a series of simulations designed to evaluate the appropriateness

of the different approximate methods for multi-agent control presented in the thesis. The purpose

is to get answers to the following questions:

1. Do approximations work?

2. How does the performance depend on noise in the uncontrolled process?

3. How does the performance depend on the number of agents?

4. How does the performance depend on the number of dimensions?

5. How do different parameter settings influence the results?

6. How efficient are different approximations?

In the following, I outline the tasks used for simulations and reiterate some particularities of the

controllers used.

Note that the simulations are restricted to a subset of possible scenarios and care must be

taken in extrapolating insights gleaned from these scenarios to the general case.

11. Tasks

I measure the performance of approximate controllers using two general types of tasks: goal

directed control and ergodic control.

11.1. Goal directed control

In goal directed control, agents should reach a fixed state XT at time T . State dependent costs

are 0 if the goal state is reached and infinite if this is not the case. There are no further state

dependent costs. I perform simulations with different numbers of agents, different uncontrolled

process rates and in one-dimensional and two-dimensional state spaces. All simulations end at

T = 1. Transition rates are constant over time and equal for all possible transitions. See table 1

for specifications of simulations.

11.2. Ergodic control

In general, ergodic control is control with state-dependent costs which are independent of time.

In the particular kind of ergodic control tested here, there are N single-agent ergodic goals that

53

Simulation number of agents uncontrolled process rates number of dimensions1 5 5 12 5 20 13 5 100 14 20 5 15 20 20 16 100 1 17 100 1 18 5 5 29 5 20 2

Table 1: Simulations for goal-directed control.

should be fulfilled, each consisting in keeping an agent on some location i. Each single-agent

goal qi gives a reward (i.e. a negative cost) as long as it is fulfilled. The overall state-dependent

cost q(X) is then

q(X) =

∫ T

0

D∑i=1

min(Xi, ni)qidt, (152)

where −qi is the reward gained from keeping an agent at i and ni is the number of times that

reward is available (for instance, ni = 3 would imply that the reward at location i is maximized

at t if Xi(t) = 3.

Two simulations are performed. In both, agents move on a one-dimensional grid with 21

locations. State-dependent costs are according to equation 152 with

qi =

−2 if i ∈ {5, 10, 15}

0 else(153)

and

ni =

1 if i ∈ {5, 15}

3 if i = 10

0 else

. (154)

State transitions occur with rate 2 to adjacent locations in the first of the two simulations

(simulation 10) and with rate 10 in the second (simulation 11).

12. Controllers

Simulations are performed using all methods for multi-agent control discussed in the thesis, where

appropriate with multiple parameter-settings. In the following, I outline details related to the

different methods.

54

12.1. Exact Control

Exact control is performed according to the method presented in Section 8.4. Due to the com-

plexity of exact control it is only possible to perform in simple scenarios with few agents. Where

possible it serves as a point of reference for the evaluation of the performance of approximate

methods.

12.2. Variational Approximation

The variational approximation is performed as discussed in Section 9.3. Whenever the mean

squared error between the expected state at T and the goal state exceeds a predefined threshold

ε, the weights of the controller are updated according to the update rule defined in equation 130.

The approximation is tested using ε = .1, ε = 1−5 and ε = 1−10.

12.3. Partial Evaluation of the Solution to the Master Equation

A further control method tested here is based on the approximation of the solution to the Master

equation introduced in section 9.4. Before the control task is started, 1000 assignments of agents

to goals are computed. The n most likely assignments are kept and adapted to state changes

during control. The approximation is tested for n = 1, n = 10 and n = 100.

12.4. Weak Noise Approximation

The weak noise approximation is applied as explained in Section 9.1. The covariance matrix of the

unnormalized Gaussian approximating r is computed using a boundary condition B(t) = 0.1I,

where I is the identity matrix.

12.5. Gaussian Approximation

The Gaussian approximation is performed according to Section 9.5.

13. Sampling

State trajectories are sampled using Gillespie’s method (see Section 4.4). Note that samples

acquired using this method are only approximately correct, since rate-changes between jumps

are not accounted for. However, the error is small, since rates change slowly relative to the rate

of jumps.

55

14. Measure of Performance

Recall that the control-dependent costs are defined as the KL divergence between the controlled

process gt and the uncontrolled process f . The The KL-divergence between two MJPs is

KL(q||p) =

∫ T

0

dt∑X

q(X, t)∑X′ 6=X

(gt(X

′|X) lngt(X

′|X)

f(X ′|X)+ f(X ′|X)− gt(X ′|X)

). (155)

(equation 45, Section 4.5). One cannot evaluate this directly. However, it can be estimated as

KL(q||p) =

T/∆t∑i=0

∆t

n

∑X+

ti

∑X′ 6=X+

ti

(gt(X

′|X+ti ) ln

gt(X′|X+

ti )

f(X ′|X+ti )

+ f(X ′|X+ti )− gt(X ′|X+

ti )

). (156)

by sampling n trajectories X+. For the results presented here, n = 100 and ∆t = 0.1. In addition

to the overall control dependent costs, evolution of costs over time is shown.

15. Results

See Figures 17 to 28 for results of simulations. Error-bars correspond to the standard error of

the mean.

In the following, I shortly summarise the results.

15.1. Goal-Directed Control

Simulation 1 Goal directed control of five agents in a one dimensional grid with 41 locations.

Duration is 1, uncontrolled transitions occur with rate 5. See Figure 17 for results.

Control failed using the weak-noise approximation and the variational approximation using an

error threshold of ε = 0.1 (in both cases, the goal state was not reached). Average control costs

using the exact method were 20.14±0.58. Overall performance of all other methods was similar.

Average control costs using the Gaussian approximation were close to optimal (21.88± 0.69),

however, control is strong in the beginning of the trial and weak at the end of the trial. Using

all other methods, this is reversed.

Simulation 2 Goal directed control of five agents in a one dimensional grid with 41 locations.

Duration is 1, uncontrolled transitions occur with rate 20. See Figure 18 for results.

Average control costs using the exact method were 11.34 ± 0.33. Costs using the variational

method were close to optimal. The method based on the approximation of the solution to the

Master equation performed close to optimal for n = 10. Using n = 1 average control costs were

16.21± 0.47. Costs were sub-optimal in the beginning of the trial.

Costs for the Gaussian approximation were considerably higher than using the exact method,

28.1± 1.9.

56

The weak-noise method failed, giving average control costs that were several orders of magni-

tude higher than those of any other method.

Simulation 3 Goal directed control of five agents in a one dimensional grid with 41 locations.

Duration is 1, uncontrolled transitions occur with rate 100. See Figure 19 for results.

Control costs using the exact method were 7.34± 0.41. Results using the variational approxi-

mation were close to optimal. Using the approximation of the solution to the Master equation,

control costs were above optimal, 19.82± 1.94 for n = 1 and 12.23± 0.28 for n = 10. For these

methods, costs were sub-optimal in the beginning of the task. The Gaussian approximation

performed worse than optimal. Control costs were 20.47± 1.21.

The weak noise approximation failed, with control costs that were several orders of magnitude

higher than optimal.

Simulation 4 Goal directed control of 20 agents in a one dimensional grid with 41 locations.

Duration is 1, uncontrolled transitions occur with rate 5. See Figure 20 for results.

The exact method was not employed in this simulation since the task was too complex to make

an exact solution feasible.

Results were best using the variational approximation with an error criterion ε = 10−5 (average

control costs of 36.41± 0.94). The variational approximation with error criterion ε = 10−10 and

the approximation of the solution to the Master equation with n = 100 were close to that result

(39.75± 0.95 and 38.92± 0.88). The method based on the approximation of the solution to the

Master equation performed slightly worse for n = 1 and n = 10.

Average control costs using the weak noise approximation were far from optimal. Using the

method based on the variational approximation with an error criterion ε = 0.1 did not lead to

consistent fulfilment of the goal.

Simulation 5 Goal directed control of 20 agents in a one dimensional grid with 41 locations.

Duration is 1, uncontrolled transitions occur with rate 20. See Figure 21 for results.

Average control costs were lowest using the controller based on the variation approximation

(25.12± 0.56 for ε = 10−5, 25.32± 0.67 for ε = 10−10).

Control costs using the partial evaluation of the marginal were considerably higher and clearly

depended on the number n of assignments computed for the approximation (135.3±1.2 for n = 1,

87.35± 0.79 for n = 10 and 66.07± 0.69 for n = 100).

The Gaussian approximation resulted in average control costs that were much higher (215.70±139.0). Notably, the largest part of this exceeded cost comes from the last part of the task, shortly

before control was finished.

Average control costs using the weak-noise approximation were several orders of magnitude

higher than using any other method.

The goal was not fulfilled with the variational approximation and ε = 0.1.

57

Simulation 6 Goal directed control of 100 agents in a one dimensional grid with 10 locations.

Start-state and goal state are equal with 10 agents occupying each location. Duration is 1, un-

controlled transitions occur with rate 1. See Figure 22 for results.

Results were best using the variational method (average control costs of 15.01 ± 0.75 using

ε = 10−10 and 15.03 ± 0.84 using ε = 10−5). Average control costs using the method based on

the approximation of the solution to the Master equation were considerably higher (28.88± 0.72

for n = 1, 26.82 ± 0.53 for n = 10 and 27.38 ± 0.48 for n = 100). The Gaussian approximation

performed in a similar range, but with much higher variation (31.02± 13.06)

Again, the weak noise approximation led to control costs that were several orders of magnitude

higher than using all other methods. The variational approximation with ε = 0.1 did not lead

to a consistent fulfilment of the goal.

Simulation 7 Goal directed control of 100 agents in a one dimensional grid with 10 locations.

At t = 0, one location is occupied by 100 agents, all others are empty. In the goal state, all

locations are occupied by 10 agents. Duration is 1, uncontrolled transitions occur with rate 1.

See Figure 23 for results.

Average control costs were lowest using the variational approximation (107.64 ± 1.86 for ε =

10−5, 109.66 ± 1.55 for ε = 10−10). Using the approximation of the solution to the Master

equation, costs were slightly higher (114.69± 1.55 for n = 1 and 113.42± 1.8 for n = 10).

Although the overall costs using both these methods were similar, the evolution of costs over

time was different: Controllers based on the approximation of the solution to the Master equation

led to higher costs than those based on the variational approximation in the beginning of the

task and lower costs in the end.

Both the Gaussian approximation and the weak noise approximation led to average control

costs that were several orders of magnitude higher.

Control based on the variational approximation with ε = 0.1 did not fulfil the goal.

Simulation 8 Goal directed control of five agents in a two dimensional grid with 225 locations.

Duration is 1, uncontrolled transitions occur with rate 5. See Figure 24 for results.

Control using the exact method led to average control costs of 24.42±1.15. Results for control

using the variational approximation and using the approximation of the solution to the Master

equation were in a similar range.

The weak noise approximation and the Gaussian approximation did not work.

Simulation 9 Goal directed control of five agents in a two dimensional grid with 225 locations.

Duration is 1, uncontrolled transitions occur with rate 20. See Figure 25 for results.

Average control costs using the exact method were 17.6 ± 0.51. Results using both the vari-

ational approximation and the approximation of the solution to the Master equation were in a

similar range. One exception is the approximation of the Master equation with n = 1, where

costs were higher (22.37± 0.7), which is due to increased costs in the beginning of the trial.

58

The Gaussian approximation and the weak noise approximation did not work.

15.2. Ergodic Control

Simulation 10 Ergodic control with 5 agents. Duration of one trial is 100. Uncontrolled tran-

sitions occur with rate 2. See Figure 26 for results.

Overall costs using exact control were −2.78± 0.02.

Results using the approximation of the solution to the Master equation were in a similar range

with n = 10 (−2.8± 0.02). For n = 1, costs were slightly higher (−2.68± 0.02).

The costs using the variational approximation were close to optimal (−2.7± 0.02 for ε = 0.1,

−2.73 ± 0.017 for ε = 10−5 and −2.74 ± 0.02 for ε = 10−10). The control-dependent costs

under the variational approximation were lower than using exact control, the state-dependent

costs were higher (for the variational method with ε = 10−10, control-dependent costs were

0.93 ± 0.01, state-dependent costs −3.67 ± 0.02, using exact control, control-dependent costs

were 1.32± 0.01, state-dependent costs were −4.11± 0.05).

Average costs using the Gaussian approximation were sub-optimal (−2.27 ± 0.03). Notably,

the control-dependent costs were much higher than using any other method (3.7± 0.28) and the

state-dependent costs lower (−5.97± 0.02).

The weak noise method cannot be applied to ergodic control.

Simulation 11 Ergodic control with 5 agents. Duration of one trial is 100. Uncontrolled tran-

sitions to neighbouring locations occur with rate 10. See Figure 26 for results.

Overall costs using exact control were −1.55± 0.01.

Results for control using the variational approximation and using the approximation to the

solution of the Master equation with n = 10 were close to optimal. In all cases, average control-

dependent costs were low (0.1± 0.0001).

Average costs using control based on the approximation to the solution of the Master equation

with n = 1 were much higher (4.32± 0.03), which is mainly due to increased control-dependent

cost (5.96 ± 0.02), the state-dependent cost was comparable to that obtained using the other

methods.

Average costs for the Gaussian approximation were high (3.35± 0.003).

15.3. Noise

Simulation 12 Goal directed control with 5 agents in a two-dimensional grid with 41 locations

and varying transition rates in the uncontrolled process. Duration of one trial is 1. See Figure

28.

Average control costs relative to the costs obtained using the exact method increased with

the uncontrolled process rates in the method based on the approximation of the solution to the

Master equation.

59

0.0 0.2 0.5 0.8 1.0t

0

10

20

30

40i

0 1 2 3number of agents

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9t

0

10

20

30

40

50

60

70

80

90

aver

age

cont

rol c

osts

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

Gaussian approximationexact control

1 2 3 4 5 6controller

0

5

10

15

20

25

aver

age

cont

rol c

osts

1 2 3 4 5 6controller

0

20

40

60

80

100

time

(s)

Figure 17: Simulation 1. Goal-directed control of five agents on a one-dimensional grid withuncontrolled transition rate λ = 5. Top left: One sample of the control task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10.5: Gaussian approximation. 6: Exact control.

Using the Gaussian approximation, average control costs were higher than with the exact

method, but, relative to the costs obtained with the exact method, constant with respect to the

uncontrolled process rates.

For all other methods, average control costs were close to those of the exact solution.

60

0.0 0.2 0.5 0.8 1.0t

0

10

20

30

40

i

0 1 2 3number of agents

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9t

0

20

40

60

80

100

120

aver

age

cont

rol c

osts

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

Gaussian approximationexact control

1 2 3 4 5 6controller

0

5

10

15

20

25

30

35

aver

age

cont

rol c

osts

1 2 3 4 5 6controller

0

50

100

150

200

250

300

time

(s)

Figure 18: Simulation 2. Goal-directed control of five agents on a one-dimensional grid withuncontrolled transition rate λ = 20. Top left: One sample of the control task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10.5: Gaussian approximation. 6: Exact control.

61

0.0 0.2 0.5 0.8 1.0t

0

10

20

30

40

i

0 1 2 3number of agents

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9t

0

20

40

60

80

100

120

140

160

aver

age

cont

rol c

osts

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

Gaussian approximationexact control

1 2 3 4 5 6controller

0

10

20

30

40

50

60

aver

age

cont

rol c

osts

1 2 3 4 5 6controller

0

100

200

300

400

500

600

time

(s)

Figure 19: Simulation 3. Goal-directed control of five agents on a one-dimensional grid withuncontrolled transition rate λ = 100. Top left: One sample of the control task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10.5: Gaussian approximation. 6: Exact control.

62

0.0 0.2 0.5 0.8 1.0t

0

10

20

30

40

i

0 1 2 3 4number of agents 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

t0

200

400

600

800

1000

1200

1400

aver

age

cont

rol c

osts

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

approximation of the ME, n=100

Gaussian approximation

1 2 3 4 5 6controller

0

50

100

150

200

aver

age

cont

rol c

osts

1 2 3 4 5 6controller

0

100

200

300

400

500

600

time

(s)

Figure 20: Simulation 4. Goal-directed control of 20 agents on a one-dimensional grid withuncontrolled transition rate λ = 5. Top left: One sample of the control task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10.5: Approximation of the solution to the Master equation, n = 100. 6: Gaussianapproximation.

63

0.0 0.2 0.5 0.8 1.0t

0

10

20

30

40

i

0 1 2 3 4number of agents 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

t0

500

1000

1500

2000

2500

aver

age

cont

rol c

osts

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

approximation of the ME, n=100

Gaussian approximation

1 2 3 4 5 6controller

0

50

100

150

200

250

300

350

400

aver

age

cont

rol c

osts

1 2 3 4 5 6controller

0

500

1000

1500

2000

2500

time

(s)

Figure 21: Simulation 5. Goal-directed control of 20 agents on a one-dimensional grid withuncontrolled transition rate λ = 20. Top left: One sample of the control task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10.5: Approximation of the solution to the Master equation, n = 100. 6: Gaussianapproximation.

64

0.0 0.2 0.5 0.8 1.0t

0

2

4

6

8

i

7 11 15number of agents 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

t0

50

100

150

200

250

300

aver

age

cont

rol c

osts

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

approximation of the ME, n=100

Gaussian approximation

1 2 3 4 5 6controller

0

5

10

15

20

25

30

35

40

45

aver

age

cont

rol c

osts

1 2 3 4 5 6controller

0

200

400

600

800

1000

1200

1400

1600

1800

time

(s)

Figure 22: Simulation 6. Goal-directed control of 100 agents on a one-dimensional grid withuncontrolled transition rate λ = 1. Top left: One sample of the control task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10. 5:Approximation of the solution to the Master equation, 6: Gaussian approximation.

65

0.0 0.2 0.5 0.8 1.0t

0

2

4

6

8

i

0 25 50 75 100number of agents 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

t0

50

100

150

200

250

300

aver

age

cont

rol c

osts

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

approximation of the ME, n=100

1 2 3 4 5controller

0

20

40

60

80

100

120

aver

age

cont

rol c

osts

1 2 3 4controller

0

50

100

150

200

250

300

time

(s)

Figure 23: Simulation 7. Goal-directed control of 100 agents on a one-dimensional grid withuncontrolled transition rate λ = 1. Top left: One sample of the control task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10.5: Approximation of the solution to the Master equation, n = 100.

66

0 2 4 6 8 10 12 1402468

101214

0 2 4 6 8 10 12 1402468

101214

0 1 2number of agents 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

t0

20

40

60

80

100

mea

n co

ntro

l cos

t

variational control, εmax=10−5

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

exact control

1 2 3 4 5controller

0

5

10

15

20

25

30

mea

n co

ntro

l cos

t

1 2 3 4 5controller

0

50

100

150

200

time

(s)

Figure 24: Simulation 8. Goal-directed control of five agents on a two-dimensional grid withuncontrolled transition rate λ = 5. Top left: Start- and goal-state of the task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−5.2: Variational control, ε = 10−10. 3: Approximation of the solution to the Masterequation, n = 1. 4: Approximation of the solution to the Master equation, n = 10.5: Exact control.

67

0 2 4 6 8 10 12 1402468

101214

0 2 4 6 8 10 12 1402468

101214

0 1 2number of agents 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

t0

20

40

60

80

100

120

140

aver

age

cont

rol c

osts

variational control, εmax=10−1

variational control, εmax=10−10

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

exact control

1 2 3 4 5 6controller

0

5

10

15

20

25

aver

age

cont

rol c

osts

1 2 3 4 5 6controller

0

200

400

600

800

1000

1200

1400

time

(s)

Figure 25: Simulation 9. Goal-directed control of five agents on a two-dimensional grid withuncontrolled transition rate λ = 20. Top left: Start- and goal-state of the task. Topright: Average control costs over time. Bottom left: Overall average control costs.Bottom right: Computation time. Controllers are: 1: Variational control, ε = 10−1.2: Variational control, ε = 10−5. 3: Variational control, ε = 10−10. 4: Approximationof the solution to the Master equation, n = 1. 5: Approximation of the solution tothe Master equation, n = 10. 6: Exact control.

68

0.0 25.0 50.0 75.0 100.0t

0

5

10

15

20

i

0 1 2 3 4number of agents

1 2 3 4 5 6 7controller

6

4

2

0

2

4

aver

age

cost

s

control costsstate-dependent costs

1 2 3 4 5 6 7controller

3.0

2.5

2.0

1.5

1.0

0.5

0.0

aver

age

cost

s

1 2 3 4 5 6 7controller

0

50

100

150

200

250

time

(s)

Figure 26: Simulation 10. Ergodic control of five agents on a one-dimensional grid with un-controlled transition rate λ = 5. Top left: One sample of the control task. Top right:Average control costs and state dependent costs. Bottom left: Overall average con-trol costs. Bottom right: Computation time. Controllers are: 1: Variational control,ε = 10−1. 2: Variational control, ε = 10−5. 3: Variational control, ε = 10−10. 4:Approximation of the solution to the Master equation, n = 1. 5: Approximation ofthe solution to the Master equation, n = 10. 6: Gaussian approximation. 7: Exactcontrol.

69

0.0 25.0 50.0 75.0 100.0t

0

5

10

15

20

i

0 1 2 3 4number of agents

1 2 3 4 5 6 7controller

4

2

0

2

4

6

8

aver

age

cost

s

control costsstate-dependent costs

1 2 3 4 5 6 7controller

2

1

0

1

2

3

4

5

aver

age

cost

s

1 2 3 4 5 6 7controller

0

200

400

600

800

1000

1200

1400

time

(s)

Figure 27: Simulation 11. Ergodic control of five agents on a one-dimensional grid with uncon-trolled transition rate λ = 20. Top left: One sample of the control task. Top right:Average control costs and state-dependent costs. Bottom left: Overall average con-trol costs. Bottom right: Computation time. Controllers are: 1: Variational control,ε = 10−1. 2: Variational control, ε = 10−5. 3: Variational control, ε = 10−10. 4:Approximation of the solution to the Master equation, n = 1. 5: Approximation ofthe solution to the Master equation, n = 10. 6: Gaussian approximation. 7: Exactcontrol.

70

10 20 30 40 50 60 70 80 90level of noise

5

10

15

20

25

30

35

40

45

50

aver

age

cont

rol c

osts

variational control, εmax=10−10

approximation of the ME, n=1

approximation of the ME, n=10

Gaussian approximationexact control

Figure 28: Simulation 12 The effect of noise. Average control costs over 10 samples dependingon transition rates in the uncontrolled system in a goal directed control task with 5agents. Duration of the task was 1.

71

Part V.

Discussion & Conclusion

16. Discussion

The objective of this work was to develop efficient methods for multi-agent control in Markov

jump processes building on the LSMDP framework and using approximate inference techniques.

Exact optimal control is feasible for single-agent systems and multi-agent systems with state-

dependent cost-functions that depend linearly on the state vector of the controlled system. In

general, exact optimal control for multi-agent systems is intractable and approximate methods

are needed.

This thesis introduces approximate methods for two general types of tasks, goal directed

multi-agent control and ergodic multi-agent control. These methods include a weak-noise ap-

proximation, a variational approximation, a partial evaluation of the analytical solution to the

forward Master equation and a Gaussian approximation.

Simulations indicate that both the variational approximation and the partial evaluation of the

analytical solution to the Master equation are, with some restrictions, appropriate methods for

approximate control of multi-agent systems.

The controller based on a variational approximation performed close to optimal in simulations

where optimal results are available (Simulations 1, 2, 3, 8, 9, 10 and 11). Where this is not the

case its results are consistently among the best. The minimum error criterion in the method

seems to have only a small effect both on performance of the method and on computation time.

This goes with the restriction that, evidently, if the error criterion is too high, fulfillment of the

goal cannot be guaranteed. Interestingly control was good not only in the late stages of the

tasks (when T − t is small), which was expected (as discussed in Section 9.3) but over the whole

duration.

The method based on the partial evaluation of the solution to the Master-equation is par-

ticularly successful in the case of low noise in the uncontrolled process and few agents (as in

Simulations 1 and 8). If noise is high, the control costs are suboptimal in the early stages of

control if the approximation is coarse (Simulations 2, 3 and 8, in particular Simulation 12). This

is probably due to the fact that the approximation is based on selecting a subset of assignments

from agents to goals. If noise is low, this selection is more likely to reflect the final state of

the system than if noise is high. This effect relates to the phenomenon of symmetry braking

(cf. [Kappen, 2005]) – in stochastic control with high noise most control takes place in the last

stages of a trial, since early control is less effective because of the noise that perturbs the state

in the time remaining until the end of the task.

The weak-noise approximation failed to give satisfying solutions to control problems in all

simulations. The reason for this may be that in all cases the number of agents were relatively

72

low (see Section 9.1 for further discussion).

Performance of the Gaussian approximation of the solution to the Master equation was far

from optimal in all simulations. Again, this may be due to the low number of agents used in

the simulations. However, in Simulation 6, where the number of agents was comparably high

(100), the method performed poorly only in the last stage of the task. This is expected, since the

assumptions for a Gaussian approximation are violated for small T − t, as discussed in Section

9.5. Thus the method may still be appropriate in some restricted cases.

In all methods, the computation time depends on the transition rate in the uncontrolled

process, the number of agents and the dimensionality of state-space. The reasons are twofold:

First, since sampling is based on Gillespie’s method, the number of times the controlled process

rate function is evaluated depends on the number of jumps. That, in turn, increases both with

the number of agents and with the uncontrolled process rate. Second, the number of possible

successors for a given state depends on the number of agents and the dimensionality of state-

space. Since the controlled process rate function needs to be evaluated for all possible successor

states, computation time rises.

Note that except for the weak noise approximation, the number of states in the system influ-

ences computation time only through the computation of single-agent solutions, which is done

before a control trial starts. Computing single-agent solutions can become a problem if the num-

ber of states is large. In some situations it may be appropriate to use approximate methods such

as the weak-noise approximation for single-agents introduced in Section 7.2.1.

In the case of the variational approximation, necessary computation increases with the number

of goal locations due to the increased number of parameters in the optimization. Still, compu-

tation time was relatively low in all simulations. The dependence of computation time on the

value of the error criterion ε was negligible. The reason for this may be that the weights are

not recomputed each time the error threshold is exceeded, they are merely adapted. Since that

adaptation starts with an error slightly over the error criterion ε, irrespective of its value, the

time until the error falls again below that threshold should only weakly depend on it.

Computation time using the controller based on the partial evaluation of the solution to the

Master equation depends heavily on the number of computed assignments. It can be very fast

if the approximation is based on few assignments from agents to goals (using n = 1, the method

was fastest almost all simulations) and relatively slow if many assignments are computed.

17. Challenges

This thesis was the first attempt on tackling the problem of multi-agent control in Markov

jump processes using methods borrowed from approximate inference. Although first results are

promising, many challenges remain.

The methods developed for multi-agent control are only appropriate for goal-directed control

and ergodic control (if costs-function are non-linear). In particular, they cannot be applied to

cost-functions that are time dependent and yield costs that are non-zero for more than one time

73

point. While it seems reasonable to extend the methods presented here to cost-functions that

are non-zero at multiple isolated time-points, a transfer to tasks with continuous cost-functions

seems difficult. In these cases, methods based on sampling of state trajectories may be more

appropriate.

In the case of ergodic control, the presented methods work only on the special cases in which

state-dependent costs can be conceived as a combination of single-agent ergodic state-dependent

costs or for collision avoidance. One should further investigate ergodic control on a more general

level.

A further future direction of research concerns the application of the presented methods to

real-world scenarios. A topic that has not been touched in this thesis, but which would be

important for applications is the translation of transition probabilities to action commands.

The simulations performed in the thesis were all on structured, grid-like, state-spaces in which

transitions can only take place between adjacent locations. However, all methods introduced in

this thesis (with the exception of the weak-noise approximation) can be applied to processes with

arbitrary transition functions. One should investigate how the methods work in such contexts,

especially since this is an central feature that distinguishes control on Markov jump processes

from control in continuous space, such as path integral control.

Throughout the thesis, I assume that uncontrolled process rates are known. In real-world

scenarios this is likely not to be the case and one should investigate efficient methods for approx-

imating the uncontrolled process rates from data.

18. Further Approaches

I investigated the possibility of multi-agent control using two further methods that are not dis-

cussed in the thesis. The first is based on a computation of the analytical solution to the forward

Master-equation using the Fourier transform, the second is an approximation of the solution

to the Master equation using Belief propagation. Both methods appear to be computationally

too demanding to accomplish the task in a time appropriate for control applications. However,

investigation in these directions was only preliminary.

19. Conclusion

The purpose of this thesis was to develop efficient methods for multi-agent control on Markov

jump processes building on the framework of linearly solvable Markov decision processes and ap-

proximate inference. Five different methods were presented, discussed and tested in simulations.

Notably, a method based on an approximation using a variational lower bound to the free

energy of the controlled process performed well in all tested scenarios.

The thesis opens a promising new direction for research in multi-agent control which should

be further pursued.

74

20. Acknowledgements

I thank Prof. Manfred Opper and Dr. Andreas Ruttor for their supervision and patient support

during the preparation of the thesis. I also thank Duncan Blythe for proofreading.

75

A. MJP Inference with Arbitrary Cost Function

With an arbitrary cost function c(s, t), the KL(q||ppost) becomes

KL(q||ppost) = Ex∼q(·)

{ln

q(x)

ppost(x)

}(157)

= Ex∼q(·)

lnq(x)

1Z pprior(s) exp

(−∫ T

0c(x(t), t)dt

) (158)

= lnZ + KL(q||pprior) + Ex∼q(·)

{∫ T

0

c(x, t)dt

}(159)

The partial derivative of this with respect to q(s, t) is

δ

δq(s, t)KL(q||ppost) =

δ

δq(s, t)KL(q||pprior) + c(s, t) (160)

Hence the derivative of the Lagrangian becomes

δL

δq(s, t)=∑s′ 6=s

(gt(s

′|s) lngt(s

′|s)f(s′|s)

− gt(s′|s) + f(s′|s))

(161)

+∂

∂tλ(s, T ) +

∑s′

gt(s′|s) (λ(s′, t)− λ(s, t)) (162)

+ c(s, t) (163)

= 0 (164)

=∑s′ 6=s

(r(s′, t)

r(s, t)f(s′|s) ln

r(s′, t)

r(s, t)− r(s′, t)

r(s, t)f(s′|s) + f(s′|s)

)(165)

− ∂

∂tln r(s, t) (166)

−∑s′

r(s′, t)

r(s, t)f(s′|s) ln

r(s′, t)

r(s, t)(167)

+ c(s, t) (168)

=∑s′ 6=s

f(s′|s)(

1− r(s′, t)

r(s, t)

)− ∂

∂tln r(s, t) + c(s, t) (169)

(170)

This yields for the differential equation

∂tr(s, t) =

∑s′ 6=s

f(s′|s) (r(s, t)− r(s′, t)) + c(s, t)r(s, t) (171)

76

B. Single Agent Control

∂tpi =

∑j 6=i

Cij(pi − pj) (172)

= pi∑j 6=i

Cij −∑j 6=i

Cijpj (173)

= −piCii −∑j 6=i

Cijpj (174)

= −∑j

Cijpj (175)

→ d

dtp = −Cp (176)

This is solved by

p(t) = exp(C(T − t))>p(T ). (177)

C. Variational Approximation with Gaussian Marginal

We approximate the marginal distribution P (X,T ) of state vectors X at time T by a Gaussian

with mean µ and covariance Σ (from 41, section 4.3). φ>X(T ) will then be normally distributed

with mean φ>µ and variance φ>Σφ. Given that assumption, exp(φ>XT

)is distributed according

to a log-normal distribution and we have lnE[exp

(φ>X(T )

)]= φ>µ + 1

2φ>Σφ. The term to

maximize is then

f = −σ2

2||φ||2 + φ>XT − φ>µ−

1

2φ>Σφ>, (178)

by differentiating, one gets

∂f

∂φi= −σ2φi +XTi − µi −

∑j

φj .Σij (179)

Setting this to 0, the solution for φ turns out to be

φ = (σ2I + Σ)−1(XT − µ). (180)

D. EM-Formulation of Expectation Control

We introduce a likelihood function

p(Xt,Z|w) ∝ exp

(−||w

D∑i=1

X(i)t

r(t)(i)

Z(i)−XT ||2

). (181)

77

The E-step consists in finding the likelihood p(Z|Xt,w), which we define as

p(Z|Xt,w) ..=

D∏i=1

δ

(Z(i) −

N∑l=1

wlr(t)(i)l

). (182)

The goal of the M-step is to find parameters w that maximize

Q(w,wold) =∑Z

p(Z|Xt,w(old)) ln p(Xt,Z|w) (183)

= −||wD∑i=1

X(i)t

r(t)(i)k∑N

l=1 wlr(t)(i)l

−XT ||2 + const, (184)

where const comprises terms that do not depend on w.

E. Implementation Details

All programs used for simulations in this thesis were implemented using Python (version 2.7.3)

[Van Rossum and Drake Jr, 1995] and Numpy (version 1.6.1) [Oliphant, 2007]. Numerical inte-

grations are performed using odeint [Ahnert and Mulansky, 2011] with machine precision.

78

References

[Ahnert and Mulansky, 2011] Ahnert, K. and Mulansky, M. (2011). Odeint - solving ordinary

differential equations in c++. arXiv e-print 1110.3397. IP Conf. Proc. - September 14, 2011 -

Volume 1389, pp. 1586-1589.

[Bandini et al., 2007] Bandini, S., Federici, M. L., and Vizzari, G. (2007). Situated cellular

agents approach to crowd modeling and simulation. Cybernetics and Systems: An International

Journal, 38(7):729–753.

[Barraclough et al., 2004] Barraclough, D. J., Conroy, M. L., and Lee, D. (2004). Prefrontal

cortex and decision making in a mixed-strategy game. Nature Neuroscience, 7(4):404–410.

[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

[Chen and Cheng, 2010] Chen, B. and Cheng, H. H. (2010). A review of the applications of

agent technology in traffic and transportation systems. Intelligent Transportation Systems,

IEEE Transactions on, 11(2):485497.

[Chen and Dong, 2013] Chen, D. and Dong, S. (2013). The application of multi-agent system in

robot football game.

[Guestrin et al., 2003] Guestrin, C., Koller, D., Gearhart, C., and Kanodia, N. (2003). General-

izing plans to new environments in relational mdps. In In International Joint Conference on

Artificial Intelligence (IJCAI-03. Citeseer.

[Jahnke and Huisinga, 2007] Jahnke, T. and Huisinga, W. (2007). Solving the chemical master

equation for monomolecular reaction systems analytically. Journal of Mathematical Biology,

54(1):1–26.

[Kappen, 2007] Kappen, B. (2007). An introduction to stochastic control theory, path integrals

and reinforcement learning.

[Kappen, 2005] Kappen, H. J. (2005). Linear theory for control of nonlinear stochastic systems.

Physical Review Letters, 95(20):200201.

[Kappen et al., 2012] Kappen, H. J., Gmez, V., and Opper, M. (2012). Optimal control as a

graphical model inference problem. Machine learning, page 124.

[Kesting et al., 2008] Kesting, A., Treiber, M., and Helbing, D. (2008). Agents for traffic simu-

lation. arXiv preprint arXiv:0805.0300.

[Kitano, 2000] Kitano, H. (2000). Robocup rescue: A grand challenge for multi-agent systems.

In MultiAgent Systems, 2000. Proceedings. Fourth International Conference on, pages 5–12.

IEEE.

79

[Kording, 2007] Kording, K. (2007). Decision theory: what” should” the nervous system do?

Science, 318(5850):606–610.

[Marr, 1982] Marr, D. (1982). Vision: A computational investigation into the human represen-

tation and visual information. WH San Francisco: Freeman and Company.

[Oliphant, 2007] Oliphant, T. E. (2007). Python for scientific computing. Computing in Science

& Engineering, 9(3):10–20.

[Opper and Ruttor, 2010] Opper, M. and Ruttor, A. (2010). A note on inference for reaction

kinetics with monomolecular reactions.

[Roche et al., 2010] Roche, R., Blunier, B., Miraoui, A., Hilaire, V., and Koukam, A. (2010).

Multi-agent systems for grid energy management: A short review. In IECON 2010-36th

Annual Conference on IEEE Industrial Electronics Society, page 33413346.

[Ruttor and Opper, 2010] Ruttor, A. and Opper, M. (2010). Approximate parameter inference

in a stochastic reaction-diffusion model.

[Ruttor et al., 2009] Ruttor, A., Sanguinetti, G., and Opper, M. (2009). Approximate inference

for stochastic reaction processes.

[Severini, 2005] Severini, T. A. (2005). Elements of Distribution Theory. Cambridge University

Press.

[Sugrue et al., 2005] Sugrue, L. P., Corrado, G. S., and Newsome, W. T. (2005). Choosing the

greater of two goods: neural currencies for valuation and decision making. Nature Reviews

Neuroscience, 6(5):363375.

[Sutton and Barto, 1998] Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An

Introduction. MIT Press.

[Todorov, 2004] Todorov, E. (2004). Optimality principles in sensorimotor control. Nature neu-

roscience, 7(9):907–915.

[Todorov, 2009] Todorov, E. (2009). Efficient computation of optimal actions. Proceedings of

the National Academy of Sciences, 106(28):11478–11483.

[van den Broek et al., 2008] van den Broek, B., Wiegerinck, W., and Kappen, B. (2008). Graph-

ical model inference in optimal control of stochastic multi-agent systems. Journal of Artificial

Intelligence Research, 32(1):95122.

[Van Rossum and Drake Jr, 1995] Van Rossum, G. and Drake Jr, F. L. (1995). Python reference

manual. Centrum voor Wiskunde en Informatica.

[Wilkinson, 2011] Wilkinson, D. J. (2011). Stochastic Modelling for Systems Biology. CRC Press.

80