Simulation and Investigation of Multi-Agent Reinforcement

Simulation and Investigation of Multi-AgentReinforcement Learning for Building Evacuation

Scenarios ∗

Ashley WhartonSt Catherine’s College

May 18, 2009

∗Supervisor: Prof. Steve Roberts

1

Contents

1 Acknowledgements 4

2 Abstract 5

3 Glossary 6

4 Introduction 7

5 Background / Literature Review 85.1 Building Evacuation Planning / Modelling . . . . . . . . . . . . . . . . . . 85.2 Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.4 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.5 Model of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.6 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.6.1 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.6.2 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.7 Temporal-Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . . 135.8 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.9 Exploitation vs. Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.9.1 Greedy Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.9.2 Randomised Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 165.9.3 Epsilon-Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Approach to Programming 18

7 Development of the Simulation 197.1 Initial Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.2 Initial Single Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.3 Simulation Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.4 Program Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.5 Analysis of First Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.6 Investigating the Effects of Alpha and Lambda . . . . . . . . . . . . . . . . 267.7 Introduction of Static Fire . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.8 Larger State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.9 Uncertainty in Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.10 Non-Stationary Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 397.11 Extension to Multi-Agent Simulation . . . . . . . . . . . . . . . . . . . . . 46

8 Validation/Verification 50

2

9 Conclusions 519.1 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

10 References 54

11 Appendix - MATLAB Code 55

3

1 Acknowledgements

A great deal of thanks is owed to my supervisor, Prof. Steve Roberts, for suggesting theproject topic to me and agreeing to supervise my work. His guidance and stimulatingdiscussions have inspired a genuine interest in the subject, and the project as a whole hasgiven me an invaluable insight into scientific research. This has ultimately resulted in myapplication to study for a DPhil with the Pattern Analysis and Machine Learning ResearchGroup to enable me to continue working in the field.

I would also like to thank all the staff of the department for their help in my educationover the last 4 years.

- Ashley Wharton

4

2 Abstract

Evacuation Modelling is an important consideration in modern building projects. Theability to generate an accurate quantitative assessment of how well a building design facil-itates evacuations as well as easily, quickly and cheaply compare different building layoutsis very valuable. Evacuation Modelling Software is also of great use to safety professionalsand evacuation management personnel who can use it to improve and optimise the detailsof evacuation plans and responses including, for example, how many Evacuation Officers abuilding would require and where they should be stationed to have the greatest effect onaiding the building’s occupants.

This project investigates the applicability and usefulness of Multi-Agent ReinforcementLearning to Building Evacuation Simulations. Specifically, a method of ReinforcementLearning known as Temporal-Difference Learning is used to develop a basic simulationwhich is extended and improved to model a large building containing a multi-agent, het-erogeneous population attempting to evacuate in the presence of a non-stationary fire. Themethod used to develop the model demonstrates how Temporal-Difference Learning can beused to simulate uninformed human occupants of buildings during evacuations as well asproviding useful approaches to modelling uncertainty in position due to reduced visibilityin a smoky environment and effective methods of communication between agents.

Consideration is also given to some of the many possibilities available to extend thiswork and allow the approach used to achieve its full potential in the construction of anaccurate and versatile piece of Building Evacuation Simulation software.

5

3 Glossary

Artificial Intelligence (A.I.) - The intelligence of machines and software agents evi-dent when those systems perceive their environment and take actions which maximize theirchances of completing their goals.

Agent - Although there is no generally accepted definition of an ’agent’ in the A.I. field[1] we adopt the opinion that an agent is an entity within an environment that possessessome domain knowledge, can perform actions and has goals.

State - A state is a specific instantiation of the agent’s environment.

Action - An action is a method an agent can use to affect its own state, its environ-ment or another agent’s state.

Reward - An immediate payoff resulting from taking a specific action in a specific state.The agent’s aim is to maximise the sum of the rewards it receives.

Policy - A scheme for choosing which action to take in each state.

Trial - A trial is defined to be the period from the initiation of the agent in its envi-ronment to either the agent’s attainment of its goal or the agent’s death.

Monte Carlo Methods - A class of computational algorithms that rely on repeatedrandom sampling to compute their results.

6

4 Introduction

The aim of this project is to investigate the usefulness of Reinforcement Learning as a toolto model humans in evacuation scenarios. Reinforcement Learning is a method used in theA.I. field to allow software agents to learn, in the absence of a ‘teacher’, from interactionswith their environment. There are several ways this can be achieved, some of which fitvery closely to the way animals, and indeed humans, learn.

The structure of this report is organised to reflect that of the project as a whole,therefore, we begin with an explanation of the topic and a brief introduction to buildingevacuation-modelling and the usefulness of multi-agent simulations. Time is then spentexplaining the general principle of Reinforcement Learning and its key concepts includingthe use of Markov Decision Process models and considerations of agent ‘horizons’. Wethen look at a particularly useful form of Reinforcement Learning known as Temporal-Difference Learning, where the essential principles of this method are conveyed along withadditional considerations of how agents can balance exploration of their environment withexploitation of the knowledge they have already gained.

As a demonstration of the usefulness of these approaches in modelling humans a build-ing evacuation simulation is created. This simulation is used to indicate the applicability ofthese learning techniques to modelling building evacuation scenarios. The simulation devel-ops from a basic beginning, modelling a very simple building environment, to a reasonablycomplex level where it has the capacity to simulate a time-varying fire, the uncertaintyin position due to low visibility in a smoke filled building and communication betweenevacuees. A simulation of any reasonable complexity is rarely (if ever) constructed in onephase. This report takes the reader through the development of the model from the initialcreation of a TD(λ) agent algorithm and basic simulation of a simple building environment,through the introduction of extra elements of complexity, with testing and analysis of theresults gained at each stage.

7

5 Background / Literature Review

In this section of the report the importance of building evacuation modelling software isexplained along with the common use of multi-agent environments. The general principle ofReinforcement Learning is then introduced and the key concepts considered. This leads intoa explanation of a type of Reinforcement Learning known as Temporal-Difference learningand thought is then given to how this method could be implemented in the constructionof an accurate and versatile building evacuation simulation.

5.1 Building Evacuation Planning / Modelling

Evacuation planning is an important consideration during the design stage of modernbuildings. The use of evacuation models aids in the placement of structural features andexits within the building and provides a method of predicting and quantifying evacuationsuccess under a wide range of conditions.

There exists a genuine need for flexible but accurate evacuation modelling softwareand there are many alternatives to choose between. In 2003 the Fire Safety EngineeringGroup of the University of Greenwich conducted a survey of the area [2]. They estimatedthat more than 40 different evacuation models were being used by engineers, architects,designers and safety professionals for a variety of different environments and that manymore were being developed.

Of the numerous methods that currently exist, the advancements in computing powerover the last few decades have meant that agent-based simulations are by far the mostcommon approach to evaluating the effectiveness of group evacuations.

5.2 Multi-Agent Systems

Distributed Artificial Intelligence is a subfield of A.I. that has existed since the early1980s. It is normally seen as being composed of two main disciplines [3]. One is known asDistributed Problem Solving, which is concerned with the information management aspectsof multiple component systems e.g. task decomposition and solution synthesis. The otheris known as Multi-Agent Systems (MAS) and deals with the behaviour management ofmultiple independent entities or agents that interact in a common environment.

Research conducted on MAS has tried to provide principles for the construction ofcomplex systems involving multiple agents and the mechanisms required for coordinatingagents’ behaviour. An extensively used domain for research in the field has been thePredator/Prey or ’Pursuit’ domain [4]. Within this orthogonal, toroidal world the prey(shown in white in Figure 1 below) moves randomly. The predators (shown in black) aimto trap the prey by coordinating their movements and blocking all the moves the preycould possibly make.

Although it is a ‘toy domain’, many of the issues that arise in MAS present themselvesin the Pursuit Domain and it has proven that a simple abstraction can be a very usefultool in studying a wide variety of approaches. It also provides an insight into the types

8

Figure 1: A Typical Instance Of The Pursuit Domain (left) And The Predators’ Goal(right).

of parameters to consider in creating a basic environment in which multiple agents caninteract.

5.3 Reinforcement Learning

Reinforcement Learning is the problem faced by an agent that must learn behavior throughtrial-and-error interactions with a dynamic environment. The two main methods usedto deal with this problem have been to either search the behaviour space using geneticprogramming/genetic algorithm techniques or to use statistical techniques and dynamicprogramming methods to estimate the worth of taking specific actions in specific states.The second of these methods takes advantage of the specific structure of reinforcementlearning problems and can provide a relatively easy and quick way to develop solutions toquite complex problems.

At each step of interaction in the reinforcement learning model the agent receivesinformation about the current state and must then choose one of the available actions.The action chosen then changes the state of the environment and the value of the newenvironmental state is communicated back to the agent through the reinforcement signal(Reward). Figure 2 shows a graphical representation of the structure of this ReinforcementLearning Model and the flow of information between agent and environment.

In general, the environment is considered non-deterministic so taking the same action ina specific state on two different occasions may not produce the same next state or reward.Contrary to methods used in supervised learning there is no presentation of input/outputpairs for the agent to learn its best long-term strategy from and the evaluation of thesystem is concurrent with learning (so called on-line learning). The agent must try todevelop an optimal policy for which actions to take in which states so that the sum of therewards it receives is maximised.

Agents usually need to consider future states as well as the immediate reward when

9

Figure 2: The Reinforcement Learning Model

deciding upon actions to take. Depending on the model of long-run optimality being used(chosen depending on how long the agent expects to exist in the environment) the agentmust look ahead to varying degrees and has to be able to learn from delayed reinforcement.Markov Decision Processes (MDPs) model such delayed reinforcement problems well.

5.4 Markov Decision Processes

A Markov Decision Process is defined by the following elements:

• Discrete Time (t = nT).

• A Set of States (s ε S).

• A Set of Actions (a ε A).

• A Reward Function r(st,at,st+1) - the expected instantaneous reward received whenperforming action at in state st and ending up in state st+1.

• A State Transition Function T(st,at,st+1) - the probability of making a transition tostate st+1 from state st using action at.

• State transitions are independent of any previous environment states or agent actions.

General MDPs may have infinite state and action spaces but it is possible to avoidthis complexity, reduce the learning task, shorten computation time and lessen memoryrequirements by considering a discretised state space, thereby forming a finite-state finite-action problem.

The set of actions in a MDP is non-uniform, meaning that the actions available to anagent are a function of state, that is:

at = π(st) (1)

where π is known as the policy function and maps state space S into action space A.

10

We wish to find an optimal policy π∗ (determining which action should be performedin each state) that will maximise the expected total reward over an agent’s trial until aterminal state (emergency exit or fire state) is reached.

5.5 Model of Optimality

There exist a number of choices for a prospective agent’s model of optimality (mentionedabove), these include Finite Horizon Models, Infinite Horizon Models and Average-RewardModels to name a few. Since the aim of this project is to model humans (beings of optimisti-cally long life span in terms of the time period between agent/environment interactions)the Infinite Horizon Model is a sensible choice for the following analysis.

This discount model takes into account all the rewards an agent receives during atrial but geometrically reduces their significance the further from the present they werereceived using the discount factor γ. This means the expected total reward over a trial canbe expressed as:

E[rt+1 + γrt+2 + γ2rt+3 + ...

]= E

[∞∑k=0

γkrt+k+1

](2)

Where the discount factor is in the range 0 ≤ γ ≤ 1 and the notation rt+1 is shorthand forr(st, at, st+1). This is an effective way to mathematically bound a potentially infinite sumof rewards and can be seen to signify the probability of living for another step.

5.6 Dynamic Programming

An agent’s goal is to perform behaviour that optimises Equation 2. This is difficult becauseeach action taken affects all rewards gained thereafter. Dynamic Programming [5] usesthe MDPs structure to perform this optimisation while avoiding the impossible task ofconsidering all possible action sequences. There are two main ways this can be achieved;they are Policy Iteration and Value Iteration.

5.6.1 Policy Iteration

A policy is a scheme for choosing which action to take in each state as shown in Equation1. In order to perform optimal behaviour and maximise Equation 2, therefore, we needto find an optimal policy π∗. Policy Iteration is a method of converging on this optimalpolicy, it consists of two stages; evaluation and improvement.

To evaluate a specific policy we need to determine the ‘value’ of states under it. Thiscan be expressed as a value function V π, which is a sum of expected future rewards. Thus,the value of an agent beginning a trial in state s0 and using policy π to reach a terminalstate would be:

V π(s0) = limN→∞

E

[N−1∑t=0

γtr(st, π(st), st+1)

](3)

11

Our goal is to find the optimal policy that maximises this value function:

V π∗(s0) = maxπ

V π(s0) (4)

Where V π∗ is the optimal value function. From here on we will simplify this notation tojust V ∗. It is possible to separate the first and subsequent terms in the expression forthe optimal value function to produce a relationship known as the Bellman OptimalityEquation for discrete states and actions:

V ∗(st) = maxa

∑j

Tij(a)[r(i, a, j) + γV ∗(j)] (5)

Essentially, this says that the true value of state st is the reward gained from moving intothe next state added to the true value of that next state, which when seen from the firststate must be discounted accordingly, hence the γ multiplier.

We can use the recursive nature of the Bellman Optimality Equation to iterate the valuefunction:

V πn+1(i) =

∑j

Tij(π(i))[r(i, π(i), j) + γV πn (j)] (6)

It has been shown [6] that this method ensures the new value function is an improvementon the previous one and therefore V π

n converges to the optimal value function V ∗.

The next step is to improve the policy π by simply updating it using:

π(i) = argmaxa

∑j

Tij(π(i))[r(i, π(i), j) + γV πk (j)] (7)

Once we reach a stage where there is no difference between successive policies we havefound the optimal policy.

5.6.2 Value Iteration

Value iteration works the opposite way round to policy iteration. Rather than begin witha policy evaluation and then improve it, in value iteration we compute the optimal valuefunction and then obtain the optimal policy.

Initially, an arbitrary value is assigned to all states i.e. V0(i) = 0, for all i, and theseare iteratively updated according to the equation:

Vn+1(i) = maxa

∑j

Tij(a)[r(i, a, j) + γVn(j)] (8)

Similar to the termination procedure in policy iteration, this value iteration cycle continuesto occur until the difference between successive approximations drops below a low threshold

12

value. When this occurs we can consider Vn to be a suitably close estimate to V ∗. Theoptimal policy can then be easily determined using:

π∗(i) = argmaxa

∑j

Tij(a)[r(i, a, j) + γV ∗(j)] (9)

Both of these dynamic programming algorithms have limitations. In computing theoptimal value function they make use of a ‘lookup’ table to store the values of all states. Forlarger state spaces the ‘curse of dimensionality’[7] means that the number of computationsand the amount of memory required grow exponentially. These methods also requirecomplete knowledge of the environment in the form of the transition probability functionTij(a) and the reward function r(i,a,j ), these are often difficult to attain in real-worldproblems.

5.7 Temporal-Difference Learning

By using a sequence of observations from a trial in the environment we can estimatethe value of states rather than trying to compute them with knowledge of the transitionprobability and reward function. Where previously we would have attempted to find:

V π(st) = E[r(st, π(st), st+1) + γV π(st+1)] (10)

We can now use the agent’s sequence of observations to create a new estimate of the valueof the current state Vn+1(st) with the equation below:

V π(st) ≈ Vn+1(st) = rt+1 + γVn(st+1) (11)

Where the current estimate of the value of the next state Vn(st+1) is used to replace that ofthe second term on the right side of Equation 10 and the observations from the environmentare of the form:

[s0, π(s0), s1, r1, π(s1), s2, r2, π(s2), s3, r3, ..., sT−1, π(sT−1), sT , rT ] (12)

Unlike other prediction-learning techniques that attempt to converge to a solution usingthe error between the prediction and the true value, TD(λ) learning uses the difference(error) between two successive predictions of the value of a state, known as the temporaldifference:

δt = Vn+1(st)− Vn(st) (13)

= [rt+1 + γVn(st+1)]− [Vn(st)] (14)

This ‘temporal error’ term can then be used to adjust the current estimate with the learningrule given in Equation 15, where α determines the rate of learning.

Vn(st)← Vn(st) + αδt (15)

13

In situations where a terminal state exists we could use a Monte Carlo approach toestimate the value of a state by using a complete set of observations until termination likeso:

Vn+1(st) ≈T−t−1∑k=0

γkrt+k+1 (16)

Expanding the expression above and determining the error between this updated estimateand the previous estimate yields:

Vn+1(st)− Vn(st) = (rt+1 + γrt+2 + γ2rt+3 + ...+ γT−t−1rT )− V π(st) (17)

We can rewrite this equation to contain similar terms to Equation 14 as is shown below:

Vn+1(st)− Vn(st) = [rt+1 + γV π(st+1)− V π(st)] + γ[rt+2 + γV π(st+2)− V π(st+1)] + ...

+ γT−t−2[rT−1 + γV π(sT−1)− V π(sT−2)] + γT−t−1[rT − V π(sT−1)](18)

Using temporal differences this can be expressed as:

Vn+1(st)− Vn(st) = δt + γδt+1 + ...+ γT−t−1δT−1 (19)

Which leads to another form of the temporal-difference update rule:

Vn(st)← Vn(st) + α[δt + γδt+1 + ...+ γT−t−1δT−1] (20)

The update rule of Equation (15) only uses an estimated value of the next state whichbiases the improved estimate of the value of the current state, whereas the update ruleusing the Monte Carlo estimate, Equation (20), is unbiased but has a large variance sinceit sums all the variances of the stochastic rewards. This is where the λ in TD(λ) learningcomes in, it assigns an exponential discount on the future temporal-differences allowing abalance between bias and variance to be found:

Vn(st)← Vn(st) + α[δt + (γλ)δt+1 + ...+ (γλ)T−t−1δT−1] (21)

Equations (14) and (16) can now be seen to be TD(0) and TD(1) respectively. There ex-ists one problem with this method as it stands though, we still require knowledge of futurerewards (except for in the TD(0) case). This makes online-learning impossible since thesefuture rewards are unknown until termination of the trial.

The use of a memory variable associated with each state can, however, allow the desiredonline updating of the value function:

Vn+1(st)← Vn(st) + α[rt+1 + γVn(st+1)− Vn(st)]en+1(st) (22)

The memory variable, e, is known as the eligibility trace [8]. With each step in the environ-ment the eligibility traces associated with every state are allowed to exponentially decay

14

by γλ, with the exception of the eligibility trace for the current state, which is incrementedby 1.

en+1(s)←{

1 + γλen(s) for s = stγλen(s) otherwise

(23)

The purpose of the eligibility trace is to ‘remember’ those states that were visited morerecently than others so that the global temporal difference error can be used to updatethem proportionally. The states visited longer ago are given less credit for the temporaldifference error.

Note that in this interpretation λ assigns an exponential discount on past temporal-differences instead of future temporal-differences solving the online learning problem. Sut-ton and Barto [9] showed that these two perspectives on TD(λ) learning (backward andforward facing) are essentially the same. The TD(λ) method is an extremely useful typeof reinforcement learning which Sutton [10] demonstrated could make more efficient useof its experience than supervised-learning methods, converging more rapidly and makingmore accurate predictions along the way.

5.8 Q-Learning

There is a type of temporal difference learning based on value iteration called Q-Learning[11]. This method uses Action-Value functions otherwise known as a Q-functions whichare defined by:

Q(st, at) = E[r(st, at, st+1) + γV (st+1)] (24)

Temporal Difference learning is conducted, in a similar manner to that in Equations 22and 23, according to:

Q(st, at)← Q(st, at) + α(r(st, at, st+1) + γQ(st+1, at+1)−Q(st, at))e(st, at) (25)

e(st, at)←{

1 + γλe(st, at) for observed st, atγλe(st, at) otherwise

(26)

This produces a ‘Q-Value’ associated with taking each action in every state. The policyadopted is then simply:

π∗(st) = argmaxa

Q∗(st, at) (27)

5.9 Exploitation vs. Exploration

A key consideration for an unsupervised learning agent is how and when the trade offshould be made between exploratory moves into its environment to gain knowledge andexploitative moves that take advantage of the knowledge already accumulated. This prob-lem can be easily illustrated by a simple machine learning problem known as the k-armedbandit.

15

In this problem an agent is located in a room with a number (k) of gambling machinesknown as one-armed bandits (slot machines). The agent must maximise the payoff from alimited number of pulls of the machines. Each machine pays out a 1 or a 0 according totheir underlying probability parameters which are unknown to the agent and each payoffis independent to all others.

The agent’s strategy must depend on how many pulls of the machines they are allowed.The more pulls allowed the worse the consequences of prematurely converging on a sub-optimal machine and the more the agent should explore. There have been many strategiesproposed for the k-armed bandit problem, the two extremes of which are explained belowfollowed by a commonly used compromise.

5.9.1 Greedy Strategies

These strategies are based on always choosing the action with the highest estimated payoff.Their failing can be that described above; that the agent converges too quickly on a sub-optimal machine that it believes is optimum due to unlucky early sampling. The agentmust explore for longer to alleviate this problem.

5.9.2 Randomised Strategies

At the other end of the spectrum of action-selection schemes are completely randomisedstrategies. An agent using such a strategy would act exactly like it was completely unin-formed no matter how much learning had taken place since it will always choose a randomaction out of the possible choices available in its current state.

5.9.3 Epsilon-Greedy

Between these two ‘pure’ schemes are a set of exploration strategies that have some methodof choosing between taking the action with the highest estimate of expected reward andactions chosen at random. The Epsilon-Greedy scheme is one very common decision mech-anism of this type that can be implemented with relative computational ease. An agentfollowing this method will, with probability 1 − ε, choose the action it currently expectswould lead to the highest sum of rewards:

π(st) = argmaxa

Q(st, at) (28)

The rest of the time (ε%) it would adopt a uniform random strategy. The key to theusefulness of the Epsilon-Greedy scheme is that the value of ε changes over the courseof an agent’s trial. Initially this parameter is set high (typically ε0 = 0.9) encouragingthe selection of a random action to facilitate initial exploration but after each action it isreduced via a relation such as:

εi+1 = 0.9999εi (29)

16

This slow decrease of ε over time allows the agent to increasingly utilise the knowledgegained from the earlier period of exploration. In the limit t → ∞ the agent will stopexploration completely and only use the knowledge it has gained. Below is a segment ofpseudo-code for an Epsilon-Greedy algorithm.

if random number (in range 0 - 1) < epsilonChoose action randomly from the valid options

elseChoose the action with the highest estimate of expected reward

end

The observation made above, that an agent will cease exploring its environment afterit has become very familar with it, seems reasonable. This behaviour is only appropriate,however, when the environment is static. If it is non-stationary (i.e. continues to changeover time) then the agent must continue some amount of exploration so it will realisethe environment has been altered. This can be achieved most simply by adjusting theEpsilon-Greedy limit to some non-zero, constant value, ε∞ → k.

17

6 Approach to Programming

MATLAB has been used as the development environment in which the programming workof this report has been conducted. This language was chosen due to its relative simplicityand ability to rapidly prototype, allowing quick build and test times by making use of thelarge library of in-built functions.

If the program that has been created were to be used in a professional package outsideof an academic context, the choice of a high-level, interpreted language such as MATLABwould result in rather poor implementation. To achieve much faster execution times a low-level language (such as C/C++ or FORTRAN) would be ideal and further developmenttime would allow optimisation for speed and the use of memory resources as well as thecreation of a suitable User Interface.

Although this project has ultimately produced a single program, a substantial amountof time and effort has also been spent on programming minutia and on gaining a fullunderstanding of the underlying theory and related work in the field. The approach ofthis report will try to reflect this, where possible, with important or interesting pointsmentioned even if they did not make the final program.

Complete, commented source-code is available in the appendix, with the full under-standing that programming is indeed no trivial task. Inevitably, for the sake of brevity,much content including all the previous versions of the program code has been omitted.This has been a fairly large project, and unfortunately fairly large bodies of work have hadto be summarised in a fashion that belies the true time involved. It is also worth notingthat this project is original work.

18

7 Development of the Simulation

The domain required for a building fire-evacuation simulation needs to be an abstractionof the real-world scenario, making savings in state-action space and calculation time wherepossible. Initially we begin with a very basic state and action space, explained below,containing a single agent in order to develop the essential aspects of what will later becomeour multi-agent building simulation.

Throughout the development stage of this multi-agent simulation we will have to makecertain assumptions. With respect to the building environment we are assuming that wecan use an infinite-horizon, discrete time, Markov Decision Process to model reality inan evacuation scenario. We also state that the purpose of our model is to simulate theevacuation behaviours of human occupants, we assume our subjects are rational and thattheir goals should be to stay alive and minimise the time they take to exit the building.

7.1 Initial Environment

Here we set out the structure of the building environment and the domain rules an agentmust adhere to. Some of these points follow on from the explanation of reinforcementlearning techniques given in the previous section. Others are explained, where necessary,to have been chosen to simplify the learning task whilst still producing a relatively accuratebuilding model.

• The building space is initiated as a relatively small area divided into a discrete grid-like world of square spaces that represent possible agent states. S = (i, j) wherei = {1, 2, ..., 10} and j = {1, 2, ..., 10}.

• This grid-like world requires boundaries (i.e. not toroidal like the pursuit domain)to represent the building’s external walls.

Figure 3: Initial Building Layout

19

• Agents may stay still or move horizontally or vertically but not diagonally.

A = {Up,Down,Right, Left,Wait}

It is necessary to prohibit diagonal movement in an orthogonal environment usingdiscrete time if it is desirable that an agent’s speed is constant. If a diagonal movewere allowed in the same time period as orthogonal moves then the agent would havetravelled

√2 as fast in the diagonal direction as it would in an orthogonal one.

Figure 4: The 5 Actions Possible (including Wait)

Limiting the number of possible actions available in any state to 5, instead of 9 ifdiagonal moves were permitted, also simplifies the learning task and memory require-ments since there are less state-action pairs to value.

• A non-uniform set of outcomes from possible actions in each state is required torepresent the building’s walls. In an interior state (i.e. not adjacent to a wall) all 5possible actions are available to the agent, whereas when an agent is in one of theouter states (next to a wall) the action that attempts to move the agent through theadjacent wall would result in the agent remaining in the current state.

Figure 5: Actions Possible for Agent Adjacent to Walls

20

• The goal state, i.e. the emergency exit, is modelled by a positive reward if +1.

• For each move taken in the building a reward of -0.005 is given. This figure has beenchosen to be an approximation to the relative ’reward’ a human would receive fromthe negative health impact of smoke inhalation and the increasing risk of not beingable to reach the exit.

• Since the state and action spaces are relatively small, it is possible to use the table-based (or explicit) representation of the model. This method stores information onthe value of taking specific actions in specific states in a State-Action Matrix (alsoknown as a Q-Matrix) as explained in the previous section. The structure of such amatrix for our initial environment is shown below.

Figure 6: State-Action Matrix Structure

With such a straightforward way of storing an agent’s knowledge of its environmentin the expected results of the actions available, it is already possible to begin to seehow multiple agents could share information, by simply exchanging values stored intheir Q-Matrices, but we shall come to this in the later sections of this report.

Storing information an agent has learnt in a Q-Matrix also allows easy saving andloading of knowledge within the program. This means an agent can use stored knowl-edge by loading a previously accumulated array at the beginning of a trial, thusallowing the program to use a common benchmark across multiple simulation runs.

21

7.2 Initial Single Agent

The agent script designed to run a backward-facing form of TD(λ) learning known as Q-Learning in this environment follows the simple algorithm shown in pseudo-code below:

% new trial started : get initial observation and choose action[action] = select move(Q(initial state,:), epsilon);

while current trial < max trial

% if this is the last move the agent will makeif new state == goal state

% get current observation[reward] = reward info(new state);

% update value matrix Q and eligibility trace matrix edelta = r - Q(current state,current action);e(current state,current action) = e(current state,current action) + 1;Q(state,action) ← Q(state,action) + alpha * delta * e(state,action);e(state,action) ← gamma * lambda * e(state,action);

% update trial informationsteps(current trial) ← steps(current trial) + step;total reward(current trial) ← total reward(current trial) + reward;

% increment trialtrial ← trial + 1;

% the agent will make another moveelse

% get current observation[reward] = reward info(new state);

% choose action[action] = select move(Q(current state,:), epsilon);

% update value matrix Q and eligibility trace matrix edelta = r + gamma * Q(new state,new action) - Q(current state,current action);e(current state,current action) = e(current state,current action) + 1;Q(state,action) ← Q(state,action) + alpha * delta * e(state,action);e(state,action) ← gamma * lambda * e(state,action);

22

% update trial informationsteps(current trial) ← steps(current trial) + step;total reward(current trial) ← total reward(current trial) + reward;

% move to new state[new state] = state transition(current state, current action);

% plots agent’s final path and total reward received and number of steps taken% vs. trial numberplot results();

end

epsilon ← epsilon * 0.9999

end

% plots agent’s final path and total reward received and number of steps taken% vs. trial numberplot results();

This code calls a number of functions. Below is a brief explanation of their purposes.

• select move is an epsilon-greedy function designed to pick which action to take inthe current state.

• reward info is a function that returns the reward the agent is to receive for enteringinto the new state.

• state transition is a function that uses the current state and current action choiceto determine the new state.

The algorithm above runs the ’else statements’ until the agent enters the goal state andthe trial terminates. It stores information on each trial in variables for the total rewardreceived and the number of steps taken and once the full series of trials has been completedit plots the results.

23

7.3 Simulation Output

Two graphical outputs from the simulation have been created; the first is a MATLABfigure with subplots showing log steps vs. trial and total reward vs. trial. Log plots areused to show sufficient detail in the presence of trials with uncommonly high number ofsteps. These plots allow the user to visually estimate the agent’s learning from the shapeof the graphs.

The second output from the program is a diagram showing the agent’s path throughthe environment during the last trial. This is possible because the program stores thesequence of states the agent passes through. Once the final trial terminates the programplots arrows linking these states in a graphical representation of the environment. Thisstate transition information is stored for all trials, however, and a plot of the agent’s pathduring any one can produced as an output.

7.4 Program Overview

The initial structure of the single agent simulation program is shown below:

∗The arrow.m function used in plotting the agent’s path is not the author’s work [12].

24

7.5 Analysis of First Results

When this initial program is run with the agent starting in state (10,1), i.e. the bottomleft corner, and the emergency exit placed in state (1,10), i.e. the top right corner, theoutputs looks like so:

Figure 7: Outputs From The Single Agent Program

The total reward graph converges to just less than unity, which one would expect giventhat the shortest path to the goal incurs a negative reward of 0.09 whilst reaching the goalprovides a positive reward of 1. The number of steps taken by the agent converges onthe minimum number of 18 and after 1500 trials the agent has completely settled on anoptimum route through the environment.

Care should be taken when interpreting the agent’s path through the building. It seemsthe agent is taking an almost diagonal path from the initial state to the exit state and thatthis is the shortest distance the agent could travel to fulfil its goal and therefore would bethe route the agent would take. One must remember, however, that the simulation domainis orthogonal (no diagonal moves) so the two paths shown in Figure 8 are actually of thesame length.

Figure 8: Two Equidistant Agent Paths

25

When the agent is learning to traverse the building the values in its Q-Matrix shouldconverge so that the actions with the maximum values are those shown in Figure 9. Eachof these actions should also have an equal value associated with it.

Figure 9: The Optimum Policy

This is true because choosing any 18 possible moves out of these ideal ones will resultin the agent reaching the emergency exit and gaining the maximum possible reward. Theagent can only pick one move at a time though and when given equal valued choices foractions it will pick randomly. This means that paths passing through the centre of theboard are more likely than those around the edges since to achieve a path skirting theoutside wall the agent would have had to have randomly picked 9 ’Up’ moves in a rowfollowed by 9 ’Right’ moves in a row (or vice versa).

7.6 Investigating the Effects of Alpha and Lambda

It is important to understand the effect different choices of alpha and lambda can have onthe rate of learning. As a demonstration of the effect of these variables a simple experimenthas been devised where the program is run multiple times, each time with different valuesfor the constants α and λ. In this experiment the agent is again tasked with moving fromstate (10,1) to state (1,10). A measure of how successful the agent is in doing this andhow fast learning occurs is gained from the sum of mean square errors, over 1000 trials,between the number of steps the agent takes to achieve its goal and the optimum numberof 18 steps:

RMSEα,λ =

√∑1000τ=1 (Nτ − 18)2

1000(30)

Where Nτ is the number of steps taken in trial τ . Each run is repeated 30 times and anaverage taken to attempt to smooth out the random element the epsilon-greedy decisionmechanism introduces and allow us to see the underlying effects that alpha and lambdahave. The results from this experiment are shown in Figure 10.

This simple empirical assessment shows that with the online learning method we areusing we get the best performance with intermediate values of α and λ for this environment.

26

Figure 10: The Effect of Alpha and Lambda on RMSE

When α = 0 no learning takes place and the effect is a random walk between the agent’sstarting point and the goal state. An example of the output over 1000 trials of such asimulation run is shown below in Figure 11.

Figure 11: Lack of Learning When Alpha = 0

The average RMS error for this value of alpha (across the whole range of λ since thisnow has no effect; see Equations 25 and 26) was 6095. Most pairings of α and λ show avast improvement on this but as alpha increases the RMS values tend to infinity and thesimulation does not finish. At the higher values of lambda the asymptotic behaviour occursat lower values of alpha as can be seen in the λ = 0.8 and λ = 0.9 lines. The only datapoint that could be attained for λ = 1 (apart from the α = 0 case) was when α = 0.1. Forthis set of values the RMS number of steps was 58, which is already the largest value acrossthe range recorded. At the other extreme, when λ = 0, the algorithm does not propagateany proportion of temporal errors backward to previous states and learning is relativelyslow. The results gained through this empirical study illustrate how the generalisation of

27

Temporal Difference and Monte Carlo methods in the TD(λ) algorithm can perform betterthan either of the two ‘pure’ methods.

Figure 12: The Variation in RMSE due to Lambda when Alpha = 0.5

The lowest value for the RMS error occurs when α = 0.5. Figure 12 shows the resultsgained with this value for alpha with a clear minimum when λ = 0.7.

7.7 Introduction of Static Fire

Now the initial environment has been created and the agent’s learning within it analysedwe can introduce some static fires and see their effect. Areas where fires are situatedare modelled by a negative reward of -1 in the corresponding states with the surroundingstates given a negative reward of -0.05 to simulate the heat radiating from the fire and theassociated increased risk.

Figure 13: The Rewards Associated with a Fire in the Centre State

The agent must learn to avoid these negative rewards on its way to the goal state. Whena fire is introduced to state (1,1) no effect is really noticeable in the agent’s learning.

28

Figure 14: Program Output with Fire in State (1,1)

When the fire is moved to State (5,6), however, it is in a position the agent is morelikely to try to pass by on its path to the exit and therefore it must learn to avoid the fireand surrounding states, demonstrated by Figure 15 which shows a slight but noticeabledifference in learning and reward graphs.

Figure 15: Program Output with Fire in State (5,6)

If we really challenge the agent by constructing an environment like that on the right-hand side of Figure 16, by setting up a ‘wall’ of fire states between the agent and theemergency exit the learning task becomes much more difficult.

There are essentially two ‘forces’ acting to shape this learning, one is introduced by thenegative reinforcement rewarded for each step taken which encourages the agent to reachthe exit as fast as possible. The other is the negative reinforcement the agent associateswith actions in states near the fires which attempt to move the agent toward them, thismakes the agent act as if the fires have a sort of repulsive effect. The effect of these two

29

Figure 16: Program Output with Multiple Fire States

drives can be seen in the envelope of the graphs, initially the agent takes a number ofessentially random paths to the exit passing through multiple fire states hence the largernegative total reward. After this initial bout of exploration the agent has learnt to avoidmost of the fire states evident by the increased step numbers as it attempts to search foralternate routes and a decrease in the negative rewards received. As exploration of theenvironment nears completion the number of steps begins to fall and the rewards receivedcontinue to rise as the agent converges on optimum behaviour. Only when an agent hassufficiently explored the environment will it learn the optimum route to the exit is to ‘bitethe bullet’ and minimise the negative rewards it encounters by passing through some ofthe states neighbouring the fires.

So far the simulation has been designed to terminate when the agent reaches the goalstate containing the emergency exit, however, this is not the only way one’s evacuationfrom a building could end. If a human were to walk directly into a fire the chances arethey would not survive. Since we are endeavouring to model humans, our simulation wouldbe more accurate if the same applied to the agent. When this alteration is made and thesimulation terminates upon the agent reaching the goal state or entering a fire state theresulting step and reward graphs for an environment identical to that on the right of Figure16 are shown in Figure 17.

Comparing these graphs to those in Figure 16 we can see several differences. In thefirst few trials, the number of steps taken is less. With more ‘simulation-terminatingstates’ in the environment this is due to the increased probability of a trial ending duringexploration by the agent. With no chance of encountering multiple fire states in a singletrial, the resulting total reward received from these trials is higher than before. During the‘mid game’ when the agent has identified the locations of the fires, the number of stepsgrows as the agent avoids the fires and explores for longer in an attempt to find alternateroutes to the exit, the total rewards decrease accordingly. Again, as the agent ‘accepts’ itmust pass through some negatively rewarding states in order to reach the exit the number

30

Figure 17: Program Output with ‘Fire Death’

of steps comes down and the rewards increase. The ‘blips’ in the reward graph that occurafter the 250th trial are due to the epsilon-greedy decision mechanism choosing randomactions that nudge the agent off its optimum course and into fires in a last few attemptsat finding a better route. The intensity of these blips appears worse now than before but aquick check of the graph scales shows they are the difference between optimum routes withrewards close to +1 and routes ending in a fire state with a reward close to -1, as before.These epsilon-greedy nudges reduce in frequency as the number of trials grows (and thevariable epsilon continues to decrease) until exploration ceases completely and the agentalways takes an optimum route.

Looking at the envelope of the steps graph shows learning is occurring in a similarmanner to before but with convergence to an optimum solution taking approximately 250trials instead of the 150 required when fire states did not result in ‘death’. Essentially theagent’s learning is proportional to how long it is in the environment and can explore states.Stopping the simulation when the agent enters a fire state cuts short a trial that wouldhave continued to provide the agent with information on its environment and simply meansanother trial is necessary later to gain that same experience. This is evident when the sumof the number of steps is taken over 1000 trials. By then both the original and ‘Fire Death’simulations have produced optimum behaviour so the comparison is essentially of how longthe agent needs to spend in the environment under each method. Averaging the numberof steps taken over 10 simulation runs helps to remove the effect of the random elementon the results. This yields an average value of 45,147 steps where only entering the exitstate ended the simulation and a value of 42,526 steps when fire states also result in death.The simulation’s runtime is directly proportional to the number of steps the agent takes.Although the results above show there is not much difference in this respect between thetwo ways of terminating simulations, the runtime is slightly less when death results fromentering fire states. This also brings the model closer to reality since the simulation willnot allow agents pass through fire and carry on unharmed.

31

7.8 Larger State Space

The model for the building the agent has been evacuating from, so far, is a rather basic one.It is relatively small and there are no features; internal walls, rooms etc that real evacuationmodelling software would need to be able to simulate. To rectify this the environment hasbeen enlarged and redesigned, it is shown below in Figure 18.

Figure 18: Layout of the New Environment

This layout was created to have many of the features typical large buildings possess; alarge hall, long corridors, smaller rooms off larger rooms, larger rooms off corridors etc. Inthis environment the emergency exit is located in state (10,30). We continue to make useof the table-based representation of the State-Action Space since even for a large buildingof this size there are still only 3000 elements to calculate accurate values for, as shown inFigure 19.

Figure 19: Expanded State-Action Matrix

32

Using values of λ = 0.7 and α = 0.5 for the agent’s algorithm its learning in thisenvironment is very quick as shown by Figure 20.

Figure 20: Agent’s Optimal Learning in Large Building

We can gain an insight into the agents learning, policy and corresponding behavior overthe course of the trials by using some additional simulation outputs. First we introducethe optimum policy for this building when it contains no fire states.

Figure 21: The Optimum Policy With No Fire States

By storing the Q-Matrix at the end of each trial we can plot the action that the agent‘believes’ will lead to the most rewarding path for each state. This can be done simplyby finding the positions of the maximum element values for each vertical column of theQ-Matrix. The policies over the range of trials can then be compared to each other andthe optimum.

In the first few trials the agent explores the environment almost entirely randomly. Thepictures in Figure 22 show the agent’s path (left) and the policy (right) at the end of trial2. The pictorial representation of the agent’s policy shows the actions for each state it

33

believes will lead to the most rewarding path. Where these are correct (determined bycomparison to the optimum policy) the actions are shown in green, where they differ fromthe optimum they are shown in red. Red dots indicate the agent considers the action ‘Wait’to be the best option, there are no green dots since the optimum policy will never wastea move staying still and incurring the negative step reward of -0.005 (representing smokeinhalation) when a different action could take it closer to the exit at no extra cost. Stateswith multiple arrows indicate the agent considers all of these actions equally valuable; thesestates tend to be in areas that the agent is yet to explore.

Figure 22: Agent’s Path and Policy After Trial 2

By trial 10 the temporal difference due to discovering the positive reward of the goalstate has already propagated back to the last few states as can be seen by the latter sectionof the path taken by the agent in Figure 23. The first few steps are as yet unaffected bythis and so the agent begins the trial with essentially random exploration until it gets closeenough to the goal.


Figure 24 shows this trend continuing so that, by trial 20, only the first few states areyet to feel the effect of the back propagation of temporal differences.

The process continues and, in trial 30, the agent has converged on an almost optimumpolicy leading to a fast route to the goal state and with no random exploration along theroute as shown in Figure 25.

34



There are areas the agent has yet to explore, however, and these could (as far asthe agent knows) yield a more rewarding route. The epsilon-greedy decision mechanismcontinues to promote exploration at this stage as can be seen in the results from trial 29(shown in Figure 26). Both trials 29 and 30 share the same policy but an epsilon-greedyrandom action generated in state (1,1) and around state (10,8) of trial 29 has led to a verydifferent path allowing further exploration of an area off the optimum path.


35

This epsilon-greedy effect can be seen again in Figure 27 where an epsilon-greedy ran-dom action in trial 320 sends the agent into the last unexplored section of the building.These exploratory paths driven by epsilon-greedy explain the two spikes in the learninggraph of Figure 20.


By the end of the simulation the random epsilon-greedy actions fail to cause the agentto deviate far from a completely optimum path. The exploratory routes these deviationsproduced earlier in the simulation have resulted in supporting policy actions in statesbuffering the optimum route. These act to move the agent back on to the optimum path.Clear examples of these supporting actions can be seen in the central hall and the doorwaysto the two divided offices on the right of the building in Figure 28. Most of the sub-optimal(red) actions are found beyond these supporting buffers in areas the agent has exploredenough to realise there is no advantage to be found in going there but not enough to haveconverged to a full optimal policy.


Memory problems were encountered during this version of the simulation program.These were not due to the single matrices actually used in the agent’s learning (Q and e),since they only contain 3000 elements each, but resulted from storing each trial’s Q-Matrixalong with information for plotting each trial’s path. In a simulation of 1000 trials storing

36

each trial’s Q-Matrix in a single variable would result a 4-Dimensional 3,000,000-elementmatrix. To store the path information 4 points are required for the (x, y) coordinates of thestart and end points of each step and with the maximum number of steps roughly around6000 we would require a 3D matrix of 24,000,000 elements for a 1000 trial simulation. Thisworking-memory problem was solved by saving the path and Q-Matrix information foreach trial to the hard drive, clearing the variables and reinitialising them with zeros.

7.9 Uncertainty in Position

In the model created so far actions deterministically cause the corresponding state tran-sitions. This is a reasonable result when the environment is clear since humans taking aparticular action would expect to arrive in the correct new state but in a smoky environ-ment the uncertainty in position can be modelled by a probability distribution centred onthe expected state as shown below. In the left image an un-smoky environment results ina probability of 1 centred on the expected state. The middle image shows this probabilityspreading out over the neighbouring states as the environment becomes smoky and in theright image this continues as the density of the smoke increases.

Figure 29: Example Probability ‘Masks’ Used In State Uncertainty Model

In the areas containing walls these ‘masks’ can be adapted to spread the possibilityof position into the states that the agent could move to. The masks for a fairly smokyenvironment and in the presence of walls could, for example, look like those below:

Figure 30: Example Probability ‘Masks’ In The Presence Of Walls

Since walls would stop smoke spreading out as easily as in free space this method isfairly accurate in modelling the distribution of smoke in an environment and its effect onassumed position.

When an agent is uncertain about its position it does not make sense to only associatethe temporal error with the states an agent believes it has travelled through since there

37

is now a possibility the path thought to have been taken is wrong. A realistic agentwould spread these temporal errors out based on how smoky the environment was. Byremembering that an agent selects which state-action pairs the temporal error alters byrecording recently visited states in an eligibility matrix we begin to see how this can beachieved.

The agent currently only increments the assumed current state by 1 so that as it passesthrough the environment it makes changes to the eligibility matrix like those shown below:

Figure 31: Example Of Current Method Of Eligibility Update

The values in the eligibility matrix decay exponentially over time so that the mostrecent (and therefore presumably the most important) states and actions are affected to agreater extent than those from further in the past.

If the agent now centres the same masks that represent the smokiness of the environmenton the state it believes itself to be in and updates the eligibility matrix according to thesevalues then a more realistic set of state-action pairs are adjusted.

In order to implement the smoke model the simulation code needs to be altered so thatthere are two versions of the current position. One of these is the true current position,which is used to find the actual reward received. The other is the agent’s estimate of itscurrent position, which is used in the eligibility mask update mentioned above.

The figure below shows the results of a simulation with a reasonably smoky environ-ment.

Figure 32: Agent’s Learning In Smoky Environment

This simulation was run in the same environment as that used to gain the results inFigure 20 (apart from the additional smoke model) so a direct comparison can be made

38

between the two images. We can see that the agent’s realistic spreading of temporaldifferences means learning still occurs at a similar rate although the first few trials nowrequire a higher number of steps. After about 30 trials the number of steps taken by theagent settles down a little although there is more variation than in the smoke-free case.

The figure below shows the path the agent believes it took during the first trial (left)and the path the agent actually took in this trial (right). The code used to plot the agent’spath in these pictures has been altered so that it varies over the course of the trial fromblue through turquoise, green, yellow and orange until at the trials end it is shown in red.This helps users visualise the states visited by the agent over time.

Figure 33: Agent’s View Of Path Taken (Left) and Actual Path Taken (Right) In Trial 1

In the last trial the corresponding plots look like:

Figure 34: Agent’s View Of Path Taken (Left) and Actual Path Taken (Right) In Trial1000

7.10 Non-Stationary Environment

So far the environment the agent has had to contend with has been static. States andtheir rewards have remained unchanged with the passage of time as the agent has movedaround the building. This is obviously not what occurs during real building evacuations

39

and a useful simulation should include a suitable model for fires that can evolve and changeduring the course of the simulation in a way that approximates the growth, movement anddeath of true fires.

Inspiration to solve the problem of creating a suitably accurate fire model has comefrom the cellular automaton known as The Game of Life, which was created by the Cam-bridge mathematician John Horton Conway in 1970. The Game of Life is played out ona two-dimensional grid of square cells, very similar to the building domain created for oursimulation. Time is discrete in the game and every cell exists in one of two states, eitherdead or alive. Each cell’s status is determined by the status of its neighbouring cells duringthe last iteration. A cell’s neighbours are the 8 cells surrounding it. The rules that governa cell’s activation or deactivation are quite simple. During the previous iteration:

1. If a live cell had fewer than two live neighbouring cells it dies.

2. If a live cell had more than three live neighbouring cells it dies.

3. A live cell with two or three live neighbours stays alive.

4. A dead cell with three live neighbours becomes alive.

Each cell’s status after every time step is based purely on the status of its neighbours beforethe time step. The initial pattern that starts the game is called the ‘seed’ of the systemand is entered by the player, after this there is no other input required. All subsequentgenerations are simply functions of the previous generations according to the rules above.

Some very interesting and quite complex patterns can develop from this system ofsimple rules and one of the reasons the game has the name it does is due to the organicappearance of some of the behaviours that emerge. Another is due to the nature of therules, which mimic those of real life. Rule 1 above can be compared to the result of underpopulation while rule 2 can be seen to mimic overcrowding. Rule 3 allows life to continuewhere satisfactory conditions are found and rule 4 is akin to reproduction.

Instead of dead or alive cells we can alter the meaning of the states to create fire cellsor non-fire cells. Fire has often been compared to a living organism; it consumes material,respires and even reproduces to an extent via embers, which suggests that rules similar tothose used in the Game of Life could provide a suitably realistic model for fire evolution.We require these rules to be chosen carefully to ensure the behaviour imitates real life firesand also obeys the rules of the building environment we have created:

1. Fire should only be able to spread to the four neighbouring cells (above, below, rightand left) since this constraint is also applied to the agent’s movements.

2. It should be possible for some small initial patterns to burn out early while otherscan take hold and spread through the building.

3. The fire should begin to die out in regions in which it has been burning for a consid-erable period of time and has used up available fuel sources as well as in areas whereit has become too sparse to generate enough heat to sustain ignition of new fuel.

40

4. It should not be possible for initial fire seeds to settle down into stable configurationsthat remain unchanged thereafter.

5. Fires should grow, move and die in a probabilistic manner so that symmetrical fireseeds do not grow into completely symmetrical fires, which would not happen in reallife.

Initial experiments with the time-varying fire code generated fires that tended to growunbounded from even quite small seeds and lacked the correct balance of probabilities toensure they died out in sparse areas or after consuming all the fuel in an area. Theseinitial attempts were based on a simple binary model where states could either be a fireor not. These early fires also took no account of the building’s features (i.e. external andinternal walls, corridors, open spaces) and grew isotropically as shown in the time-lapseframes below.

After a considerable amount of experimentation a much improved fire model was cre-ated. This met the requirements listed above as well as taking account of the buildings

41

features by utilising the same matrices used by the agent to determine the effect of actionsin each state (i.e. where the walls are) thus meaning fires will not grow through walls, butwill spread down corridors and expand into rooms. This final fire model has used the samebinary states of no fire or fire but empty states that are near to the fire now have varyinglevels of heat in them depending on how many fire cells surround them, all these levels ofheat translate into corresponding amounts of reward an agent would receive for movinginto those states.

By iteration 15 the fire can be seen to have filled the central hall of the building andbegun to spread into neighbouring corridors. Iteration 15 also shows the walls of the centralhall shielding the corridors above and below from the heat of the fire, evident by the lackof yellow (low heat) states surrounding the red (fire) states.

By iteration 45 the fire in the central hall of the building has used up all available fueland burnt out while the rest of the fire continues to engulf the left side of the building andspread down the corridors to the right hand end of the building.

When iteration 95 has been reached the fire has burnt out in the majority of the buildingapart from the divided offices in the top right and bottom right corners of the building

42

where the room shape provides the fire with a hard space to manoeuvre in. The fire takesanother 30 iterations to burn out in these rooms. By iteration 125 the whole fire has burntitself out by consuming all available fuel in the building.

We can include this time-varying fire model in our simulation by creating a global clockfor the program. This clock’s value is equal to the number of steps taken, since each stepcan be seen to occur within a fixed discrete time period. The fire is then iterated everyn steps. The reward an agent receives from a state therefore depends on the step numberwhen it enters that state. We can pre-process the fire model for a simulation to saveruntime and then simply load the reward information from this array.

It is reasonable to imagine that learning would be much slower now since the agentsQ-Matrix and Eligibility Matrix have increased to 4-Dimensions. Each 3D matrix is 30states long, 20 states wide, 5 actions high and there are (max number of steps/n) manyof them, one for each iteration of the fire since this essentially creates a new environment.There is no physical limit to the maximum number of steps so one might imagine the agenthas an immense exploration and learning task ahead of it. This is not the case, however,as evident by the learning graphs, shown below.

These simulations were run with agent death resulting from entering a fire state. Figure35 shows learning still occurring very quickly in a slowly changing environment. As therate at which the environment changes increases, learning a near optimum policy takes a

43

Figure 35: Agent’s Learning With Environment Changing Every 100 Steps



higher number of trials; roughly 120 until major exploratory trials cease in the case wheren = 20 as shown in Figure 36 compared to around 40 trials when n = 100.

Figure 37 shows the agent’s learning and reward graphs when the environment changesevery 10 steps simulating an exceptionally fast burning fire. An almost optimum path

44

avoiding all fire states is not found in this instance until around the 390th trial. Thepath taken after this stage, give or take a few step differences, is shown below in Figure38. White squares near the fire indicate states that contained fire in the past but do notduring the current iteration.

Figure 38: Path Taken In The 1000th Trial

The lower points on the graph found around trial 300 are due to these trials beingstopped early when the agent enters a fire state. Figure 39 shows this occurring, note thatalthough it seems the agent enters the fire state twice the first time the agent enters thatstate there is no fire in it, when the agent returns 2 steps later the fire has evolved andthis state now results in the end of the trial.


The highest peak found in Figure 37 occurs at the 36th trial. In this trial the agent

45

‘hides’ from the fire in the room containing small offices (centre-left of the building) untilthe fire has swept through the rest of the building and burnt out. It then continues itsexploration of other areas in the building before reaching the exit. The white arrows inthe top left of Figure 40 show where the fire has engulfed areas the agent has been in thepast.


Learning is still so quick in this time-varying environment because the agent does nothave to fully populate its 4-Dimensional Q and e matrices. State-action pairs that lead tolong, inefficient routes involving ever higher ‘levels’ of these TD(λ) learning matrices arestill negatively rewarded meaning they are discouraged and the agent takes more efficientpaths needing fewer ‘matrix-levels’, that are more easily filled via moderate exploration.

7.11 Extension to Multi-Agent Simulation

In order to add additional agents to the simulation we first must consider their interactions,questions such as these need to be answered:

• When do the agents ‘see’ each other?

• Can two agents occupy the same state?

• Do the agents move simultaneously or in turns?

• How will the agents exchange information?

There are a variety of ways these considerations can be addressed. In this project theagents’ interactions were constructed so that they would communicate when they comewithin a move of each other and they can exist in the same state. Due to the sequential

46

nature of computer code one agent’s code algorithm will run first followed by the others butin terms of the agents’ view of their environment they have moved simultaneously. Agentscommunicate by exchanging values held in their Q-Matrices and Eligibility Matrices. Thesimplest way to do this is to add the matrices together element by element.

The learning graphs from a simulation containing 2 uninformed agents where the envi-ronment changes every 100 steps due to the fire model are shown in Figure 41.

Figure 41: Simulation Results Of Two Uninformed, Communicating Agents

Since both of these agents are uninformed the number of trials taken exploring theirenvironment until an efficient route to the exit is realised is about the same as the singleagent case. There is much less variation in path length after the 50th trial though. Thisis due to the use of both agents’ domain knowledge gained through exploration. Whenepsilon-greedy suggests a random action there is a higher chance that the new state hasbeen explored by one of the agents and the correct action can be taken to put the agentback on course to the exit.

The program now outputs 2 agent path plots. The figure below shows the paths takenby the two agents in the 30th trial. Agent 1 on the left can be seen exiting the buildingrather quickly, as evident by the fire still being in its first iteration and the agent’s finalarrows still being coloured blue. Agent 2 on the other hand embarks on an exploratoryroute until it enters a fire state and the trial ends.

The pictures below show the results from a trial in which the agents meet up, commu-nicate their knowledge of the environment to each other and then move together towardthe exit. The black arrows indicate when the agents have taken the same moves together.

By the end of the simulation the agents essentially take optimum routes to the exitwithout meeting up but still continue to use the knowledge gained from their previousencounters.

47

Figure 42: Paths Taken By The Two Agents In Trial 30

Figure 43: Paths Taken In Trial Where The Two Agents Meet


The plots below show the states in the environment each agent has explored. Redindicates states the agents have been in most frequently while, at the opposite end of the

48

range, blue indicates states the agents have really been.


When the agents communicate and combine their knowledge of the environment theyhave both effectively explored the building to the degree shown below.


An informed agent can be constructed using the optimal policy but when fires areincluded in the simulation this agent must still run temporal difference learning in orderto adapt to the environment and learn to avoid the fires. Allowing an uninformed agent togain accurate information from an informed agent reduces the number of trials required forexploration and shortens the simulation’s runtime as the results on the right of the Figurebelow show. The results in the graph on the left were conducted with 2 uninformed agentsin the same environment but with sub-optimal α and λ so that a clear impression of theimprovement can be gained. Since the informed agent in the right hand figure is runningan optimum strategy and no fire states initially exist in between its starting state and theexit this agent does not have to perform any learning and all of its trials have an optimumnumber of steps.

49

Figure 47: Sub-optimal Learning By 2 Uninformed Agents (Left) And The EquivalentResults When Agent 1 Is Informed (Right)

8 Validation/Verification

The use of an infinite-horizon, discrete time, Markov Decision Process has proved a usefulabstraction of reality. The temporal-difference learning methods employed with this modelhave the capacity to reasonably accurately simulate the evacuation behaviours of humanoccupants. The results from the simulations conducted during the course of this projectshow that the agents act rationally in their avoidance of negatively rewarding states andin their attempts to minimise the time they take to exit the building. The spreading oftemporal error association over the states surrounding the current estimate of position is arealistic approach to a smoke-filled environment and the methods of agent communicationlead to recognisable behaviour when agents meet up and continue toward the emergencyexit together.

The use of other agents’ environmental information can result both in beneficial and un-desirable behaviour, which reflects what sometimes occurs in real evacuations when peopleunfamiliar with the structural details of a building make use of others’ knowledge howeveraccurate it is i.e. by following other evacuees. The method of temporal differences wasoriginally created to explain the predictive behaviour of animals in psychological experi-ments [9] [13]. It was only later that it was used in dynamic programming. The fact thatTD(λ) methods are based on observations of learning in actual mammals helps explainwhy they have produced the realistic behaviour of the agents in the simulation.

The method of representing agent’s knowledge in the relative values of elements in aQ-matrix was chosen to mimic humans’ long-term memory and has allowed a similar abilityto quickly and simply choose which actions to take.

There are, however, elements of human behaviour that are not modelled in the simula-tion. It has been observed [14] that in real life evacuations from an office-style environmenthumans delay their response to an evacuation alert by an average 29 seconds. This time isusually spent saving current work on computer workstations and collecting belongings.

The simulation as it stands has other shortcomings. There is not a large enough varietyof agents to accurately model all occupants of a typical building, e.g. no variation in agentspeeds to model the old, ill, disabled, children, and overweight/unfit people. There is also

50

no consideration of the physical space requirements of agents, which would be needed toobserve behaviour such as congestion at doorways. Consideration also needs to be givento the ability of agents to negotiate certain terrain. This is not required so much forthe building modelled in this project but in a multi-floored building with stairs agentsrepresenting the disabled would require different action functions for particular states. Instates representing stairs, for example, a disabled agent would not have an ‘Up’ or ‘Down’action as an able-bodied agent would.

Although we can make qualitative assessments of the model of human behaviour theonly way to truly know the accuracy of the simulation is by direct comparison with aphysical evacuation of similar a population and setting. If such a physical evacuation wereto be conducted it would make sense to test the most detailed simulation possible. Thereare several ways the evacuation model created in this project could be improved and theseare mentioned in the next section.

9 Conclusions

While further testing of the results of the model would be needed for its complete validation,the initial process of development and testing has been completed. The simulation hasbeen used to indicate the applicability of reinforcement learning techniques to modellingbuilding evacuation scenarios. The program created during the course of this project isvery versatile. There already exist a wide range of simulations a user can run, these include:

• The modelling of almost any building design.

• Multiple exits in single buildings.

• A multi-agent population each with varying amounts of knowledge of the environ-ment.

• A choice of the variables used in the learning algorithms.

• The inclusion of static fires or time-varying ones; either using the in-built fire modelwith random or user-generated seeds, or by creating a completely customised fire.

• The ability to model different types of fire by varying the parameters used in the time-varying fire model and by introducing a range of uncertainties in position mimickingthe smokiness of the fire.

The ability of the techniques examined to adapt to such a wide range of scenarios andproduce behaviours akin to those of humans proves their usefulness in this field. Thework conducted has also produced a variety of ways to view simulation results and agentbehaviours, which allow a greater understanding of how temporal-difference learning isimplemented and how easy humans would find evacuation from certain environments.

51

While TD learning was originally proposed to explain animals’ learning in psychologicalexperiments and may be considered an accurate description of natural learning methods,we, as humans, would obviously implement it differently to the agents in the simulation.The possession of external senses and already accumulated knowledge allow us to predictthe effects of entering certain future states. A human, for example, does not need towalk into a fire to know it is there and will under normal circumstances never choose acompletely random action when next to a fire. A simple method for modelling humanexternal senses could be created by using a look-ahead reward-based decision mechanism.When an agent is considering its next move it could ‘look ahead’ to the reward it wouldreceive in the new states that each possible action would take it to. This would mean thatwhen an agent is in a state neighbouring a fire it would never take the action leading intothat fire (and thereby receive that associated negative reward) but could pick randomlybetween any equivalent alternative options.

9.1 Further Work

There are, as have been hinted at in the previous sections, a great many ways the work ofthis project could be extended and improved upon. Some of these points are interestingfrom an academic perspective since they would allow further investigation of the potentialof multi-agent learning and some are useful considerations should a similar program everbe created for use in a professional package, outside of an academic context:

• In the current simulation the building space is divided into an orthogonal square-stateworld, which leads to inaccurate equalities of routes as shown in Figure 8. The use ofa more accurate and higher-resolution model of space such as a hexagon grid or evena continuous domain could be useful. The second of these would require a differentmethod of temporal difference learning, probably using Radial Basis Functions toestimate the value function and implementing TD learning on this parameterisedvector. A new method of generating time-varying fires would also need to be created.These changes in domain would obviously increase the complexity of the learning taskbut without a direct comparison and validation it remains to be seen whether theywould improve the accuracy of the simulation by much.

• The simulation could be simply and quickly altered to accommodate state spaces forlarger buildings with multiple floors by including ‘stair states’ in which a choice ofthe forward action would result in a state transition to another floor.

• As mentioned in the previous section it has been observed that people do not respondimmediately to an evacuation alert. A professional software package would need totake account of this by holding the agents in their initial positions for a certain periodof time.

• Professional evacuation-modelling software should also have a wider range of dis-tinct agents. This would allow the modelling of a more realistic population including

52

wheelchair users, the visually impaired, those with hearing difficulties, the ill, chil-dren and the unfit. Each of these agent types could vary in speed, size and abilityto negotiate terrain to more aptly represent the diversity of the population and theirinteraction with the environment. An agent representing a wheelchair user, for ex-ample, would not make the transition to another building floor if they were to takethe forward action in a stairs-state.

• The possible inclusion of variation in agent size also reminds us that there has beenno consideration of individuals’ physical space requirements. This would be requiredto accurately model the interactions between agents i.e. congestion at doorways, incorridors and at exits. This could be approximated in our simulation by not allowingmore than a certain number of agents to exist in one state.

• Emergency exit arrows could be included in the simulation by changing the valuefunction of particular state-action pairs surrounding the location of the arrow whenthe agent gets within a certain number of steps. If, for example, the actions leadingin the direction of the exit were given the highest values in these states’ Q-Matrixcolumns this would have the effect of encouraging the agent to move toward theexit when in the vicinity of the emergency exit arrows. This realistically models thebehaviour humans exhibit in such circumstances.

• The simulation could be used to test the effect of different action-selection schemeson the agents’ behaviour. There are a number of alternatives to the Epsilon-Greedymethod used in this project, including the Boltzmann method. In Epsilon-Greedyeach non-greedy action has an equal chance of being chosen, whereas in BoltzmannExploration these actions are weighted by their expected reward.

• It has also been proposed [15] that reinforcement learning could be used at higherlevels of abstraction. This would require a redesigning of the agent and how itinteracts with its environment. All the agents in this project’s simulation have thesame goal. This is not necessarily true in real life since Evacuation Personnel also aimto search the building, find people and help them evacuate. Agents representing thesepersonnel could gain rewards for meeting up with other agents in the environment orfor exploring areas of the building. Rewards could also be given for communication byuninformed agents to represent the positive gain from receiving the extra knowledgeof the environment. Reinforcement Learning would then act at the level of behavioursinstead of states and actions.

53

10 References

Listed below are the sources of information drawn upon throughout the course of thisproject and referenced in this text.

1. Artificial Intelligence: A Modern Approach: S J Russell and P Norvig, 1995.

2. Proceedings of the 2nd International Conference on Pedestrian and Evacuation Dy-namics : E R Galea, 2003.

3. An Analysis of Problems and Research in DAI : A H Bond and L Gasser, 1988.

4. On Optimal Cooperation of Knowledge Sources - An Empirical Investigation: MBrenda, V Jagannathan and R Dodhiawala, 1986.

5. Dynamic Programming : R E Bellman, 1957.

6. Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC : D PBertsekas, 2005.

7. Adaptive Control Processes: A Guided Tour : R E Bellman, 1961.

8. Temporal Credit Assignment in Reinforcement Learning : R S Sutton, 1984.

9. Reinforcement Learning: An Introduction: R S Sutton and A G Barto, 1998.

10. Learning to Predict by the Methods of Temporal Diferences : R S Sutton, 1988.

11. Learning from delayed rewards : C J C H Watkins, 1989.

12. Arrow.m: E A Johnson, 2002.(http://www.mathworks.com/matlabcentral/fileexchange/278)

13. A Temporal-Difference Model Of Classical Conditioning : R S Sutton and A G Barto,1987.

14. Quantification of Behaviour for Engineering Design Standards and Escape Time Cal-culations : D A Purser and M Bensilum, 2001.

15. Learning to Behave Socially : M J Mataric, 1994.

54

11 Appendix - MATLAB Code

The following pages provide the reader with an example of the source code used to imple-ment the simulation. Comments are included to make clear the function of sections of thiscode.

55

Documents

Simulation and Investigation of Multi-Agent Reinforcement