RL Project - Technion – Israel Institute of Technologynssl.eew.technion.ac.il/files/Projects/Software_for... · Web viewProject taken at the software lab of E.E., Technion, winter

RL Project

By: Eran HarelRan Tavori

Instructor:Ishai Menache

Project taken at the software lab of E.E., Technion, winter semester 2002

RL Project Ran Tavori and Eran Harel

AbstractRL – Reinforcement Learning is the process of iteratively learning how to

achieve best performance by the method of trial and error.

The general purpose of the project is to implement a test bed software for various RL algorithms in different domains. After doing so, the next stage is to implement

a couple of RL algorithms and a couple of environments and test them all.

The following document describes the requirements of the project, presents a comprehensive overview of RL algorithms, describes the design of the implemented software and presents the outcome of the tests we have

made.

- 2 -


Table Of Contents

Introduction.......................................................................................................5

Requirements....................................................................................................6

RL Algorithms....................................................................................................8

Value Functions.............................................................................................9

TD................................................................................................................10

Q-Learning...................................................................................................11

Exploration vs. Exploitation: ε-Greedy.........................................................15

Function Approximation...................................................................................17

The Feature Space..................................................................................17

RBF.............................................................................................................18

In our case...............................................................................................19

Design.............................................................................................................21

The Three Main Interfaces...........................................................................21

ISimulation interface.................................................................................21

IAgent interface........................................................................................22

IEnvironment Interface.............................................................................24

Additional Interfaces....................................................................................25

ISensation Interface.................................................................................25

IAction Interface.......................................................................................27

Package Overview.......................................................................................28

Simulation Package.................................................................................28

Agent Package.........................................................................................29

Environment Package..............................................................................31

Implemented Environments.............................................................................34

The Maze Environment...............................................................................34

The PredatorAndPrayEnvironment..............................................................35

Experimental results........................................................................................38

Maze Environment Results..........................................................................38

Predator And Pray Environment Results.....................................................41

- 3 -


Operating manual............................................................................................45

Environment editor......................................................................................45

The simulation type chooser........................................................................46

The simulation main window........................................................................46

Improvements that we have not implemented.................................................48

List of Figures..................................................................................................49

Bibliography....................................................................................................50

- 4 -


IntroductionThe following document describes the details of the software implemented in the project.

We first describe the requirements of the project.

Then we present a short theoretical introduction to RL and to function approximation.

We then dive into the details of the software design, explaining every important aspect of the platform design. In this part only abstract explanations are given, leaving the technical details to the auto-generated documentation generated by java from the code.

After the design overview we present the outcome of the test runs that we made, including graphs that measure performance.

Then we present a user manual that details the various features of the software how to use them.

At last we list a few of the interesting fields of investigation and open questions that we ran into, which if only time would allow us, we would have made further investigation in their direction.

- 5 -


Requirements1. The main target of the project is to create a test bed for various agents, in

different environments, that use reinforcement learning.

2. The project will include a platform for testing the agents in the different environments (henceforth – the platform):

2.1. The platform must define a common interface for an environment.

2.2. The platform must define a common interface for an agent.

2.3. The platform must implement simulation software that gets as an input an environment and an agent and runs a simulation based on that input.

3. The project will include test application of agents and environments:

3.1. Agents:

3.1.1. Tabular Q-agent

3.1.2. Tabular sarsa lambda agent

3.1.3. Tabular TD-lambda agent

3.1.4. TD-lambda agent that uses function approximation

3.2. Environments:

3.2.1. A maze.

3.2.2. Predator and pray

- 6 -


Definitions1

Agent – An agent is referred to as an entity that resides in an environment and interacts with it. An example for this might be a chess player. The player is the agent whereas the board is the environment. In this example, the other player might be considered as being another agent, but it might also be considered as being part of the environment, since that as far as the first agent knows, the board and the other player are both external to it, hence they both might be considered as the environment as a whole. In the software we wrote this kind of abstraction actually being used.

Environment – The environment is the complementary of the agent. An environment is considered to be everything that is external to the agent. Put another way, the environment is what the agent can not control.

State – An agent has a state in the environment. For example, the state might be the location of the agent on a grid.

Action – For each state there is a set of actions that might be taken when the agent is in the state.

Goal – Usually the agent interacts with the environment in order to achieve a goal. A goal might be, for example, winning a chess game.

Episode – An episode is a series of states, in which the first state is the an initial state and the last state is a terminal state. The terminal state is usually a goal.

Policy – A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic.

Reward Function – A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. A reinforcement-learning agent's sole objective is to maximize the total reward it receives in the long run. The reward function defines what are the good and bad events for the agent. The reward function must necessarily be fixed. It may, however, be used as a basis for changing the policy. For example, if an action selected by the policy is followed by low reward then the policy may be changed to select some other action in that situation in the future. In general, reward functions may also be stochastic.

1 The definitions are taken from [1]

- 7 -


RL Algorithms2

When speaking about algorithms of artificial intelligence one can divide the algorithms to two main classes:

Domain specific algorithms

General algorithms

Surely this division is not accurate, but for the matter of the discussion we’ll keep it.

One might expect from domain specific algorithms to have better performance in their domain than any general algorithm, simply because they were specially suited to fit the domain. But, on the other hand, they would not work in any other domain. Further more, in order to fit an algorithm to a certain domain, one must have some knowledge in advance about this domain. But this knowledge is not always present.

An example of an algorithm that is domain specific, is the A* algorithm for finding a path in a maze. Without getting into the details of the algorithm, it is sufficient to say that A* has formally proven that it is able to find the optimal route in the maze, with time complexity of O(n2) and space complexity of O(n) in the length of the path.

As to RL algorithms – they all belong to the second class of algorithms – the general algorithms. They are not domain specific and they can be applied to any domain that satisfies a small set of requirements:

When the algorithm does something “right” it gets a reward

When the algorithm does something “wrong” it gets a negative reward.

So the essence of all RL algorithms is the method of trial and error – let the algorithm try and solve the problem; if it succeeds give it a reward so that it can sense that it has done something right; and if it did not succeed, give it a negative reward so that it will be able to refrain from doing that in the past.

This kind of attitude is obviously inefficient in many cases, but, on the other hand, this might be the only method acceptable in some cases. In the case where there is no prior knowledge of the domain, this is the only method acceptable.

2 The overview of the RL algorithms is according to [1] and [3].

- 8 -


In this project, where we had to solve 2 problems: the maze problem and the predator-pray problem, we could implement domain specific algorithms for solving these problems, however we preferred the implementation of some general RL algorithms on top of them so that we will be able to learn and test the capabilities of such algorithms.

In the following discussion we will present a short introduction to some basic RL algorithms, including the algorithms that were implemented in the project.

Value Functions Let us start by defining the term of a value function. A value function is a mapping of states to real values: . Here S represents the set of possible states in the given domain. For example, if the domain was a grid of size m on n, then S would be a set of size m*n that includes all the states on the grid.

The essence of defining a value function is to be able to estimate the goodness of a state. The logic is pretty simple: if an agent is in a state that is worse than a neighbor state, it would go to the other state, so that it can make its situation better.

Of course, the real issue here is how do we get this value function? Since we assume no prior knowledge of the domain, we must also assume that at the beginning we have no value function at all. So what we will have to do is to build the value function step by step using the method of error and trial.

Markov Decision ProcessA process can be describes as a sequence of states. The agent-environment interaction could be treated as such a process, in which, given the current state, the next state is determined by the agent’s action, by the environment (hence the current state and maybe some other factors) and, optionally some stochastic function of them all.

A process is said to be Markovian if is has the following property:

The next state of the process is determined exclusively by the current state, the action taken, and optionally some stochastic function of them

A simple example of an MDP (Markovian Decision Process) would an automata. If the MDP is deterministic, then the automata would be deterministic too and if the MDP is stochastic, the automata would be indeterminist automata.

This definition of an MDP will serve us in the following text.

- 9 -


TDOne method of estimating the value function is the method of TD (Temporal Difference).

The idea behind TD methods is:

given the current estimate of V(s)

and given the reward r, after taking an action a from s

and given the next state s’, after taking an action a from s

Then v(s) gets updated using the following rule:

Equation 1: update rule for TD

where:

α is a parameter called the learning rate ( )

γ is a parameter called the discount factor ( )

The main idea of this update rule is: the value of s equals to the value of the reward given by taking an action from s plus the value of the next state (discounted).

This rule is used iteratively when trying to estimate the value function of a given policy. A whole algorithm that uses this rule is presented in Figure 1.

Figure 1 : TD algorithm for estimating Vπ

- 10 -


Here π is a policy to be evaluated. This Algorithm is used to estimate V(s) given the policy π, which maps states to actions: .

The algorithm estimates Vπ, for a given π. So now what we have to do is to make π the optimal policy. So how do we make π the optimal policy? If only would we have known V, then this would be a simple job: π is the policy that maps a state to an action that yields the best (expected) next state. And what is the best next state? This is given to us by V. So, if only would have we known V, then this would be simple… But we do have a clue of what V is – this is what the algorithm in Figure 1 calculates – it calculates V. So we can use V in order to improve our policy!

After understanding the last paragraph, what we can do is the following: estimate Vπ using the presented algorithm, and then use Vπ in order to improve π.

This method is known to actually work, and it even has a formal proof of converging to the optimal policy!

There’s a little more to TD then what is presented here, but for the purpose of the overview this should be enough.

Q-LearningWe have introduced the usage of the value function, which maps states to real values: . In this section we will introduce a bit different approach.

But before doing so, let us understand first what is the problem with the method of value function. In this method there is an implicit assumption that the agent has some model of the environment. What does it mean? It means that the agent, when it has to choose the action to be taken from state s, it uses the value function V in order to decide what is the best action to take. But how exactly does it use the value function in order to make the decision? This is where the implicit assumption of the model of the environment lies – the agent has to take an action that will result in the best reward and the best next state. But how does it know what the next state is going to be if it takes an action a from state s? For this it needs a model of the environment. A model of the environment is a mapping of the couple (state, action) to the next state: . So, given the current state, s, a model of the environment would determine what will the next state , s`, be if we were to take an action a.

So, by using the method of value functions the agent not only has to build a value function, but it also has to have a model of the environment. This is not always present. Sometimes the agent might have a model of the environment in advance, for example, if the environment is a simple grid, then the agent can tell that if it is now at position (3,4), by taking the action

- 11 -


SOUTH it would get to position (3,5). But this is not always true: what if there was an obstacle at this position? And what it there was a wind that carried the agent to a different position? Well, sometimes the agent just has to build the model of its environment dynamically.

Building a model of the environment is not always simple either. We have introduced a model of an environment that is deterministic: . But what if the environment is not deterministic? Then we would have to use a different model, a model that is stochastic: . This means that the probability of getting to state s` by taking action a while being at state s is p.

To summarize this, let us say that by using the method of value function it is not sufficient for the agent to learn the value function – it must also have a model of the environment, which is not always easily feasible.

This is where the next method comes in hand.

We introduce now the method of Q-learning. This method does not require an explicit model of the environment, rather it implicitly builds one.

Recall that value function maps state to real values. As opposed to that, Q functions map a couple of (state, action) to real values: .

What is the meaning of this mapping? Well, the meaning is quite intuitive and straight forward: Q(s,a) measures the goodness of taking action a while being is state s.

If the agent uses such a mapping, then all he has to do is, if it is in state s he must take the action a that produces the maximal value for Q(s,a), hence:

Equation 2: usage of a policy π that is derived from Q

The update rule for Q is the following rule:

Equation 3: update rule for Q

where:

α is the learning rate ( )

and γ is the discount factor ( )

- 12 -


This rule resembles the update rule for V, presented in Equation 1, only that this rule applies to Q function. For this reason it also uses a max, while the update rule for V does not.

An algorithm that uses the rule is presented in the next figure.

Figure 2: Q-learning algorithm

There is one important issue about the algorithm that has not been introduced yet, and this is the ε-greedy issue. This will be covered in the following section, but in short, the line that says “choose a from s using policy derived from Q (e.g. ε-greedy)” means that the agent should choose an action a that maximizes Q(s,a), just like presented in Equation 2, but with a probability of ε he must choose a random action.

In the beginning of this section we have discussed the advantages of calculating a Q function over a V function. We have said that in case where the agent does not have a model of its environment, of where it is difficult to create one, it is possible to use Q-learning, which does not require one.

But what are the disadvantages of the method of Q-learning?

The answer to that lies in the complexity of the state space. The dimension of the domain of a V function is the size of the state space: . (recall that

), whilst the dimension of the domain of a Q function is . (recall that ).

This implies 2 main problems:

- 13 -


Since the size of Q is greater then the size of V this implies that in order to store a Q function we need to have more memory capacity, hence the memory complexity increases by a factor of |A|.

In order to compute both V(s) for a given state s, or Q(s,a) for a given state s and a given action a, one has to make infinite number of visits to s or to s and take action a, accordingly. We will not present a formal proof of this statement, but it might be understood intuitively by seeing that the update rule for both Q and V is iterative, so in order for it to converge to the correct value one must make an infinite number of updates to it. By using the same logic, in order for V and Q to converge to the correct values with arbitrary precision, they both have to make an arbitrary number of updates for each element of the function. After seeing this it is easy to see that since V has less elements than Q, V will probably converge faster than Q.

To summarize this, let us say that both Q and V methods have their pros and cons. As a rule of thumb one should use a value function when it is possible to construct a good-enough model of the environment. Of course, when it is impossible to obtain such a model, there is now other choice but to use a Q function. But in the twilight zone, where it is possible to construct a model of the environment, but there is a chance that this model might be wrong, or that it is very difficult to do so, one must try both methods and come up with a method that best suites its domain.

In the software that we wrote we used both methods.

There are many algorithm that use Q-learning as their base method. In the project we used 2 of them: the one presented in Figure 2 and an algorithm called sarsa, that is presented next.

Figure 3: sarsa algorithm

The main difference between the sarsa algorithm and the algorithm presented in Figure 2 is the update rule.

- 14 -


Exploration vs. Exploitation: ε-GreedyWe have already mentioned the idea of choosing an action according to a policy that is not necessarily the best action; hence choosing a random action with probability of ε and choosing the best action with probability of 1-ε.

The question the needs to be asked is: what is it good for? Why not always choose the best action possible, and by that increasing the reward? The answer lies in that the policy, which we assume is using a Q function, just for the matter of the discussion, does not necessarily have the correct Q function. Moreover it is almost always that it does not have the correct function. We expect it to have an estimated value of Q, and we use this value to choose the next action. What if this value is not correct?

Let us look at a simple example: suppose that we have a state s and an action a, for which we have Q(s,a)=1 and some other state a`, where a`≠a, and Q(s,a`)=0. Recall that Q is not necessarily the correct function and it is just an estimate. And now suppose that by taking action a` we would get a higher reward than the reward gained by taking action a. But if we would always choose the action that maximizes Q , we would always choose a. So how can we learn that taking action a` would lead us to a higher reward than the reward gained by taking action a? In order for us to learn that, we would have to take a` at least once. As a matter of fact, in order for us to know the true value of Q(s,a`) we would have to take a` from s an infinite number of times. And this is where the ε-greedy approach comes in hand. Using this approach, we would always choose a random action with probability ε; hence we will eventually end up with having taken all actions from all states an infinite number of times, and by that we can surly get a proper estimation of Q(s,a) for each s and a.

Generally speaking, the approach of ε-greedy introduces the dilemma of exploration vs. exploitation. What this means is that if we were to put ε=0; hence using a perfectly greedy policy, we would be exploiting the Q function (or the value function), and if we were to put ε=1 we would be exploring all the time, and not getting the rewards, even though we do know what are the good actions and what are the bad actions. So the dilemma here is when to explore, by taking a random action, and when to exploit, by taking the best action.

This dilemma is solved in the ε-greedy approach by taking a random action with probability ε and taking the best action with probability 1-ε. This method was used in the software we wrote.

The approach of ε-greedy is also helpful in a situation were the environment is dynamically changing (but slowly). For example, a maze that has moving walls. In this case there is generally no Q function that is correct all the time – even if we were to learn a perfect Q function that would apply to the current environment, once the environment changes, the function is not perfect anymore. The means of the policy to adjust itself to the changing

- 15 -


environment is by exploring, from time to time, and seeing that “it is not missing anything”; e.g. one approach that is acceptable for dealing with such cases is using an ε-greedy algorithm.

In our implementation ε had a constant value (usually it was 0.1). But it is also possible for an algorithm to have ε that is changing. The decision of how and when to change ε is related to the rate of getting the rewards, of getting to the goal. It is a usual practice to have ε=ε(t), i.e. ε is given as a function of the time, usually – a decreasing function of t, which means much exploration at the beginning and less exploration later.

- 16 -


Function Approximation3

We have so far assumed that our estimates of value functions are represented as a table with one entry for each state or for each state-action pair. This is a particularly clear and instructive case, but of course it is limited to tasks with small numbers of states and actions. The problem is not just the memory needed for large tables, but the time and data needed to accurately fill them. In other words, the key issue is that of generalization. How can experience with a limited subset of the state space be usefully generalized to produce a good approximation over a much larger subset?

This is a severe problem. In many tasks to which we would like to apply Reinforcement Learning, most states encountered will never have been experienced exactly before. This will almost always be the case when the state or action spaces include continuous variables or large number of sensors, such as a visual image. The only way to learn anything at all on these tasks is to generalize from previously experienced states to ones that have never been seen.

To summarize this, function approximation (of both Q and V functions) is used to confront 2 main issues:

Memory capacity – the usage of function approximation usually consumes less memory.

Convergence speed and generalization – if 2 similar states (or more) are traveled, then the update of one would result in the update of the other, due to the generalization achieved by the function approximation.

There are many methods for function approximation. To name a few, these would be neural networks, decision and classification trees, nearest neighbor, k-nearest neighbor, radial basis functions, tile coding, C-MAC etc.

In the project we have implemented a Linear Function Approximation called RBF (Radial Basis Function). The following section describes this work.

The Feature SpaceA notion that is common to all function approximations is the notion of the feature space.

A feature is a measurement of some property of the state. One can think of a feature as being an abstraction of the state. An example feature would be the Manhattan distance on a grid. Yet another example would be the number of pieces in a Checkers game. A proper feature must be related to the goodness of the state. Moreover, if a feature is not related to the goodness of the state it is redundant.

3 The function approximation overview is according to [1], [2], [4], [5] and [6]

- 17 -


The feature space is a multi-dimensional space (one dimension per feature). It might be discrete or continual.

The Feature space is usually much smaller that the state space of the original problem.

It is obvious that all features are domain specific. Moreover, one can not select a feature that is meaningful without having a good understanding of the domain is it dealing with. As an understatement we can say that choosing the correct features is crucial to the success of the approximation.

So why is this feature space so important? In order to understand this let us recall a previous discussed subject – the value function. We will use value functions to explain the importance of the feature space, but it is important to note that Q functions also have similar properties.

When we discussed earlier about value functions, we said that a value function maps a state, s, to a real value, v: V(s)=v. But, what is a state? How is it represented? Of course, it is possible to represent a state in many different ways, such as an enumeration of values, or a set of enumerations etc. When speaking about value functions it is not important how the state is represented (as long as this representation is consistent), but when speaking about function approximation it is very important. Usually a function approximation is a mathematical function that get as an input a vector of real numbers. Thus, it can not get a state as an input – it must get a vector representation of it. This representation is given in the feature space, using a transformation from the state space to the feature space. Thus it is important to choose the correct transformation to the feature space.

RBFThe idea of RBF is the approximation of a function (in our case this would be the value function) by a set of linear combination of gaussians.

The objective of the function approximation is to find the approximation, f of V, where f is the following linear sum:

Equation 4: a linear sum of gaussians is used to approximate V

were

Equation 5: the base functions are gaussians that are centered at ci

The norm is the distance in the feature space. ci is the location of the base function in the feature space.

- 18 -


The number of base functions, their positions, their standard deviation and the weight of each function are all parameters that the algorithm has to learn. The following sections describes the implementation for RBF that we have made.

In our caseWe used a method in which the number of base functions is not determined in advance, rather they are added dynamically according to a threshold and according to the value of the gradient in the nearby surroundings. The location of each function (its center) is set once the function has been added. The standard deviation is set and is constant for all base function. What is fully controlled by the algorithm is the weight of the function.

The State SpaceThe usage of function approximation is sometimes the only way to go because of the great memory (space) complexity that’s involved with regular, tabular solutions. In the test runs we made there were many cases where the agent ended up with no memory left.

The Algorithm We UsedIn the project we have implemented a TD(λ) algorithm that uses RBF to approximate its value function. The following figure describes the algorithm:

Figure 4: TD(λ) algorithm that uses RBF as function approximation

In the figure, Policy is an ε-greedy policy, which means that with probability of 1-ε it returns an action a such that taking the action a while being in state s would lead to state s' such that V(s') is maximal:

- 19 -


V(s) is computed in the following way: , hence it is a RBF.

d is the distance in the feature space between the location of the ith gaussian and the transformation of s to the feature space.

model is a model of the environment. model(s,a) is the next expected state when being at state s and taking action a. The agent must use a model of the environment because it is using a value function (and not a Q function).

The Selected FeaturesIt has been already mentioned that there is a great importance to selecting the appropriate set of features. The features that we chose to use are the following features:

For each of the predators we use:

The distance from the predator to the pray

The angle between the predator and the pray

The distance of the pray from the closest corner

It is not only important to chose the correct features, but it is also important to give them the correct weights. We found that the most important feature is the distance, and then the angle and the corner.

- 20 -


Design

The Three Main InterfacesIn order to satisfy requirements 1 and 2 (see Requirements) an interface is presented that consists of 3 main interface classes:

ISimulation interface

IAgent interface

IEnvironment interface

The ISimulation interface is implemented as part of the platform. The IAgent and the IEnvironment interfaces are not implemented as part of the platform, rather they remain as interfaces and will be implemented by the user of the platform in order to satisfy requirement 3.

Following is a text and diagrams that describe these 3 interfaces.

ISimulation interfaceThe purpose of the Simulation class is to manage the process of the agent interacting with the environment.

A UML diagram of the interface is presented and discussed:

ISimulation

ISimulation()init()start()steps()trials()setShowGUIProgress()setShowGUIStatistics()

(from Simulation)

<<Interface>>

Figure 5: ISimulation interface

Discussion of the methods:

ISimulation.init(IEnvironment environment, Object[] envParams, IAgent agent, Object[] agentParams)

Initializes the simulation instance, the agent, and the environment.

The environment is an instance of the IEnvironment interface and the array envParams is the parameters that will be passed to the init method of the class.

The same goes with agent and agentParams.

- 21 -


ISimulation.start()

Starts a new trial.

The function calls Environment.start() and Agent.start(). This way the first sensation (the starting state) of the environment is forwarded to the agent and the agent returns its first action and the trial begins.

ISimulation.steps(int numOfSteps)

Runs the simulation for numOfSteps steps, starting from whatever state the environment is in. Note that the environment does not necessarily has to be in the initial state – steps() starts a simulation from whatever state the environment is in, and this state might be any state. If the terminal state is reached, the simulation is immediately prepared for a new trial by calling Simulation.start(). The switch from the terminal state to the new starting state does not count as a step. Thus, this function allows the user to control the execution of the simulation by providing the total number of steps directly.

ISimulation.trials(int numOfTrials, int maxStepsPerTrial)

Runs the simulation for numOfTrials trials, starting from whatever state the environment is in (just like steps()). Each trial can be no longer than maxStepsPerTrial steps. Each trial begins by calling Simulation.start() and ends when the terminal state is reached or when maxStepsPerTrial steps is reached, whichever comes first. Thus, this function allows the user to control the execution of the simulation by providing the total number of trials directly.

IAgent interfaceThe purpose of the agent is to solve an abstract problem. The problem is defined as a reinforcements problem: an agent searching for a policy that will produce a maximal gain for it (maximal reinforcement) in some environment, not necessarily defined in advance.

The agent might be a complex class that incorporates many subclasses in order to achieve this goal. These classes might be classes such as a Policy class, a Function Approximation class etc. However, the interface of the Agent class, as defined by IAgent is simple.

The IAgent interface is presented in the following UML diagram and the following text:

- 22 -


IAgent

init()start()step()save()load()

<<Interface>>

Figure 6: IAgent interface

The interface as a whole is not implemented by the platform, but is implemented by the user of the platform (henceforth – the user).

Following is a discussion of the interface’s methods.

IAgent.init(Object[] params)

In implementing this method the user can do whatever initialization its agent needs. This method is invoked by Simulation.init() method at the beginning of each trial. Agent.init() should initialize the instance of the agent, making any needed data-structures. If the agent learns or changes in any way with experience, then this function should reset it to its original, naive condition.

The params array is the same array that was passed to the init method of the ISimulation class.

IAction IAgent.start(ISensation s)

This function is called at the beginning of each new trial. Agent.start() should perform any needed initialization of the agent to prepare it for beginning a new trial. As opposed to Agent.init() , this method should not delete the data structures (or the learned policy). It should only prepare the agent to the beginning of a new trial.

The input parameter, s, is the current state (or sensation) in which the environment is in at the present time (normally it should be the start state).

The method should return an action, which is the action that the agent selects, given the sensation s.

IAction IAgent.step(ISensation prevS, IAction prevA, ISensation nextS, double reward)

This is the main function for the Agent class, where all the learning takes place. It will be called once by the simulation instance on each step of the simulation. This method informs the agent that, in response to the sensation prevS and its (previously chosen) action prevA, the environment returned the payoff in reward and the sensation nextS. This function returns an action to be taken in response to the sensation nextS.

If the trial were terminated, the nextS would have its value set to null. In this situation the value returned by the method is ignored.

- 23 -


The sensation and action prevS and prevA on one call to Agent.step() are always the same as the sensation nextS and the returned action on the previous call. Thus, there is a sense then in which these arguments are unnecessary, provided just as a convenience. They could simply be remembered by the agent from the previous call. This is permitted and often necessary for efficient agent code (to prevent redundant processing of sensations and actions). For this to work Agent.step() must never be called directly by the user.

void IAgent.save(String toFileName)

This method is used to save the agent to the disk.

What is actually saved might be only the policy that the agent holds.

In order to make the saving possible, all the relevant interfaces extend the Serializable interface.

void IAgent.load(String fromFileName)

This method is used to load the agent from a disk file.

What is actually loaded might be only the policy that the agent holds.

In order to make the saving possible, all the relevant interfaces extend the Serializable interface.

IEnvironment InterfaceThe purpose of the environment is to simulate a real world environment.

This interface is implemented completely by the user of the platform.

There could be many kinds of environments: a maze, a tetris game, a chess game, a robot trying to get up etc. However, the common interface of all the environments is simple and it is presented in the IEnvironment class:

IEnvironment

init()start()step()

<<Interface>>

Figure 7: IEnvironment interface

Following is a discussion of the IEnvironment methods.

IEnvironment.init(Object[] params)

This method should initialize the instance of the environment, making any needed data-structures. If the environment changes in any way with experience, then this function should reset it to its original, naive condition. For example: if the environment is a tetris game, than init shout reset the state of the game to the initial state.

- 24 -


Normally the method is called once when the simulation is first assembled and initialized.

The params array is the same array that was passed to the init method of the ISimulation class.

ISensation IEnvironment.start()

The method is normally called at the beginning of each new trial. It should perform any needed initialization of the environment to prepare it for beginning a new trial. It should return a pointer to the first sensation of the trial.

double IEnvironment.step(IAction action)

This is the main function for the environment class. It will be called once by the simulation instance on each step of the simulation. This method causes the environment to undergo a transition from its current state to a next state dependent on the action action.

The function returns the payoff of the state transition as a return value.

Any data-structure and graphics actions should be done by the environment in this method.

ISensation IEnvironment.getSensation()

Returns the current state/sensation of the environment. If the last transition was into a terminal state, then the current sensation returned must have the special value null.

Additional InterfacesIn order for the classes that inherit the 3 interfaces described above to communicate with each other in an orderly fashion, additional interfaces have to be defined.

These interfaces are:

ISensation interface

IAction interface

Following is a discussion of these interfaces.

ISensation InterfaceA sensation in the context of this application is a generalization of a state. Meaning – the environment might be in a certain state; however, the agent might think of it as being in the same exact state, but it might also be wrong, thinking of it as being in a different state. The reason is that the agent does not always have a perfect information about the environment. If the agent senses that the environment is in state s and if the agent’s sensors are perfect, then the environment is really in state s. But if the agent’s sensors

- 25 -


are not perfect (in the general case) then the environment might also be in state s`. For this reason we use the term sensation rather than use the term state in order to represent the agent’s point of view of the environment.

Previously we discussed about MDPs. In the discussion we used an implicit assumption about the states: we assumed that the states are given to us perfectly, i.e. we have full information about the current state. Of course, in the general case this is not always true and this is why we chose to use the term sensation, rather then state. An MDP that is used in a non-perfect information domain is referred to as POMDP (Partially Observed MDP). The application implemented in the project is designed to deal with partially observed domain as well as fully observed domains.

A sensation is conceptually part of the environment.

The inner representation of a sensation might be a simple enumeration in the case of a discrete environment, or it might be an X-location and a Y-location in case of a 2D environment (such as a 2D maze) and it might also be presented as a several continuous variables (such as the case in some real world problems).

But in order to make the interface clear and sharp, there are only few methods needed by a sensation interface:

ISensation

getActions()getID()

<<Interface>>

Figure 8: ISensation interface

Following is a discussion of the methods.

collection of IAction ISensation.getActions()

Given a sensation, there is a collection of actions that are applicable to it.

This is exactly what this method does: it is a simple mapping: . Meaning – given this sensation, the method

returns the actions that are applicable to it.

String ISensation.getID()

An agent must be able to identify a sensation in order to manage a policy. This method returns the ID of the sensation.

The way in which an ID of a sensation is built is fully dependent on the environment. If, for example, the environment is a simple grid-maze, then the ID might be the (X,Y) coordinates of this sensation.

The two things that are important about the ID is that it must be unique and consistent. Unique means that 2 different sensations must not map to the same ID string. Consistent means that there can be no such case where a sensation maps to a string at an earlier time and at a later time it will map to

- 26 -


a different ID string. Consistency can also be viewed as a deterministic mapping from the domain of the sensations to the domain of strings.

A string was chosen to represent the ID because:

1. it is simple

2. it has a built in hashCode() method

One must keep in mind that if the agent wants to use a policy in which a function approximation is being used, the agent must first “get to know” the domain – it needs to know what kinds of states are there in the domain and how they are represented. That is – it must be aware of what kind of objects this method returns.

IAction InterfaceIAction describes the action that an agent takes and that might (normally) have an influence on the next sensation of the environment.

The interface is straight forward:

IAction

getID()

(from Environment)

<<Interface>>

Figure 9: IAction interfaceMethod discussion:

String IAction.getID()

An agent must be able to identify an action in order to manage a policy (just like the case with ISensation). This method returns the ID of the action.

The way in which an ID of an action is built is fully dependent on the environment. If, for example, the environment is a simple grid-maze, then the ID might be {RIGHT, LEFT, UP, DOWN}.

The discussion about the method ISensation.getID() applies to this method too, and one must read it carefully.

- 27 -


Package OverviewAfter describing the interfaces let us categorize them by dividing them into packages and define the interactions between the different interfaces.

Agent

Simulation

Environment

GUI

Figure 10: The Packages

The platform is divided into 4 main packages:

Simulation package

Agent package

Environment package

GUI package

The simulation package uses (depends on) the two other packages. This is possible because they have a well defined interface.

Following is a detailed discussion of each package.

Simulation PackageThe simulation package consists of one interface and one class only:

ISimulation interface

Simulation class

The ISimulation interface has already been defined earlier in the document (see ISimulation interface).

The Simulation class simply implements the ISimulation interface as it has already been defined.

- 28 -


Agent PackageThe Agent package consists of 3 interfaces and at least 3 classes (at least one implementation for each interface):

IAgent interface

IPolicy interface

IFunctionApproximator interface

The IAget interface has already been defined (see IAgent interface).

There could be many implementations to this interface. For example: QAgent, TDAgent etc…

It is the user of the platform’s responsibility to implement the agents and the implementations will be discussed in another document.

The other 2 interfaces: IPolicy and IFunctionApproximator are merely a suggestion for the implementation of an agent.

IPolicy is an interface of a policy that the agent learns and uses. This is actually the heart of the agent that uses reinforcement learning as a learning algorithm.

The interface is presented below:

IPolicy

step()learn()

<<Interface>>

Figure 11: IPolicy interface

Method discussion:

IAction IPolicy.step(ISensation s)

The essence of a policy is: given a sensation, decide which action to take, based on the sensation. This is just what this method is supposed to do – a simple mapping – .

Of course, how the mapping is done depends on the Policy and on how much “knowledge” it has been able to gain so far.

IPolicy.learn(ISensation prevS, IAction action, ISensation nextS, double reward)

In order for the step method to be able to generate a good action, the Policy class must be able to learn.

In this method the class learns: given the previous sensation prevS and the previously chosen action action the environment has moved to the next sensation nextS and returned a reward reward.

- 29 -


IFunctionApproximator is a class whose essence is to be able to provide an approximate mapping (as suggested from its name) from one domain to another.

Of course, in an ideal world, the mapping would not be an approximation but rather a precise mapping. In this situation the class that implements the interface would consist of a large table that maps objects.

The mapping would typically be a mapping of sensations to actions.

But in many cases holding a large table of mapping is not practical. In those cases one must use a function approximation.

There are many ways to approximate functions: Neural Networks, Decision Trees, Gradient Descent and more. The implementer of the interface must choose one of these methods.

IFunctionApproximator

map()

<<Interface>>

Figure 12: IFunctionApproximator interface

IAction IFunctionApproximator.map(ISensation s)

This method maps sensations to actions. If the class implements a tabular mapping, then this mapping is the mapping of visited states to the estimated values of those states, and if the class implements an approximate function than this mapping maps all states according to the approximation parameters learnt so far.

In order to understand the context of this class refer to Figure 13.

- 30 -


Environment PackageThe environment package encapsulates all the environment classes. Those classes may be classes that manage logically the environment, classes that display graphically the environment etc.

The package shows three interface outside of it, to be used by the other packages. Those are:

IEnvironment interface

ISensation interface

IAction interface

All three interfaces had been fully discussed in the preceding text (see IEnvironment interface, ISensation interface, IAction interface) and, as mentioned before, the implementation of them is left to the user of the platform.

To summarize this we present a diagram of all the main interfaces:

- 31 -


ISimulation

ISimulation()init()start()steps()trials()collectData()setShowGUIProgress()setShowGUIStatistics()

(from Simulation)

<<Interface>>

IAgent

init()start()step()save()load()

(from Agent)

<<Interface>>

IPolicy

step()learn()

(from Agent)

<<Interface>>

IFunctionApproximator

map()

(from Agent)

<<Interface>>

IEnvironment

init()start()step()

(from Environment)

<<Interface>>

Figure 13: Interaction between the interfaces

The simulation instance has one instance of an environment (which implements IEnvorinment) and one instance of an agent (which implements IAgent).

The simulation uses those instances in order to create a simulation.

The agent has a policy instance and it uses it in order to decide what its next step will be. The policy, in turn, uses a function approximation in order to get the job done.

Serialization

Serialization is java’s way of saving objects to disk files or sending them through a network connection.

- 32 -


Java makes it very easy for the programmer to save his object. All you have to do is declare that the class of your object implements the Serializable interface, and all of the referenced objects within the object also have their classes implementing the Serializable interface and that’s all. Now it is possible to save a whole object to disk and load it from disk. When an object is saved to disk all of the other objects references by it are automatically saved too.

For this reason, in this application, where there is a need to save objects to disk (such as the policy of the agent, the environment and the environment’s state) many interfaces that were discussed in this document extend the Serializable interface, which results in the classes themselves implementing the Serializable interface.

Following is a list of all classes/interfaces that implement the Serializable interface:

1. agent.IAgent

2. agent.IFunctionApproximator

3. agent.IPolicy

4. agent.QAgent.Q (an example of a class that the implementor of the IAgent interfaces uses, that also has to be saved to disk)

5. environment.IAction

6. environment.ISensation

- 33 -


Implemented EnvironmentsTo demonstrate the usage of the platform, we have implemented 3 environments:

A Maze environment

A Predator & Pray environment

The following section describes the above environments in detail.

The Maze EnvironmentThe maze environment is a 2D maze on a rectangle grid where the actor (the agent) must find the exit, (we will call it the goal state). On each episode the agent can get the current state ID and query the environment about the possible actions that can be taken. The possible actions can be one of {NORTH, EAST, SOUTH, WEST} but some states may contain less options, depending on the walls locations. Each time the agent takes a step, the maze’s state changes to reflect the agent’s new coordinate. If the agent takes a step that leads to a wall, its location won’t change.

The implementation of the maze environment is straightforward. A matrix was used, where each cell in the matrix can contain a code, representing the agent, the goal state, a wall, or an empty cell. This decision makes it easy to calculate the next state or the possible actions, and also makes it easy to create a graphic display. For the simplicity of the code, the maze environment also keeps a pointer to MazeSensations representing the goal state, the agent’s location, and the initial state.

The GridSesation classThe GridSensation class was created to allow a unified view of a grid location for environments that allow the above-mentioned four possible actions and also another unique action: STAY, which means literally stay in place. The STAY option is only used in the PredatorAndPrayEnvironment, which is discussed later.

A GridSensation is a representation of a coordinate on a grid. Each GridSensation has a unique ID, which is an enumeration of the cell it represents. That is to say, its ID is the index in the environment’s matrix when it is being opened to a long array.

A GridSensation keeps a pointer to its enclosing environment to allow easy implementation of the next two methods, which are used by the agents for function approximation, and environment modeling.

boolean GridSensation.isGoal()

- 34 -


Tests if this sensation is a goal state.

int GridSensation.distance(GridSensation otherSensation)

Returns the actual distance between this state and the other state.

The MazeSesation classThe MazeSesation is a GridSensation that does not allow the STAY action.

The GridAction classThe GridAction is nothing more than an abstraction of the above idea of five possible grid actions: NORTH, EAST, SOUTH, WEST, and STAY. One can think of this class as an enumeration of these actions.

The PredatorAndPrayEnvironment The predator and pray environment is a GridEnvironment just like the maze environment. On the grid there are an arbitrary number of predators (monsters) that are trying to catch the pray (the pacman). In other words, this is a pacman game in which the agent plays the monsters role, and is trying to catch the pacman to achieve its goal. On each step the agent picks an action (a PredatorAndPrayAction, will be described in detail shortly), and the environment moves the monsters accordingly. At that point, if the pacman was not caught it makes its move according to its deterministic getaway algorithm encapsulated in the PacMan class, which will be described soon. Note that the environment’s step actually encapsulates the monsters step and the pacman’s step that are two separate steps. This makes the game a round robin game.

As was previously stated, the agent’s goal is achieved when at least one of the monsters is located on the same cell as the pacman. Each time the goal is achieved, the monsters are placed at a random initial location, and the chase starts once again. This allows a quicker improvement of the learned policy.

Apart from the common interface inherited from IEnvironment the PredatorAndPrayEnvironment class implements one important method which allows the agent package to implement an IEnvironmentModel in an easy manner:

Map PredatorAndPrayEnvironment.getProbableNextSensations (Iaction action)

The method calculates all probable next states, for the chosen action, and then returns a mapping from each possible next state to its probability of occurring.

- 35 -


To allow the environment to be able to calculate the real distances between two grid cells, (remember that we are talking about a maze, and therefore the distances aren’t exactly the Manhattan distances), a utility class was added to abstract this operation:

The FloydWarshall classThe FloydWarshall class implements the Floyd-Warshal algorithm to find the shortest distance between all pairs of nodes (cells) in a GridEnvironment. On creation it is given a pointer to its enclosing environment and it calculates the shortest distances between every two cells. It also finds the maximal shortest distance on the grid for normalizing purposes (this feature is important for the part of function approximation). The FloydWarshall class gives the PredatorAndPrayEnvironment a great deal of improvement since it can then use the real distances in a maze and not counting on the Manhattan distance, that can be far from being accurate.

The three main methods of the FloydWarshall class are:

Int FloydWarshall.distance (GridSensation from, GridSensation to)

Returns the distance between two states (or cells) on the grid environment that created it.

Int FloydWarshall.distance (int fromRow, int fromCol, int toRow, int toCol)

Returns the distance between cells on a grid environment. The cells are at the coordinates given in the parameters.

Int FloydWarshall.getMaxMinDist ()

Returns the maximal shortest distance between two points on the grid. In other words this is the maximal distance d such that there exist two points on the grid p1, p2 (not walls) and distance (p1, p2)==d and there is no other d that applies (d is maximal).

For example: on a maze without walls this would be the Manhattan distance.

The PredatorAndPraySensation classThe PredatorAndPraySensation class can be thought of as a set of GridSensations. Each sensation is the coordinate of one of the actors on the grid. By actors we mean the predators, and the pacman. A possible action in a PredatorAndPraySensation is then a set of actions, each virtually taken by one of the monsters (recall that there is actually only one agent, and there for all of the monsters actions are taken in one combined step).

The main methods in the PredatorAndPraySensation class, apart from the methods described in the interface are:

- 36 -


GridSensation PredatorAndPraySensation.getPacManSensation()

Returns the GridSensation that represent the pacman location.

GridSensation[] PredatorAndPraySensation. getPredatorsSensations ()

Returns an array of GridSensations that represent the monsters locations.

boolean PredatorAndPraySensation.isGoalSensation()

Tests if this is a goal state.

The PredatorAndPrayAction classA PredatorAndPrayAction is actualy a set of GridActions taken by all the monsters in a single step. Each entry in the set is a GridAction that was chosen by one of the monsters.

- 37 -


Experimental resultsThe following section describes the execution results of the implemented agents with a few selected environments.

Maze Environment ResultsThe executions upon the maze environment were quite impressive. The first maze we tested was an 11x11-sized maze:

The results are described in the graphs. Note: that the shortest path is 20 steps.

Figure 14: #of steps for episode for a Q agent

- 38 -


Figure 15: #of steps for episode for a Sarsa lambda agent.

Figure 16: #of steps for episode for a TD lambda agent.

The next maze we tested was an 20x20-sized maze. The shortest path on this maze was 45-steps.

- 39 -


And the results were:

Figure 17: #of steps for episode for a Q agent.

The results were quite similar for the Sarsa Lambda agent.

- 40 -


Figure 18: #of steps for episode for a TD Lambda agent.

Predator And Pray Environment ResultsSince the results were quite good, we moved on to the Predator & Pray environment. First we tested a rather small 10x10 grid with 3 monsters.

Even for this small maze the agents that did not use any sort of function approximation failed because of lack of memory, when the policy table got too big. We therefore display the results we gained using the RBF TD Lambda agent.

- 41 -


Figure 19: #of steps for episode for a RBF TD Lambda agent.

As you can see, the RBF TD Lambda, can be somewhat unstable. We believe this is due to the indeterminist behavior in the first trial, which caused a large variance in different executions.

Next we tested much bigger mazes. The following maze is 29x29-sized.

The heuristic agent achieved an average of 36 steps for episode in the good executions, while the RBF TD Lambda agent achieved an average of 49 steps after ~3600 steps. An interesting phenomenon is that the RBF TD Lambda achieves far better results on an empty grid than the heuristic agent.

We relate this fact to the features we used. As you can recall we used a feature, which was the distance from the corners. We believe this is the

- 42 -


cause for this kind of behavior. If only time would allow us we would probably research this phenomenon in greater detail.

Figure 20: #of steps for episode for a RBF TD Lambda agent.

- 43 -


Operating manualThe following is a short description of the project’s main windows controls, to allow the user to get familiar with the usage.

Environment editor.We created the environment editor to be able to quickly create test cases for the platform. The editor can create both maze initiation files and predator & pray initiation files. All you have to do is select the desired environment in the first dialog.

To create a new environment, click File new (Ctrl-N) and select the desired size.

To save the environment, click File save (Ctrl-S).

To open an existing environment file, click File open (Ctrl-O).

To draw walls click the draw button and drag or click the mouse on the drawing panel. To erase, do the same with the right mouse button.

To add monsters click the add monster button and click, on the drawing panel in the desired places. NOTE: in the maze environment, only one monster will be selected as the initial agent’s location.

To place the pacman in the predator and pray environment, click the place pacman button and click on the drawing panel where you want to place it.

To place the goal state in the maze environment, click the place pacman button and click on the drawing panel where you want to place it.

- 44 -

Draw mode button

Add monster button

Place pacman button

File menu

Drawing panel


The simulation type chooserThis dialog pops up when you run the platform main application. It prompts the user to select an environment and an agent.

The simulation main windowThis frame displays the environment’s state, and some statistics:

- 45 -

Agent selector

Environment selector


The number of episodes (number of times the agent achieved the goal state.

The number of steps so far.

The average number of steps per episode.

Time since startup.

The application’s controls are:

Learn button – freezes the graphic display to allow the agent to quickly learn a better policy.

Display button – resumes the graphic display of the environment state.

Delay slider – sets the delay time of the graphic display.

- 46 -

Display panel

Learning mode button

Display mode button

#of simulation steps

#of complete episodes

Average steps

Simulation delay

Total time


Improvements that we have not implementedDuring the process of development we came up with many ideas of how to make a better implementation of it that we did not have the time to invest in. The list is long and quite interesting. following are only a few main ideas that came up.

Distributed implementation. Implements the predators as 2 (or more) separate entities, each having its own policy. In this case the predators must learn to cooperate in order to catch the prey.

Add features to the function approximation.

Make the learning of the weights of the features be automatic.

Use a data structure for holding the base functions that has a better time complexity. In the current implementation the process of getting the value of the approximation and of updating the weights is linear in the number of base functions. It is possible to use a data structure, such as KD-tree, in which this complexity would be logarithmic in the number of base functions.

Use a different mechanism for function approximation. For example, the usage of C-MAC has been successfully applied to similar problems. Also the method of Neural Networks has been successfully used in the past together with RL algorithms (an example is TD-Gammon).

- 47 -


List of FiguresFigure 1 : TD algorithm for estimating Vπ........................................................10

Figure 2: Q-learning algorithm.........................................................................13

Figure 3: sarsa algorithm.................................................................................14

Figure 4: TD(λ) algorithm that uses RBF as function approximation...............19

Figure 5: ISimulation interface.........................................................................21

Figure 6: IAgent interface................................................................................23

Figure 7: IEnvironment interface.....................................................................24

Figure 8: ISensation interface.........................................................................26

Figure 9: IAction interface...............................................................................27

Figure 10: The Packages................................................................................28

Figure 11: IPolicy interface..............................................................................29

Figure 12: IFunctionApproximator interface....................................................30

Figure 13: Interaction between the interfaces.................................................32

Figure 14: #of steps for episode for a Q agent................................................38

Figure 15: #of steps for episode for a Sarsa lambda agent.............................39

Figure 16: #of steps for episode for a TD lambda agent.................................39

Figure 17: #of steps for episode for a Q agent................................................40

Figure 18: #of steps for episode for a TD Lambda agent................................41

Figure 19: #of steps for episode for a RBF TD Lambda agent........................43

Figure 20: #of steps for episode for a RBF TD Lambda agent........................44

- 48 -


Bibliography1. Sutton R. S. and Barto A. G, Reinforcement Learning: An Introduction, MIT

Press, Cambridge, MA, 1998

2. Mitchell T. M, Machine Learning, McGraw Hill, 1997

3. Kaelbling L. P. and Littman M. L, Reinforcement Learning: A Survey, Journal of Artificial Intelligence Research, pages 237-285, 1996

4. Unemy T, Learning not to Fail – Instance-Based Learning from Negative Reinforcement, 1992

5. Krose B. and Smagt P, An Introduction to Neural Networks, University of Amsterdam, 1996

6. Gurney G, Computers and Symbols versus Nets and Neurons, UCL Press, 2001

- 49 -

Documents

RL Project - Technion – Israel Institute of Technologynssl.eew.technion.ac.il/files/Projects/Software_for... · Web viewProject taken at the software lab of E.E., Technion, winter