[IEEE IEEE EUROCON 2009 (EUROCON) - St. Petersburg, Russia (2009.05.18-2009.05.23)] IEEE EUROCON 2009 - Concept extraction using temporal-difference network EUROCON2009

CONCEPT EXTRACTION USING TEMPORAL-DIFFERENCE NETWORK EUROCON2009

Habib Karbasian1, Majid N. Ahmadabadi1, 2 and Babak N. Araabi1, 2

1 Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran 2 School of Cognitive Sciences, Institute for studies in theoretical Physics and Mathematics, Niavaran, Tehran, Iran

Abstract: In this paper, we propose a novel framework to extract temporally extended concepts in a grid world environment using a probable data structure named temporal-difference network. First a reinforcement-learning agent tries to learn its environment for the task of wall following. After that we train a newly introduced temporal-difference network (TDN) in the brain of the agent in order to gain a predictive model of the environment. At last the most promising sequences of action-observation of the given environment will be sorted out based on their probability.

Index Terms: Concept, Reinforcement Learning, MDP, POMDP, Temporal-Difference Network.

I. INTRODUCTION

In this section, we try to describe the preliminary

knowledge of what constitutes a concept, how the agent should learn the environment using its previous experiences and what TD network is and how we change it in a way to meet our need in this paper.

�A. CONCEPTS

Concept can be defined as a generalization and abstraction of low level data. According to this rough definition of concept, we should try to realize this fact in the agent which tries to understand its surrounding environment. In POMDP problems the only information which the agent perceives is its sensory-action data. Thus the structure of concepts cannot consist of anything more than the actions performed by the agent and the observations gained from the environment. So concepts are defined as sequences of action-observation pairs called temporally extended sequence of action-observation or experience. For instance in the following you can see a temporally extended action-observation in length of n:

�� (1)

B. REINFORCEMENT LEARNING

In each problem in which reinforcement learning is used as a tool of learning, some elements should be devised carefully and intelligently in order to have the agent behaving what has been meant.

One of important elements is what constitutes the agent’s state. In this paper, we need a definition of state for our agent in a way that agent will be guided so that it can gradually collect more fruitful information from the environment and avoid visiting where there is no helpful information for the purpose. To be clearer on this issue, we take an example of a wall-following robot in a grid world environment shown in Fig. 1.

The environment is an 8x8 empty room which

is surrounded by walls. Assume that the robot has just one sensor on its left returning 1 if there is a block (wall) on its left cell and 0 otherwise. To retrieve the concepts of this environment which are wall and corner, the robot should scan them by using its sensor. In other words, it should move in the environment in such a way that its sensor always returns 1. Therefore, the robot should avoid entering zone A specified by diagonal lines in Fig. 1 or moving counterclockwise along a wall; because in both cases the robot always gets 0 from its sensor. The reinforcement learning approach introduced here attempts to direct the agent so that it selects better actions to perform and consequently gather helpful information for concept learning. As a result, paths such as the one shown in Fig. 1 which do not provide any helpful information for the robot are unlikely to be traversed. In the RL framework, an agent learns by interacting with an environment over a series of discrete time steps. At each time step t, the agent observes the state of the environment, �� , and chooses an action, �� A, which changes the

Fig.1 – A simple grid world environment for wall-following task. The robot has one sensor on its left returning 1 if there is a block on its left cell and 0

otherwise. Zone A is the area where the agent’s sensory information does not have useful information for

concept learning because in this area robot’s sensor always returns 0.

978-1-4244-3861-7/09/$25.00 ©2009 IEEE1888

state of the environment to �� and gives the agent a reward �� . An environment has the Markov property if the reward and the next state depend on only the current state and action, although this dependency may be stochastic, such that there is a fixed probability Pss'

a of going to state s' when taking

action a in state s . A Markov decision process (MDP) is an environment with a set of states S, a set of actions A, a reward �ss'

a for environment triplet consisting of a state s, an action a, and a next state s’, and that has the Markov property. The agent’s goal is to maximize its return or accumulated discounted reward. The agent does this by creating a policy, which is a function that maps states to actions. In stochastic environments, this policy should maximize the expected accumulated discounted reward. A common strategy is to assign a value to every state-action pair, called a Q-value, which is an estimate of the expected return for choosing that action in that state, and following an optimal policy thereafter. This value can be approximated using methods such as Q-learning as following: �� (2) where � and � are learning rate and discount factor respectively. Since the agent does not have any information about its position in the environment, we need to have a criterion distinguishing the agent’s states in the environment for the reinforcement learning algorithm. We utilize the agent’s experience as a criterion to specify its current state in the environment. The magic of utilizing experience as a history of the most recent actions and observations to specify the agent’s state in the environment offers a potential feature facilitating concept learning to be realized; by applying history, the situations in the environment where the agent has the same experience are recognized as one state. Consequently, the proposed reinforcement learning categorizes the environment into classes of different experiences. Since the structure of concepts in this paper consists of the agent’s experience, the aforementioned feature plays an important role in concept extraction as described later. To specify the state of an agent in an environment by utilizing its experience, we define state ��in action-value function �� as follows: �� (3) where n denotes the length of the history used for the state, and ��’s and ��’s are actions and observations respectively. The length of the history used as the agent’s state is four, i.e. the most recent four action-observation pairs of the agent’s experience specify current state of the agent in the environment. This helps us not to include Cartesian location of the

agent in the state which generalization will not be performed properly because we are trying to solve a POMDP problem [2].

C. TEMPORAL DIFFERENCE NETWORK

Temporal-difference (TD) networks are formalism for expressing and learning grounded knowledge about dynamical systems. TD networks represent the state of the dynamical system as a vector of predictions about future action–observation sequences. Each prediction is an estimate of the probability or expected value of some future event. For example, a prediction might estimate the probability of seeing a particular observation at the next time step. The predictions generated at each time step are thought of as “answers” to a set of “questions” asked by the TD network. Representations that encode the state of a dynamical system as a vector of predictions are known as predictive representations [3; 4; 5]. It has been shown in [6] that one particular representation known as linear predictive state representations (or linear PSRs) can represent any �!-order Markov model or partially observable Markov decision process (POMDP) [3; 6]. Further, they showed that the size of the PSR scales at least as well as the existing approaches; a linear PSR model is at least as compact as the equivalent POMDP or �!-order Markov model. TD networks are a generalization of linear PSRs and therefore inherit their representational power [7]. TD networks have been applied successfully to simple environments, both fully and partially observable. In the following the different types of TD network will be shown and a new TD network will be introduced to give more generalization capability to the conventional TD network.

In the following we will describe a conventional TDN without history in its structure. For information on the general specification of TD networks we direct the reader to the original work [1]. The problem addressed by TD networks is a general one of learning to predict aspects of the interaction between a decision making agent and its environment (a dynamical system). At each of a series of discrete time steps t, and agent takes an action at " A and the environment responds by generating an observation ��" O. In this work, we will consider TD networks with two observations, 0 and 1. The action and observation events occur in sequence, �� . This sequence will be called experience. We are interested in predicting not just each next observation, but more general, action-conditional functions of future experience. The focus of this work current is on partially observable environments—environments where the observation �� is not a sufficient statistic to make optimal predictions about future experience (�� does

1889

not uniquely identify the state of the environment). We will be using TD networks to learn a model of the environment that is accurate and can be maintained over time. A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing target relationships between predictions, observations, and actions. These nodes and links determine a set of questions being asked about the data and predictions, and accordingly are called the question network. Each node on the TD network is a function approximator that outputs a prediction using inputs such as the current observation, the previous action, and the predictions made at the previous time step. This computation part of the TD network is thought of as providing the answers to the questions, and accordingly is called the answer network. Fig. 2 shows a typical question network. The question of node #$ at time t is ‘If the next action is a1, what is the probability that the next observation �� will be 1?’. Similarly, node # asks ‘If the next action is a1, what will node #$ predict at time�% � �?’. This is a desired relationship between predictions, but also a question about the data. We can extend the explanation of this to see what the relationship is between #and the data, which yields the question ‘If the next two actions are a1, what is the probability that �� will be 1?’. In a fully observable (Markov) environment, it is natural for the question network to be a set of questions that are in some way interesting to the experimenter. In partially observable environments the structure of the question network has additional constraints; the answers should be a sufficient statistic that accurately represents the state and can be updated as new data becomes available. The question of how to discover a question network that expresses a minimal sufficient statistic is important. Formally� #�� " �&�� ' � &� ( ( ( � � � denotes the prediction for node ' at time step %( The column vector of predictions #� � �#��$� ( ( ( � #��) is updated according to a vector valued prediction function with modifiable parameter�*+

,- �� .��*�/�� (4) This prediction function corresponds to the answer network, where /� " �0� is a feature vector, *� is a 1 � matrix of weights, and . is the n-vector form of the identity function or the S-shaped logistic function .��

� 234. The feature vector is a function of the preceding action, observation, and node values. The modifiable parameters in * are updated with gradient-descent algorithm shown in equation (5).

56��7 � �89�� #��:;�� <=>?

<@>?A (5)

where � is a step-size parameter, 9�� is a target for #�defined by a question network, and ;�� " B&� �C corresponds to whether the action condition for #� was met at time %.

One special thing to note here is that action

selection algorithm for TD network is nothing unless uniform random selector. Therefore none of action has any kind of priority over the others. In this case shown in Fig. 2 the probability of selecting �� is equal to that of selecting � (D�� D�� E&F). II. NEW TEMPORAL DIFFERENCE NETWORK

We have described the whole idea of temporal difference network as a predictive model of the problem. In order to enrich the structure of the TD network, we have also used the history-based feature vector in [8]. In the Fig. 3(a) a conventional TD network is illustrated. In this problem the agent has only two definite actions, turn left and turn right. The structure of this TD network has only two depths. For instance #��answers the question of what the probability of seeing observation � is if the agent takes the action of turn left. Or #� shows us how probable the agent is able to perceive the observation � if the agent does two turn left consecutively. The remaining predictive nodes have their own explanation. But these answers are eligible where we do not need to know what the intervening observations are. In the following figure if the agent interacts in an environment in which there are only two observations,{0, 1}, and the observation of the TDN is for example 1 (o = 1), when the predictive node answers the aforementioned question, it will not be able to respond what the agent has seen just before it might perceive the goal observation. So we are going to augment this conventional TD network so that it will be able to retain intervening observations along the path to reach the goal observation. In the Fig. 3(b) you see the augmented TD network in a way that it is capable of retaining intervening

Fig.2 – Symmetric action-conditional question network. The network forms a symmetric tree, with a branching

factor of |A|. This example has depth d = 4. Some of the labels have been left out of this diagram for clarity,

each of these nodes should have a label GH and each is conditioned on some action.

1890

observations. For more explanation, the predictive node #� in

Fig. 3(b) answers this question of what is the probability of seeing �� where the first action taken is I and the upcoming observation is 0 and the second action is L, as this is denoted in mathematical way below:

#�+�DB�� J�� I� �� &� �� IC �K

(6) In order to manipulate the conventional TD

network’s formula to meet this condition, we should recall the main equations of TD network in the following:

#� � �.��*�� (7)

56��7 � �89�� #��:;�� <=>?

<@>?A (8)

;�� ;�� LMN� � " � �&� �� (9)

We can remember that ;�� is a condition whether

the action �� has been taken for corresponding #� in the time t. To have this structure conditioned on intervening observations, we should import another condition in the formula based on observations. The new formula of TD network will take form of the following:

5O6��7 � P�89�� #��:;�� <=>?

<@>?AQ ��1�� (10)

The observation condition �� says whether the observation perceived in the next time lap, % � �, has been met for the corresponding #��in the time t. This ensures that what observations have been seen along the path to reach the target observation.

III. IMPLEMENTATION AND RESULTS

In this section, we would like to present how to extract temporally extended concepts with aid of reinforcement learning and our manipulated TD network in the environment depicted in Fig. 5(a) to attain some meaningful concepts from the environment based on the task of wall following. We encourage the reader to consider a similar task suitable for human. Consider navigating around a room with the lights off. At all times you can feel if you are touching a wall beside you or you hit a wall in front of you, otherwise nothing. If the room were the map shown in Fig. 4, what sorts of patterns would be made in your mind?

The agent’s location in the environment is

represented by the triangle in Fig. 5. The agent has two perceptual inputs: one sensor on its left indicating whether there is a block on the left side (|observation| = 2, wall and free space)and the other sensor on the front indicating whether the agent collides with a block while moving forward. The agent also has a limited action space; it can either attempt to move forward or it can rotate 90 degrees clockwise (turn right, RS) or counterclockwise (turn left, RT) (|action| = 3).

The reason behind using a sensor for detecting collision is that in the environment shown in Fig. 5 are situations that the reinforcement learning agent would need longer history to distinguish among them if it did not have such sensor. Since longer history reduces the speed of the specialized reinforcement learning, a collision detector is used so that it dominates the information coming from the other sensor just when a collision happens. Therefore, the agent does receive sensory information from one of its sensors at each time slice; it either senses a collision by collision detector

(a)

(b)

Fig.3 – (a) This TD network is a conventional TDN where the agent has only two possible actions, R and L. As discussed in the text, this TD network is unable to retain the intervening observation along the path. (b)

This is the augmented TD network which has the capability of tracking the intervening observations.

Fig.4 – This environment has been used for our

proposed framework. The triangle shows the current position of the agent.

1891

or gets information about its left side from the other sensor. All parameters used in the reinforcement learning for this environment are shown in the following: Reinforcement Learning Settings: Episodes = 50000 Steps in each episode = 50 Alpha = 0.9 Alpha Decay Power= 0.45 Lambda = 0.9 Action = 3 (forward (U), turn right (RS), turn left (RT)) State = Four pairs of previously performed actions and observations Action Selection Algorithm = epsilon-greedy Epsilon = 0.5 Epsilon Decay Power= 0.999 Reward-Punishment Policy for Learning: 1- Turn left then turn right or turn right then turn left = -60; 2- Four successive turns in one direction = -55; 3- Two turns left then two turns right or two turns right then two turns left = -50; 4-Open space on the left side = -30; 5- Collision = -10; 6- Block on the left side = +50;

It is interesting to mention here that reward-punishment policy for learning is written down from high to low priorities. It means that for instance if the agent has made four right turns in a row and there is still a block on its left side, the rule number 2 will be given as a punishment rather than rule number 6 as a reward.

A TD network has been used to form a predictive model of the surrounding environment. This TD network is the manipulated network which has been described in the previous section. All parameters used in the TD network for this environment are shown in the following: Temporal Difference Network (TDN) Learning Settings: Branch factor (b) = 6 (|Actions|1|Observations|=V 1 W) TD network depth (d) = 3 (the level of network) History depth = 6 (previously perceived observations used for feature vector of TD network) Steps to learn TD network = 100000 Steps to extract concepts = 10000 Number of predictive nodes = 129

(J �XY�J � �Z[\]3]Z3] ��

) The target observation of TD network is whether wall can be on the left side of the agent (o = 1). All intervening observations in the structure of

TD network are checked with the left sensor of the agent. The procedure of our proposed framework is in this order; first we are to have the agent learn the environment through its experience for the task of wall following. The detailed information of the agent’s structure has been given in previous sections. After a certain numbers of episodes, the agent is able to perform its task desirably according to the graph shown in Fig. 5.

In this level, we need to gain a predictive model of the environment. To do so, a newly devised TD network which has been explained in the previous section is put into action for the specified steps of training. At last to extract concepts, we should move the agent for some steps in the environment to see which one of these temporally extended sequences of actions-observations are more promising than the others. In this last section the agent should traverse the environment, but how? The situation which the agent has reached so far is having a Q-table in its memory and a predictive model of its surrounding environment. As mentioned before, the algorithm of action selection of TD network is based on uniform random selector. But at this stage of the framework where we need to differentiate sequences of actions-observations, we have to get the agent to take actions based upon its learned Q-table and update its TD network along the way of traversal. To put it into other way, we have a agent following the wall or observing the wall on its left hand and updating its TD network for each step it takes. We have come to this point that how are we able to extract promising concepts among others from the environment? As calculated above the TD network we have trained has 129 predictive nodes, 129 sequences of action-observation, 129 temporally extended concepts. While the agent is travelling along side of the wall, TD network’s predictive nodes tell us how probable the agent will be to see the wall on its left hand. So to extract those fruitful concepts among the others, we need to maintain the average of each predictive node while the agent is

Fig.5 – This is the graph depicting the average expected reward of the reinforcement learning for the task of wall

following (x-axis = episodes, y-axis = average of expected reward).

1892

travelling along the wall. Finally this method leads us to have 129 average numbers indicating how probable each predictive node has been when the agent was following the wall. In the Table 1 you are able to see 15 promising sequences of actions-observations for this special task and environment and in Fig. 6 most of them are illustrated for better visual perception. Table 1. This table shows 15 most probable concepts of the environment for the task of wall following which are sorted out decreasingly based on their average probability values.

Index Concepts Average Probability

1 f1f1f1 0.723

2 r1f1f1 0.709

3 f1f1r1 0.685

4 f1f1 0.669

5 f0f1r1 0.641

6 r0f1r1 0.628

7 f1r1 0.615

8 l0f1r1 0.594

9 l1f1r1 0.590

10 r0f1f1 0.573

11 r1f1r1 0.571

12 l0f1f1 0.531

13 f1 0.523

14 l1f1f1 0.522

15 r1f1f1 0.521

As you can see in the Fig. 6 these concepts are

meaningful to human, such as wall following in the straight line, turning right to pass right corner or turning left to traverse left corner and so on. These surprising results show us that TD network along

with reinforcement learning agent can detect those sequences of actions-observations which are superior compared to the others in terms of reward-punishment policy of reinforcement learning. IV. CONCLUSION

In this paper our main goal has been to realize our framework which can extract most rewarding sequences of actions-observations using a probability-based data structure called TD network. The first phase of the framework is to have an agent learn in a certain environment to build its own Q-table based on reward-punishment policy. It is needful to repeat this fact that the state of the respective agent is composed of previously taken actions and perceived observations. The next step is to form a predictive model of the environment with the aid of TD network. In order to incorporate intervening observations into the sequence of actions, we have come up with an alteration in the structure of TD network. This new TD network helps us to retain the previously perceived observations into the predictive model. At last we get the learned agent to move around the environment based on its previously formed Q-table and maintain and update its TD network for a certain numbers of steps to extract fruitful concepts. The agent has some restriction on its sensory-action capabilities in a way that it has two sensors on left and front sides where the front side is only used to detect collision. The extracted concepts fairly highlighted the features of the environment as patterns of actions and observations.

In near future, we intend to utilize the proposed approach as a tool toward achieving an intelligent decision maker which can learn the designated task faster and more efficiently. You will see in the next step that the knowledge created in this framework is able to help a reinforcement learning agent learn the same task in another environment in a more rewarding manner.

V. ACKNOWLEDGEMENT

The authors would like to express their gratitude toward Farzad Rastegar, Eddie Rafols and Brian Tanner for their great deal of assistance on this work.

REFERENCES

[1]. Richard S. Sutton and Brian Tanner. Temporal-difference networks. Proceedings of the Workshop “Advances in Neural Information Processing Systems 17”, Cambridge, MA, 2005. MIT Press, pp. 1377–1384.

[2]. Farzad Rastegar and Majid N. Ahmadabadi, “Grounding Abstraction in Sensory Experience”. Proceedings of the Workshop”IEEE/ASME Int. Conf. Advanced Intelligent Mechatronics (AIM)“, Zurich, Switzerland, August 2007.

[3]. Michael. L. Littman, Richard. S. Sutton, and Satinder

Fig.6 – These are visualization of some concepts

which have been obtained in Table 1.

1893

Singh. Predictive representations of state. In “Advances in Neural Information Processing Systems 14“, Cambridge, MA, 2002. MIT Press.

[4]. Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In “ICML ’04: Twenty-first international conference on Machine learning”, ACM Press, 2004.

[5]. Herbert Jaeger. Discrete-time, discrete-valued observable operator models: a tutorial. Technical report, German National Research Center for Information Technology, 1998.

[6]. Satinder Singh, Michael R. James, and Matthew R. Rudary. Predictive state representations: A new theory for modeling dynamical systems. In “Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference”, 2004, pp.512–519.

[7]. Satinder Singh. Private communication, 2004.

[8]. Richard S. Sutton and Brian Tanner. Temporal-difference networks with history . Proceedings of the Workshop “The Nineteenth International Joint Conference on Artificial Intelligence” , Edinburgh, UK, 2005, pp. 865-870.

1894

Documents

[IEEE IEEE EUROCON 2009 (EUROCON) - St. Petersburg, Russia (2009.05.18-2009.05.23)] IEEE EUROCON 2009 - Concept extraction using temporal-difference network EUROCON2009