FDMS with Q-Learning: A Neuro-Fuzzy Approach to Partially ...cdn.intechopen.com/pdfs/4103/InTech-Fdms_with_q_learning_a_neuro... · 251 FDMS with Q-Learning: A Neuro-Fuzzy Approach

251

FDMS with Q-Learning: A Neuro-Fuzzy Approach to PartiallyObservable Markov Decision Problems

Toygar Karadeniz & Levent Akin Department of Computer Engineering, Bogazici University, Turkey

[email protected]; [email protected]

Abstract: Finding optimal solutions to Partially Observable Markov Decision Problems is known to be NP-hard. This paper describes a novel neuro-fuzzy approach to obtain fast, robust and easily interpreted solutions by utilizing a combination of several learning techniques including neural networks, fuzzy decision making and Q-learning. Keywords: POMDP, neuro-fuuzy, rule-based systems

1. Introduction

Partially Observable Markov Decision Process / Problem(POMDP) (Littman et al, 1994) is a general model of a single agent interacting with a partially observable environment. Although POMDPs are mainly encountered in robot navigation, the model has a large spectrum of applicability including quality control, medical diagnosis, and various planning tasks (Cassandra, 1998). Several algorithms have been developed to solve POMDPs but finding optimal solutions to POMDP problems is NP-hard (Papadimitriou and Tsitsiklis, 1987). Using discrete state-spaces with low dimensional input-spaces is the common choice, since the smaller state-spaces can be examined more easily. The usual techniques are useful only for the smallest problems (Cheng, 1988) and very inefficient to be used on all others. The Witness algorithm (Cassandra et al, 1994) can, efficiently, solve POMDPs only with up to 16 states. A much more difficult problem is the case of continuous state spaces. There is an implicit cost for the generality and expressiveness of the POMDP framework. Another major problem with the previous approaches is that they typically give the solutions in terms of cumbersome probability distributions. In this paper, we present a novel neuro-fuzzy approach to solve POMDPs, which not only produces high quality solutions efficiently for problems with large number of states, but also generates human readable rules for the actions to take. The organization of the paper is as follows. In section 2 the POMDP is defined. Conventional solution techniques are discussed in section 3. In section 4, the proposed new

solution strategies and structures are presented. Sample runs and results, which are obtained by running the algorithms on the Shuttle Docking Problem (Chrisman, 1992), are given in section 5. The conclusions are given in section 6.

2. Partially Observable Markov Decision Processes

A Partially Observable Markov Decision Process (POMDP) is a Markov Decision Process (MDP) (Papadimitriou & Tsitsiklis, 1987) in which the decisions must be based solely on noisy and incomplete observations of the system's states. Although this model can be applied to a wider range of problems than can the MDP model, it has received much less attention, in part because solving even the smallest POMDP is computationally demanding. A POMDP is a six-tuple defined as (S, A, T, R, , O), where

S is a finite set of states in which the agent can be at a discrete time step. A is a finite set of states which constitutes the agent’s choice of behavior. T is the transition function, which gives, for each state tuple, a probability of transition from the first state to the second one. T(s, s`) is the probability of transition from state s to state s`.R is the reward function, which gives, for each action taken on each state tuple and according to the observation received, a value of reward or punishment. R(a, s, s`, o) is the reward/punishment when an action a creates a transition from state s to state s` and gives observation o.

Karadeniz, T. & Akin, L. / FDMS with Q-Learning: A Neuro-Fuzzy Approach to Partially Observable Markov DecisionProblems, pp. 251 - 262, International Journal of Advanced Robotic Systems, Volume 1, Number 3 (2004), ISSN 1729-8806

252

S, A, T, and R describe a Markov decision process of (S, A, T, R)

is a finite set of observations the agent can experience of its world O is the observation function, which gives, for each action and resulting state, a probability distribution over possible observations. O(s’, a, o) is the probability of making observation o given that the agent does action a and get into the state s’.

Fig. 1. POMDP agent model

According to this definition, the only difference between a POMDP and an MDP is that in POMDP, the agent is unable to observe the current state with perfect sensor data. Instead, it obtains its observations according to the action to take and the state in which it will get into after that action. The goal remains the same in both, which is to maximize expected discounted future rewards. The problem of controlling a POMDP can be divided into two parts, as shown in Fig.1. The agent makes observations and generates actions. It implements an internal belief state, which is represented by belief and summarizes its previous experience. The state estimatorcomponent updates the belief state according to the last action, the current observation, and the previous belief state. The policy component generates actions as a function of the agent's belief state rather than the state of the world. Among others, one obvious choice may be to define the belief state as the most probable state of the world according to the past experience. Although this may be suitable for determining actions in some cases, it is not sufficient in general. In order to act effectively, an agent must also take into account its own degree of uncertainty. If it is lost or confused, it might be appropriate for it to take sensing actions such as asking for directions, reading a map, or searching for a landmark. The best choice for a belief state may be, in this case, the probability distribution over the states of the world. The probability distributions provide a basis for the agent to act reasonably under uncertainty associated with the world. Moreover, they give a sufficient statistic for both the past history and the initial belief state of the agent. If

the agent's current belief state is properly computed with no deficiencies then its past actions or observations will give no additional useful information for determining the current state of the world.

3. Conventional POMDP Solution Techniques

The great generality of the POMDP model means that no single method can be expected to solve all POMDPs effectively and efficiently. Indeed, the problem of finding optimal solutions for POMDPs is undecidable (Mandani, Hanks, Condon, 1999). Why solving POMDPs is difficult? The main problem is that the agent may have a large amount of uncertainty about the current state of the environment. It can become lost so easily. So, the optimal policy must be robust enough to overcome the problem of "getting lost" and "acting when lost". According to the survey of Lovejoy (1991) on the existing POMDP algorithms, many current POMDP algorithms run by constructing a finite representation of a value function over belief states, then iteratively updating this representation and expanding the horizon of the policy it implies, until a desired depth is reached. For some classes of problems (Sondik, 1971) infinite-horizon policies will have finite representations and value functions can be obtained for these problems by expanding the horizon until the value function converges to a stable value. In practice, infinite-horizon policies often can be approximated by extremely long finite horizons even if convergence is not obtained. Regardless of whether they are run to convergence, existing exact algorithms can take an exponential amount of space and time to compute a policy, even if the policy itself does not require an exponential size representation. Current reinforcement learning algorithms (Lovejoy 1991) require extensive repeated searches of the state-space in order to propagate information about the payoffs available. So, much of the work done in this field uses discrete state-spaces with low dimensional input-spaces, as the smaller state-spaces can be examined more easily. But the real goal is to extend reinforcement learning algorithms to work efficiently in large state-spaces and this requires generalization to be used to spread information between similar states in order to reduce the amount of information that must be collected. This is especially important in the case of continuous state-spaces, where there are effectively an infinite number of states, and the probability of ever returning to exactly the same state, is negligible. In these cases, generalization of experience is not only desirable, but also a necessity. Without a good generalization principle, the number of states is the worst bottleneck in the problems.

3.1. Standard Tabular Q-Learning Q-learning (Watkins, 1989) is a kind of reinforcement learning algorithm that does not need a model of its environment and can be used on-line. Q-learning algorithms work by estimating the values of state-action pairs. The value Q(s,a) is defined to be the expected

253

discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. After being initialized to arbitrary numbers, Q-values are estimated on the basis of experience of the agent as given in the following algorithm: 1. From the current state s, select an action a. This will

cause a receipt of an immediate payoff r, and arrival at a next state s'.

2. Update Q(s, a) based upon this experience as small changes in

asQbsQrxasQ ,,'max, (1) where x is the learning rate and 0 < < 1 is the discount factor

3. If convergence is reached, then return Q(s, a) for each state s and every action a.

4. Go to 1.

This algorithm is guaranteed to converge to the correct Q-values if the environment is stationary and depends on the current state and the action taken in it. A lookup table is used to store the Q-values. So, it is generally called Tabular Q-Learning. Tabular Q-Learning is a well-known and popular algorithm, especially for solving Markovian problems. By using all possible belief states as the actual states of the problem, this learning method can also be used on POMDPs. But, it is only feasible for small-scale problems as it produces lookup tables as solutions. Unfortunately, in the real world, the agents usually face larger-scale problems, which are too complex for a lookup table based technique. In that case, a function approximation scheme, which replaces the table with an approximation function, is necessary. Neural networks can implement that function, as they are universal approximators (Feuring and Lippe, 1995). More information about this algorithm can be found in the studies of Watkins and Dayan (1992) and Watkins (1989).

3.2. Back-Propagation with Q-Learning This section examines the use of multi-layer neural networks to represent Q-functions. The neural networks, which are used in Back-Propagation with Q-Learning, are multi-layer perceptrons (MLPs) (Brent, 1991) just like in back-propagation. Tabular Q-learning equation is used as previously, however, the Q-values are interpreted as the output error signal used in a backpropagation-style gradient calculation of weight changes. The weight update mechanism is given by the following equation,

twtAa

ttt QQQrw 1max (2)

One of the most important issues in using neural networks is the design of the input state representation scheme. Schemes that incorporate specialized knowledge of the domain can often do better than naive representation schemes. The belief state arrays may be used as the inputs. More domain specific knowledge can also be incorporated into the input nodes.

4. A New Approach: Fuzzy Decision Making System

4.1. Starting Point: Q-Learning The Q-learning algorithm (Watkins 1989) seems to be a promising approach. Though it is not adequate by itself, using back-propagation neural networks to store the information learnt by it would be a good choice. But the algorithms that can be applied on-line, in the sense that the updates to the neural networks are made at each time step during the trial, without the need to store any state-action information may give better results. So, this approach will need a wider range of training parameter values. Reinforcement learning problems use a scalar value called a reward/cost received by the control system for transitions from one state to another. The main goal is to find a policy, which maximizes the expected future discounted sum of these rewards called as the “return”. The value function is a prediction of the return available from each state,

0)(

kkt

kt rExV (3)

where r is the payoff received for the transition from state vector xt to Xt+i and is the discount factor (0 < < 1). Note that V(xt) therefore represents the discounted sum of the payoffs received from time step t onwards, and that this sum will depend on the sequence of actions taken. The control system is required to find the policy, which maximizes V(xt) in each state. Q-learning (Watkins 1989) has the idea of fitting a Q-function, which is a prediction of the return associated with each action in each state. This prediction can be updated with respect to the predicted return of the next state visited,

)(),( 1tttt xVraxQ (4) We have to maximize the overall rewards received, so Eq.4 becomes,

),(max),( 1 axQraxQ tAzttt (5)

This is the one-step Q-learning update equation, which has been shown to converge for finite-state Markovian problems when a lookup table is used to store the values of the Q-function (Watkins 1989). Once the Q-function has converged, the optimal policy is to take the action in each state with the lightest predicted return. This is called the greedy policy. It is not possible for the Q-functions to learn if the action with the highest Q is chosen at all times, as, during training, other actions with lower predicted payoffs may in fact be better. The method used to select actions has a direct effect on the rate at which the reinforcement learning algorithm will converge to an optimal policy. Ideally, the system should only choose to perform a non-greedy action if it lacks confidence in the current greedy prediction. Although various methods have been suggested for use in discrete state-space systems (Thrun 1992), they are not generally applicable to systems using continuous function approximators. Learning the Q-function requires some method of storing the current predictions at each state for each action, and updating them as new information is gathered. Storing a

254

separate value for each state-action pair quickly becomes impractical for problems with large state-spaces, and for continuous state-spaces it is simply not possible. Therefore, there is the need to use some form of function approximation, which generalizes predictions between states, and thus provides predictions even in situations that have never been experienced before. Neural networks, or Multi-Layer Perceptions, provide such a continuous function approximation technique, and have the advantage of scaling well to large input and state-spaces (unlike, for instance, CMACs (Albus 1981) or Radial Basis Functions).

4.2. Q-Learning With Neural Networks: Connectionist Q-Learning In order to represent the Q-function using neural networks, either a single network with outputs is required or separate networks each with a single output. Lin (1992) used the one-step Q-learning Eq.5 to calculate the error in and update the neural network weights according to,

twttAatt QQQrw 1max (6)

where is the learning constant and wQt is a vector of the output gradients calculated by backpropagation. Watkins (1989) proposed combining Q-learning with Temporal Difference learning, in order to speed up training. In this formulation, the current update error is used to adjust not only the current estimate of Qt, but also that of previous states, by keeping a weighted sum of earlier error gradients,

tw

t

k

ktttAatt QQQrw

01 )(max (7)

where A weights the relevance of the current error on earlier Q-value predictions. The one-step Q-learning equation is therefore a special instance of this equation where =0. A faster method of convergence is found by Lin (1992). He used the update rule with > 0.Moreover, he made some changes in the structure of the equations too,

twttt QQQw ' (8) where

11 'max)1(' ttAatt QQrQ (9)

Eq.9 shows that each Q't depends recursively on future Q't values. This means that updating can only occur at the end of each trial. So, we must have a way to store all the state-action pairs and then present them in a temporally backward order to propagate the prediction errors correctly. This is called backward replay. Notice that this is not a TD-learning algorithm in the true sense, if the greedy policy is not followed. The temporal difference errors will not add up correctly,

001max

ktkt

k

kktktAakt

k QrQQr (10)

unless the action corresponding to maximum value is performed at every time step. Using =0 whenever non-greedy actions are performed, can overcome this problem

(Watkins, 1989). The update rule for Q-learning combined with temporal difference methods requires A to be zeroed on every step that a non-greedy action is taken. As from the above arguments the greedy action could in fact be incorrect (especially in the early stages of learning), zeroing the effect of subsequent predictions on those prior to a non-greedy action is likely to be more of a problem than a help in converging on the required predictions. Furthermore, as the system converges to a solution, greedy actions will be used more to exploit the policy learnt by the system, so the greedy returns will be seen anyway. Therefore, using Modified Connectionist Q-Learning (MCQ-L) would be the best choice, here,

tw

t

ktttt QQQrw

01 )( (11)

This differs from normal Q-learning in the use of the Qt+1associated with the action selected, rather than the greedy max Qt+i used in Q-learning. This ensures that the temporal difference errors will add up correctly, regardless of whether greedy actions are taken or not, without the need to zero A. If greedy actions are taken, however, then this equation is exactly equivalent to standard Q-learning, and so, in the limit when exploration has ceased and the greedy policy is being followed, the updates will be the same as for standard Q-learning. MCQ-L therefore samples from the distribution of possible future returns given the current exploration policy, rather than just the greedy policy as for normal Q-learning. Therefore, the Q-function will converge to,

),()|(),( 11 axQxaPrEaxQ ttttt 12)

which is the expected return given the probabilities, P(a|xt), of actions being selected. Consequently, at any point during training, the Q-function should give an estimation of the expected returns that are available for the current exploration policy. As it is normal to reduce the amount of exploration as training proceeds, eventually the greedy action will be taken at each step, and so the Q-function will converge to the optimal values. Peng and Williams (1994) presented another method of combining Q-learning and TD learning, called Q( ). This is based on performing a normal one-step Q-learning update to improve the current prediction Qt, and then using the temporal differences between successive greedy predictions to update it from there on, regardless of whether greedy actions are performed or not. This means that does not need to be zeroed, but requires that two different error terms to be calculated at each step. At each time step, an update is made according to the one-step Q-learning (Eq.12) and then a second update is made using,

tw

t

k

kttAatAatt QQQrw

01 )(maxmax (13)

Note the summation is only up to step t = 1. As we are dealing with a continuous state-space system, both updates affect the same weights and so result in an

255

overall update of

ttAatAat

twttAat

t eQQr

QQQrw

maxmax

max

1

1(14)

This results in a slightly altered and less efficient on-line update algorithm than for either standard Q-learning or MCQ-L.

4.3. Goal: Adding Fuzziness To Connectionist Q-Learning MCQ-L may still be either unpractical in case of large state-action spaces, or impossible with continuous state space. One fine approach is extending MCQ-L into fuzzy environments (Watkins and Dayan, 1992). The motivations are twofold:

Fuzzy Inference Systems are universal approximators (Feuring and Lippe, 1996) and are good candidates to store the Q-values. Prior knowledge can be embedded into the fuzzy rules, which can reduce training significantly.

The rules have the form:

if Si then action = ai, ai Atif Si and action = at then q-value = g(Si,at)

where the state Si is defined by "x1 is Si,1 and x2 is Si,2…and xn is Si,n", (Si,j)n

j=i are fuzzy labels and Ai is the set of possible actions that can be chosen in state Si.The inferred action a(x) for input vector x = (x1,...,xn) and rule consequents (ai)n

i=1 is:

N

ii

N

iii

x

axxa

1

1

)(

)()( (15)

and the corresponding Q-value:

N

ii

N

iiii

x

aSqxaxQ

1

1

)(

),()(),( (16)

The function x i(x) represents the truth value of rule igiven the input vector x. As a summary, the learning system is supposed to be able to choose one action for each rule. The jth possible action in rule i is called as a[i, j] and its corresponding q-value as q[i, j]. With competing actions for each rule, a fuzzy inference system is given as:

if x is Si then a[i, 1] with q[i, 1] or a[i, 2] with q[i, 2] or

…a[i, j] with q[i, j]

Still, the agent is after the best right-hand-side for each rule which is the action with the best q-value. The q-values are zeroed initially and are not significant in the first stages of the learning process. In fuzzy q-learning, to explore the set of possible actions and acquire experience

through the reinforcement signals, the actions are selected using an Exploration/Exploitation Policy (EEP). The very-well known Boltzmann exploration is replaced by a Pseudo-Stochastic or a combined directed/non-directed exploration (Caironi and Dorigo, 1994). This maximizes the knowledge gain. Let i be the selected action in rule i using an EEP and i*such as q[i, i*] = maxj< j q[i, j]. The actual Q-value of the inferred action, a, is:

N

ii

N

ii

x

iiqxaxQ

1

1

)(

,)(),( (17)

and the value of state x:

N

ii

N

ii

x

iiqxxV

1

1

*

)(

,)()( (18)

Let x be a state, a the action applied to the system, y the new state and r the reinforcement signal. The difference between the old and the new Q(x, a) can be thought of as an error signal, that can be used to update the action q-values. By an ordinary gradient descent:

N

ii

i

x

xQiiq

1)(

)(, (19)

where i is a learning rate. To summarize; MCQ-L seems to be the best method in learning in incomplete knowledge with neural networks. But it needs a base network structure which is suitable to POMDPs. Moreover, as mentioned, for solving POMDPs with large number of states, MCQ-L must be extended to fuzzy environments. Starting from here, we propose a POMDP solver system utilizing MCQ-L by using Fuzzy Decision Making System (which is described in more detail in Section 4.4) as a base network. The neuro-fuzzy network of FDMS utilizes Fuzzy Q-Learning. MCQ-L with FQ-L can generalize the solution easily to like-problems and also to problems with larger states. Certainly, the generalization is in the degree of likeness between the states.

Fig. 2. Fuzzy Decision Making System (FDMS)

256

4.4. The Network The neural network based architecture called Fuzzy Decision Making System (FDMS) proposed in this study is shown in Fig.2 only for three linguistic terms to simplify drawing. The system shown has two inputs and one output, thus there exist two linguistic variables for the antecedent part and one linguistic variable for the consequent part. This architecture is utilized as the base model because of its convenient internal structure. It allows each step of fuzzy inference to be fully achieved. In addition, the structure enables to construct a rule base from scratch. The membership function for each linguistic variable is gaussian with the formula given in Eq.20,

2

2

2exp)( cxxA

(20)

where A is the fuzzy set, x is the variable and c and are the center and the width of the gaussian function. FDMS consists of two-phases. In the first phase, using training examples, the rule set of the system is extracted. Thereafter, the extracted rules are fed into the network with their rule weights. Here, a rule weight is a measure for specifying the importance of a rule in the rule set. In the second phase, the membership parameters of input and output variables are tuned using fuzzy backpropagation.

4.5. Clustering The centers of both input and output membership functions are found a priori by subtractive clustering (Grabusts, 2002). Since, the input variables are semi-uniformly distributed in POMDP problems; there is no need to modify the centers of input membership functions while training the network. Therefore, it is adequate to update only the widths of input membership functions. However, this argument is not valid for the output membership functions whose widths and centers have to be modified.

4.6. Rule Extraction As the learning rule of competitive learning, we have replaced the learn-only-if-win approach, which allows each rule node to fire at least once with a different approach in which only the rule nodes that have enough contribution to the final decision, fire. The degree of ‘enough’ is given by a threshold value in the algorithm. Each rule in the rule base consists of an antecedent part, which is formed by a conjunctive clause, and a consequent part. So, each rule-strength should reflect the effects of its antecedents together with its consequent. When the rules are extracted using the FDMS Rule Extraction Algorithm (Fig.3), they are applied to layer four of the network (Fig.2). The weights, which are calculated as a result of the algorithm, are normalized and they correspond to the weights associated with links of layer four. Moreover, rather than using bounded sum in the rule strengths, weighted sum which reflects the firing strengths of each rule node to its consequent effectively, is preferred. Therefore, the rule strength of the ith node, RSi, in layer four is given as:

3*j

jiji FSwRS (21)

where FSj3 denotes the firing strength of the jth node in

layer three.

4.7. Rule Reduction The rules, which are generated in the first phase of rule extraction, are usually larger in number than the next two phases. After the second phase, the rule set is reduced to a new set by a fuzzification process on both the antecedent and the consequent parts. This fuzzification is done in accordance with the membership values for input and output pairs of each rule. Finally, the third phase removes the redundancy in the set and we get the final rule set as our solution to the problem. Final set always has much lesser rules compared to the initial set.

4.8. Q-Learning Functionality In Q-learning, the input is a belief state (Smallwood and Sondik, 1973)(Sondik, 1971), which is actually an array whose dimension is determined by the number of states. The output is again an array of Q-values whose dimension is determined by the number of actions. So, FDMS takes multi-valued (dimensioned) inputs and produces multi-valued (dimensioned) outputs. As the clusters are defined separately for each output and each input, the size of the neural structure depends closely on the size of the problem. Adding Q-learning functionality to FDMS provides us a new solving system, which can mainly attack the POMDP problems not just by solving a problem but also generating the appropriate rules for that particular problem. FDMS with Q-learning is mainly an approximation system which takes a general POMDP problem as an input, generates a rule base for that particular problem and then tries to learn the problem’s aspects by generalizing over the rules and training samples. The samples are created online during the rule extraction phase and then fed into the learning system. POMDP learning process is done by the Q-learning component, which is integrated with the decision-making system. The learning component of FDMS with Q-learning is implemented inside the backward and forward learning passes. Rule extraction phase and learning phase are totally separated in FDMS with Q-learning. So, the system can be used separately both as a rule base generator or a function approximator by itself. The input is a vector with a length equal to the number of POMDP problem states. Each vector element can have real values between 0 and 1. So, the inputs correspond to the POMDP belief states. The output is also a vector, but it contains the Q-values for each of the corresponding action. So, the length of this vector equals to number of actions in that particular problem. FDMS with Q-learning approximates the Q-function by taking the belief states, which are probability distributions over states as the inputs, and producing Q-values, which are actually points on the overall Q-function of the problem. By definition Q-function is defined over actions.

257

step (i): for each pattern taken from the training data set for each input and output variable of this pattern for each linguistic term associated calculate the membership value

step (ii): for each input and output variable of the training set find out the linguistic term with the maximum membership value put the resulting triple in the matrix, M

step (iii): for each x1, x2, y triple taken from M search M if there exists the same triple in terms of membership if there is, increment weight of the rule represented by the triple put the triple in matrix, RM with its new rule weight loop until all triples are visited

step (iv): for any x1 and x2 pair taken from RM choose the linguistic term of y with the maximum membership value put this new rule into the rule set with its weight

Fig. 3. The FDMS Rule extraction algorithm

Fig. 4. Graphical Representation of Shuttle Docking Problem

5. Sample Runs and Results

In order to assess the performance of the proposed algorithm, it was tested on the Shuttle Docking Problem(Chrisman, 1992). This problem is chosen as a test-bed, not only because it is a well-known POMDP problem but

also it describes an mid-large environment where the solutions can be interpreted visually

5.1. Shuttle Docking Problem - Definition Shuttle Docking is a simulated docking application with incomplete perception, non-deterministic actions and noisy sensors. The scenario consists of two space stations separated by a small amount of free space with locking docks located on each station. The task is to transport supplies between two docks. Each time the agent successfully attaches the least-recently visited station; it receives a reward of +10. In order to dock, the agent must position itself in front of the station with its back to the dock and backup. Whenever the agent collides with the station propelling forward into the dock, it receives a penalty of -3. At all other times, it receives zero reinforcement. Three actions are available to the agent: GoForward, Backup and TurnAround. The agent is always facing one of the stations, and TurnAround causes it to face the other. Depending on the state, GoForward action either detaches from the loading dock, launches into free space, approaches next station from free space or collides into the space station directly ahead. Backup is almost the inverse of GoForward except that it is extremely unreliable. Backup launches from a station with a probability 0.3. From space, it approaches a station in reverse with probability 0.8. And from docking position, it fails to dock 30% of the time. When actions fail the agent sometimes remains in the same position, but sometimes accidentally gets randomly turned around. The agent's perception is very limited. From a station, it sees the station or only empty space depending on the way it is facing. In the free space perception is noisy: with probability 0.7 the agent sees the forward station, otherwise it sees nothing but empty space. The two stations appear identical except the least-recently visited

258

station displays an 'accepting deliveries' sign which is visible to the agent when the station is visible. When docked, only the interior of the dock is visible. The +10 reward is also observable for one time unit after receipt. This problem was originally presented in a formal POMDP problem format. But to give an insight to the problem, we converted the formal version into a graphical form (Fig.4). By the help of this conversion, it is much easier to determine the possible action paths (policies) that can be taken by our agent through the problem solution. Here, the states are given as nodes. The transition function is depicted along with the actions. If it is possible for our agent to go from one state to another, this is shown as an arc between these two states whose head is showing the arrived state. Moreover, this particular arc has the name of the action, which makes it possible for the agent to transfer itself. The transition probability is also given in the name of the arc, if it is not 1.5.2. Test configuration Each of the training sets of this study are the same for each algorithm and consists of 1000 patterns which are created by a ‘semi-uniform random exploration’ of the state space (Dearden, Friedman and Russell, 1998). For BP with Q-Learning, the training parameters are separately given in Fig.6. Throughout our study, a time-varying learning-rate constant is used, rather than the time-varying schedule. This gives the final network a better performance in the expense of speed. In FDMS with Q-Learning, clusters are selected automatically by subtractive clustering. The computer used in this study, is

a Pentium-4 based system operating at 1.7 GHz. The amount of main memory is 256 MBs. The algorithms are implemented in ANSI C by using MS Visual Studio 6.0 as the development environment. The operating system is MS Windows XP Professional Version 2002.

5.3. Solutions In the following subsections, we present the solutions as sub-graphs of the full graphical representation of Shuttle Docking Problem given in Fig.4.

5.3.1. Tabular Q-Learning Solution Tabular Q-Learning gives a Q-value set of actions indexed by each of the states. Doing a simple iteration through the parsed problem structure by the help of the provided Q-value set, we can present the solution of Tabular Q-Learning as a finite state representation in Fig.5. Note that, the solution does not start from DOCKEDMRV state. It may start in any of the defined states randomly. But the solution in each case is consistent with the problem definition.

5.3.2. BP with Q-Learning Solution BP with Q-Learning algorithm produces a different solution (Fig.6) than the tabular version, but this new solution is also valid and consistent with the problem definition. The agent takes slightly different actions in some of the states, but eventually it manages to dock itself into the least recently visited station starting from the most recently visited station. Function approximation when used with Q-Learning can provide satisfactory

Fig. 5. Tabular Q-Learning Solution

259

solutions. The question is whether it is possible to get better solutions, which provide more information on larger problems.

5.3.3. FDMS with Q-Learning Solution FDMS with Q-Learning starts by 5 automatically assigned input clusters and 11 output clusters. The learning component produces the solution as an FDMS network not just a table of values. The output shown in Fig.7 contains a Q-value table similar to the one produced by the BP with Q-Learning. This table is obtained by feeding the network with belief states showing full belief in each state. In fact the network can give an output for any given belief state.

5.3.3.1. Rule Set In the second and third rule sets for Shuttle Docking problem, for some states more than one action may be defined. This phenomenon comes directly from the problem definition and it simply shows that each of these actions is possible for the solution if that particular state is encountered. For example, if the agent has a full belief of being in the FRONTLRV state that it can either BACKUP or TURNAROUND. Taking any of these actions will lead it to a solution. More interestingly, if it has a full belief of DOCKEDMRV state, it may take any defined action it wishes. The degree of multiple action possibility is controlled by a threshold, which is defined in the algorithm. In Fig.8, this threshold is defined zero for the third rule set and infinite for the second rule set.

So, for the third set, no multi-action outputs are defined, but this is possible in second set.

5.3.3.2. Effect of Training Set Size on Output With training sets of different sizes, FDMS with Q-Learning gives the output values as in Table 1, Table 2 and Table 3. In order to determine number of the inputand output clusters, we have done a batch of 10 runs each with different training sets and computed the mean and standard deviation. The input and output cluster numbers do not differ considerably as the number of training patterns increases (Table 1 and Table 2). This is expected since subtractive clustering is used and the clusters are automatically assigned according to a pre-defined coverage percentage. Similarly, to determine running times, we have done a batch of 10 runs each with the same training set and computed the mean and standard deviation. The running time of learning depends only on the number of training patterns and the number of rule nodes in the network (Table 3). So, although both rule set creation and learning show almost linearly increasing running times with the increase in the number of training patterns (Fig.10), the duration for subtractive clustering increases exponentially (Fig.9). The important point here is that, once the rule set creation is finished and it is reflected into the neural structure as an initial knowledge, the iterative learning process is much faster. The rule set creation process serves not only as a solution summary for the problem but also a valuable initial knowledge collection (Fig.8).

Fig. 6. BP with Q-Learning Solution

260

5.3.4. Discussion of Results For the Shuttle Docking Problem, if only the optimal route of the problem is considered, FDMS with Q-Learning produces a solution, which is exactly the same

with Tabular Q-Learning. The optimal route of the Shuttle Docking Problem is given in Fig.11. The other two states off the path are actually trapping states of the problem.

Fig. 7. FDMS with Q-Learning Solution

Fig. 8. Rule Sets

261

Table 1. Number of FDMS Input Clusters for Different Sizes of Training Sets

Table 2. Number of FDMS Output Clusters for Different Sizes of Training Sets

Table 3. Durations in seconds for Different Sizes of Training Sets

Fig. 9. Duration of Subtractive Clustering The agent should never go through these states if it is behaving rationally. But what about the solution if the agent finds itself in one of these two states at the beginning? In this case, the solution of Tabular Q-Learning is better than ours. But this is because of the fact that currently ‘semi-uniform exploration’ is being used in FDMS with Q-Learning. So, simply, our algorithm focuses on the optimal solution path relatively ignoring the rest for a particular problem.

6. Conclusion

Robot learning is an inherently hard task. But trying to succeed in uncertain domains is harder. The problems do not have satisfying solutions in most cases. Even in the cases that they have some solutions, these solutions are cumbersome because of the probability distributions, large arrays and exponential complexity in both space and time. So, they are not much understandable if not intractable. In this study, we developed a learning system, which will find POMDP solutions that are as powerful as the current solutions and can be generalized more easily. The generalization is crucial in NP-hard

problems and finding optimal solutions to POMDP problems is NP-hard (Papadimitriou & Tsitsiklis 1987). In our system, the trade-off between quality and efficiency can be adjusted during run-time. So, we can produce the most efficient closest-to-optimal solution for a particular problem. In addition to this, as a part of the solution, we get the best trade-off choice for a particular problem/computer system couple. Moreover, our system produces a rule base, which can summarize the solution to a simpler, understandable and tractable form.

Fig. 10. Durations of Rule Set Creation & Learning

The achievements by FDMS with Q-Learning which has not been achieved until now, are as follows,

rule base approach to POMDPs a solution system combination of q-learning, rule extraction and fuzzy decision making fuzzy and parametric input style, so the trade-offs can be adjusted for each particular POMDP readable and understandable solutions along with the trained neural network ready to run on the problem almost linear time complexities for rule base construction and learning modules

262

random POMDP generation and solution simulation fully autonomous system, almost no user interaction

Fig. 11. Optimal Path and Rules for SDP

Some of the possible future work options on FDMS with Q-Learning are as follows,

In the current version, the rule base is constructed before the learning starts and used as a static reference. It is possible to make rule base dynamic so it can be updated during the learning process. The decision making component of FDMS with Q-Learning is a neural network with a special structure. This part can be replaced with another compatible learning system. Genetic algorithms may be a good choice.Improving the structural representations of the input states may result in better performance. There is a possibility of improving the neural network update mechanisms according to the inherent structures of partial observable environments. Although FDMS with Q-Learning performs considerably well on the optimal path of POMDP solutions, it doesn't do so well in non-optimal trapping states. Improvement to solve this problem is possible and relatively straightforward.

7. References

Albus, J. S. (1981). “Brains, Behaviour and Robot.” BYTE Books, McGraw-HilL chapter 6, pp. 139-179.

Brent, R. P., (1991) “Fast Training Algorithms for Multi-Layer Nets.”, IEEE Transactions on Neural Networks, Vol. 2, No. 3, pp. 346-354.

Caironi P., and Dorigo M., (1994) "Training Q-Agents" Tech. Rep. IRIDIA/94-14, University of Bruxel, Belgium.

Cassandra, A. R., (1998) “A Survey of POMDP Applications.” Working Notes: AAAI Fall Symposium on Planning with Partially Observable Markov Decision Processes, pages 17-24.

Cassandra, A. R., Kaelbling, L. P., and Littman, M. L., (1994) “Acting optimally in partially observable stochastic domains”, Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA.

Cheng, H. T., (1988) “Algorithms for Partially Observable Markov Decision Processes”, Ph D. Dissertation, University of British Columbia, British Columbia, Canada.

Chrisman, L., (1992) “Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach.” AAAI 1992: 183-188.

Dearden, R., Friedman, N. and Russell, S., (1998), “Bayesian Q-Learning.” In Proc.AAAI, pages 761-768.

Feuring, Th. and Lippe, W. M., (1995), “Fuzzy neural networks are universal approximators.” Proc. of VI IFSA Congress, San Paulo, Brasil, 659-662.

Grabusts, P., (2002). “Clustering Methods In Neuro-Fuzzy Modeling.” Scientific Proceedings of Riga Technical University.

Lin, L., (1992). "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching", Machine Learning, vol. 8, pp. 293-321.

Littman, M. L., Cassandra, A. R. and Kaelbling, L. P., (1995), “Learning Policies for Partially Observable Environments: Scaling Up.” Proceedings of the Twelfth International Conference on Machine Learning, 263-370.

Lovejoy, William S. (1991), “A survey of algorithmic methods for partially observable markov decision processes.” Annals of Operations Research, 28:47{66.

Mandani, O., Hanks, S., Condon, A., (1999) On the Undecidability of Probabilistic Planning and Infinite-Horizon POMDPs. AAAI-99. AAAI Press/MIT Press 541-548.

Papadimitriou, C. H. and Tsitsiklis, J. N. (1987). “The complexity of Markov decision processes.” Mathematics of Operations Research, 12(3):441-50.

Peng, J. and Williams, R. J., (1994) "Incremental multi-step q-learning." In Proceedings of the Eleventh International Conference on Machine Learning, Rutgers University In New Brunswick, pp. 226-232, Morgan Kaufmann.

Smallwood, R. D. and Sondik, E. J. (1973). “The optimal control of partially observable Markov processes over a finite horizon.” Operations Research, 21:1071-1088.

Sondik, E. (1971), “The Optimal Control of Partially Observable Markov Processes.” PhD Thesis, Stanford University.

Thrun, S.B. (1992), “Efficient exploration in reinforcement learning.” Tech. Rep. CMU-CS-92-102, Carnagie-Mellon University.

Watkins, C. J. (1989). “Learning with Delayed Rewards.” PhDthesis, Cambridge University.

Watkins, H. and Dayan, P., (1992) “Q-learning”, Machine Learning, 8(3):279-292.

Documents

FDMS with Q-Learning: A Neuro-Fuzzy Approach to Partially ...cdn.intechopen.com/pdfs/4103/InTech-Fdms_with_q_learning_a_neuro... · 251 FDMS with Q-Learning: A Neuro-Fuzzy Approach