6
Fuzzy Q=Learningand Dynamical Fuzzy $-Learning Pierre Y ves Glorennec DCpartement d’hformatique INSA de Rennes glorenneQirisa. fr December 8, 1993 Abstract This paper proposes two reinforcement-based learning algorithms. The first, named Fuzzy Q-Learning, in an adaptation of Watkins’ Q-Learning for Fuzzy Inference systems. The second, named Dynamical Fuzzy Q-Learning, eliminates some drawbacks of both Q-Learning and Fuzzy Q-Learning. These algorithms are used to improve the rule base of a fuzzy controller. Keywords: Q-Learning, reinforcement, Fuzzy Inference systems, relative viability zone, process control, rule quality. 1 Introduction In the reinforcement learning paradigm, an agent receives from its environment a scalar reward value called reinforcement. This feedback is rather poor: it can be boolean (true, false) or fuzzy (bad, fair, very good...), and, moreover, it may be delayed. A sequence of control actions is often executed before receiving any information on the quality of the whole sequence. Therefore, it is difficult to evaluate the contribution of one individual action. This Credit Assignment Problem has been widely studied since the pioneering work of Barto, Sut- ton and Anderson [2]. The whole methodology is called Temporal Difference Method [13] and contains a family of algorithms. Recently, Watkins [14] proposed a new algorithm of this family, Q-Learning. This paper proposes two fuzzy versions of Q-Learning, respectively called Fuzzy Q-Learning (FQL) and Dynamical Fuzzy Q-Learning (DFQL). Section 2 describes the Q-Learning method, Section 3 presents a general class of Fuzzy Inference Systems (FIS). FQL is presented in Section 4 and DFQL in Section 5. The effectiveness of the proposed method is shown thanks to an example in Section 6. 2 Q-Learning Q-Learning is a form of competitive learning which provides agents with the capability of learning to act optimally by evaluating the consequences of actions. Q-Learning keeps a Q-function which attempts to estimate the discounted future reinforcement for taking actions from given states. A Q-function is a mapping from state-action pairs to predicted reinforcement. In order to explain the method, we adopt the implementation proposed by Bersini [5]. 1. The state space, U c R”, is partitioned into hypercubes or cells. Among these cells we can distinguish: (a) one particular cell, called the target cell, to which the quality value +1 is assigned, (b) a subset of cells, called viability zone, that the process must not leave. The quality value for viability zone is 0. This notion of viability zone comes from Aubin [l] and eliminates strong constraints on a reference trajectory for the process. 0-7803-1896-X/94 $4.00 01994 IEEE 474

[IEEE 1994 IEEE 3rd International Fuzzy Systems Conference - Orlando, FL, USA (26-29 June 1994)] Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference - Fuzzy Q-learning

  • Upload
    py

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Fuzzy Q=Learning and Dynamical Fuzzy $-Learning

Pierre Y ves Glorennec DCpart ement d’hformat ique

INSA de Rennes glorenneQirisa. fr

December 8, 1993

Abstract

This paper proposes two reinforcement-based learning algorithms. The first, named Fuzzy Q-Learning, in an adaptation of Watkins’ Q-Learning for Fuzzy Inference systems. The second, named Dynamical Fuzzy Q-Learning, eliminates some drawbacks of both Q-Learning and Fuzzy Q-Learning. These algorithms are used to improve the rule base of a fuzzy controller.

Keywords: Q-Learning, reinforcement, Fuzzy Inference systems, relative viability zone, process control, rule quality.

1 Introduction In the reinforcement learning paradigm, an agent receives from its environment a scalar reward value called reinforcement. This feedback is rather poor: it can be boolean (true, false) or fuzzy (bad, fair, very good...), and, moreover, it may be delayed. A sequence of control actions is often executed before receiving any information on the quality of the whole sequence. Therefore, it is difficult to evaluate the contribution of one individual action.

This Credit Assignment Problem has been widely studied since the pioneering work of Barto, Sut- ton and Anderson [2]. The whole methodology is called Temporal Difference Method [13] and contains a family of algorithms. Recently, Watkins [14] proposed a new algorithm of this family, Q-Learning. This paper proposes two fuzzy versions of Q-Learning, respectively called Fuzzy Q-Learning (FQL) and Dynamical Fuzzy Q-Learning (DFQL).

Section 2 describes the Q-Learning method, Section 3 presents a general class of Fuzzy Inference Systems (FIS). FQL is presented in Section 4 and DFQL in Section 5. The effectiveness of the proposed method is shown thanks to an example in Section 6.

2 Q-Learning Q-Learning is a form of competitive learning which provides agents with the capability of learning to act optimally by evaluating the consequences of actions. Q-Learning keeps a Q-function which attempts to estimate the discounted future reinforcement for taking actions from given states. A Q-function is a mapping from state-action pairs to predicted reinforcement. In order to explain the method, we adopt the implementation proposed by Bersini [5].

1. The state space, U c R”, is partitioned into hypercubes or cells. Among these cells we can distinguish:

(a) one particular cell, called the target cell, to which the quality value +1 is assigned, (b) a subset of cells, called viability zone, that the process must not leave. The quality value

for viability zone is 0. This notion of viability zone comes from Aubin [l] and eliminates strong constraints on a reference trajectory for the process.

0-7803-1896-X/94 $4.00 01994 IEEE 474

(c) the remaining cells, called failure zone, with the quality value -1.

2. In each cell, a set of J agents compete to control a process. With M cells, the agent j , j E (1,. . . , J}, acting in cell c, c E (1,. . . , M}, is characterized by its quality value Q[c, j]. The probability to agent j in cell c will be selectedis given by a Boltzmann distribution.

3. The selected agent controls the process as long as the process stays in the cell. When the process leaves cell c to get into cell c', at time step t , another agent is selected for cell c' and the Q-function of the previous agent is incremented by:

A Q[c, il = a(r(t) + Y mkaQ[k, c'l - Q[c, A} (1)

where Q is the learning rate (a < 1), y the discount rate (7 < 1) and r ( t ) the reinforcement.

+1

-1

if c' is the target cell (reward)

if c' is in the failure zone (punishment) r ( t ) = 0 if c' is in the viability zone (2)

Rt = C;==gY"'t+n (3)

{ It is shown in [3] that Q[c, j] tends towards the discounted cumulative reward Rt:

This method is very attractive because it performs an on-line model-free optimization of a control policy, enabling the identification of the best operator at any time. But, infortunately, Q-Learning is slow and several authors have proposed extension in order to reduce the problem search space so as to accelerate learning. For example, Lin [lo] proposes experience replay and teaching. Whitehead [16] adds a biasing function to the Q-function, with an external teacher providing shorter latence feedback. Clouse and Utgloff [6] propose a teaching method with a teacher interacting in real-time during learning. McCallum [11] uses transitional proximity: the states closest to the current state receive correspondingly more of the current reward.

3 Fuzzy Inference Systems We consider a FIS with N fuzzy rules on the form:

rule i : if 21 is Ai(') and 2 2 is A',(') . . . and z, is A$") then y is B' (4)

where (A:( ' ) ) ,=I ,~ and B', i = 1 to N, are fuzzy subsets characterized by linguistic labels (e.g. "small", "medium", "positive large", ...) and by a function z -+ p ~ ( t ) E [O, 11, called membership function, quantifying the membership degree of 2 to A.

This FIS is characterized by:

0 Any parametrable input membership function. The function parameters must allow dilatation and translation. We suppose that the output membership functions have one point, b', such that p B i ( b ' ) = 1.

Any conjunction operator (generalization of the boolean And), e.g. Minimum, product or Lukasiewicz conjunction.

Product inference.

Centroid defuzzification.

For an input vector, x, the inferred output, y, is:

where a i ( x ) is the stength of rule i . It is shown in [8] that such FIS are universal approximators and that they can learn by examples. The most important features of FIS are that they can incorporate human prior knowledge into

their parameters and that these parameters have clear physical meaning. Tuning a FIS consists in going from qualitative to quantitative: how can parametrers be tuned so that a FIS performs a desired nonlinear mapping ? In the following section, we show that an adaptation of Q-Learning enables the best rule base among a set of possible rule bases to be found.

475

4 Fuzzy Q-Learning Fuzry Q-Learning is an immediate extension of the basic method. For simplicity we take a fuzzy partition of each input variable domain, for example with triangles crossing at grade 0.5. Therefore, for an input vector, x E R", only 2" rules are fired. The set is defined by:

Hj(l),...,j(z-) = {X E R"/aj(l)(x) # 0, . * . i aj(zm)(x) # 01 (6)

is a hypercube in R", j ( i ) E (1,. . .,N} for i = 1 to 2", j ( i ) # j(k) if i # k. In the vertices of these hypercubes, only one rule is fired, with truth value 1.

These hypercubes form a partition of the state space into cells which are divided into a target cell, a viability mne and a failure zone (see Section 2).

We d e h e J sets of fuzzy rules sharing the same antecedent parts. In our implementation, an agent is equivalent to a fuzzy rule base, it acts on the whole state space but computes its control action for each hypercube (cell). Rule number i , i = 1 to NI of agent number j , j = 1 to J , is now:

rule i: if 21 is A:") and is Ai(') . . , and 2" is A?) then y is B ' ( j ) (7)

The set of agents is implemented by a FIS with J outputs, see Figure (1).

Figure 1: Architecture for FQL

For each agent, we define:

1. A Q-function, Q[c, j ] , where c refers to the cell and j the agent.

2. A rule quality, q [ k , j ] , for rule number k used by agent number j .

The relation between Q and q-functions is given by:

1

2" k E H ( c ) Q ~ c , ~ I = - ( I P , ~ I

where H ( c ) refers to the fired rules when the input vector is in cell number c. Q[c, j ] is the mean value of q [ k , j I for k E H ( c ) .

If j o is the number of the active agent, its Q-function, Q[c, jo], is updated by (1) when the process leaves cell c to get into a cell c'. Moreover, thanks to the notion of rule quality, we can dispatch the received punishment/reward to the rules used by the active agent in cell c. Let At be the time step and let t o and tl = to + m A t be the instants where the process goes in and out of a cell, m is a variable integer. The mean relative activity of rule k in cell c is defined by:

We verify immediately that:

N

476

The quality q [ k , j o ] is incremented by:

and we have: 1

Q[c, j01i,=,~ = Q[c, j01l,=~~ + AQ = - d k , j 0 1 i ~ = ~ , 2" k E H ( c )

where Q[,, .]I, (resp. q[ . , .]I, stands for Q value (resp. q) at time t .

Comments - when an agent is assimilated to a FIS, prior knowledge can easily be incorporated. - a FIS generates continuous control outputs. - the Q-function is used to qualify the set of rules of an agent. Therefore, a "super" agent with

- the drawback of FQL is that it is necessary to know at any time in whicht cell the process is the best rules of each agent can be built.

located. This drawback is eliminated with DFQL.

5 Dynamical Fuzzy Q-Learning The main feature in this Section is the introduction of relative viability zone and the computation of reinforcement with respect to this relative viability zone. At time t , the system state can be characterized by:

e ( t ) = F[des i red ( t ) - a d u a l ( i ) ] (13) where F is a positive monotonous non-decreasing real function such as F ( 0 ) = 0 (e.g. F ( z ) = 11z11). We introduce also:

P ( t ) = Ae(t) + (1 - A)P(t - 1) (14) with X < 1. Clearly, the lower P ( t ) is the better the previous control actions are.

relative viability zone defined as: When an agent is selected to control the process, at time step t,, it has a limit time, At, and a

aiP(tn) < P ( t ) < anP(tn) (15)

with t E [tn,tn +At], 01 < 1 and a2 > 1. The agent is changed either

1. at time t' if (3t*)(t' - t , < At ) (P( t ' ) 2 anP(t,)); we set tn+l = t',

2. or at time t = t , + At if equation (15) holds. We set tn+l = t , + A t .

The reinforcement is computed by:

As in FQL, we define a rule quality, q [ k , j ] , related to rule k used by agent j, and a rule activity, a d [ k ] , for rule k , given by equation (9), where m is the number of time steps between t n and t n + l .

The Q-function is evaluated, at time steps t = t,, t n+ l , . . ., for each input vector, xt , and for each agent, j, by:

Equations (5) and (17) have the same structure. Therefore, each Q-function can be implemented

When the active agent, j o loses the control of the process, we compute the quantity AQ which is as an output of FIS with ( ~ [ k , j ] ) ~ € { l , ~ ~ , j € { l , ~ } as synaptic weights.

dispatched to the rule quality ( q [ k , j O ] ) k = l , ~ . We have:

477

6 Example In this paper we show only for short an application of DFQL used to find the best set of fuzzy rules for the control of a recurrent system, proposed by [12].

The FIS used was built up in the following way:

1. Three inputs: C k + 1 , Yk, y k - 1

2. Three fuzzy subsets defined on [-1,1] for c k + 1 , and four fuzzy subsets for y k and Y k - 1 , defined on [-1.5,1.5]. The membership functions were obtained by the difference of two sigmoids. There were 48 fuzzy rules.

3. The T-norm was a smoothed Lukasiewicz's conjunction, [8].

4. Four control outputs corresponding to four operators, with their respective Q-function .

The consequent parts of rules were randomly generated (central point of B'(j) in equation (7)). We want to find the best agent, evaluate all the rules used by the agents, and build a "super" agent with parameters ( b k ( j * ) ) f = l , j* such that:

The output system has to follow, a sinusoidal reference signal given by C k = sin(%). We have simulated the control problem with the following parameters: a = 0.2, y = 0.95, a1 = 0.75, a2 = 1.5, X = 0.5 and At = 20 time steps. At time step 1000, Figure (2) shows the performances of the best agent, and Figure (3) the performances of the "super" agent.

I 0.40

0.30

0.20

0.10

0.00

-0.10

Figure 2. The best agent, discovered at time step 1000.

Starting with random parameters in the consequent part of fuzzy rules, DFQL is able to identify the best rules among a given set of initial random rules. If the optimal solution is not present in this first set, we can use either a reinforcement-based random search [9] or a competitive learning with exploration [A.

7 conclusion Two fuzzy versions of Q-learning have been proposed. Both of them enable easy incorporation of prior knowledge, if available, and generation of continuous control actions. DFQL introduces a relative

478

I

viability zone to avoid a fixed partition of state space, which is a crucial point. The ability of DFQL to discover the best rules among a given randomly generated rule set has been shown.

0.40

0.20

0.00

-0.20

-0.40 0 0 0 0 0 0 1 8 x ; g q ; ; 0 -

Figure 3. The “super” agent, built at time step 1000.

References [l] Aubin J.P., “Learning rules of cognitive processes”, C.R. Acad. Sc. Paris, T. 308, Sirie I, 1989.

[2] Barto A., Sutton R., Anderson C. “Neuronlike elements can solve difficult learning control prob- lems”, IEEE Trans. on SMC, Vol. 13, Sept. 83.

[3] Barto A., Sutton R., Watkins C. “Sequential decision problems and neural networks”, in Advances in Neural Information Processing Systems 2, D. Touretzky, Ed. Morgan Kaufinann, San Mateo, CA, 1990.

141 Berenji H., Khedkar P., ” Learning and tuning fuzzy logic controllers through reinforcement”,

[5] Bersini H. , Reinforcement learning and recruitment mechanism for adaptive distributed control”,

[6] Clouse J.A., Utgloff P.E., ”A teaching method for reinforcement learning”, Proc. of grh Workshop

[7] Glorennec P.Y., “A new reinforcement learning algorithm for optimizaton of a fuzzy rule base”, Proc. of EUFIT’93, Aachen, Germany, Sept. 1993.

[8] Glorennec P.Y., “A general class of Fuzzy Inference Systems: application to identification and control” , Proc. of 2rh European Congress on Systems Science, Prague, Czech Republic, Oct. 1993.

[9] Glorennec P.Y., ” Neuro-Fuzzy Control Using Reinforcement Learning”, Proc. of IEEE-SMC’93, Le Touquet, France, Oct. 93.

[lo] Lin L-J., ”Self-improvement based on reinforcement learning, planning and teaching” , Proc. of gth Workshop on Machine Learning ML’91, 1991.

[ll] McCallum R.A., ”Using transitional proximity for faster reinforcement learning” , Proc. of grh Workshop on Machine Learning ML ’92, 1992.

[12] Narendra K., Parthasarathy K. ”Identification and control of dynamical systems using neural networks”, IEEE Trans. Neural Networks, Vol. 1, p. 4-27, 1990.

[13] Sutton R.S., “Learning to predict by the method of temporal differences”, Machine Learning, 3.

[14] Watkins C. ”Learning from delayed rewards”, PhD Thesis, University of Cambridge, England,

[15] Watkins C., Dayan P. ”Q-Learning, Technical Note” , Machine Learning, Vol. 8 , p.279-292,1992.

[16] Whitehead S.D. ”A complexity analysis of cooperative mechanisms in reinforcement learning” in

IEEE Trans. on Neural Networks, 3(5), Sept. 92.

TR. IR/IRIDIA/92-4, Universit6 Libre de Bruxelles, 1992.

on Machine Learning ML ’92, 1992.

pp. 9-44, 1989.

1989.

Proc. of the gth AAAI C o n j , 1991.

479