10
CONTINUOUS ACTION GENERATION OF Q-LEARNING IN MULTI-AGENT COOPERATION Kao-Shing Hwang, Yu-Jen Chen, Wei-Cheng Jiang, and Tzung-Feng Lin ABSTRACT Conventional Q-learning requires pre-defined quantized state space and action space. It is not practical for real robot applications since discrete and finite numbers of action sets cannot precisely identify the variances in the different positions on the same state element on which the robot is located. In this paper, a Q-Learning composed continuous action generator, called the fuzzy cerebellar model articulation controller (FCMAC) method, is presented to solve the problem. The FCMAC displays continuous action generation by linear combination of the weighting distribution of the state space where the optimal policy of each state is derived from Q-learning. This provides better resolution of the weighting distribution for the state space where the robot is located. The algorithm not only solves the single-agent problem but also solves the multi-agent problem by extension. An experiment is implemented in a task where two robots are taking action independently and both are connected with a straight bar. Their goal is to cooperate with each other to pass through a gate in the middle of a grid environment. Key Words: Reinforcement learning, FCMAC, multi-agent. I. INTRODUCTION In recent years, reinforcement learning (RL) has become an attractive learning method for training robots [1–3]. More and more research projects have extended the reinforcement learning methods to multi-agent systems [4], where agents know little about other agents and the environ- ment changes during learning. Applications of reinforcement learning in multi-agent systems include soccer [5, 6], pursuit games [7, 8], and coordination games [9–11]. When imple- menting real robot applications, common reinforcement learning, like Q-learning, normally requires well-defined quantized state and action spaces to estimate the value func- tion. There are some methods for estimating the value func- tion. Some researchers have used a lookup table to estimate the value function [1]. The problems with these methods involve how to discretize the state spaces in a continuous space. The simplest method is partitioning the state space into several uniform intervals for each dimension. Nevertheless, this suffers from enormous computational requirements as the resolution of discretization is subtle, especially in a high-dimensional state space. However, simple discretization may lead to poor learning performance. Adaptive resonance theory (ART)-based cognitive models can adaptively partition the state space [12–13]. They require a certain amount of computation time to evaluate the vigilance parameter of all nodes, especially if the number of nodes is huge. Also, it cannot estimate the value when no representing state is acti- vated. This paper proposes a fuzzy cerebellar model articula- tion controller (FCMAC)-based Q-Learning controller using multiple overlapping tiles of the spaces to produce a quantized state space with fine resolution. The FCMAC approach com- bines Q-Learning and uses it for a task that requires two robot agents to cooperate. In this task, there is a bar connecting the robots to curb each other and they must pass through a gate in the middle of a 9x5 grid, as depicted in Fig. 1. This article is organized as follows. Single agent and multi-agent reinforcement learning is introduced in Section II. The FCMAC-based Q-Learning methods are described in Section III. The combination of FCMAC and multi-agent continuous valued Q-learning is illustrated in Section IV. Finally, Section V concludes the paper. II. BACKGROUND 2.1 Single-agent Q-learning Q-Learning is a form of a model-free reinforcement learning method based on stochastic dynamic programming. It provides robots with the capability of learning to act opti- mally in a Markovian environment. The robots are assumed to be able to discriminate the set S of distinct environmental states, and can take a set A of actions on the environment. Fig. 2 is a procedure of the one-step Q-learning algorithm [14]. Manuscript received October 4, 2010; revised May 25, 2011; accepted May 14, 2012. Kao-Shing Hwang is with Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan. Yu-Jen Chen, Wei-Cheng Jiang, and Tzung-Feng Lin are with Department of Electrical Engineering, National Chung-Cheng University, Chiayi, Taiwan. Wei-Cheng Jiang is the corresponding author (e-mail: [email protected]). Asian Journal of Control, Vol. 15, No. 4, pp. 1011–1020, July 2013 Published online 26 September 2012 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/asjc.614 © 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

Embed Size (px)

Citation preview

Page 1: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

CONTINUOUS ACTION GENERATION OF Q-LEARNINGIN MULTI-AGENT COOPERATION

Kao-Shing Hwang, Yu-Jen Chen, Wei-Cheng Jiang, and Tzung-Feng Lin

ABSTRACT

Conventional Q-learning requires pre-defined quantized state space and action space. It is not practical for real robotapplications since discrete and finite numbers of action sets cannot precisely identify the variances in the different positions onthe same state element on which the robot is located. In this paper, a Q-Learning composed continuous action generator, calledthe fuzzy cerebellar model articulation controller (FCMAC) method, is presented to solve the problem. The FCMAC displayscontinuous action generation by linear combination of the weighting distribution of the state space where the optimal policy ofeach state is derived from Q-learning. This provides better resolution of the weighting distribution for the state space where therobot is located. The algorithm not only solves the single-agent problem but also solves the multi-agent problem by extension. Anexperiment is implemented in a task where two robots are taking action independently and both are connected with a straight bar.Their goal is to cooperate with each other to pass through a gate in the middle of a grid environment.

Key Words: Reinforcement learning, FCMAC, multi-agent.

I. INTRODUCTION

In recent years, reinforcement learning (RL) hasbecome an attractive learning method for training robots[1–3]. More and more research projects have extended thereinforcement learning methods to multi-agent systems [4],where agents know little about other agents and the environ-ment changes during learning. Applications of reinforcementlearning in multi-agent systems include soccer [5, 6], pursuitgames [7, 8], and coordination games [9–11]. When imple-menting real robot applications, common reinforcementlearning, like Q-learning, normally requires well-definedquantized state and action spaces to estimate the value func-tion. There are some methods for estimating the value func-tion. Some researchers have used a lookup table to estimatethe value function [1]. The problems with these methodsinvolve how to discretize the state spaces in a continuousspace. The simplest method is partitioning the state space intoseveral uniform intervals for each dimension. Nevertheless,this suffers from enormous computational requirementsas the resolution of discretization is subtle, especially in ahigh-dimensional state space. However, simple discretizationmay lead to poor learning performance. Adaptive resonancetheory (ART)-based cognitive models can adaptively partition

the state space [12–13]. They require a certain amount ofcomputation time to evaluate the vigilance parameter of allnodes, especially if the number of nodes is huge. Also, itcannot estimate the value when no representing state is acti-vated. This paper proposes a fuzzy cerebellar model articula-tion controller (FCMAC)-based Q-Learning controller usingmultiple overlapping tiles of the spaces to produce a quantizedstate space with fine resolution. The FCMAC approach com-bines Q-Learning and uses it for a task that requires two robotagents to cooperate. In this task, there is a bar connecting therobots to curb each other and they must pass through a gate inthe middle of a 9x5 grid, as depicted in Fig. 1.

This article is organized as follows. Single agent andmulti-agent reinforcement learning is introduced in SectionII. The FCMAC-based Q-Learning methods are described inSection III. The combination of FCMAC and multi-agentcontinuous valued Q-learning is illustrated in Section IV.Finally, Section V concludes the paper.

II. BACKGROUND

2.1 Single-agent Q-learning

Q-Learning is a form of a model-free reinforcementlearning method based on stochastic dynamic programming.It provides robots with the capability of learning to act opti-mally in a Markovian environment. The robots are assumed tobe able to discriminate the set S of distinct environmentalstates, and can take a set A of actions on the environment.Fig. 2 is a procedure of the one-step Q-learning algorithm[14].

Manuscript received October 4, 2010; revised May 25, 2011; accepted May 14,2012.

Kao-Shing Hwang is with Department of Electrical Engineering, National SunYat-Sen University, Kaohsiung, Taiwan.

Yu-Jen Chen, Wei-Cheng Jiang, and Tzung-Feng Lin are with Department ofElectrical Engineering, National Chung-Cheng University, Chiayi, Taiwan.

Wei-Cheng Jiang is the corresponding author (e-mail: [email protected]).

Asian Journal of Control, Vol. 15, No. 4, pp. 1011–1020, July 2013Published online 26 September 2012 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/asjc.614

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 2: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

2.2 Multi-agent Q-learning

As Littman noted in [15], no agent lives in a vacuum –it must interact with other agents in the environment toachieve its goal. Multi-agent systems differ from single agentsystems in that several agents exist in the environmentmodeling each others’ goals and actions. From an individualagent’s perspective, multi-agent systems differ from single-agent systems most significantly because the environment’sdynamics can be affected by other agents. In addition to theuncertainty of a system, other agents can intentionally affectthe environment [16–18].

In the case of multiple agents, each learning simultane-ously, one particular agent is learning the value of actions ina non-stationary environment. Thus, the convergence of theaforementioned Q-learning algorithm is not necessarily guar-anteed in a multi-agent setting. With certain assumptionsabout the way in which actions are selected at each state overtime, Q-learning converges to the optimal value function V*.The simplest way to extend this to the multi-agent stochasticgame (SG) setting is just to add a subscript to the formulationabove, that is, to have the learning agent pretend the environ-ment is passive:

Q s a Q s a R s a V sk k k k k k( , ) ( ) ( , ) [ ( , ) ( )]← − + + ′1 α α γ�(1)

V s Q s aka A

k kk k

( ) max ( , )←∈

(2)

The symbol ak means the action selected by the k-thagent, and the symbol

�a means all of the joint actions are

selected by all of the agents.Several authors have tested variations of this algorithm

[6, 8, 11]. Nevertheless, this approach is inaccurate becausethe definition of the Q-values incorrectly assumes they areindependent of the actions selected by the other agents. Thecure to this problem is to simply define the Q-values as afunction of all agents’ actions:

Q s Q s R s V sk k k k( , ) ( ) ( , ) [ ( , ) ( )]� � �a a a← − + + ′1 α α γ (3)

For (by definition, two-player) zero-sum SGs, Littmansuggests the minimax-Q learning algorithm, in which V isupdated with the minimax of the Q-values [15]:

V s P a Q s a aP A a A

a A

1 1 1 1 1 21 1 2 2

1 1

( ) max min ( ) ( , ( , ))( )

←∈ ∈

∈∑Π

(4)

The symbol Ak is the action spaces of player k (k = 1, 2)and Q1(s,(a1,a2)) as the expected reward for taking action a1

when the opponent chooses action a2 from state s. The symbolP is a transition function defining the effects of the variousactions on the state of the environment.

Although (4) can be extended to general-sum SGs,minimax-Q is no longer well-motivated in those settings. Inour problem, the two agents have to cooperate rather thancompete with each other. They have to achieve the same goal,so we can simply define the V as follows:

V s Q s a aa A a A

1 1 1 21 1 2 2

( ) max ( , ( , )),

←∈ ∈

(5)

Equation (5) shows that the Q-values of the playersdefine a game in which there is a globally optimal actionprofile (meaning the payoff to any agent under that jointaction is no less than his payoff under any other joint action).

2.3 Cerebellar model articulation controller

In the 1970s, Albus introduced the cerebellar modelarticulation controller (CMAC) [19]. CMAC has the abilitiesof local generalization, function approximation, and so on[20].The architecture of the CMAC consists of two processingstages [21]. Here, we utilize the first stage to obtain a nonlineartransformation that maps the input state variable x ∈ RN into ahigher-dimensional vector v ∈ {0,1}M where N and M are thedimensions of vector x and v, respectively. The vector v is asparse vector in which a maximum of C of its components areexcited (C is called a generalization parameter, that is theratio of generalization width to quantization width). The

Fig. 1. The task environment and robots.

1. Initial Q(s, a) to be 0 for all states s and actions a. 2. Perceives a current state s. 3. Choose an action a according to the action value

function. 4. Carry out an action a in the environment. Let s' be a

next state and r be an immediate reward. 5. Update the action value function in terms of state s and

the action a pair. )]([),()1 a g ¢a(),(1 sVrasQasQ tt ++−←+

),(max)( asQsVa A∈

where a is a learning rate parameter and g is a fixed discount factor between 0 and 1.

6. Return to 2.

Fig. 2. Q-learning algorithm.

1012 Asian Journal of Control, Vol. 15, No. 4, pp. 1011–1020, July 2013

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 3: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

corresponding cells are nonzero because of the capacityof a biological organism to generalize learning experi-ence from one to another. This is referred to as localgeneralization.

The mapping is realized by feeding the binary output onthe sensor layer into the logical AND units. Each AND unitreceives an input from each receptive field and is sparselyinterconnected to the next logic OR operation unit. Fig. 3depicts the schematic diagram of two-dimensional (2-D)CMAC operations with width C = 3. After the mappingoperation, each subset is relative to the others along thehyperdiagonal in the input hyperspace. Besides, each elementfalling within the subsets on the hyperspace has a value ofone. The nonlinear mapping for a 2-D CMAC can be repre-sented as a M1 ¥ M2 matrixw(x). The result of the mapping isdrawn in Fig. 4.

III. SYSTEM IMPLEMENTATION

This section describes how to extend conventionaldiscrete Q-learning to continuous valued Q-learning withFCMAC. The system architecture of the FCMAC-basedQ-learning controller is shown in Fig. 5. The controllerreceives the state from the environment. The FCMAC quan-tizer generalizes the state s that is received from the environ-ment. The Action Selection Function selects an action uaccording to the Q value recorded in the memory to affect theenvironment. The environment feeds back a reinforcementsignal r. The controller uses the signal to update the Q value.In this method, an interpolated action value is computed by

the weighted linear combination of action values in terms ofrepresentative states and action. This contributes to realizingthe smooth motions of the real robot.

3.1 State representation and action interpolation

A basic idea for continuous value representation of thestate, the action, and the reward in Q-learning is to describethem as contribution vectors of the representative states,actions, and rewards, respectively. First, we quantize thestate/action space adequately. Each quantized state andaction can be the representative state s1, . . . ,sn and the rep-resentative action a1, . . . ,am, respectively. The state/actionrepresentation is given by a contribution vector of the rep-resentative state x = ( , , ),w wx

nx

1 … u = ( , , )w wumu

1 … [22]. Acontribution value indicates the closeness to the neighborrepresentative state/action. The summation of contributionvalues is one. In addition, we will use FCMAC to calculatea contribution.

Fig. 3. The nonlinear mapping of 2-D CMAC.

Fig. 4. The nonlinear mapping of 2-D CMAC.

FCMAC-based Q-Learningcontroller

FCMACQuantizer

State s

MemoryQ(x, u)

r

u

QX

Environment

Action SelectionFunction

Fig. 5. The system architecture of FCMAC-based Q-learningcontroller.

1013K.-S. Hwang et al: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 4: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

3.1.1 Fuzzy cerebellar model articulation controller

Unlike the conventional fuzzy rule based controller, aneuro-fuzzy system does not require the extraction of controlrules from experts. The learning is based on observations ofthe input/output relationship of the system. Careful inspec-tion of the operation process of the CMAC and fuzzy logicalgorithm reveal some striking similarities between bothsystems. For example, they perform function approximationin an interpolation look-up table manner with the principle ofdichotomy and generalization. Moreover, the nonlinearmapping of the CMAC can be regarded as a subset of theaggregation operation in fuzzy sets.

The standard univariate basis function of the CMAC isbinary, so the network modeling capability is only piecewiseconstant. In contrast, the univariate basis functions witha higher order piecewise polynomial that can generatesmoother output have been investigated recently [23–25]. Acrisp set can be considered a special form of fuzzy set, towhich an instant to some degree can either belong or notbelong. This property is similar to the problem of whether thestate variable excites a specific region in the sensor layer ofthe CMAC. The membership function, mi(x)→[0,1], associ-ates each state variable x with a number. It represents thegrade of membership of x in mi(x). Three different shapetriangles are adopted for the generalization parameter C = 3,so only one maximal output exists in the hyperspace follow-ing the aggregation operation. These membership grades havea peak value at the center of the excited region, and decreaseas the input moves toward the edge of the excited region.Fig. 6 depicts the organization of the overlapping receptivefields in the input space. Different membership grades areassigned to the corresponding cell if one of the quantizationregions is excited by a given state variable.

The sequence of m13 > m12 > m11, m22 > m21 > m23, m31 > m32 > m33

is notable for a state variable x, which is near the left of theexcited region in Fig. 7. In the extreme case, when every statevariable x falls on the center of its corresponding excitedregion, a fuzzy singleton m = 1 is generated at the intersectionof every subset in the hyperspace. The 2-D FCMAC opera-tions, where each subset is offset relative to the others along thehyperdiagonal in the input hyperspace, are illustrated in Fig. 7.

The nonlinear mapping of the FCMAC is imple-mented by replacing the logic AND and OR operations inthe CMAC with the commonly used T-norm and T-conormoperations, respectively. The comparison of CMAC andFCMAC with some T-norm and T-conorm dual operators islisted as Table I.

The nonlinear mapping result of the FCMAC is shownin Fig. 8. The algebraic product of the T-norm and algebraicsum of T-conorm are adopted in this work, since they producesmoother output surfaces and can make the system analysisavailable.

3.1.2 Continuous valued Q-learning

When the robot perceives current sensor informationas a continuous state x, we define a weighting factorwi

x (i = 1, . . . ,n; n is the number of representative states andi means the i-th representative state) in terms of a repre-sentative state si. The weighting factor is calculated usingthe FCMAC method mentioned above. This weightingfactor wi

x represents how a current continuous state x isinfluenced by a representative state si. When the robot per-ceives its environment as a continuous state x, this current xis described by the weighted linear combination of repre-sentative state si as follows:

x si 1

==∑wi

xi

n

(6)

Fig. 6. Input membership function of (a) Subset 1 for C = 3,(b) Subset 2 for C = 3, and (c) Subset 3 for C = 3.

1014 Asian Journal of Control, Vol. 15, No. 4, pp. 1011–1020, July 2013

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 5: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

As with the action space, a continuous action commandu is denoted by the weighted linear combination of repre-sentative action aj as follows:

u aj 1

==

∑w ju

j

m

(7)

The action weighting factor w ju (j = 1, . . . m; m is the

number of representative actions and j means the jth repre-sentative action) represents how a continuous action u isinfluenced by a representative action aj.

In conventional Q-learning, an optimal action at acurrent state is defined as an action maximizing a Q value atthe state. Such mapping from state space S to action space Ais called a policy. In the same way, an optimal action u* at acontinuous state x is defined in terms of the optimal actions of

the represented state. The optimal action of a representedstate is defined as follows:

π* ( ,( ) arg max ),s s aa

iA

i jj

Q=∈

(8)

where aj is a represented action. Therefore, the optimal actionu* at state x, is obtained by the calculation of the product of aweighting factor in terms of a representative state si and arepresentative action aj maximizing a Q value at si, and isdefined as follows:

u s s aa

* ( ) arg max ( , )= == =∑ ∑w w Qi

xi

i

n

ix

i j

i

n

j

π*1 1

(9)

The optimal action u* also can be written as follows:

u au* *==

∑w j j

j

m

1

(10)

From (9) and (10), the weighting vector of representedactions can be obtained by:

w w

ww Q s

otherwise

ju

ij

i

n

ij

ix

j i

* ,

, arg max ( , )

,

= ′

′ ==⎧

⎨⎪

wherea a

a

0⎩⎩⎪

(11)

Since the state value V(x) of the state x can be definedas the Q-value with optimal action Q(x, u*), it is linearlyinterpolated based on a weighting factor for a representedstate si and its state value Q(si, u*). The state value of repre-sented state is also linearly interpolated, based on a weightingfactor, for a represented action aj and its action value Q(si, aj).The state value V(x) is obtained by:

Fig. 7. The nonlinear mapping of 2-D FCMAC.

Table I. Comparison of CMAC and FCMAC.

CMAC FCMACCharacteristic function Membership functionAND operation T-norma ∧ b Min(a,b) Fuzzy Intersection

Max(0,a + b - 1) Bounded Productab Algebraic Product

OR operation T-conorma ⁄ b Max(a,b) Fuzzy Union

Min(1,a + b) Bounded SumA + b - ab AlgebraicSum

Fig. 8. The nonlinear mapping result of 2-D FCMA.

1015K.-S. Hwang et al: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 6: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

V w Q

w w Q

ix

i

i

n

ix

j i j

j

m

i

n

( ) ( , *)

( , )*

x s u

s au

=

=

=

==

∑∑1

11

(12)

3.2 Updating the representative Q-values

When executing a continuous action u at a continuousstate x, the world transits to the next state x’ from the currentstate-action (x, u), acquiring a reward r based on the nextstate, the action value is updated by the following equation:

Q Q

w w r V Q

i j i j

ix

ju

i j

( , ) ( , )

[ ( ) ( , )]

s a s a

x s a

←+ + ′ −α γ (13)

where ais a learning rate (between 0 and 1) and g is thediscounting factor controlling to what degree rewards in thedistant future affect the total value of a policy (between 0 and1). The updating ratio of the action value varies based on itsweighting factors for both the continuous state and actions.

IV. MULTIAGENT CONTINUOUSVALUED Q-LEARNING

4.1 Algorithm

In this paper, we want to find a solution to help our tworobots cooperate to get through the gate. When implementinga real robot application, the performance of conventionaldiscrete Q-learning is not sufficient to control the robot.Therefore, we combine it with continuous valued Q-learningto obtain a smooth continuous value of output commandactions. To do so, we use FCMAC to calculate the weightingfactor of the quantized representative states and obtain thelinear combination action in terms of the quantized repre-sentative actions. The specification of the symbols of theFCMAC-based Q-Learning controller is depicted in Table II,

and the details of our Multi-agent FCMAC-based continuousvalued Q-learning are as follows:

1. Initialize:For all representative states si and akj, Let Qk(si,(a1j, . . .,aPj)) = 0, (i = 1 to n, j = 1 to m, k = 1 to P)Initialize all agents’ state x

Loop:Use FCMAC to calculate the weighting factors wi

x

from x,(i = 1 to n)Calculate uk* in terms of wi

x:

u w Q s a ak ix

a a a ak i j Pj

i

n

kj j Pj kj

* arg max max ( , ( , , )){ , , }

= ( )−= 1

1

1…

…∑∑Use (11) to obtain all the weighting factors w jk

uk*, (k = 1

to P)Take action u uP1*, , *… , observe r, and next state x’Calculate Vk(x�):

V w Q

Q w w Q

k ix

k i

i

n

k i

j

m

ju

jPuP

( ) ( , )

( , )

′ = ′

′ =

=

=

′ ′

x s u

s u

� � �

1

1 111

kk i j jP

jP

m

a a( , ( , , ))s 1

1

�=

Update representative Q value:

Q a a Q a a

w w r V

k i j jP k i j jP

ix

jku

kk

( , ( , , )) ( , ( , , ))

( )*

s s

x

1 1� �←

+ + ′α γ −−[ ]=

∏ Q a ak i j jPk

P

( , ( , , ))s 11

V. EXPERIMENT

5.1 Construction of experimental environment

To demonstrate the performance of the proposed algo-rithm, two robots connected by a bar are commanded to movethrough a narrow gate, as shown in Fig. 9. There is adigitalvideo camera hung on the top of the field so the envi-ronmental information and the locations of the robots can beobtained by the camera. The goal is to help each other throughthe gate in a 9 x 5 grid map. There are four representativeactions for each agent: up, down, left, and right. Agentsperform actions and update the Q-values simultaneously. Theexperiments are conducted by mobile robots, with the differ-ential drive moving platforms shown in Fig. 10. After theinformation about the locations of the robots is obtained, themajor control and learning system interface can analyze anddetermine an action by the Q-Learning policy. The actionselected for each robot is a command of relative displacementvector, (DX,DY), from the current location. This kinematics

Table II. Specification of the symbols of FCMAC-basedQ-Learning controller.

Symbol The specification of symbol

si The state gets from environment, and i means the ithrepresentative state.

X The state quantized by FCAMCaPj The jth represented action of the Pth agentU The continuous action generated by the FCMAC-based

Q-Learning controllerwi

x The weighting factor of representative state si

w ju The weighting factor of representative action aj

R The reinforcement signala The learning rate (between 0 and 1)g The discounting factor (between 0 and 1)

1016 Asian Journal of Control, Vol. 15, No. 4, pp. 1011–1020, July 2013

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 7: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

information is transferred into motor control signals for eachwheel by a low-level inversed dynamics function coded in thedriver function. The vision system and the control and learn-ing system interface are shown in Fig. 11. Since an agentmust take its partners’ temporal states and actions intoconsideration in the multiple agents system, the dimension-ality of the state space and cardinality of the action sets areexpanded by a multiplication of the number of the partners,i.e., 45 x 45 states and 4 x 4 actions, in the experiments.

5.2 Experimental results

Fig. 12 is the training data of the two robots cooperatingto pass through the gate. The x-axis is referred to as thenumber of trials. The y-axis is the number of steps that themulti-agent systems required to pass through the gate suc-cessfully. We can see that, after the first 100 trials, there werestill more than 500 steps to find the way to pass through thegate. When the number of training trials was greater than 100,

the average number of steps of the robots decreased to about200 steps. After 450 trials, we can see that the robotshave found an optimal solution and take about 10 steps tocooperate to pass through the gate.

We show the photos of the experimental results inFig. 13. The two robots initially are at the top and right side ofthe field. After moving 10 steps (from Step 1 to Step 10), theycan pass through the gate in the middle of the field andachieve their goal. Regardless of whether the agents arealways in the center of the quantized representative states, thecontinuous valued Q-learning can recognize the variances ofcontinuous state space and generate the optimal policy to actin the environment.

VI. CONCLUSION

This paper proposed an FCMAC approach calculatingthe weighting factors of the agents’ state with multi-agentQ-learning to generate a continuous action output com-mand and perform it in real robot applications. Although the

Fig. 9. The configuration of the experiments on the multi-agentsystem.

Fig. 10. The hardware of the robots.

Fig. 11. Vision system and Control and Learning systeminterface.

Fig. 12. The training data of FCMAC-based continuous valuedmulti-agent Q-learning.

1017K.-S. Hwang et al: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 8: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

conventional discrete valued Q-learning learns well and canachieve convergence in a task that has a well-defined discretequantized state and action space, it has poor performance inreal-world applications because the state of agents in the realworld is continuous and the flexible continuous action outputis better than the inflexible discrete action output.

The task in this paper was simple and the environmentwas generally the same. The method described above maynot learn well if the environment is highly changeable orhas more obstacles. In future work, we will apply our

method to more complicated tasks and more changeableenvironments. The key point, in our view, is the method ofupdating the Q-values. We will investigate methods for cal-culating the value function of the state and observe theeffects of different methods of updating the Q-values tomake our method more practical when implementing real-world applications.

REFERENCES

1. Sutton, R. S. and A. G. Barto, Reinforcement Learning:An Introduction, MIT Press, Cambridge (1998).

2. Gosavi, A. “Reinforcement learning: A tutorial surveyand recent advances,” INFORMS J. Computing, Vol. 21,No. 2, pp. 178–192 (2009).

3. Busoniu, L., R. Babuska, and B. De Schutter, “Acomprehensive survey of multiagent reinforcementlearning,” IEEE Trans. Syst. Man, Cybern. Part C-Applicat. Reviews, Vol. 38, No. 2, pp. 156–172 (2008).

4. Masoumi, B. and M. R. Meybodi, “Learning automatabased multi-agent system algorithms for finding optimalpolicys in markov games,” Asian J. Control, Vol. 14, No.4, pp. 137–152 (2012).

5. Hwang, K. S., Y. J. Chen, and C. H. Lee, “Reinforcementlearning in strategy selection for a coordinated multi-robot system,” IEEE Trans. Syst. Man, Cybern. Part A-Syst. Humans, Vol. 37, No. 6, pp. 1151–1157 (2007).

6. Tucker, B. Learning roles: Behavioral diversity inrobot teams. In S. Sen, editor, Collected papers fromthe AAAI-97 workshop on multiagent learning. AAAIPress, Menlo Park, CA (1997).

7. Tan, M. “Multi-agent reinforcement learning: independ-ent vs. cooperative agents,” in Proc. 10th Int. Conf.Machine Learning, Amherst, MA, pp. 330–337 (1993).

8. De Jong, E. Non-random exploration bonuses for onlinereinforcement learning. In S. Sen, editor, Collectedpapers from the AAAI-97 workshop on multiagentlearning. AAAI Press, Menlo Park, CA (1997).

9. Xiao, D. and A.-H. Tan, “Self-organizing neural architec-tures and cooperative learning in a multiagent environ-ment,” IEEE Trans. Syst. Man, Cybern., Part A-Syst.Humans, Part B-Cybern., Vol. 37, No. 6, pp. 1567–1580(2007).

10. Araabi, B. N., S. Mastoureshgh, and M. N. Ahmadabadi,“A study on expertise of agents and its effects oncooperative Q-learning,” IEEE Trans. Syst. Man Cybern.Part B-Cybern., Vol. 37, No. 2, pp. 398–409 (2007).

11. Claus, C. and C. Boutilier, “The dynamics of reinforce-ment learning in cooperative multiagent systems,” inProc. Fifteenth National Conf. Artif. Intell., Madison,WI, pp. 746–752 (1998).

12. Tan, A. H., N. Lu, and D. Xiao, “integrating temporaldifference methods and self-organizing neural networks

(a) Step 1. (b) Step 2.

(c) Step 3. (d) Step 4.

(e) Step 5. (f) Step 6.

(g) Step 7. (h) Step 8.

(i) Step 9. (j) Step 10.

Fig. 13. Implementation result of time step.

1018 Asian Journal of Control, Vol. 15, No. 4, pp. 1011–1020, July 2013

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 9: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

for reinforcement learning with delayed evaluative feed-back,” IEEE Trans. Neural Netw., Vol. 19, pp. 230–244(2008).

13. Ueda, H., T. Naraki, Y. Nasu, K. Takahashi, and T.Miyahara, “State space segmentation for acquisition ofagent behavior,” IEEE/WIC/ACM Int. Conf. Intell. AgentTechnol., IAT’06, pp. 440–446 (2006).

14. Watkins, C. J. C. H. and P. Dayan, “Technical note:Q-learning,” Mach. Learn., Vol. 8, No. 3–4, pp. 279–292(1992).

15. Littman, M. L. “Markov games as a framework for multi-agent reinforcement learning,” in Proc. 11th Int. Conf.Machine Learning, San Francisco, CA, pp. 157–163,(1994).

16. Stone, P. and M. Veloso, “Multiagent systems: A surveyfrom the machine learning perspective,” Auton. Robot.,Vol. 8, No. 3, pp. 345–383 (2000).

17. Fan, Y., G. Feng, and Y. Wang, “Combination frameworkof rendezvous algorithm for multi-agent systems withlimited sensing ranges,” Asian J. Control, Vol. 13, No. 2,pp. 283–294 (2011).

18. Peng, K. and Y. Yang, “Collective tracking control formulti-agent system on balanced graphs,” Asian J.Control, Vol. 13, No. 4, pp. 505–512 (2011).

19. Albus, J. S. “A new approach to manipulator control: thecerebellar model articulation controller (CMAC),” Trans.ASME, J. Dynam. Syst. Meas., Control, Vol. 97, pp. 220–227 (1975).

20. Lin, C.-M. and T.-Y. Chen, “Self-organizing CMACcontrol for a class of MIMO uncertain nonlinearsystems,” IEEE Trans. Neural Netw., Vol. 20, No. 9, pp.1377–1384 (2009).

21. Hwang, K.-S. and C.-S. Lin, “Smooth trajectory trakingof three-link robot: A self- organizing CMAC approach,”IEEE Trans. Syst. Man Cybern. Part B-Cybern., Vol. 28,No. 5, Oct. (1998).

22. Takahashi, Y., M. Takeda, and M. Asada, “Continuousvalued Q-learning for vision-guided behavior acquisi-tion,” Proc. 1999 IEEE/SICE/RSJ Int. Conf. MultisensorFusion and Integration for Intelligent Systems, pp. 255–260 (1999).

23. Su, S.-F., Z.-J. Lee, and Y.-P. Wang, “Robust and fastlearning for fuzzy cerebellar model articulation control-lers,” IEEE Trans. Syst. Man Cybern. Part B-Cybern.,Vol. 36, No. 1, pp. 203–208 (2006).

24. Brown, M. and C. J. Harris, “A perspective and critiqueof adaptive neurofuzzy systems used for modeling andcontrol applications,” Int. J. Neural Syst., Vol. 6, No. 2,pp. 197–220 (1995).

25. Jou, C.-C. “A fuzzy cerebellar model articulation con-troller,” in IEEE Int. Conf. Fuzzy Syst., pp. 1171–1178(1992).

Kao-Shing Hwang is a professor ofElectrical Engineering Department atNational Sun Yat-sen University and anadjunct professor of Electrical Engi-neering Department at Chung ChengUniversity, Taiwan. He received M.M.E.and Ph.D. degrees in Electrical and

Computer Engineering from Northwestern University, Evan-ston, IL, U.S.A., in 1989 and 1993, respectively. Since August1993 he has been with National Chung Cheng University,Taiwan. He was the chairman of Electrical EngineeringDepartment (2003–2006) and was the chairman of Opti-mechatronics Institute of the university (2010–2011). He isalso in charge of establishing the standardized exchange plat-form of heterogeneous educational resources under Ministryof Education of Taiwan. His research interest includes meth-odologies and analysis for various intelligent robot systems,machine learning, embedded system design, and ASIC designfor robotic applications.

Yu-Jen Chen received B.S. degree in elec-trical engineering from Tatung Institute ofTechnology, Taipei, Taiwan, in 1994 andM.S. degree and Ph.D. degrees in electricalengineering from National Chung ChengUniversity, Chia-Yi, Taiwan, in 1997 and2009, respectively. From 2004 to 2009, hewas an adjunct lecturer in Center for

General Education, National Chung Cheng University. SinceAugust 2010 he has been an assistant professor of ElectricalEngineering Department at National Chung Cheng Univer-sity. His research interests include machine learning, robot-ics, neural networks and embedded systems.

Wei-Cheng Jiang was born in Taiwanin 1985. He received the B.S. degreein Computer Science and informationEngineering and M.S. degree in Electro-Optical and Materials Science fromNational Formosa University. Heis currently working toward Ph.D. degreein Department of Electric Engineering,

National Chung Cheng University, Chiayi, Taiwan. Hiscurrent research interests are neural networks, learningsystems, fuzzy control, mobile robots.

1019K.-S. Hwang et al: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society

Page 10: Continuous Action Generation of Q-Learning in Multi-Agent Cooperation

Tsung-Feng Lin received B.S. degree inDepartment of Mechanical and MarineEngineering from National Taiwan OceanUniversity, Keelung, Taiwan, in 2002 andM.S. degree in Graduate Institute of Opto-Mechatronics from National Chung ChengUniversity, Chia-Yi, Taiwan, in 2004. Hewas a smartphone system software engineer

in Arimacomm corp., Taipei, Taiwan, during 2006–2007 andin ASUSTek corp., Taipei, Taiwan, during 2007–2008. Since2009 to present, he was a senior linux system software engi-neer in DeviceVM Inc., Taiwan Branch, Taipei, Taiwan. Hisresearch interests include machine learning, robotics, andlinux instant-on desktop systems.

1020 Asian Journal of Control, Vol. 15, No. 4, pp. 1011–1020, July 2013

© 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society