Cooperative Reinforcement Learning: Brief Survey and Application

Cooperative Reinforcement Learning: Brief Survey and Application to Bio-insect and ArtificialRobot Interaction

Ji-Hwan Son and Hyo-Sung Ahn

Department of MechatronicsGwangju Institute of Science and Technology (GIST)

1 Oryong-dong, Buk-gu, Gwangju 500-712, KoreaE-mail: [email protected], [email protected]

Abstract— In this paper, we address our on-going research calledBRIDS that is a trial to formulate bio-insect and artificial robotinteraction based on reinforcement learning. First we briefly reviewrelevant research works in this field. Since cooperation among mobileagents is a key technology, we examine cooperative reinforcementlearning. Specifically we will briefly review the concept of area ofexpertise in cooperative learning area. In fact bio-insect and artificialrobot interaction has been studied in Leurre [1]. Our research,however, has a key difference from Leurre in that it is to drivebio-insect by coordination of a group of mobile robots towards adesired point. In author’s best knowledge, driving bio-insect towardsa desired point based on fully autonomous coordination of multiplemobile agents has not been studied in existing publications.

Index Terms— Reinforcement learning, Bio-insect and artificialRobot Interaction on the base of Distributed Systems (BRIDS), Areaof expertise

I. I

In mobile robot area, many researchers have studied robotintelligence based on dynamics. But in nature, it is not easyto define dynamics due to complexity and uncertainty. Thus,it is challenging to apply robot intelligence theories to actualexperimental environment. Our research goal is to answer to thequestion: how to apply machine learning theory to control realanimal or bio-insect. Indeed, it is a hard task to understand, predictand control bio-insect and animal using artificial intelligence. Ourresearch is inspired by Leurre [1] that is a project to make a mixedsociety composed of cockroaches and robots. Interesting part ofthis research is that they understand the flock of cockroachesand then the robots go into their flock and behave as same ascockroaches. In this paper, we will introduce our on-going researchthat is a trial to formulate bio-insect and artificial robot interactionbased on reinforcement learning. In section II, we will address ourresearch and explain why our research is important. In section III,we will explain cooperative reinforcement learning that is a toolfor interacting with bio-insect. In section IV, we will introducearea of expertise (AOE) in cooperative reinforcement learning area.This concept of AOE can be found in [2]. AOE enables a groupof mobile robots to learn the environment fast and reliably. Insection V, we will introduce our platform and research methods.In section VI, we will provide conclusion.

II. B- A- I

In this part, we introduce the motivation of this research.

A. Motivation and Goals

Our research seeks a study on bio-insect and artificial robot in-teraction to establish a new architectural framework for improvingthe intelligence of service robot. One of the main research goals

Corresponding author: Hyo-Sung Ahn

is to drive bio-insect by coordination of a group of mobile robotstowards a desired point. The research includes establishment ofhardware/software for bio-insect and artificial robot interactionand synthesis of distributed sensing, distributed decision, anddistributed control systems for building a community composedof bio-insect and artificial robots. Fig. 1 explains how to composeand connect subsystems.

Fig. 1. Flowchart of proposed community composed of distributed deci-sion, distributed control and distributed sensing. Subsystems are connectedin a feedback loop manner.

Distributed sensing is for the recognition and detection ofmobile bio systems, and for the construction of wireless sensor net-work to locate artificial robots and bio-insect. Distributed decisioncontains learning of reactions of bio-insect repetitively for a certainform of inputs. It aims at finding which command and actuationdo drive bio-insect towards a desired point or drive away from thetarget position. Reinforcement learning algorithm will be designedto generate penalty or rewards on a set of actions. Distributeddecision stores the state of current action and its outputs, whichare closely associated with the future event, into memory. Then itselects commands and outcomes of past actions for the currentclosed-loop learning. Thus the synthesis of recursive learningalgorithm on the base of storage and selection procedure alongthe learning domain will be of main interest in the distributeddecision. Distributed control includes control and deployment ofmulti-mobile robots via coordination, and design of optimallydistributed-control algorithm based on the coordination. It learnshow bio-insect reacts upon relative speed, position, orientationbetween multi-mobile robots and bio-insect. The ultimate goalof this research is thus to establish a new theoretical frameworkfor robot learning via a recursive sequential procedure of thedistributed sensing, decision and control systems. For convenienceour on-going research is called Bio-insect and artificial RobotInteraction on the base of Distributed Systems (BRIDS). Fig. 2illustrates architecture of our research (BRIDS).

The research on bio-insect and artificial robot interaction willprovide a fundamental theoretical framework for human and robotinteraction. The applications include service robot, cleaning robot,intelligence monitoring systems, intelligent building, and ITS. This

711-4244-2368-2/08/$20.00 ©2008 IEEE

Fig. 2. Architecture of BRIDS: the figure shows how to relate individualsubsystems. The first step is to construct distributed sensing, distributeddecision and distributed control systems. Then, we make a closed-systembased on feedback loop for learning and exchange of knowledge for sharinginformation.

research is for a control of model-free bio-systems; thus it couldbe used for a control of complex systems such as metropolitantransportation control and environmental monitoring, which cannotbe readily modeled in advance. The result can also be used toattract and expel harmful insects such as cockroach via interactionand intra-communication

B. Related Works

1) Leurre: Leurre is a project on building and controlling mixedsocieties composed of insects and robots. Their main goals areto study of mixed society to develop and control. As seen inthe Fig. 3, cockroaches and insbot are together in testbed. Toco-exist with cockroaches, insbot has some known knowledge ofbehavior and pheromone [3]. In normal environment, cockroachesrun away from insbot when it moves. Insbot is controlled bywireless network, and it is equipped with small size IR proximitysensor, light sensor, linear camera, temperature sensor and sensorsto detect pheromone [4]. Goal of this robot is to behave like aninsect and to be able to influence the society.

Fig. 3. Cockroaches and Insbot robot [1].

2) Locomotion control of bio-robotic system via electric stim-ulation: In [5], they show the direct control of cockroach viaelectric stimulation. As seen in Fig. 4, they made a trackball-computer interface to find the movement based on stimulusgenerator. This stimulus generator is connected in cockroach byportable stimulation unit that gives some pulse signal. And thenthey find the movement by trackball-computer interface. Based onexpertise, they made an autonomous electronic backpack. It usestwo photosensors that made a cockroach to follow the black line.

Fig. 4. The trackball-computer interface [5].

3) Pheromone-guided mobile robots: In [6], they propose amobile robot which detects a female silkworm pheromone. Someinsects have pheromone that uses for communication or copula-tion. When a male silkworm moth found a pheromone of femalesilkworm moth, it shows some behavior to find a female silkwormmoth. This behavior is based on silkworm’s olfactory sensorimotorsystem. So they research the process and behavior patterns basedon biological knowledge. They made a PheGMot-III (Fig. 5) robotfor a similar behavior. It has pheromone sensor to find somepheromones and a same size as a male silkworm moth.

Fig. 5. The third pheromone-guided mobile robot, PheGMot-III, com-pared to a 10 yen coin. The wingspan of a silkworm moth is about 4cm,so PheGMot-III is as small as a silkworm moth. [6].

4) Roach Bot: 1 Roach bot(Fig. 6(Left)) is a robot controlledby cockroach. As seen in Fig. 6(Right), cockroach is placed onthe top of trackball. When some obstacle is detected by sensorof roach bot, it gives some stimulus by electric light. Cockroachis a nocturnal insect that hates the light. Because of this reason,when cockroach is exposed to light, it runs away to avoid light.These movements are caught by trackball sensor and it moves byfollowing the movement of cockroach.

Fig. 6. Cockroach and Roach Bot. 1

From the literature survey, we could not find machine theorybased on bio-insect and artificial robot interaction area. Some

1Refefence from: http://www.conceptlab.com/roachbot/.

72

papers have proposed some direct controls of insect or imitation oftheir behavior. In our research, we use cooperative reinforcementlearning theory to control the insect. In the next section we havea brief review on cooperative reinforcement learning.

III. C R L

Reinforcement learning [7],[8] is a very popular method inmachine learning area. In this section, we introduce reinforcementlearning, Q-learning and cooperative reinforcement learning forour research.

A. Reinforcement Learning

This learning algorithm uses a reward signal that is based ontrial and error interaction. This algorithm is applied for MarkovDecision Process (MDP). It consists of a discrete set of environ-ment states S, a discrete set of agent actions A and a set of scalarreinforcement signal within range 0 and 1. When a agent exploresits environment, it always tries to maximize its rewards. For thisjob, it needs to find policy π and mapping state to actions. Theaction-repeating process is called MDP and the goal of MDP is tomaximize some cumulative function of the reward (reinforcementsignal). As seen in Fig. 7, the figure explains interaction betweeneach state, action, and reward.

Fig. 7. The agent-environment interaction at discrete, finite world,choosing one from a finite collection of actions at every time step [8].

B. Q-learning

Q-learning[9] is an important part of reinforcement learning. Itupdate a reward value by iteration. At first it initializes Q(s, a)table for each s(state) and a(action) and observes current state sand selects discount factor γ(0 ≤ γ ≤ 1). Select an action a andexecute it, then it receives an immediate reward r and observesthe new state s* and update the table entries for the next Q(s, a)by way of:

Q(s, a) = Q(s, a) + α(r + γmaxa′

Q(s′, a′) − Q(s, a)) (1)

where α is learning rate (0 ≤ α ≤ 1) which gives different weightson old data and new data.

C. Cooperative Reinforcement Learning

We can easily answer to the question of why a cooperativelearning is necessary in machine learning area [10]. In bio-insectand artificial robot interaction, each agent needs more information.Problem is that if agent works in a complicated environment andhas lots of variables, it will then need more time to get some usefulinformation. Thus, if decisions are based on cooperation of a groupof mobile robots, learning speed will become faster. In cooperativereinforcement learning [10], different types of learning algorithmsexist. Most part of methods are based on game theory using nashequilibrium and zero-sum game or hierarchical(layer) architectureor mixed up another learning algorithms. In this paper, we focus oncooperative reinforcement learning methods. In [11], they considertwo agents that have diametrically opposed goals. This methodallows us to use a single reward function that one agent tries to

maximize and the other, called the opponent tries to minimize.This method is called a two-player zero-sum markov game orcalled minmax-Q learning algorithm, which can be summarizedas follow.

V(s) = maxπ∈PD(A)

mino∈O

∑a∈A

Q(s, a, o)πa (2)

Q(s, a, o) = R(s, a, o) + γ∑

s′T (s, a, o, s′)V(s′) (3)

Eq. (2) is the value of a state s, and s′ in markov game and (3)is the expected reward for taking action a when another agentchooses opponent action o. In [12], they use average reward-based learning such as the Monte-Carlo algorithm for task-levelmultirobot systems that make two levels. Action-level systemsperform missions based on reactive behavior, whereas task-levelsystems perform missions at a higher level by decomposing theminto subtasks. In [13], they use two-level reinforcement learningwith communication (2LRL) method. In the first level, agentslearn how to select their target and then they select the actiondirected to their target in the second level. Here use different

Fig. 8. Flowchart of action selection in 2LRL-1.2[13].

Q-tables (QFollow, QPrey, QOwnp, QOwnP, QOther) that havetheir own purpose and learning mechanism. Interesting part ofthis research is that standard Q-learning algorithm is extended tomulti-agent environments. This two level decision mechanism usescatching small and big prey in a discrete grid-world environment.In [14], they show a integrated sequential Q-learning and geneticalgorithm. Sequential Q-learning equation is given by (4). In eq(4),ε means learning rate and µ means discount rate. Three mainelements of genetic algorithm operators are selection, crossoverand mutation. For using these elements, gene will be changed andcalculated by fitness function. This fitness function shows whichgene will give a good result. Fig. 9 shows logical flow betweensequential reinforcement learning and genetic algorithm.

Q(s, a1j , a

2j , · · · , a

ij) = (1 − ε)Qi(s, a1

j , a2j , · · · , a

ij)

+ε(r + µ maxa1 ,a2 ,···,ai

Q[s′, a1, a2, · · · , ai])

(4)

In [15], they propose more developed Sparse Cooperative Q-learning that is composed on agent-base and edge-base basedon structure of coordination graph(CG). As seen in Fig. 10, itshows how each neighbor interacts with Qi and Qi j. Agent-baseddecomposition method will update Qi by equation (5) from theagent i and neighbor agents j. Edge-based decomposition methodwill update Qi j by equation (6) using edge-based update methodand agent-based update will be updated by equation (7).

Qi(si, ai) =12

∑j∈Γ(i)

Qi j(si j, ai, a j) (5)

73

Fig. 9. The integration scheme of RL and GA [14].

Fig. 10. Agent-based decomposition (left) and edge-based decomposition(right) [15].

Fig. 11. Edge-based update (left) and agent-based update (right) [15].

Qi j(si j, ai, a j) := Qi j(si j, ai, a j)

+α[Ri(s, a)|Γ(i)|

+R j(s, a)|Γ( j)|

+ γQi j(s′i j, a∗i , a∗j) − Qi j(si j, ai, a j)] (6)

Qi j(si j, ai, a j) := Qi j(si j, ai, a j)

+ α∑k∈i, j

Rk(s, a) + γQk(s′k, a∗k) − Qk(sk, ak)

|Γ(k)|(7)

where s is state, a is action, and Γ(k) is neighbor of agent i.From the literature search, we can find lots of papers in the fieldof cooperative reinforcement learning. In [16], they show lots ofmethods and problem domains. In the latest of survey article [17],the paper also shows and classifies many ways of cooperativelearning algorithm. But we cannot find any survey on area ofexpertise concept. So in the next section, we will introduce areaof expertise (AOE).

IV. A E

In this section, we introduce area of expertise (AOE) in coopera-tive reinforcement learning area. In real world, many people haveone more merits on their capability. In reinforcement learning,AOE has been recently introduced. An expertise concept of [2]tells why AOE is critical in cooperative reinforcement learning.A similar approach called advice-exchange was studied in [18].But a difference between advice-exchange and area of expertise

is that advice-exchange is based on previous experience conceptand area of expertise is learning cooperation at which agent gotmore expertise. So in this section, we explain expertise measure,meaning of area of expertise, and weight strategy sharing (WSS).

A. Area of Expertise

Area of expertise is to show which agent is more expertisein part of knowledge domain. In [2], they explain two differentaspects on expertise. In behavioral knowledge-base, a better andmore rational behavior agent is more expertise and in structuralpoints of view, a better and more reliable knowledge agent insomewhere is more expertise.

B. Expertise Measure

In [2], they introduce methods for expertise evaluation. Incooperative reinforcement learning area, we do not know whichagent is more expertise or which agent can find an optimal action.The expertise measure can help to calculate expertise. Variousways of expertise measure [2], [19], [20] show how to calculateor process reward(reinforcement signal) of each agent. As seenin table I, there are different methods for measuring expertise. Asseen in table I, each method shows how to calculate reinforcementsignals from [2],[19],[20]. Each method will show different resultsin different environment. Because of this reason, if we want toadapt some expertise measure, it should be chosen based onenvironment.

C. Weight Strategy Sharing (WSS)

Weight strategy sharing (WSS) is also a method to find area ofexpertise from multiple agents. WSS is to show how to calculateexpertise value in weight. Similar weight concept of reinforcementlearning is found in [21]. Reward is updated by taking account ofother agent’s location information such as:

R =w · Rw +

∑ni=1 Ri

w + n(8)

where w is the agent’s reward weighting result, Rw is the agent’sreward due to the agent foraging food, summation of the reward Ri

of agents n located in agent’s visual range. The equation (8) showshow much emphasis is placed on the agent’s reward compared tothat of other agents.

Next, we review weight strategy sharing algorithm [2],[19],[20].For area of expertise, weight Wip j is calculated by

Wip j =

1 − αi, if j = i;αi

ep j−epi∑k∈Epl (epk−epi)

, if j ∈ Epi;

0, otherwise.(9)

where i indicates the learner agent, j is an index on the otheragents, p stands for different local parts of Q-table and ep j is theexpertness of agent j on portion p of the Q-table and αi indicateshow much each agent relies on the others. Reference [2] explainsa more advanced case when the agent knows AOE. In [20], theydo an experiment using two alice robot. Interesting part of thisexperiment is that each robot has blinded-sensor called AliceBL(blinded left sensor) and AliceBR (blinded right sensor). In thissituation, they adapt area of expertise to make a good performance.In [22], they propose adaptive weight strategy sharing (AdpWSS)(10) and regret measure (11). Equation (10) is a probability ofsharing knowledge to another agent. This probability is based on

74

TABLE IV [2], [19], [20].

Methods Formulations

Normal eNrmi =

∑nowt=1 ri(t)

Absolute eabsi =

∑nowt=1 |ri(t)|

Positive ePi =∑now

t=1 r+i (t),

r+i (t) =

0, if r+i (t) ≤ 0

ri(t), otherwise

Negative eNi =∑now

t=1 r−i (t),

r−i (t) =

0, if ri(t) > 0

−ri(t), otherwise

Gradient eGi =∑now

t=c ri(t)

Average Move eAMi =

(∑ntrial

trial=1 mi(trial)/ntrial)−1

Entropy eEnti (x) =

−∑

a∈actions Pr(a|x)Ln(Pr(a|x))

Certainty eCeri (x) =

maxak∈actionsexp(Q(x,ak)/T )∑ak∈actions exp(Q(x,ak)/T )

Fig. 12. AliceBL observes all three states the same.[20]

weight result of each agent:

Probs =

0, if |wi −W j| ≤ Th1,1, if |wi −W j| ≥ Th2,|wi−w j |−Th1

Th2−Th1, otherwise.

(10)

where Th1, Th2 are threshold result and wi, w j are weight result ofwi, w j. The regret measure can be used as an alternative of strategysharing. Eq. (11) is based on uncertain bounds of both actionswhere each lb() and ub() means lower limit of estimated state-action value and upper limit of approximated state-action valuebased on (12).

regret(st+1) = −lb(q(st+1, a1) − ub(Q(st+1, a2))) (11)

Bound(Qt(st, a)) = QT (S T , a) ± t a2 ,k−1

s√

k(12)

Using this method, expertness will be calculated by the followingequation:

expertnessm(st) = 1 −1

1 + exp(−b ∗ regret(st))(13)

In [23], they use expertise sharing concept for designing a biding-agent for electricity markets.

D. Similar concept in Neural Network Area

Weight concept is also found in neural network area. Eachneuron is connected by weight results. So it is also important tochoose and get the efficient weight result in long time. As seen inFig. 13, expert networks are connected by gating network. The fig.13 from [24], called a modular connectionist architecture, showshow to connect networks. Each weight is changed by Q-learningbased on neural network and each expert network is seen as similaras area of expertise. In [25], they explain the method to connect

Fig. 13. A modular connectionist architecture [24].

the weights of the gating networks based on backpropagationalgorithm.

V. E P

For BRIDS, we use e-puck robot2. It is developed at swissfederal institute of technology in Lausanne (EPFL) for educationpurpose. It has dsPIC 30f6014 microcontroller, 8 IR proximitysensor, 3D accelerometer, 3 microphones for finding direction,VGA color camera, IR receiver for remote control and bluetoothmodule for wireless connection. The robot can be programmedby open hardware that everyone can easily use and develop it.For control of multiple e-puck robots, we use bluetooth access

Fig. 14. E-Puck robots.

point that can allow multiple serial connection. The e-puck robotis equipped with sercom protocol that can make it easy to controlit by bluetooth serial connection. To find locations of each e-puckrobot and insect, we use image processing as shown in Fig. 15.This method can easily obtain locations of bio-insect and mobilerobots. It is however hard to distinguish each e-puck robot. Tosolve this problem, we will use landmarks composed of differentcolors and shapes. The landmarks will help us to find a robot’slocation and heading angle. Selection of insect is also an importantpart in our experiment. First of all, size of insect will be as sameas e-puck robot. And it needs physical strength, long life andresponse of robot’s movement. For this reason, in related research

2Reference from : http://www.e-puck.org.

75

Fig. 15. Experimental Setup. This platform is composed of bluetoothaccess point and main computer for control and image processing.

cockroaches are popularly used. But cockroach is so fast that it isnot easy to control it by e-puck robot. Because of these reasons,we will test various species of insects. After choosing bio-insect,three robot will drive an insect to a goal point; but the robot learnsthe pattern of bio-insect and reactions of it from some particularaction. When robots work, camera will detect the movement ofinsect. Defining rewards and states, and learning policy, we canrepeat the learning procedure until the desired goal is achieved.

VI. C

In this paper, we have introduced our on-going research calledBRIDS, and some background theories and knowledge. Bio-insectand artificial robot interaction will play an important part forimproving the robot intelligence. We propose using the distributedsystems, which are based on distributed decision, distributedcontrol and distributed sensing. Reinforcement learning is unsu-pervised learning that can learn events by activation. This conceptis as similar as learning methods of human or animal. So weapply the reinforcement learning algorithm to robot intelligence.There are lots of cooperative reinforcement learning algorithms.Of particular interest is the area of expertise (AOE) concept.The AOE extends cooperative reinforcement learning to multi-agent robot intelligence. The main contribution of this paper isto propose using AOE and cooperative reinforcement learning todrive bio-insect towards a desired point based on fully autonomouscoordination of multiple mobile agents. In our future publications,we will report the actual hardware testbed and actual experimentaltest results.

VII. A

Authors would like to acknowledge the financial support fromKorea Science and Engineering Foundation (KOSEF, Project No.R01-2008-000-10031-0).

R

[1] G. Caprari, A. Colot, R. Siegwart, J. Halloy, and J.-L. Deneubourg,“Building Mixed Societies of Animals and Robots,” IEEE Robotics& Automation Magazine.

[2] B.N. Araabi, S. Mastoureshgh, and M.N. Ahmadabadi, “A Studyon Expertise of Agents and Its Effects on Cooperative Q-Learning,”Systems, Man and Cybernetics, Part B, IEEE Transactions on, vol.37, no. 2, pp. 398–409, 2007.

[3] Raphael Jeanson, Colette Rivault, Jean-Louis Deneubourg, StephaneBlanco, Richard Fournier, Christian Jost, and Guy Theraulaz, “Self-organised aggregation in cockroaches,” Animal Behaviour, vol. 69,pp. 169–180, 2005.

[4] G. Caprari, A. Colot, R. Siegwart, J. Halloy, and J.-L. Deneubourg,“Insbot : Design of an Autonomous Moni Mobile Robot able toInteract with Cockroaches,” in Proceedings of IEEE InternationalConference on Robotics and Automation. ICRA’2004, New Orleans,2004, pp. 2418–2423.

[5] R. Holzer and I. Shimoyama, “Locomotion control of a Bio-roboticsystem via electric stimulation Holzer Shimoyama,” in Proceedingsof the 1997 IEEE/RSJ International Conference on Intelligent Robotsand Systems, IROS ’97, Grenoble, France, 1997, vol. 3, pp. 1514–1519.

[6] Yoshihiko Kuwanaa, Sumito Nagasawab, Isao Shimoyamab, and Ry-ohei Kanzakic, “Synthesis of the pheromone - oriented behaviour ofsilkworm moths by a mobile robot with moth antennae as pheromonesensors,” Biosensors and Bioelectronics, vol. 14, pp. 195–202, 1999.

[7] Leslie Pack Kaelbling, Michael Littman, and Andrew Moore, “Re-inforcement learning: A survey,” Journal of Artificial IntelligenceResearch, vol. 4, pp. 237–285, 1996.

[8] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduc-tion, MIT Press, 1998.

[9] C.J.C.H. Watkins and P. Dayan, “Q-learning,” Machine Learning,vol. 8, no. 3, pp. 279–292, 1992.

[10] M Tan, “Multi-agent reinforcement learning : Independent vs. coop-erative agents,” in in ‘Proceedings, Tenth International Conferenceon Machine Learning ’, Amherst, 1993, vol. 1, pp. 330–337.

[11] M.L. Littman, “Markov games as a framework for multi-agentreinforcement learning,” Proceedings of the Eleventh InternationalConference on Machine Learning, vol. 157-163, 1994.

[12] P. Tangamchit, J. Dolan, and P. Kosla, “The necessity of averagerewards in cooperative multirobot learning,” 2002.

[13] G. Erus and F. Polat, “A layered approach to learning coordinationknowledge in multiagent environments,” Applied Intelligence, vol.27, no. 3, pp. 249–267, 2007.

[14] Y. Wang and C.W. de Silva, “A machine-learning approach tomulti-robot coordination,” Engineering Applications of ArtificialIntelligence, vol. 21, no. 3, pp. 470–484, 2008.

[15] J.R. Kok and N. Vlassis, “Collaborative Multiagent ReinforcementLearning by Payoff Propagation,” The Journal of Machine LearningResearch, vol. 7, pp. 1789–1828, 2006.

[16] L. Panait and S. Luke, “Cooperative Multi-Agent Learning: The Stateof the Art,” Autonomous Agents and Multi-Agent Systems, vol. 11,no. 3, pp. 387–434, 2005.

[17] E. Courses and T. Surveys, “A Comprehensive Survey of MultiagentReinforcement Learning,” Systems, Man, and Cybernetics, Part C:Applications and Reviews, IEEE Transactions on, vol. 38, no. 2, pp.156–172, 2008.

[18] L. Nunes and E. Oliveira, “Advice-Exchange Amongst Heteroge-neous Learning Agents: Experiments in the Pursuit Domain,” posterabstract) Autonomous Agents and Multiagent Systems (AAMAS03),2003.

[19] MN Ahmadabadi and M. Asadpour, “Expertness based cooperativeQ-learning,” Systems, Man and Cybernetics, Part B, IEEE Transac-tions on, vol. 32, no. 1, pp. 66–76, 2002.

[20] MN Ahmadabadi, A. Imanipour, BN Araabi, M. Asadpour, andR. Siegwart, “Knowledge-based Extraction of Area of Expertisefor Cooperation in Learning,” Intelligent Robots and Systems, 2006IEEE/RSJ International Conference on, pp. 3700–3705, 2006.

[21] ME El-Telbany, AH Abdel-Wahab, and SI Shaheen, “Learning spatialand expertise distribution coordination inmultiagent systems,” 2001,vol. 2.

[22] P. Ritthipravat, T. Maneewarn, J. Wyatt, and D. Laowattana, “Com-parison and Analysis of Expertness Measure in Knowledge SharingAmong Robots,” LECTURE NOTES IN COMPUTER SCIENCE, vol.4031, pp. 60, 2006.

[23] A. Nouri, A. Fazeli, and A. Rahimi Kian, “Designing a bidding-agent for electricity markets : a multi agent cooperative learningapproach,” Accepted in the International Federation of AutomaticControl (IFAC), 2008.

[24] C.W. Anderson and Z. Hong, “Reinforcement learning with modularneural networks for control,” IEEE International Workshop on NeuralNetworks Application to Control and Image Processing, 1994.

[25] R.A. Jacobs and M.I. Jordan, “A modular connectionist architecturefor learning piecewise control strategies,” Control Conference,American, vol. 28, 1982.

76

Documents

Cooperative Reinforcement Learning: Brief Survey and Application