Multi-goal Q-learning of cooperative teams

Expert Systems with Applications 38 (2011) 1565–1574

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Multi-goal Q-learning of cooperative teams

Jing Li b,*, Zhaohan Sheng a, KwanChew Ng a

a School of Management Science and Engineering, Nanjing University, Nanjing, Chinab School of Engineering, Nanjing Agricultural University, Nanjing, China

a r t i c l e i n f o a b s t r a c t

Keywords:Q-learningCooperative teamMulti-agent learningMulti-goal learning

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.07.071

* Corresponding author at: School of ManagemeNanjing University, Nanjing, China. Tel.: +86 2583597

E-mail addresses: [email protected] (J. Li), [email protected] (K. Ng).

This paper studies a multi-goal Q-learning algorithm of cooperative teams. Member of the cooperativeteams is simulated by an agent. In the virtual cooperative team, agents adapt its knowledge accordingto cooperative principles. The multi-goal Q-learning algorithm is approached to the multiple learninggoals. In the virtual team, agents learn what knowledge to adopt and how much to learn (choosing learn-ing radius). The learning radius is interpreted in Section 3.1. Five basic experiments are manipulatedproving the validity of the multi-goal Q-learning algorithm. It is found that the learning algorithm causesagents to converge to optimal actions, based on agents’ continually updated cognitive maps of howactions influence learning goals. It is also proved that the learning algorithm is beneficial to the multiplegoals. Furthermore, the paper analyzes how sensitive the learning performance is affected by the param-eter values of the learning algorithm.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

In cooperative teams, team members adopt knowledge toimprove their ability and teams’ performances. They have morethan one learning goal in cooperative teams. In this paper, teammembers’ learning goals consist of the size of the team, the perfor-mance of the team and individuals. A multi-agent system is used tosimulate cooperative teams. The model of virtual cooperative teamis based on Gilbert and Ahrweiler’s research (Ahrweiler, Pyka, &Gilbert, 2004; Gilbert, Ahrweiler, & Pyka, 2007; Gilbert, Pyka, &Ahrweiler, 2001). In their research, the ‘‘KENE” was used todescribe the knowledge of members. Suppliers and customerswere generated by the computation of the KENE. Based on theresearch, this paper proposes a virtual cooperative team for theexperiments skeleton of the learning algorithm. Details of the vir-tual cooperative team are proposed in Section 3.1.

To solve the multi-goal problem, Gadanho (2003) presented theALEC agent architecture which has both emotive and cognitivedecision-making capabilities to adapt the multi-goal survival task.Gadanho’s research was beneficial to deal with the multi-goal task(the goals may conflict with each other). An improved reinforce-ment learning algorithm was proposed to learn multi-goal dia-logue strategies (Cuayáhuitl, 2006). Zhou and Coggins (2004)presented an emotion-based hierarchical reinforcement learning(HRL) algorithm for environments with multiple goals of reward.

ll rights reserved.

nt Science and Engineering,[email protected] (Z. Sheng),

The multi-goal Q-learning algorithm is proposed to improve themulti-goal learning ability of the agents (the virtual team mem-bers). The tendency of agents for exploring unknown actions is dis-cussed in the learning algorithm. Agents with the learningalgorithm can decide what knowledge to adopt and how much tolearn (choosing learning radius) by themselves for multiple goals.Experimental results show that the multiple goals can be achievedby agents with the learning algorithm. Moreover, two sets of sen-sitivity experiments are conducted in the paper.

2. Review of the related research

The learning algorithm is one of the key issues in the agentbased system. Vriend (2000) considered that an agent was saidto employ individual-level learning (if it learned from its own pastexperiences) and to employ population-level learning (if it learnedfrom other agents). This paper focuses on both the population-leveland individual-level learning in cooperative teams. The agent inthis paper learns the learning radius from its own experiencesand learns other knowledge from other agents. Many algorithmscan be used for the individual-level and population-level learning,such as reactive reinforcement learning, belief-based learning,anticipatory learning, evolutionary learning, and connectionistlearning.

Reinforcement learning was learning what to do and how tomap situations to actions, so as to maximize a numerical rewardsignal. The learner must found which actions yielded the mostreward by trying them in each state (Sutton & Barto, 1998). Com-pared with other algorithms, reinforcement learning models make

http://dx.doi.org/10.1016/j.eswa.2010.07.071

mailto:[email protected]



http://dx.doi.org/10.1016/j.eswa.2010.07.071

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

1566 J. Li et al. / Expert Systems with Applications 38 (2011) 1565–1574

few assumptions about both the information available to an agentand the cognitive abilities of an agent. Wang proposed a two-layermulti-agent reinforcement learning algorithm to improve the per-formance of the agents (Wang, Gao, Chen, Xie, & Chen, 2007). Rein-forcement learning model was also used in supply chain for theordering management (Chaharsooghi, Heydari, & Zegordi, 2008).Tuyls investigated the reinforcement learning in multi-agent sys-tems from an evolutionary dynamical perspective (Tuyls, Hoen, &Vanschoenwinkel, 2006). The incremental method for learning ina multi-agent system was proposed with reinforcement learning(Buffet, Dutech, & Charpillet, 2007).

Q-learning is one of the reinforcement learning models thathave been studied extensively by researchers. Q-learning was asimple way for agents to learn how to act optimally in controlledMarkovian domains (Watkins, 1989). It was a famous anticipatorylearning approach. Watkins presented and proved in detail a con-vergence theorem for Q-learning based on the outlined in 1992(Watkins, 1992). Many researchers improved the learning modelin their paper, such as Even-Dar and Mansour (2003) and Akchuri-na (2008).

In the literature, Q-learning has been used in many fields. Cheng(2009) investigated how intelligent an agent used the Q-learningapproach to make optimal dynamic packaging decision in the e-retailing setting. Park employed modular Q-learning in assigninga proper action to an agent in the multi-agent system (Park, Kim,& Kim, 2001). Waltman and Kaymak (2008) studied the use of Q-learning for modeling the learning behavior of firms in repeatedCournot oligopoly games. Based on Q-learning algorithm, Distantepresented a solution to the problem of manipulation control: tar-get identification and grasping (Distante, Anglani, & Taurisano,2000). Tillotson, Wu, and Hughes (2004) proposed a multi-agentlearning model to control routing within the Internet.

Based on the former learning algorithm, the paper proposes amulti-goal Q-learning algorithm which is implemented in a virtualcooperative team. In the algorithm, agents can adjust their learningradius and knowledge adaptively. The remainder of this paper isorganized as follows. Section 3 proposes the model of the virtualcooperative team and the multi-goal Q-learning algorithm. Section4 describes the five experiments used to test the availability of ourapproach and the results obtained. In Section 5, the paper con-ducted two sets of sensitivity experiments with respect to learningparameters. Future directions and conclusive remarks end the pa-per in Section 6.

3. Model

In this paper, the multi-goal learning algorithm is based on Q-learning. The experiments of the algorithm are manipulated on avirtual cooperative team. The model of the virtual cooperativeteam is proposed in Section 3.1 and the multi-goal Q-learning algo-rithm is considered in Section 3.2.

iv

ll-2 l-1

Fig. 1. The learning t

3.1. The virtual cooperative team

The cooperative team consists of several team members whomeet some others’ demands. All team members cooperate toaccomplish some work with their knowledge. Each team memberis simulated by an agent in the NetLogo 4.0.2. The virtual team G(G = hV,Ei) consists of N agents (V = {v1,v2,v3, . . .,vN}), where eachagent can be considered as a unique node in a cooperative team.The relation in the cooperative team is modeled by an adjacencymatrix E, where an element of the adjacency matrix eij = 1 if theagent vi uses his knowledge to support vj to accomplish its taskðMv jÞ and eij = 0 otherwise. The relation among the agents are di-

rected, so eij – eji. The relation between vi and vj is shown inFig. 1 with an arrow. In the model, if vi supports vj to do something,vi is called as the follower in the relation of vi and vj. Meanwhile, vj

is called as the leader.

3.1.1. Agent stateThe state of vi is defined as Sv i

¼ fkv i; rv i ;fv i

g, where kv iis the

knowledge of the agent, rv iis the learning radius and fv i

is thefitness of the agent. If fv i

6 0, vi will be deleted from the virtualcooperative team. If the agent vi gets the biggest reward in lastperiod and the reward f last-reward

v iis more than f reward

threshold; v i will create

log f last-rewardv i

� �j kagents. In the virtual team, the agent is a team

member with an individual knowledge base. This knowledge

of vi is represented as kv i¼ f kF

v i; kT

v i; kE

v i

n o; kF

v i; kT

v i; kE

v i

n o; . . . ;

kFv i; kT

v i; kE

v i

n og, where kF

v ikF

v i2 ½1;100�

� �is the research field,

kTv i

kTv i2 ½1;10�

� �is the special technology in the field of kF

v iand

kEv iðkE

v i2 ½1;10�Þ is the experience of using kT

v iin the field of kF

v i.

The length of kv iis between klmin

v iand klmax

v i.

In the model, the agent (vi) adjusts learning radius ðrv iÞ from his

own experiences and learns kv ifrom other agents in the scope of

rv i. An agent with a sampling radius of 2 takes data on the two lev-

els to his followers and leaders. Fig. 1 shows the learning targets ofthe agent with rv i

¼ 2. The agent’s leaders in level l + 1 and l + 2 areshown with the black circles. The agent’s followers in two levelsare shown with the gray circles. In the figure, the arrows meanthe agent (at the end of the arrow) is the follower of the pointedagent.

The agent’s performance in the model is presented as the fitnessðfv iÞ. The fitness can be explained by the sum of rewards in the all

last periods. In the paper, all revenues and costs are in fitness units.Each new agent’s fitness is finitial.

3.1.2. Adjacency matrixIn the paper, an adjacency matrix E is used to model the relation

between agents. Since the task for each agent must be supported

jv

l+1 l+2

argets ðrv i¼ 2Þ.

i

tvS

Cooperate with followers and leaders

Do research

Multi-goal Q-learning

1

i

tvS +

Learning other knowledges

Fig. 2. Basic graphical model of the paper.

J. Li et al. / Expert Systems with Applications 38 (2011) 1565–1574 1567

by several sub-tasks, there are some relations between agents. Thetask for v i; Mv i

, is generated from its kv ias

Mv i¼

Xlkvi

l¼1

kFv i l� kT

v i l� kE

v i l

0@

1AmodðNÞ ð1Þ

where lkv iis length of the knowledge set kv i

; N is the maximalknowledge ever possible within the model. Each task must be sup-ported by other sub-tasks which are done by the followers of vi. Thecount of appropriate followers (the sub-tasks can be done by thesefollowers) for vi is Iv i

¼minflkvi; Iinitialg, where Iinitial is the system

parameter of maximal sub-tasks. kv iis chopped into Iv i

sections ran-domly, and then each section is mapped into a sub-task using For-mula (1). vi must find enough followers to accomplish the sub-tasks.If there is more than one agent for a certain sub-task, the one (vj) atthe cheapest fitness price will be chosen, and then eji = 1. If any oneof sub-task is not accomplished, vi cannot accomplish his task. How-ever, if the sub-task computed by Formula (1) is less than Mmin, thenthis sub-task is defined as the basic ability for agents and can bedone by itself (Under this condition, the agent should pay rewardsto the environment). If Mv i

> Mmax, then this task is accepted bythe environment of the cooperative team and vi get fitness rewordsfrom the environment. Moreover, vi pay fitness rewards for his fol-lowers and get fitness rewards from his leaders.

3.1.3. Agent actionsA finite set of actions for agent vi is defined as Av i

¼ farv i; asv ig.

arv imeans the increase or decrease operation on the learning ra-

dius rv i. The rules of arv i

will be discussed in Section 3.2. asv imeans

the usage of research strategies for agent vi. If the reward of agentvi in last period is less than fmin (the minimal reward), vi will choose

a new kFv i; kT

v i; kE

v i

n ofor its kv i

. This is done by randomly changing

one kFv i

in (1,100). Corresponding with the new kFv i; kT

v iand kE

v i

are set with the minimal value in their scope. Moreover, vi mustpay taxes for this action.

The change process of Sv iis the Markov decision process (MDP)

since the transition function

T stv i; at

v i; stþ1

v i

� �¼ P Stþ1

v i¼ stþ1

v ijSt

v i¼ st

v i;At

v i¼ at

v i

� �ð2Þ

where stv i

is the value of Sv iat period of t, at

v iis the action of vi in

period t, stþ1v i

is the value of Sv iat period of t + 1. fSv i

;Av i; Tv i

; rev ig

is a MDP where rev iis a mapping from Sv i

� Av ito R that defines

the reward gained by vi after each state transition. The action policyp is the mapping from states to distributions over actions(p : Sv i

! PðAv iÞ). PðAv i

Þ is the probability distribution over actions.The problem is then to find p based on rev i

.

3.2. Multi-goal Q-learning algorithm

The multi-goal Q-learning algorithm can be viewed as a sam-pled, asynchronous method for estimating the optimal Q-functionfor the MDP ðhSv i

;Av i; Tv i

; rev iiÞ. The reword ðrev i

Þ at period t isdetermined by the multi-goal reward function (Formula (3))

retv i¼ dDrt

s þ gDrta þ lDrpt

v ið3Þ

Drts is the variation of team size (the total number of team agents)

from period t � 1 to t. Drta is the variation of team performance

(sums of all agents’ fitness). Drptv i

is the variation of the vi’s fitness.d, g and l are the weights of the three learning goals in the rewardfunction.

The Q-function Qðsv i; av iÞ defines the expected sum of the dis-

counted reward attained by executing action av iðav i2 Av i

Þ in statesv iðsv i2 Sv i

Þ. The Q-function is updated by using the agent’s expe-rience. The learning process is as follows.

First, agents observe the current state and select an action fromarv i

. This paper adopts the Boltzmann soft-max distribution to se-lect the action. For each state of learning radius (rv i

), arv i¼

a0rv i; a1

rv i; a2

rv i

n owhere a0

rv imeans the decrease operation on the

learning radius rv i; a1

rv imeans no operations on rv i

and a2rv i

meansthe increase operation on rv i

. Supposed eu is the tendency of agentsto explore unknown actions. Au is defined as the set of unexploredactions from the current state and Aw is defined as the set of actionsfrom the current state that have already been explored at leastonce. The probability of taking action am

v ican be calculated as:

P amv ijSv i

� �¼ e

Q s;amv i

� � Xn

eQ s;an

vi

� �,for am

v i2 Aw ð4Þ

If one of action in arv ibelongs to Aw and the others belong to Au,

then

P amv ijSv i

� �¼

1� eu for amv i2 Aw

eu=2 for amv i2 Au

(ð5Þ

If only one action belongs to Au, then

P amv ijSv i

� �¼

ð1� euÞeQ s;am

vi

� � P2

eQ s;an

vi

� �,for am

v i2 Aw

eu for amv i2 Au

8>><>>: ð6Þ

After selected an action, the agent observes the state at periodt + 1 and receives a reward for the system. Update the correspond-ing Q value for state sv i

and action av iaccording to the following

formula:

Qðsv i; av iÞ ¼ ð1� aÞQðsv i

; av iÞ þ a rev i

þ c maxa0vi

Q s0v i; a0v i

� � !ð7Þ

where a (0 6 a < 1, learning rate) is the weight of the new informa-tion in updating Q and c (0 6 c 6 1, discount rate) represents theimportance of the value of future states in the assessing of the cur-rent state. At each period of simulations, the agent uses the multi-goal Q-learning algorithm to update his Q-table and learns the opti-mal cognitive map of how actions influence the goals.

In the model, vi learns rv ifrom his experiences and kv i

fromother agents at radius rv i

. After vi used the Av i; v i learns kv i

fromvrvi

. Moreover, the process of vi learning kv ifrom vj is manipulated

by the following rules. If 9 kFv j; kT

v j; kE

v j

n om

R kv iand kF

v jm¼ kF

v in, then

kTv in¼max kT

v in; kT

v jm

n o; kE

v in¼max kE

v in; kE

v jm

n o(m and n are the po-

sition of knowledge in vj and vi). If 9 kFv j; kT

v j; kE

v j

n om

R kv iand

kFv jm

–8kFv in; v i will add kF

v j; kT

v j; kE

v j

n om

in kv iif lkvi

< klmaxv i

(klmaxv i

is

the system parameter for the maximal length of kv i). If neither con-

dition obtains, vi learns nothing from vj. Fig. 2 shows the basic


graphical model of the paper. Dotted arrows indicate that the ac-tion is chosen depending on the state.

4. Runs of the model

Five experiments were run on the virtual cooperative team toquantify the performance of the multi-goal Q-learning algorithm.Each experiment is simulated 2000 periods. Agents do all worksillustrated in Fig. 2 at each period. The parametric settings forthe experiments are proposed in Table 1. The interface of theexperiments is presented in Fig. 3.

4.1. Scenario 1. No learning among team members

For the first scenario, there is no studying behavior amongagents. This scenario is the basic model of cooperative teams.Fig. 4 shows the number of agents in the cooperative team intwo thousands period of simulations. Fig. 5 shows the sum of fit-ness (log10) for all agents. Figs. 4 and 5 present the performanceof the virtual cooperative team. Fig. 6 shows the average fitness(log10) of all agents. The performance of individual is presentedin Fig. 6. All of the data used in the following figures was takenfrom two thousands period of simulations.

Table 1Summary of experimental settings.

Parameter Value

Initial number of agents 400Tax 10%

klminv i

5

klmaxv i

40

finitial 2000fmin 200f rewardthreshold

800

Iinitial 6N 100Mmin 10Mmax 90a 0.5c 0.5eu 0.5

Fig. 3. Interface of th

4.2. Scenario 2. Maximizing the personal performance

For this scenario, d = 0, g = 0 and l = 1.This means all agents willlearn to maximize its own fitness. The performance of the cooper-ative team will not be considered by these agents. In the simula-tion, agents will adjust its learning radius to learn knowledgefrom others by the multi-goal Q-learning algorithm. Figs. 7 and 8show the performance of the virtual cooperative team. Average fit-ness of all agents is presented in Fig. 9.

Compared Fig. 9 with Fig. 6, it is found that agents’ personal per-formance is improved by the learning algorithm. However, the sizeof the cooperative team in Fig. 7 is smaller than the one in Fig. 4.The means the learning algorithm at Scenario 2 is beneficial tothe personal performance, but it cannot improve the size of thecooperative team.

4.3. Scenario 3. Maximizing the team’s performance

In order to manipulate the learning goal as maximizing theteam’s performance, the parameters will be set as d = 0, g = 1 andl = 0. In this scenario, agents choose optimal action to maximizethe sums of all agents’ fitness. Fig. 11 shows the sums (log10) ofall agents’ fitness in the simulation. Fig. 10 shows the size of thecooperative team at Scenario 3. The average performance of theagents is shown in Fig. 12.

The sum of all agents’ fitness in Fig. 11 is more than the one inFig. 5. This means the learning algorithm is available to improvethe team’s performance. Moreover, the number of agents is similarat the two scenarios.

4.4. Scenario 4. Maximizing the team’s size

For this scenario, d = 1, g = 0 and l = 0. The learning agents learnto maximize the total number of the team members. Fig. 13 showsthe size of the cooperative team at Scenario 4. This figureshows the number of the team members in the simulation.Fig. 14 shows the sums (log10) of all agents’ fitness in the simula-tion. The average performance of the agents is shown in Fig. 15.

The number of agents in Fig. 13 is about 600 at period 2000 (thevalue on the trend line). However, the number in Fig. 4 is about 500at period 2000. The learning algorithm at Scenario 4 shows signif-icance to improve the size of the cooperative team. Moreover, theperformance of the team is improved at the same time.

e experiments.

0

200

400

600

800

1000

1 163 325 487 649 811 973 1135 1297 1459 1621 1783 1945Period

Num

ber o

f age

nts

Fig. 4. Number of agents in the cooperative team at scenario 1 (the dotted line is the trend line).

0

2

4

6

8

10

1 159 317 475 633 791 949 1107 1265 1423 1581 1739 1897Period

Sum

of f

itnes

s (L

og 1

0)

Fig. 5. Sum fitness of all agents in the cooperative team at scenario 1.

012345678

1 156 311 466 621 776 931 1086 1241 1396 1551 1706 1861Period

Aver

age

fitne

ss (L

og 1

0)

Fig. 6. Average fitness of all agents in the cooperative team at scenario 1.

0100200300400500600700800900

1 162 323 484 645 806 967 1128 1289 1450 1611 1772 1933Period

Num

ber o

f age

nts



4.5. Scenario 5. Multi-goal learning

At this scenario, agents learn to improve the performance andsize of the cooperative team besides to improve its own fitness.The weight of different goals can be set as: d = 0.25, g = 0.25

and l = 0.5. The learning algorithm is used to achieve the multi-ple goals. Figs. 16–18 show the simulation results at thisscenario.

The agents at Scenario 5 have multiple learning goals. The num-ber of agents in Fig. 16 is about 600 and the number in Fig. 4 is

0

2

4

68

10

12

14

16

1 158 315 472 629 786 943 1100 1257 1414 1571 1728 1885Period

Sum

of f

itnes

s (L

og 1

0)


0

2

4

6

8

10

12

14

1 157 313 469 625 781 937 1093 1249 1405 1561 1717 1873Period

Aver

age

fitne

ss (L

og 1

0)


0100200300400500600700800900

1 161 321 481 641 801 961 1121 1281 1441 1601 1761 1921Period

Num

ber o

f age

nts


0

2

4

6

8

10

12

14

1 160 319 478 637 796 955 1114 1273 1432 1591 1750 1909Period

Sum

of f

itnes

s (L

og 1

0)



about 500 at period 2000 (the value on the trend line). This meansthe learning algorithm is beneficial to the size of the cooperativeteam. The performance in Figs. 17 and 18is more than the one in

Figs. 5 and 6 obviously. This means the performance of individualsand the team is improved by the agents with the learning algo-rithm. These five experiments prove that the multi-goal Q-learning

0

2

4

6

8

10

12

1 161 321 481 641 801 961 1121 1281 1441 1601 1761 1921Period

Aver

age

fitne

ss (L

og 1

0)


0100200300400500600700800900

1 158 315 472 629 786 943 1100 1257 1414 1571 1728 1885Period

Num

ber o

f age

nts


0

2

4

6

8

10

12

14

1 159 317 475 633 791 949 1107 1265 1423 1581 1739 1897Period

Sum

of f

itnes

s (L

og 1

0)


0

2

4

6

8

10

12

1 158 315 472 629 786 943 1100 1257 1414 1571 1728 1885Period

Aver

age

of fi

tnes

s (L

og 1

0)



0

200

400

600

800

1000

1200

1 153 305 457 609 761 913 1065 1217 1369 1521 1673 1825 1977


0

2

4

6

8

10

12

14

16

18

1 149 297 445 593 741 889 1037 1185 1333 1481 1629 1777 1925


0

2

4

6

8

10

12

14

1 148 295 442 589 736 883 1030 1177 1324 1471 1618 1765 1912


Table 2The parameter ranges for the sensitivity analysis.

Parameter Minimum Maximum Sensitivity step

c 0.1 0.9 0.2eu 0.1 0.9 0.2


algorithm is available for the single and multiple learning goals incooperative teams.

5. Sensitivity experiments

This paper conducts two sets of sensitivity experiments with re-spect to learning parameters c (discount rate) and eu (unknownexploration). The basic values of parameters are set as Table 1.The weights of different learning goals are set as: d = 0.25, g= 0.25 and l = 0.5. Wide parameter ranges are chosen to get thesensitive analysis. The paper changed one parameter at a time,keeping the rest at their value in Table 1. For each parameter, theexperiment is simulated 2000 periods. Since c and eu should be be-tween 0 and 1, Table 2 shows the wide parameter ranges of theexperiments.

The final value of the number of agents (the value on the trendline) is shown in Fig. 19. The bottom X axis specifies the values of cand eu. The results suggest that the size of cooperative teams is

sensitive to the parameters values. The size of the team is almostidentical for the bigger value of all parameters. Very small valuesof c tend to decrease the size of the team. The size of the team isnot sensitive to the middle values of c. Moreover, agents can im-prove the size of virtual team for all values of c and eu excepteu = 0.1. This means the learning target of improving team sizecan be achieved for most value conditions. ‘‘No Learning” meansthe value of the virtual team’ performance without learning inthe following three figures.

Figs. 20 and 21 show the performance of individuals and coop-erative teams. It is found that the middle level of c and eu can im-

400

450

500

550

600

650

0.1 0.3 0.5 0.7 0.9Parameters values

Num

ber o

f age

nts

γ eu Sensitivity Analysis

No Learning Sensitivity Analysis

Fig. 19. The final number of agents in parameter experiments.

89

1011121314151617

0.1 0.3 0.5 0.7 0.9

Parameters values

Sum

of f

itnes

s (L

og 1

0)



Fig. 20. The final sum of fitness in parameter experiments.

56789

101112131415

0.1 0.3 0.5 0.7 0.9Parameters values

Aver

age

fitne

ss (L

og 1

0)



Fig. 21. The final average fitness in parameter experiments.


prove the performance of individuals and cooperative teams. Lowor high levels of parameters will reduce agents’ performance.Moreover, the larger values of eu is not beneficial to the perfor-mance of individuals and the team. The learning goals of improvingthe performance of individuals and virtual team are achieved for allvalues of c and eu. This means the learning algorithm is valid for allparameters’ settings.

6. Conclusion

In order to improve the agent’s multi-goal learning ability, thispaper presents a multi-goal Q-learning algorithm in cooperative

teams. First, the paper proposed a multi-agent cooperative teamfor the experiments of the learning algorithm. Second, the multi-goal Q-learning algorithm is modeled to solve the multiple goallearning problems in the virtual team. The algorithm gives the abil-ity for the agents to adjust his learning radius from the observationof his experiences. With the knowledge learned from others, theagents are able to achieve the multiple goals. The five experimentsillustrated that the multi-goal Q-learning algorithm can be used asan effective learning approach in cooperative teams. Furthermore,the sensitivity analyses of two important learning parameters areconducted in Section 5.

The work presented here can be extended along several direc-tions. For example, more complex learning mechanisms can be


used to improve the agents’ learning ability to accomplish morecomplex problems. Such extensions of learning ability may be ableto provide better mechanisms to resolve the conflicts among coop-erative team members. These are key issues for the research ofcooperative teams.

Acknowledgements

This research was supported in part by: (i) the NSFC (NationalNatural Science Foundation of China) under Grant 70771045; and(ii) NSFC key program under Grant 70731002.

References

Ahrweiler, P., Pyka, A., & Gilbert, N. (2004). Simulating knowledge dynamics ininnovation networks. In R. Leombruni & M. Richiardi (Eds.), Industry and labordynamics: The agent-based computational economics approach (pp. 284–296).Singapore: World Scientific Press.

Akchurina, N. (2008). Optimistic–pessimistic Q-learning algorithm for multi-agentsystems. In R. Bergmann et al. (Eds.), MATES 2008. LNAI (Vol. 5244, pp. 13–24).

Buffet, O., Dutech, A., & Charpillet, F. (2007). Shaping multi-agent systems withgradient reinforcement learning. Autonomous Agents and Multi-agent Systems,15, 197–220.

Chaharsooghi, S. K., Heydari, J., & Zegordi, S. H. (2008). A reinforcement learningmodel for supply chain ordering management: An application to the beer game.Decision Support Systems, 45, 949–959.

Cheng, Y. (2009). Dynamic packaging in e-retailing with stochastic demand overfinite horizons: A Q-learning approach. Expert Systems with Applications, 36,472–480.

Cuayáhuitl, H., Renals, S., Lemon, O., & Shimodaira, H. (2006). Learning multi-goaldialogue strategies using reinforcement learning with reduced state-actionspaces. In INTERSPEECH 2006 – ICSLP, Pittsburgh, Pennsylvania (Vol. 9, pp. 17–21).

Distante, C., Anglani, A., & Taurisano, F. (2000). Target reaching by using visualinformation and Q-learning controllers. Autonomous Robots, 9, 41–50.

Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of MachineLearning Research, 5, 1–25.

Gadanho, S. C. (2003). Learning behavior-selection by emotions and cognition in amulti-goal robot task. Journal of Machine Learning Research, 4, 385–412.

Gilbert, N., Ahrweiler, P., & Pyka, A. (2007). Learning in innovation networks: Somesimulation experiments. Physica A, 378, 100–109.

Gilbert, N., Pyka, A., & Ahrweiler, P. (2001). Innovation networks – A simulationapproach. Journal of Artificial Societies and Social Simulation, 4<http://www.soc.surrey.ac.uk/JASSS/4/3/8.html>.

Park, K., Kim, Y., & Kim, J. (2001). Modular Q-learning based multi-agentcooperation for robot soccer. Robotics and Autonomous Systems, 35, 109–122.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.Cambridge, MA: MIT Press.

Tillotson, P. R. J., Wu, Q. H., & Hughes, P. M. (2004). Multi-agent learning for routingcontrol within an Internet environment. Engineering Applications of ArtificialIntelligence, 17, 179–185.

Tuyls, K., Hoen, P. J., & Vanschoenwinkel, B. (2006). An evolutionary dynamicalanalysis of multi-agent learning in iterated games. Autonomous Agents andMulti-agent Systems, 12, 115–153.

Vriend, N. J. (2000). An illustration of the essential difference between individualand social learning, and its consequences for computational analyses. Journal ofEconomic Dynamics and Control, 24, 1–19.

Waltman, L., & Kaymak, U. (2008). Q-learning agents in a Cournot oligopoly model.Journal of Economic Dynamics and Control, 32, 3275–3293.

Wang, B., Gao, Y., Chen, Z., Xie, J., & Chen, S. (2007). A two-layered multi-agentreinforcement learning model and algorithm. Journal of Network and ComputerApplications, 30, 1366–1376.

Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, University ofCambridge, England.

Watkins, C. J. C. H. (1992). Technical note Q-learning. Machine Learning, 8, 279–292.Zhou, W., & Coggins, R. (2004). Biologically inspired reinforcement learning:

Reward-based decomposition for multi-goal environments. In A. J. Ijspeert et al.(Eds.), BioADIT 2004. LNCS (Vol. 3141, pp. 80–94).

http://www.soc.surrey.ac.uk/JASSS/4/3/8.html

http://www.soc.surrey.ac.uk/JASSS/4/3/8.html

Documents

Multi-goal Q-learning of cooperative teams