Multi-goal Q-learning of cooperative teams

  • Published on

  • View

  • Download


  • s-goaagenThegenadimulimagoalperalg

    1. Introduction

    In cooperative teams, team membersimprove their ability and teams performa

    ve teame sizemulti-el of vresearPyka,

    Gadanhos research was benecial to deal with the multi-goal task(the goals may conict with each other). An improved reinforce-ment learning algorithm was proposed to learn multi-goal dia-logue strategies (Cuayhuitl, 2006). Zhou and Coggins (2004)presented an emotion-based hierarchical reinforcement learning(HRL) algorithm for environments with multiple goals of reward.

    this paper learns the learning radius from its own experiencesand learns other knowledge from other agents. Many algorithmscan be used for the individual-level and population-level learning,such as reactive reinforcement learning, belief-based learning,anticipatory learning, evolutionary learning, and connectionistlearning.

    Reinforcement learning was learning what to do and how tomap situations to actions, so as to maximize a numerical rewardsignal. The learner must found which actions yielded the mostreward by trying them in each state (Sutton & Barto, 1998). Com-pared with other algorithms, reinforcement learning models make

    * Corresponding author at: School of Management Science and Engineering,Nanjing University, Nanjing, China. Tel.: +86 2583597501.

    E-mail addresses: (J. Li), (Z. Sheng),

    Expert Systems with Applications 38 (2011) 15651574

    Contents lists availab

    Expert Systems w (K. Ng).Ahrweiler, 2001). In their research, the KENE was used todescribe the knowledge of members. Suppliers and customerswere generated by the computation of the KENE. Based on theresearch, this paper proposes a virtual cooperative team for theexperiments skeleton of the learning algorithm. Details of the vir-tual cooperative team are proposed in Section 3.1.

    To solve the multi-goal problem, Gadanho (2003) presented theALEC agent architecture which has both emotive and cognitivedecision-making capabilities to adapt the multi-goal survival task.

    2. Review of the related research

    The learning algorithm is one of the key issues in the agentbased system. Vriend (2000) considered that an agent was saidto employ individual-level learning (if it learned from its own pastexperiences) and to employ population-level learning (if it learnedfrom other agents). This paper focuses on both the population-leveland individual-level learning in cooperative teams. The agent inthan one learning goal in cooperatimembers learning goals consist of thmance of the team and individuals. Asimulate cooperative teams. The modis based on Gilbert and AhrweilersGilbert, 2004; Gilbert, Ahrweiler, &0957-4174/$ - see front matter 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.07.071adopt knowledge tonces. They have mores. In this paper, team

    of the team, the perfor-agent system is used toirtual cooperative teamch (Ahrweiler, Pyka, &2007; Gilbert, Pyka, &

    The multi-goal Q-learning algorithm is proposed to improve themulti-goal learning ability of the agents (the virtual team mem-bers). The tendency of agents for exploring unknown actions is dis-cussed in the learning algorithm. Agents with the learningalgorithm can decide what knowledge to adopt and how much tolearn (choosing learning radius) by themselves for multiple goals.Experimental results show that the multiple goals can be achievedby agents with the learning algorithm. Moreover, two sets of sen-sitivity experiments are conducted in the paper. 2010 Elsevier Ltd. All rights reserved.Multi-goal Q-learning of cooperative team

    Jing Li b,*, Zhaohan Sheng a, KwanChew Ng a

    a School of Management Science and Engineering, Nanjing University, Nanjing, Chinab School of Engineering, Nanjing Agricultural University, Nanjing, China

    a r t i c l e i n f o

    Keywords:Q-learningCooperative teamMulti-agent learningMulti-goal learning

    a b s t r a c t

    This paper studies a multiteams is simulated by anto cooperative principles.goals. In the virtual team, aing radius). The learning rproving the validity of theagents to converge to optactions inuence learninggoals. Furthermore, the paeter values of the learning

    journal homepage: wwwll rights reserved.l Q-learning algorithm of cooperative teams. Member of the cooperativet. In the virtual cooperative team, agents adapt its knowledge accordingmulti-goal Q-learning algorithm is approached to the multiple learningts learn what knowledge to adopt and how much to learn (choosing learn-us is interpreted in Section 3.1. Five basic experiments are manipulatedti-goal Q-learning algorithm. It is found that the learning algorithm causesl actions, based on agents continually updated cognitive maps of hows. It is also proved that the learning algorithm is benecial to the multipleanalyzes how sensitive the learning performance is affected by the param-orithm.le at ScienceDirect

    ith Applications

    lsevier .com/locate /eswa

  • few assumptions about both the information available to an agentand the cognitive abilities of an agent. Wang proposed a two-layermulti-agent reinforcement learning algorithm to improve the per-formance of the agents (Wang, Gao, Chen, Xie, & Chen, 2007). Rein-forcement learning model was also used in supply chain for theordering management (Chaharsooghi, Heydari, & Zegordi, 2008).Tuyls investigated the reinforcement learning in multi-agent sys-tems from an evolutionary dynamical perspective (Tuyls, Hoen, &Vanschoenwinkel, 2006). The incremental method for learning ina multi-agent system was proposed with reinforcement learning(Buffet, Dutech, & Charpillet, 2007).

    na (2008).

    3.1. The virtual cooperative team

    The cooperative team consists of several team members whomeet some others demands. All team members cooperate toaccomplish some work with their knowledge. Each team memberis simulated by an agent in the NetLogo 4.0.2. The virtual team G(G = hV,Ei) consists of N agents (V = {v1,v2,v3, . . .,vN}), where eachagent can be considered as a unique node in a cooperative team.The relation in the cooperative team is modeled by an adjacencymatrix E, where an element of the adjacency matrix eij = 1 if theagent vi uses his knowledge to support vj to accomplish its task

    is called as the leader.

    Each new agents tness is finitial.


    1566 J. Li et al. / Expert Systems with Applications 38 (2011) 15651574In the literature, Q-learning has been used in many elds. Cheng(2009) investigated how intelligent an agent used the Q-learningapproach to make optimal dynamic packaging decision in the e-retailing setting. Park employed modular Q-learning in assigninga proper action to an agent in the multi-agent system (Park, Kim,& Kim, 2001). Waltman and Kaymak (2008) studied the use of Q-learning for modeling the learning behavior of rms in repeatedCournot oligopoly games. Based on Q-learning algorithm, Distantepresented a solution to the problem of manipulation control: tar-get identication and grasping (Distante, Anglani, & Taurisano,2000). Tillotson, Wu, and Hughes (2004) proposed a multi-agentlearning model to control routing within the Internet.

    Based on the former learning algorithm, the paper proposes amulti-goal Q-learning algorithm which is implemented in a virtualcooperative team. In the algorithm, agents can adjust their learningradius and knowledge adaptively. The remainder of this paper isorganized as follows. Section 3 proposes the model of the virtualcooperative team and the multi-goal Q-learning algorithm. Section4 describes the ve experiments used to test the availability of ourapproach and the results obtained. In Section 5, the paper con-ducted two sets of sensitivity experiments with respect to learningparameters. Future directions and conclusive remarks end the pa-per in Section 6.

    3. Model

    In this paper, the multi-goal learning algorithm is based on Q-learning. The experiments of the algorithm are manipulated on avirtual cooperative team. The model of the virtual cooperativeteam is proposed in Section 3.1 and the multi-goal Q-learning algo-rithm is considered in Section 3.2.Q-learning is one of the reinforcement learning models thathave been studied extensively by researchers. Q-learning was asimple way for agents to learn how to act optimally in controlledMarkovian domains (Watkins, 1989). It was a famous anticipatorylearning approach. Watkins presented and proved in detail a con-vergence theorem for Q-learning based on the outlined in 1992(Watkins, 1992). Many researchers improved the learning modelin their paper, such as Even-Dar and Mansour (2003) and Akchuri-ll-2 l-1

    Fig. 1. The learning t3.1.2. Adjacency matrixIn the paper, an adjacency matrix E is used to model the relation

    between agents. Since the task for each agent must be supported

    jv3.1.1. Agent stateThe state of vi is dened as Sv i fkv i ; rv i ;fv ig, where kv i is the

    knowledge of the agent, rv i is the learning radius and fv i is thetness of the agent. If fv i 6 0, vi will be deleted from the virtualcooperative team. If the agent vi gets the biggest reward in lastperiod and the reward f last-rewardv i is more than f

    rewardthreshold; v i will create

    log f last-rewardv i

    j kagents. In the virtual team, the agent is a team

    member with an individual knowledge base. This knowledge

    of vi is represented as kv i f kFv i ; kTv i ; k

    Ev i

    n o; kFv i ; k

    Tv i ; k

    Ev i

    n o; . . . ;

    kFv i ; kTv i ; k

    Ev i

    n og, where kFv i k

    Fv i 2 1;100

    is the research eld,

    kTv i kTv i 2 1;10

    is the special technology in the eld of kFv i and

    kEv i kEv i 2 1;10 is the experience of using k

    Tv i in the eld of k

    Fv i .

    The length of kv i is between klminv i and kl

    maxv i .

    In the model, the agent (vi) adjusts learning radius rv i from hisown experiences and learns kv i from other agents in the scope ofrv i . An agent with a sampling radius of 2 takes data on the two lev-els to his followers and leaders. Fig. 1 shows the learning targets ofthe agent with rv i 2. The agents leaders in level l + 1 and l + 2 areshown with the black circles. The agents followers in two levelsare shown with the gray circles. In the gure, the arrows meanthe agent (at the end of the arrow) is the follower of the pointedagent.

    The agents performance in the model is presented as the tnessfv i . The tness can be explained by the sum of rewards in the alllast periods. In the paper, all revenues and costs are in tness units.Mv j and eij = 0 otherwise. The relation among the agents are di-rected, so eij eji. The relation between vi and vj is shown inFig. 1 with an arrow. In the model, if vi supports vj to do something,vi is called as the follower in the relation of vi and vj. Meanwhile, vjl+1 l+2

    argets rv i 2.

  • First, agents observe the current state and select an action fromarv i . This paper adopts the Boltzmann soft-max distribution to se-lect the action. For each state of learning radius (rv i ), arv i a0rv i ; a

    1rv i ; a

    2rv i

    n owhere a0rv i means the decrease operation on the

    learning radius rv i ; a1rv i means no operations on rv i and a

    2rv i means

    the increase operation on rv i . Supposed eu is the tendency of agentsto explore unknown actions. Au is dened as the set of unexploredactions from the current state and Aw is dened as the set of actionsfrom the current state that have already been explored at leastonce. The probability of taking action amv can be calculated as:

    Applications 38 (2011) 15651574 1567by several sub-tasks, there are some relations between agents. Thetask for v i; Mv i , is generated from its kv i as

    Mv i Xlkvil1

    kFv i l kTv i l k

    Ev i l


    1AmodN 1

    where lkv i is length of the knowledge set kv i ; N is the maximalknowledge ever possible within the model. Each task must be sup-ported by other sub-tasks which are done by the followers of vi. Thecount of appropriate followers (the sub-tasks can be done by thesefollowers) for vi is Iv i minflkvi ; Iinitialg, where Iinitial is the systemparameter of maximal sub-tasks. kv i is chopped into Iv i sections ran-domly, and then each section is mapped into a sub-task using For-mula (1). vi must nd enough followers to accomplish the sub-tasks.If there is more than one agent for a certain sub-task, the one (vj) atthe cheapest tness price will be chosen, and then eji = 1. If any oneof sub-task is not accomplished, vi cannot accomplish his task. How-ever, if the sub-task computed by Formula (1) is less thanMmin, thenthis sub-task is dened as the basic ability for agents and can bedone by itself (Under this condition, the agent should pay rewardsto the environment). If Mv i > Mmax, then this task is accepted bythe environment of the cooperative team and vi get tness rewordsfrom the environment. Moreover, vi pay tness rewards for his fol-lowers and get tness rewards from his leaders.

    3.1.3. Agent actionsA nite set of actions for agent vi is dened as Av i farv i ; asv ig.

    arv i means the increase or decrease operation on the learning ra-dius rv i . The rules of arv i will be discussed in Section 3.2. asv i meansthe usage of research strategies for agent vi. If the reward of agentvi in last period is less than fmin (the minimal reward), vi will choosea new kFv i ; k

    Tv i ; k

    Ev i

    n ofor its kv i . This is done by randomly changing

    one kFv i in (1,100). Corresponding with the new kFv i ; k

    Tv i and k

    Ev i

    are set with the minimal value in their scope. Moreover, vi mustpay taxes for this action.

    The change process of Sv i is the Markov decision process (MDP)since the transition function

    T stv i ; atv i ; s

    t1v i

    P St1v i st1v i jS

    tv i stv i ;A

    tv i atv i


    where stv i is the value of Sv i at period of t, atv i is the action of vi in

    period t, st1v i is the value of Sv i at period of t + 1. fSv i ;Av i ; Tv i ; rev igis a MDP where rev i is a mapping from Sv i Av i to R that denesthe reward gained by vi after each state transition. The action policyp is the mapping from states to distributions over actions(p : Sv i ! PAv i ).PAv i is the probability distribution over actions.The problem is then to nd p based on rev i .

    3.2. Multi-goal Q-learning algorithm

    The multi-goal Q-learning algorithm can be viewed as a sam-pled, asynchronous method for estimating the optimal Q-functionfor the MDP hSv i ;Av i ; Tv i ; rev i i. The reword rev i at period t isdetermined by the multi-goal reward function (Formula (3))

    retv i dDrts gDrta lDrptv i 3Drts is the variation of team size (the total number of team agents)from period t 1 to t. Drta is the variation of team performance(sums of all agents tness). Drptv i is the variation of the vis tness.d, g and l are the weights of the three learning goals in the rewardfunction.

    The Q-function Qsv i ; av i denes the expected sum of the dis-

    J. Li et al. / Expert Systems withcounted reward attained by executing action av i av i 2 Av i in statesv i sv i 2 Sv i . The Q-function is updated by using the agents expe-rience. The learning process is as follows.i

    P amv i jSv i

    eQ s;amv i


    eQ s;anvi

    ,for amv i 2 Aw 4

    If one of action in arv i belongs to Aw and the others belong to A...


View more >