8
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/265272507 Hybrid System of Reinforcement Learning and Flocking Control in Multi-robot Domain ARTICLE CITATIONS 2 READS 33 3 AUTHORS: Hung Manh La University of Nevada, Reno 50 PUBLICATIONS 178 CITATIONS SEE PROFILE Ronny Salim Lim Oklahoma State University - Stillwater 11 PUBLICATIONS 80 CITATIONS SEE PROFILE Weihua Sheng Oklahoma State University - Stillwater 151 PUBLICATIONS 1,033 CITATIONS SEE PROFILE All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. Available from: Weihua Sheng Retrieved on: 11 February 2016

Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/265272507

HybridSystemofReinforcementLearningandFlockingControlinMulti-robotDomain

ARTICLE

CITATIONS

2

READS

33

3AUTHORS:

HungManhLa

UniversityofNevada,Reno

50PUBLICATIONS178CITATIONS

SEEPROFILE

RonnySalimLim

OklahomaStateUniversity-Stillwater

11PUBLICATIONS80CITATIONS

SEEPROFILE

WeihuaSheng

OklahomaStateUniversity-Stillwater

151PUBLICATIONS1,033CITATIONS

SEEPROFILE

Allin-textreferencesunderlinedinbluearelinkedtopublicationsonResearchGate,

lettingyouaccessandreadthemimmediately.

Availablefrom:WeihuaSheng

Retrievedon:11February2016

Page 2: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

Hybrid System of Reinforcement Learning and Flocking Control inMulti-robot Domain

Hung Manh La, Ronny Lim and Weihua Sheng

Abstract— In multi-robot domain, one of the importantproblems is to achieve cooperation among robots. In this paperwe propose a hybrid system that integrates reinforcementlearning and flocking control in order to create adaptive andintelligent multi-robot systems. We study two problems ofmulti-robot concurrent learning of cooperative behaviors: (1)how to generate efficient combination of high level behaviors(discrete states and actions) and low level behavior (continuousstates and actions) for multi-robot cooperation; (2) howto coordinate concurrent learning process in a distributedfashion. To evaluate our theoretic framework we apply itto solve the problem of how a multi-robot network learnsto avoid predator while maintaining network topology andconnectivity. The experiments and simulations are performedto demonstrate the effectiveness of the proposed hybrid system.

Keywords: Reinforcement learning, Flocking control, multi-robot systems, hybrid system.

I. I NTRODUCTION

In multi-robot system, one of the important research prob-lems is to achieve cooperation among robots by distributedcontrol methodology. In recent years, machine learning tech-niques such as reinforcement learning have been developedfor multi-robot systems to allow the robots to learn coop-eration [1], [2]. However, traditional reinforcement learningassumes discrete and finite state/action spaces; therefore, itis difficult to be directly applied to most real applicationsthat inherently involve with continuous and infinite space.Furthermore, even if the states can be discretized, the learnedelementary behaviors are still discrete. In addition, theswitching of discrete behaviors usually causes the control ofthe robots become non-smooth, which is undesirable in mostapplications. To tackle this issue, several methods have beenproposed to make the reinforcement learning methods workin continuous environment [3], [4], [5]. However, these meth-ods mostly focus on the approach of action approximationsuch as trying to estimate the state/action values based onknown environments instead of using discrete lookup tables.In this paper, we propose a hybrid system that integratesreinforcement learning and flocking control. Furthermore, toevaluate our theoretic framework we apply it to solve theproblem of how a mobile robot network learns to avoidpredators while maintaining topology and connectivity.

From biology we known that there are several anti-predator functions in animal aggregations [6], [7], [8]. Onereason why fish schools and bird flocks move together is

Hung Manh La, Ronny Lim and Weihua Sheng are with theSchool of Electrical and Computer Engineering, Oklahoma State Uni-versity, Stillwater, OK 74078, USA(hung.la, ronny.lim,weihua.sheng)@okstate.edu.

that it becomes difficult for predators to pick out individualprey from groups because many moving preys create asensory overload of the predator’s visual channel [6]. Basedon the fact that all preys should form a group to avoidpredator efficiently our paper focuses on distributed decisionmaking problem where individuals have a number of options(safe places) to choose from. Often in these decisions thereis a benefit for consensus, where all individuals choosethe same of the available options. However, the consensusmethods [9], [10] require a connected network in which allrobots can communicate together. This may not be validin real environments because sometimes some robots maynot connect to the network. Therefore, in this paper weare interested in the problem of reaching consensus evenwhen robots cannot connect to the network all the time, butthey still can make right decisions based on the proposedreinforcement learning algorithm. We address two problemsof multi-robot concurrent learning of cooperative behaviors:

• How to generate efficient combination of high levelbehaviors (discrete states and actions) and low levelbehavior (continuous states and actions) in flockingcontrol for multi-robot cooperation.

• How to coordinate concurrent learning process in adistributed fashion.

II. FLOCKING CONTROL ALGORITHM

In this section we present the distributed flocking controlalgorithm. Basically this algorithm allows the robots to movetogether without collision, avoid obstacles and track thetarget. First, we considern robots moving in anm (e.g.,m =2,3) dimensional Euclidean space. The dynamic equationsof each robot are described as:

{

qi = pi

pi = ui, i = 1,2, ...,n.(1)

here qi, pi ∈ Rm are the position and velocity of roboti,respectively, andui is the control input of roboti.

To describe the topology of flocks we consider a dynamicgraphG consisting of a vertex setϑ = {1,2...,n} and an edgeset E ⊆ {(i, j) : i, j ∈ ϑ, j , i}. In this topology each vertexdenotes one member of the flock, and each edge denotesthe communication link between two members. We define aneighborhood set of roboti as follows:

Nαi =

{

j ∈ ϑ : ‖q j −qi‖ ≤ r, ϑ = {1,2, ...,n} , j , i}

, (2)

herer is an active range (radius of neighborhood circle in thecase of two dimensions,m = 2, or radius of neighborhoodsphere in the case of three dimensions,m = 3), and‖.‖ is

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 7

Page 3: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

the Euclidean distance. The superscriptα indicates the actualneighbors (α neighborhood robots) of roboti that is used todistinguish with virtual neighbors (β neighborhood robots)in the case of obstacle avoidance.

The set ofβ neighbors (virtual neighbors) [11] of robotiat time t with k obstacles is

Nβi (t) =

{

k ∈ ϑβ : ‖qi,k −qi‖ ≤ r′,ϑβ = {1,2, ...,k}

}

(3)

herer′

is selected to be less thanr, in our simulationsr′=

0.6r. ϑβ is a set of obstacles. ˆqi,k, pi,k are the position andvelocity of roboti projecting on the obstaclek, respectively.The virtual neighbors are used to generate the repulsive forceto push the robots away from the obstacles. The movingobstacles are considered as the predators.

The geometry of flocks is modeled by anα-lattice [11]that meets the following condition:

‖q j −qi‖ = d, j ∈ Nαi , (4)

hered is a positive constant indicating the distance betweenrobot i and its neighborj. However, at singular configura-tion (qi = q j) the collective potential used to construct thegeometry of flocks is not differentiable. Therefore, the set ofalgebraic constrains in (4) is rewritten in term ofσ - norm[11] as follows:

‖q j −qi‖σ = dα, j ∈ Nαi ,

here the constraintdα = ‖d‖σ with d = r/kc, where kc isthe scaling factor. Theσ - norm,‖.‖σ, of a vector is a mapRm =⇒ R+ defined as‖z‖σ = 1

ε [√

1+ ε‖z‖2−1] with ε > 0.Unlike the Euclidean norm‖z‖, which is not differentiable atz = 0, theσ - norm‖z‖σ, is differentiable every where. Thisproperty allows to construct a smooth collective potentialfunction for robots.

The distributed flocking control algorithm proposed in ourprevious work [12] (this algorithm is the extension of theflocking control algorithm in [11]) controls all robots toform an α-lattice configuration. This algorithm is presentedas follows:

ui = cα1 ∑

j∈Nαi

φα(‖q j −qi‖σ)ni j + cα2 ∑

j∈Nαi

ai j(q)(p j − pi)

+cβ1 ∑

k∈Nβi

φβ(‖qi,k −qi‖σ)ni,k + cβ2 ∑

k∈Nβi

bi,k(q)(pi,k − pi)

−ct1(qi −qt)− ct

2(pi − pt)

−cmc1 (q(Nα

i ∪{i})−qt)− cmc2 (p(Nα

i ∪{i})− pt). (5)

In this control protocol,cα1 ,cα

2 ,cβ1,c

β2,c

t1,c

t2,c

mc1 andcmc

2 arepositive constants.φα(z) is the action function to control theattractive or repulsive forces between roboti and its neighborj. φβ(z) is the action function to control the repulsive forcebetween roboti and its obstaclek. ni j and ni,k are thevectors along the line to connect the pair(qi, q j), and thepair (qi,k, qi), respectively.ai j(q) and bi,k(q) are adjacencymatrices, respectively.qt andpt are the position and velocityof the target, respectively. In this paper we considerqt asthe safe place that the robots need to go there to escape

the predators. The pair(q(Nαi ∪{i}), p(Nα

i ∪{i})) is defined as

q(Nαi ∪{i}) = 1

|Nαi ∪{i}| ∑

|Nαi ∪{i}|

i=1 qi

p(Nαi ∪{i}) = 1

|Nαi ∪{i}| ∑

|Nαi ∪{i}|

i=1 pi

here |Nαi ∪ {i}| is the

number of agents in agenti’s local neighborhood includingagenti itself. More details of the these terms, see [11], [12].

III. G ENERAL FRAMEWORK OF HYBRID SYSTEM IN

MULTI -ROBOT DOMAIN

Our goals in developing the cooperative learning algorithmin a distributed fashion are to:

• allow the robots to learn even with continuous statesand actions.

• coordinate concurrent learning process to generate anefficient control policy in a distributed fashion.

With regards to the limitation of discrete and finite space,we propose the hybrid system of reinforcement learning indiscrete space and network controller (flocking controller)in continuous space as shown in Figure 1. This controlarchitecture has two main parts, the reinforcement learningmodule (high level learning) and flocking controller module(low level).

Fig. 1. The hybrid system for reinforcement learning and flocking controllerin multi-robot domain.

The flocking controller(5) is the network controller tocontrol all robots to move together without collision andtrack a stationary or moving target. In addition, the flockingcontroller allows the robots to avoid predator based on arepulsive force generated from an artificial potential fieldinduced by the predator. However, this repulsive force canmake the network break up. Therefore, we need to combineboth flocking control and reinforcement learning togetherto allow all robots to flock together in order to avoid thepredator in an efficient way while maintaining formation(topology) and connectivity.

The reinforcement learning module is the key of thecontroller, and it is integrated with the flocking controller.The goal is to select one of the safe places for the flockingcontroller to allow the prey (robot) to escape from the

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 8

Page 4: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

predator. By retrieving the states and rewards, the reinforce-ment learning module allows to find the appropriate safeplace for each high level behavior so that the topology andconnectivity of the network can be maintained.

All robots in a cooperative multi-robot system can influ-ence each other. Hence, it is important to ensure that theactions selected by the individual robots result in optimaldecisions for the whole group. We consider the problem ofcomputing a coordinated action for a group ofn robots. Eachrobot i selects an individual actionai from its action setAi.Then, a joint action for whole group isa = (a1, ...,an) whichgenerates a rewardR(a) for whole team. The coordinationproblem is to find the optimal joint actiona∗ that maximizesR(a), or a∗ = arg maxaR(a).

IV. M ODELING AND COOPERATIVE LEARNING

ALGORITHM

A. Model of multi-robot learning to avoid predator

Before we model the multi-robot learning to avoid preda-tor we have the following assumptions:

• Each robot (prey) can sense the position and velocityof its neighbors.

• All robots can flock together (based on flocking control)with fixed topology in free space.

• Each robot can sense the location of the predator withinthe sensing range.

We would like to build a learning algorithm in whichthe robots (preys) should cooperate to avoid the predatortogether. We are interested in the problem of how to designan intelligent robot network to avoid predator in an efficientway. This problem can be illustrated in Figure 2. In thisfigure, the robots try to learn to make the same decision(select same safe place to go) so that the network will notbreak up, or the connectivity can be maintained.

Fig. 2. Illustration of the safe places to choose.

First, we build the model of the predators as follows:• The predator tries to go into the center of the network.

This behavior of the predator is usually modeled in mostsituations [13], [14].

• The velocity of the predator is faster than that of theprey (robot).

These behaviors of the predator will cause the prey networkto break up. As a result, the preys will not flock together[6]. This is one of the reasons that the preys have to learnin a cooperative fashion so that they can agree on the samesolution such as finding the same safe place to escape thepredator. From these analyses, we model the preys (robots)as follows:

Fig. 3. Illustration of the predator and prey detection ranges. R1 is theactive range of the robot (prey),R2 is the predator detection range, andRpis the prey detection range.

• All robots flock together in free space and form an alphalattice formation [11] based on the distributed flockingcontrol law (5).

• If the predators go into the detection range(R2), therobot should learn and select one of the safe places togo (see Figure 3).

• If the predators go into the risk area(R1), the robotshould move away based on the repulsion force.

With the built models of the predator and prey our learninggoal is how the robots can cooperate together to avoidpredator while they have to maintain network performancesuch as the connectivity and topology. To achieve this goalwe propose a cooperative learning algorithm as discussed inthe next subsection.

B. Cooperative Learning Algorithm

In this section we develop a cooperative reinforcementlearning (Q learning) algorithm. We compare it with anindependent reinforcement learning algorithm. Let the cur-rent state and action, and the reward of roboti be si,ai,ri,respectively. Then let the next state and action of robotibe s,

i,a,i, respectively. At each moment, we have a partially

observable environment. This means that not all robots cansee the predator, and each robot only communicates withits neighbors to exchange local information. We have thefollowing models for the state, action and reward.

The state: we assume that when the learning mode starts(all robots flocked together and formed an alpha latticeformation) the state is initialized. The state is defined as thenumber of the predatorsnp and the number of neighboringrobotsnr in the viewing range of roboti, si = [np, nr]. Forexample, if one predator and 6 neighboring robots in theviewing range of the roboti then the state for roboti is

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 9

Page 5: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

(1, 6). If only 6 neighboring robots and no predator in theviewing range of the roboti then the state for roboti is (0,6). If the roboti performs the action i.e. follows the virtualleader, it will keep moving until the next state changess,

i , si.The maximum number of the states depends on the numberof the robots. Hence, we have the maximum total number ofthe states of roboti to be(n−1,1),(n−1,0),(n−2,1),(n−2,0),...,(0,1),(0,0). In general, the maximum total numberof the states of roboti in the case of one predator equals to(n−1)×2. Since all robots want to maintain the connectionto the network, they want to avoid states(0,1) and (0,0).

The action: assume that the predator goes any directionin the field to attack the prey (robot) in convenient way,hence the desired actions of the robots to avoid predator are:go to one of four safe places to escape. If we encode 4safe places as numbers 1, 2, 3, 4, we have for each robotsas Ai = [1, 2, 3, 4]. In the case the predator enters therisk area then the robot will generate the repulsive forceto keep away from the predator. Additional actions can beintroduced if desired. The illustration of this scenario isshown in Figure 4. The action, selecting one of safe places,

������ �������� ��� ������ ���� ����� � ���� ����� �

���� ����� � ���� ����� �Fig. 4. The predator avoidance strategy.

is generated in Reinforcement Learning module. Then, thisaction is implemented in the flocking controller. Basically, itis one of four rendezvous points that all robots are desiredto go there.

The reinforcement reward: the reinforcement reward signalchanges in the experiments, depending on the input datathat is received. Two behaviors of the robot are consideredas avoiding predator and increasing the number of theneighboring robots up to a certain number. For this purpose,the reward is defined as follows:

The positive reward is given if the robot maintains thenumber of its neighbors. The negative reward is given ifthe robot loses its neighbors. Specificly, the higher negativereward is given if the robot loses more neighbors. In hexag-onal lattice configuration, the robot inside the network has 6neighbors, and the robot on the border of the network has 1to 5 neighbors. Based on this fact we can define the rewardfunction as follows:

if |Nαi | < 6 then

ri = |Nαi |;

elseri = 6 (keep anα-lattice configuration);

The delayed reward approach has been followed, so thereward is computed after each learning trial.

1) Cooperative Learning Algorithm Description: In co-operative fashion, each robot collects its own Q value basedon its action/state and its neighbors actions/states. The ideaof this collection is illustrated in Figure (5)

Fig. 5. Q value collection based on the robot’s action/state and itsneighboring actions/states.

Algorithm 1: Majority Action Following Algorithm(MAF)

for each robot i doStep 1.Ask/observe its neighbors decisions.Step 2. Select the action that most robots in theinclusive neighborhood set(i∪Nα

i ) follow.if the number of robots in each group in the set(i∪Nα

i ) selecting the same action are the same thenGo to Step 3;Step 3. The roboti will keep its own decision;

elseGo back toStep 2.

For decision making, first each prey select next actionbased on maximumQ value. Then the final decision is basedon Majority Action Following Algorithm (MAF) as shown inAlgorithm 1. In general, we propose the cooperative learningalgorithm as shown in Algorithm 2.

V. EXPERIMENT AND SIMULATION RESULTS

In this section we test our cooperative learning algorithm(Algorithm 2) combining with the distributed flocking con-trol algorithm (5) in term of connectivity, topology andreward. To do the experiment we use 8 real robots, howeverfor evaluating of connectivity, topology and reward, we use15 robots in simulation and run our algorithm in severaltraining episodes.

Experimental setup:In this experiment we use 8 Rovio robots [15] that

have omni-directional motion capability. Especially, 7 Rovio

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 10

Page 6: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

Algorithm 2: Cooperative Learning in Distributed Fash-ionSet parametersα and γ.Build the state list and action list for each robotfor each episode do

for each robot i doInitialization phase:- Initialize the matrixQi

- Find initial state(si) that corresponds with theone in the state list.- Randomly select one of actions(ai) in theaction list.Update phase(after roboti do the selectedaction):- Update the next state(s,

i).- Select the next action(a,

i) based on maximumQi value.- Compute the rewardri.- ComputeQi value:

Qi(si,ai)⇐Qi(si,ai)+α[ri +γQi(s,i,a

,i)−Qi(si,ai)]

- UpdateQi based on its neighbors:

Qi(si,ai) ⇐ Qi(si,ai)+|Nα

i |

∑j=1

Q j(s j,a j)

- Set the next state as the current state.Action implementation phase:- Final action is selected based on MajorityAction Following Algorithm (Algorithm 1).

endend

robots are used as preys, and one Rovio robot is used asa predator. Basically, these robots can freely move in 6directions. The dynamic model of the Rovio robot can beapproximated by Equation(1). However, the accuracy of thelocalization of the Rovio robot is low, and the robot doesnot has any sensing device to sense the pose (position andvelocity) of its neighbors or the obstacles. Hence we use aVICON system setup [16] in our lab (Figure 6) that includes12 infrared cameras to track moving objects. This trackingsystem can give the location and velocity of each movingobject with high accuracy.

The active range of each robot isr = 600mm. The desireddistance among robots isd = 550mm, andε = 0.1 for theσ-norm.

Number of actions = 4 (4 safe places to escape predator).This results 415 = 1073741824 (≈1 billion) possible jointactions (huge search space).

In each learning episode:- The robots have different initial deployments (randomly).- The environments are changed (different obstacle distri-

butions, and different trajectories and initial locations of thepredator)

- The learning task is to find out one of 4 optimal joint

actions.For example if we encode 4 safe places (4 actions) as

numbers 1, 2, 3, 4, then 4 optimal joint actions are:111111111111111; 222222222222222; 333333333333333

and 444444444444444.This means that all preys have to agree on the same

action in order to satisfy the tasks such as maintain the sametopology and connectivity.�������� ������� � ! "#$%&%� '(�) (� *����+(�$%&%� '(�) (� ,'�+�-.��

$%&%� '(�) (� / $(0%( $('(+�Fig. 6. Infrared cameras tracking system for experimental setup.

Figure 7 shows the result of the first training episode.Since in the first time the robots do not have any experienceof the environment (the behavior of the predator), they failedto agree on the same action. Hence, the network is broken.In the second episode (it is not shown here since the spaceis limited), the learning result is better since more than 50percent of the robots agree on the same safe place to go.In the third episode the learning is converged and all of therobots choose the same action (same safe place) (see Figure8). Therefore the connectivity is maintained.

Connectivity Evaluation:To analyze the connectivity of the network we define a

connectivity matrix[ci j(t)] as follows:

[ci j(t)] =

{

1, i f j ∈ Nαi (t), i , j

0, i f j < Nαi (t), i , j

andcii = 0. Since the rank of Laplacian [11] of a connectedgraph[ci j(t)] of ordern is at most (n−1) or rank([ci j(t)])≤(n − 1), the relative connectivity of a network at timetis defined as:C(t) = 1

n−1rank([ci j(t)]). If 0 ≤ C(t) < 1the network is broken, and ifC(t) = 1 the network isconnected. From the result in Figure 9 we can see with the

0 2000 4000 6000 8000 10000 12000 140000.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

0 100 200 300 400 500 600 700 800 9000.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Fig. 9. Connectivity evaluation for our proposed cooperativelearningalgorithm (a, b), here (b) is a zoom-in at episode 4th of (a)

proposed cooperative learning algorithm the connectivity ofthe network is maintained after 3 training episodes.

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 11

Page 7: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

Fig. 7. The trajectories of 7 Rovio robots and one predator in the first learning episode. The green small squares are the safe places, and the filled redsquares are the obstacles.

Topology Evaluation:The topology of the network is evaluated based on the

following algorithm: We clearly see that ifT = 0 the

for each robot i doif |Nα

i | changes thenTopology:Ti = abs(|Nα

i (k)|− |Nαi (k−1)|) (k is

time step or iteration)else if |Nα

i | does not change, but indices of |Nαi |

change thenTopology:Ti = number of index changes

elseTopology:Ti = 0

For the whole network :Topology:T = ∑ni=1 Ti

topology of the network does not change, and ifT > 0 thetopology of the network changes. In addition, ifT is bigthe topology changes much, and otherwise it change a little.From the result in Figure 10 we can see with our proposedcooperative learning algorithm the topology of the networkdoes not change after 3 training episodes.

Reward Evaluation:The global reward is computed asR = ∑n

i=1 ri. From theresult in Figure 11 with our proposed cooperative learningalgorithm we can obtain a maximum global reward with a

0 500 1000 1500 2000 25000

20

40

60

80

100

120

Fig. 10. Topology evaluation for our proposed cooperative learningalgorithm.

value of 62 after about 2000 iterations.For more details about these results please see some

video files which are available at the following website:htt p : //www.hungla.org/research.php

VI. CONCLUSION

We proposed a hybrid system that integrates flockingcontrol and reinforcement learning to allow robots to behaveintelligently in continuous space. Reinforcement learning isdeveloped based on cooperative Q learning and MajorityAction Following Strategy (MAF). We evaluated the pro-posed hybrid system for the case of multi-robot learningto avoid predator. We showed that the proposed cooper-ative Q learning allows the network to quickly find out

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 12

Page 8: Hybrid System of Reinforcement Learning and Flocking ...here the constraint dα= kdkσwith d = r/kc, where kc is the scaling factor. The σ- norm,k.kσ, of a vector is a map Rm =⇒R+

-2000 -1500 -1000 -500 0 500 1000 1500 2000-4000

-3000

-2000

-1000

0

1000

2000

3000

X (mm)

Y (

mm

)

Robot1

Robot2Robot3

Robot4

Robot5

Robot6Robot7

Predator

Predator movingdirection

Robot movingdirection

Obstacle1

Obstacle 2

Fig. 8. The trajectories of 7 Rovio robots and one predator in the third learning episode. The green small squares are the safe places, and the filled redsquares are the obstacles.

0 1000 2000 3000 4000 5000 600015

20

25

30

35

40

45

50

55

60

65

Fig. 11. Global reward evaluation.

the optimal joint action. This also allows the network tomaintain its topology and connectivity while avoiding thepredator. Experimental and simulation results are collectedto demonstrate the effectiveness of our proposed system.

REFERENCES

[1] L. Busoniu, R. Babuska, and B. D. Schutter. A comprehensive surveyof multiagent reinforcement learning.IEEE Transactions on Systems,Man, and Cybernetics - Part C: Applications and Reviews, 38(2):156–172, 2008.

[2] F. Fernandez, D. Borrajo, and L. Parker. A reinforcement learningalgorithm in cooperative multi-robot domain.Journal of Intelligentand Robotic Systems, 43:161–174, 2005.

[3] C. Gaskett, D. Wettergreen, and A. Zelinsky. Q-learning in continuousstate and action spaces.AI’09, LNAI 1747, Springer-Verlag, Berlin,Heidelberg, pages 417–428, 1999.

[4] J. C. Santamaria, R. S. Sutton, and A. Ram. Experiments withreinforcement learning in problems with continuous state and actionspaces.Adaptive Behaviour, 6(2):163–218, 1998.

[5] W. D. Smart and L. P. Kaelbling. Practical reinforcement learning incontinuous spaces.Proceedings of the International Conference onMachine Learning, pages 903–910, 2000.

[6] H. Milinski and R. Heller. Influence of a predator on the optimalforaging behavior of sticklebacks.Nature 275, pages 642–644, 1978.

[7] G. Roberts. Why individual vigilance increases as group size increases.Animal Behavior, 51:1077–1806, 1996.

[8] J. Krause, G. Ruxton, and D. Rubenstein. Is there always an influenceof shoal size on predator hunting success?Journal of Fish Biology,52:494–501, 1998.

[9] Z. Wang and D. Gu. A survey on application of consensus protocolto flocking control.Proceedings of the 12th Chinese Automation andComputing Society Conference in the UK, England, pages 1–8, 2006.

[10] W. Ren, R. W. Beard, and E. M. Atkins. A survey of consensusproblems in multi-agent coordination.Proceedings of the AmericanControl Conference, pages 1859–1864, 2005.

[11] R. Olfati-Saber. Flocking for multi-agent dynamic systems: Al-gorithms and theory. IEEE Transactions on Automatic Control,51(3):401–420, 2006.

[12] H. M. La and W. Sheng. Flocking control of a mobile sensor networkto track and observe a moving target.Proceedings of the 2009 IEEEInternational Conference on Robotics and Automation, 2009.

[13] S. H. Lee. Predator’s attack-induced phase-like transition in prey flock.Physical Letters A, 357:270–274, 2006.

[14] K. Morihiro, H. Nishimura, T. Isokawa, and N. Matsui. Learninggrouping and anti-predator behaviors for multi-agent systems.Pro-ceedings of the International conference on Knowledge-Based Intelli-gent Information and Engineering Systems, pages 426–433, 2008.

[15] Rovio robot: http://www.wowwee.com/en/support/rovio.[16] VICON motion system: http://www.vicon.com/.

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 13