Download pdf - Embedding learning in behaviour-based architectures: a conceptual approach

Embedding learning in behaviour-based

architectures: a conceptual approach

JOAÄ O SEQUEIRA,1 JOSEÂ DEL R. MILLAÂ N,2

M. ISABEL RIBEIRO1 and JOAÄ O G. M. GONCË ALVES2

1Instituto Superior TeÂcnico/Instituto de Sistemas e RoboÂtica, Av. Rovisco Pais 1, 1096 LisboaCodex, Portugal2European Commission, Joint Research Centre, TP 270, 21020 Ispra (VA), Italy

Received February and accepted October 1997

This paper addresses the identi®cation of the key elements to be present in a generic,behaviour-based architecture for robot control. These elements include the de®nition of

behaviours aiming at rough task execution and the use of learning to improve the performanceof the primitive behaviour set. Experimental results, with real and simulated systems, todemonstrate cooperative behaviour between a mobile platform and a manipulator are pre-sented. Ó 1998 Chapman & Hall

Keywords: Behaviour, learning, cooperation

1. Introduction

Although there are several successful application examplesof behaviour-based architectures (Brooks, 1986; KoseckaÂ ,1995; MillaÂ n, 1996; Sequeira et al., 1996; Ward and Arkin,1994), there still is a lack of methodologies for represen-tation, synthesis and analysis of this approach. This paperproposes a representation with properties suitable to for-mal analysis, namely, in what concerns controllability.The paper is organized as follows. Section 2 de®nes the

concepts used in the rest of the paper. Section 3 proposes anarchitecture supported on the concepts previously de®ned.Section 4 introduces the use of learning techniques into thearchitecture, as a framework related with controllabilityconcepts. Section 5 presents an experimental setup used tovalidate the concepts de®ned in the paper, together withresults of several experiments using robots with di�erentkinematic constraints acting cooperatively in two kinds ofmissions. Finally, section 6 concludes the paper.

2. Agent fundamentals

The concept of an agent, with roots in the arti®cial intel-ligence community, encapsulates most of the aspects of thebehaviour approaches, allowing problems with multipleheterodox robotic mechanisms to be treated under thesame terminology. We consider an agent as being de®nedby the information pipeline

Environment 7! Sensors 7! Behaviours

7! Actuators 7! Robotic device

The information on the agent state and the task to beexecuted is contained in the following sets:

S Any internal state variables, e.g. batterylevel, sensing data;

Q Robot con®guration space, or C-space;A Set of available motion strategies or ac-

tions;T � Q� A Agent state set.

The sensor and/or state information circulates betweenan agent and the environment in bursts representing thesystem events. At each event the agent takes an action,de®ned as a mapping a : Q 7!Q.* The domain set containsthe initial conditions of the action. The range set containsthe trajectory in the C-space generated, in time, by theaction.y

The con®guration, q, and the action, a, taken by theagent at an event, completely determine the trajectory ofthe robot, until the next event is detected. Therefore, anappropriate state variable for the agent is de®ned by a tuple�q; a� 2 T . This de®nition contains the information aboutthe current motion strategy of the agent and also some

*We use the term action with the meaning of a trajectory generator,without semantic concerns, and the term behaviour if some semanticcontent is implicit.y The time parameter can be assumed implicit.

0956-5515 Ó 1998 Chapman & Hall

Journal of Intelligent Manufacturing (1998) 9, 201±207

memory of the recent past, embedded in the initial condi-tion for the action. If an agent is fed back with the com-plete state of other agent it will know exactly what kind ofmotion strategy the other is taking, and may use that in-formation when making its own decisions.

3. A conceptual behaviour-based architecture

The basic structure that allows the generation of motionin the state set consists of a closed operation,sT : T � T 7! T , which gives to the set the structure of analgebraic group. Consider an agent in the state ti � �qi; ai�.At an event some decision mechanism requires the agent toswitch to the state tj � �qj; aj�. The operation sT can beinterpreted as the mechanism that generates a path in Qconnecting ti with tj. This interpretation is compatible withthe well-known group operation properties, requiring theexistence of:

(1) A null element of the group, meaning that at eachevent it must be always possible to do nothing, or to dosomething that does not compromise future decisions;

(2) An inverse for each element of the group, thus en-abling the possibility of coming back to any state, e.g. fora replanning operation;

(3) The associativity property, meaning that the groupoperation does not make any special distinctions betweenstates.

Figure 1 shows the conceptual architecture based on thegroup structure. The Task Controller is responsible for thecombination of the basic behaviours, b1; . . . ; bp, belongingto A, into more complex behaviours. The operation sT isexpressed in the block. k represents the control variablefor the combination of the basic behaviours, bi.The rightmost levels of the agent information pipeline

are represented by a continuous dynamical system,_q � f �q; u�, with q 2 Q and u being the control variable.Under the hypothesis of weak controllability of this sys-tem, the inner level of the agent can be expressed by theblocks fK; L1; . . . ;Lm;�;

Rg for some ®nite m (Sequeiraand Ribeiro, 1997). The Li represent vector ®elds able to

generate the trajectories in Q and K is a controller able todrive the robot through the reference con®gurations qref .The weak controllability hypothesis is not too restrictive.In fact, the behaviour paradigm is meaningful only if thereexists a K able to drive the robot between any two points ofits con®guration space. Figure 2 shows a graphical repre-sentation of this model, merged with the abstract model ofFig. 1. The merging of the Behaviour Structure, de®ned bythe blocks fb1; . . . ; bp; Task Controller;

R;g, with the

Robot Structure, de®ned by fK; L1; . . . ; Lm;�;Rg is made

through an Interface Block, represented by the symbol H.This block is controlled by the Task Controller and canalso be used to bypass the Behaviour Structure. Any con-®guration imposed at input i2 is directly transferred to theoutput, overriding the contents of i1.The Task Controller block uses speci®c information

about the class of tasks the agent is able to perform, e.g. thecriteria that de®ne which action to apply under the range ofenvironmental conditions considered, to de®ne the controlvariable k sent to the block, and, when necessary, point-to-point actions, sent directly to the H block.

4. Task controller design, agent controllability and learning

The Task Controller design relies on the principle that theagent is equipped with a set of behaviours powerful enoughto handle all the events that may arise in the environmentwhere it operates. It is the responsibility of the programmerto provide this set of behaviours. This principle representsthe controllability concept used in the theory of DiscreteEvent Systems (DES) (Ramadge and Wonham, 1987).Agent de®nition using DES theory considers an event asa symbol of some alphabet, and a behaviour as a trajectory(i.e. a string of symbols) in that alphabet (Antsaklis et al.,1994; KoseckaÂ , 1995). In DES-based frameworks, an agentis controllable if it satis®es a speci®ed behaviour (in theDES sense), independently of the events that may occur.The DES controllability concept provides also the

methodology for the design of the Task Controller. Itidenti®es all the environment situations that can be classi-®ed as events, and, for each event, provides a behaviour

Fig. 1. Conceptual behaviour architecture. Fig. 2. The complete Behaviour-based architecture.

202 Sequeira et al.

able to handle it. This results in a Finite State Machine(FSM) based design for the Task Controller, making theagent controllable in the sense used in DES (and thereforein the usual control sense), i.e. the agent is able to handlethe set of situations that were explicitly considered in theinitial design.A one-to-one correspondence between a behaviour and a

coordinate of Rp (the space of the ks) can be assumed.z It ispossible to de®ne a number, m, of subspaces such thatSm

i�1 Rpi � Rp and RpiT

Rpj � ;, pi 6� pj and a one-to-onecorrespondence between events and subspaces Rpi is created.The initial map events 7! Rp is de®ned by the Task

Controller FSM, which, by the previous discussion aboutDES, assures the agent controllability. In general, this mapmay have as its range only some constrained subspace ofRp. This is the case, for example, in a `potential ®eld'-basedobstacle avoidance, where the k represent weights thatdepend on some distance between the robot and the ob-stacles. The next logical step is to incorporate the means toimprove the performance with experience, without de-stroying the controllability property. This leads to the useof learning techniques.The division of the space Rp into smaller, unconstrained,

subspaces enables the use of Incremental Topology Pre-servingMaps (ITPM), together with reinforcement learningtechniques (MillaÂ n, 1996, 1997).x These maps preserve thetopology knowledge being acquired in units, and the timehistory in links between the units. Knowledge acquiredthrough time, and preserved in the ITPM, can be used tooptimize the decision, when facing a situation similar to onealready encountered in the past. ITPMs can be used in twodi�erent ways in the presented architecture, de®ned below.

(1) Local learning: to improve the basic performance ofthe agent, expressed in local (i.e. in each of the subspacesRpi) searches for optimal actions. The ITPM either de®nesthe current best k to each given sensorial situation, or in-dicates that an a priori ®xed k must be used.

(2) Global learning: to improve the global performance,di�erent subspaces Rpi can be searched. The initial encod-ing, de®ned by the above relations, represents the a prioriknowledge about the map Perception 7! Behaviours. Often,this initial encoding will not be optimal. Optimization canbe carried out at three levels:

(i) By allowing di�erent combinations for theconstruction of each subspace;

(ii) By allowing the dimensions, pi, of the subspacesto be time-varying;

(iii) By accepting some overlap between the sub-spaces.

The ®rst level is de®ned by all the permutations of the mindices that de®ne the coordinates of the space Rp. Thesearch space that represents the two last levels is composedof all of the m-tuples �p1; p2; . . . ; pm� that verify

Pmk�1 pk �

m� d ^ pi � m; i � 1; . . . ;m where d represents the totalamount of subspace overlaps (initially d � 0). ITPMs canbe used to generate:

(i) sequences of m-tuples;(ii) local variations (in the sense that they respect

the above relation) of each m-tuple.

In this work one ITPM is used to learn, in local learningmode, a trajectory for a mobile platform with a constraintin the motion in one of the coordinates (the orientation) ofthe C-space. Thus, the ITPM works only in the remainingtwo coordinates (the position).

5. Cooperation using behaviours

The architecture of Fig. 2 was tested, without ITPM stages,in the control of a real mobile manipulator (see Fig. 6,section 5.4), to evaluate the performance of the basic be-haviour architecture. Simulation studies, using one ITPMstage (for the mobile platform), were also performed. On-going work includes the test of the complete architecture inthe real experimental setup shown in Fig. 6.Two missions were considered:

(1) Traversal of a corridor to execute a task;(2) Task execution in a close neighbourhood of an ob-

stacle.

The experimental setup is organized in three agents:

(1) The mobile platform;(2) The ®rst three joints of the manipulator, in charge of

the positioning in �x; y; z�;(3) The last three joints of the manipulator, in charge of

the orientation �a; b; c�.The partitioning of the manipulator into two agents sim-pli®es the behaviour design. It keeps the gripper out of op-eration until the task neighbourhood is reached, thusminimizing potential con¯icts between agents during mis-sion execution. The system may approach the task �x; y; z�position using a trajectory thatwillmake it impossible for thegripper to achieve the task orientation. In this case somemanoeuvring, de®nedby a speci®c behaviour,will be needed.The Behaviour Structures for the three agents were

implemented in a Sun SPARC 10 workstation, witha 40 MHz processor, while the Robot Structures for themobile platform and the manipulator were implementedusing 68 020 processors at 16 MHz, in the real setup, andin the workstation referred to in the simulation cases.The Task Controller is an FSM in which the output is of

the form k � �. . . ; 0; 1; 0; . . .�, i.e. only one behaviour is

z In some sense each coordinate switches on/o� and/or weights a basicbehaviour.x Although the ITPMs work in constrained spaces (surfaces), one of theadvantages of this optimization technique lies in the simplicity of theimplementation when dealing with unconstrained subspaces.

Embedding learning in behaviour-based architectures 203

active at a time. For all the three agents the correspondingFSMs are fully interconnected networks, where each statecorresponds to a single behaviour.

5.1. Vehicle behaviours

The vehicle used in the experiments is of non-holonomictype, with a non-slide kinematic constraint. The followingclasses of behaviours were used (the numbers code thebehaviours in the graphics of Fig. 5, section 5.4).

5.1.1. Obstacle avoidance (3±4)

This class of behaviours is divided into two minor sub-classes: far-from-task navigation (3), used whenever thevehicle is out of a neighbourhood, of radius 1.5 m, centredon the task, and near-to-task navigation (4), the opposite.This subdivision is justi®ed by a particular task executionphilosophy: if an obstacle is placed in the trajectory of thevehicle, while far from the end of the task goal, obstaclecircumvention is executed. When the vehicle is in theproximity of the task the environment must be scanned ingreater detail than before and obstacle circumventionmight not be possible. These classes of behaviours werechosen because they match the data available from the ringof 24 ultrasonic sensors.The ®rst class of obstacle avoidance behaviours contains

two behaviours, used respectively for forward and back-ward motion. If an obstacle is placed near the left or theright side of the vehicle then it moves away in the forward/backward direction, this strategy being compatible with thenon-slide constraint. It also avoids collisions generated bythe span of a large area, around the vehicle rotation axis.The second class of obstacle avoidance behaviours containsthree classes of vector ®elds, to deal with front/back wallsand side walls.Figure 3 illustrates some of the vector ®elds used to de-

®ne the behaviours, respectively for forward lateral ob-stacle avoidance. Each vector is placed over the ultrasonicsensor that generates it.

5.1.2. Null action (10)

This behaviour is used whenever the vehicle receives in-formation from the arm indicating that it can complete thetask without further help from it.

5.1.3. Go straight to goal (11) and single step to goal (7)

These behaviours determine an error vector between thepresent and the goal cartesian positions, and use a low-levelcontroller that generates appropriate controls for the mo-bile platform aiming at correcting this error vector. If thevehicle is the task leader, then a single command is sent tothe robot (thus saving bandwidth).Mission 2 needs an escape from gripper singularity be-

haviour, triggered whenever the gripper signals the vehiclethat it cannot execute its task, for instance due to physicallimitations in the joint values. The vehicle moves accordingto an algorithm speci®ed a priori. Figure 4 shows a typicalpattern for the resulting trajectory.

5.2. Arm behaviours

5.2.1. Go straight to goal (2)

This behaviour is similar to the one described for the ve-hicle. If the arm becomes stretched above a pre-speci®edlimit it disables the translation motion, keeping just therotational component.

5.2.2. Approach to goal (1)

This behaviour is similar to the go straight to goal, exceptthat the stretching limit is higher to allow the task com-pletion in situations where it is no longer physically nec-essary to move the vehicle.

5.2.3. Get out of type 1 singularity region (5)

This behaviour takes the arm out of the singular con®gu-rations characterized by the arm being almost completelystretched. The behaviour is speci®ed through the weightedsum of two vector ®elds. The ®rst one points to themanipulator base position, while the second speci®es therotational component to have the arm heading to the task.

Fig. 3. Obstacle avoidance behaviours. Fig. 4. Escape from gripper singularity local vector ®eld.

204 Sequeira et al.

5.2.4. Get out of type 2 singularity region (6)

This singularity region is de®ned as a cylinder with a sym-metry axis along the vertical axis of the manipulator base.This region includes those con®gurations in which the twomanipulator links are close to each other. This behaviour isalso speci®ed through a weighted sum of two vector ®elds,specifying a rotational component until the arm is heading tothe task, and a translational component, pointing to the task.

5.2.5. Get out of double singularity region (4)

This singularity region is de®ned by a small volume, gen-erated by the intersection of the volumes of the singularityregions 1 and 2. In practical terms, the arm is in this regionwhen it is in the upward position. The vector ®eld associ-ated with the behaviour speci®es a ®xed direction of mo-tion that drives the arm toward a region of highermanoeuvrability.

5.3. Gripper behaviours

This agent has a single behaviour: Go straight to goal. Theagent is always trying to reach the orientation speci®ed by

the task. It generates a singularity warning to the other twoagents whenever the whole system is in a 5 cm neigh-bourhood of the task location �x; y; z�.

5.4. Mission 1 ± corridor traversal

The approach and corridor traversal mission is de®ned inTable 1. This mission uses a concept of Task Leader,meaning that each task of the mission will have an agentrelative to which the task is de®ned. Figure 5 displays thetypical results of two runs. The left-hand graphs corres-pond to the mission of Table 1. In the right-hand graphsthe system starts in a random con®guration, near the oneof the ®rst run. The lower graphs display the activationsequence of the behaviours, each being identi®ed by theinteger codes previously speci®ed. The dotted lines in the xyplane indicate the position of the corridor structure.Figure 6 shows the typical views at the beginning (1), in thevicinity of the approach task (2±3) and at the end of themission (4).In the ®rst test the vehicle starts heading directly to the

®rst task. When completed, it heads to the second task untilthe corridor entrance is detected. Obstacle avoidance be-haviours are used to traverse the corridor until a point isreached such that the arm is able to execute the task, thusactivating the null behaviour. The arm starts in the upwardcon®guration and therefore it activates the behaviours thatwill get it out from the singularity regions. When out of thesingularities, the arm approaches a point in the vertical ofthe �x; y� point that corresponds to the vehicle task. Thismeans that the arm tries to complete a task that has no

Fig. 5. Experimental results ± trajectories and behaviour activation sequences.

Table 1. Mission 1 speci®cation (a � b � 0 rad)

Task Taskleader

x y z h c

Initial con®guration ± 3.93 2.07 0.92 0 0Approach corridor Vehicle 3.38 3.06 ± ± ±Corridor traversal Gripper 0.97 3.06 0.3 ± 1.57


signi®cance, from the point of view of the mission, since theleader for this task was de®ned to be the vehicle. Thisstrategy causes the arm to enter the singularity region 2,thus activating the corresponding behaviour. By the timethe vehicle is trying to enter the corridor, the arm is alreadytrying to go straight to its goal.Two forms of travelling to the goal were used for the

vehicle: (i) the single step to goal (used in the second test)sends a direct position to the position controller of thevehicle, while (ii) the go straight to goal (used in the ®rsttest) computes local vectors that direct the vehicle to thetask. No signi®cant di�erence in the performance wasfound. The vehicle single steps to the ®rst task, and thebehaviour activation pattern for the second task is similaras the one in the ®rst test. As for the arm, it starts by goingstraight to the goal and, as the vehicle approaches the ®rsttask, it switches to the approaching behaviour. The re-maining activation pattern is similar to the ®rst test.

5.5. Mission 2 ± testing learning

The learning stage of the architecture is composed of oneITPM (MillaÂ n, 1997) that, if necessary, `adapts' theactions suggested by the behaviours of the mobile plat-

form.± The results presented were obtained with a simu-lated mobile manipulator, having the same characteristicsof the experimental setup. The ITPM was merged with thebehaviour architecture using a client-server software ar-chitecture. Whenever the client (the platform) is not able tocommunicate with the server (the ITPM) it uses the basicbehaviours previously established. Otherwise, the actionsuggested by the ITPM is used.Figure 7 illustrates the ITPM after three mission runs

(the mission was successfully completed in the third run).Units are numbered sequentially, with lines connecting therelated ones (see MillaÂ n (1997) for the possible relationsbetween units). The short line associated with each ITPMunit indicates the motion direction advised by that unit.The ®rst and second runs ended at unit 5, due to the shortdistance between the mobile manipulator and the obstacle.This was caused by a low obstacle detection threshold. Inthe third run the system was able to use the previouslyacquired knowledge to soon start avoiding the obstacle.This is expressed in the interconnections between units 3 to6. The interconnections in the topmost part of the imageexpress the potential of the ITPM to adapt the actions itproposes, in a unique run. The simple basic behavioursused, together with the nonholonomic characteristic of themobile platform, cause these manoeuvres. By tuning theobstacle detection threshold, it is possible to execute themission using only the basic set of behaviours. However,the use of the ITPM enables the use of a lower threshold.

Fig. 6. Four views of the door traversal experiment.

± Note that, in some sense, the ITPM action in this mission is similar to anonline adaptation of the k control.

206 Sequeira et al.

6. Conclusions

The experimental results show consistency in the perfor-mance of the system. Despite the sensory limitations of thevehicle, it was possible to traverse a corridor using, inpractice, blind navigation. The important aspect is that theobstacle avoidance behaviours were able to position thevehicle at the corridor entrance in such a way that blindnavigation was an acceptable technique in what concernsthe mission execution.The work presented in this paper is still ongoing. The

learning stages described are now being carried on thereal system. The behaviours used for the corridor tra-versal example may cause some overshooting in the ve-hicle orientation at the corridor entrance. This is anotherexample where the use of the ITPM stage (already im-plemented) to modify the actions proposed by the basic

behaviours is expected to increase the performance of thesystem.

Acknowledgements

This research was partially sponsored by the programmePRAXIS XXI, project COOPERA ± 2/2.1/TPAR/2087/95.J. Sequeira acknowledges a NATO grant, 9/A/95/PO.

References

Antsaklis, P., Lemmon, M. and Stiver, J. (1994) Modeling anddesign of hybrid control systems, in Proceedings of the 2ndIEEE Mediterranean Symposium on New Directions in Con-

trol and Automation, Maleme-Chania, Crete, Greece, June.Brooks, R. A. (1986) A robust layered control system for a mobile

robot. IEEE Journal of Robotics and Automation, 2(1), 14±23.

KoseckaÂ , J. (1995) Supervisory control theory for autonomousmobile agents, PhD Dissertation Proposal, University ofPennsylvania.

MillaÂ n, J. (1996) Rapid, safe, and incremental learning of navi-gation strategies. IEEE Transactions on Systems, Man andCybernetics ±Part B, 26(3), 408±420.

MillaÂ n, J. (1997) Incremental acquisition of local networks for the

control of autonomous robots, in Proceedings of the 7thInternational Conference on Arti®cial Neural Networks,Lausanne, Switzerland, October.

Ramadge, P. and Wonham, W. (1987) Supervisory control ofa class of discrete event processes. SIAM Journal of Controland Optimization, 25(1), 207±230.

Sequeira, J. and Ribeiro, M. I. (1997) A behaviour based kernelarchitecture for robot control, in Proceedings of SYR-OCO'97, Nantes, France, September.

Sequeira, J., Ribeiro, M. I. and GoncË alves, J. M. G. (1996) Co-operation among multiple robotic mechanisms, in Proceed-ings of SIRS'96, Lisbon Portugal, July.

Ward, K. and Arkin, R. (1994) Reactive control of a mobile

manipulator using pseudo-joint damping, Technical Report,Georgia Institute of Technology.

Fig. 7. Final ITPM after two runs of mission 2.