Autonomous and fast robot learning through motivation

Robotics and Autonomous Systems 55 (2007) 735–740www.elsevier.com/locate/robot

Autonomous and fast robot learning through motivation

M. Rodrıgueza,∗, R. Iglesiasa, C.V. Regueirob, J. Correaa, S. Barroa

a Electronics and Computer Science, University of Santiago de Compostela, 15782, Santiago de Compostela, Spainb Department of Electronic and Systems, University of Coruna, 15701, A Coruna, Spain

Available online 10 May 2007

Abstract

Research on robot techniques that are fast, user-friendly, and require little application-specific knowledge by the user, is more and moreencouraged in a society where the demand of home-care or domestic-service robots is increasing continuously. In this context we propose amethodology which combines reinforcement learning and genetic algorithms to teach a robot how to perform a task when only the specificationof the main restrictions of the desired behaviour is provided. Through this combination, both paradigms must be merged in such a way that theyinfluence each other to achieve a fast convergence towards a good robot-control policy, and reduce the random explorations the robot needs tocarry out in order to find a solution.

Another advantage of our proposal is that it is able to easily incorporate any kind of domain-dependent knowledge about the task. This is veryuseful for improving a robot controller, for applying a robot-controller to move a different robot-platform, or when we have certain “feelings”about how the task should be solved.

The performance of our proposal is shown through its application to solve a common problem in mobile robotics.c© 2007 Elsevier B.V. All rights reserved.

Keywords: Reinforcement learning; Robot control; Autonomous agents; Genetic algorithms

1. Introduction

In an ageing population where the demands on healthcare,social services and domestic care is continuously increasing,the possibility of getting robots that are capable of learningtasks on their own (in domestic environments, for example),is being required more and more. In this context, reinforcementlearning (RL) seems to be a very interesting strategy, since therobot learns from its interaction with the environment due tothe presence of an extrinsic motivation, called reinforcement,that tells the robot how good or bad it has performed butnothing about the desired set of actions it should have carriedout—although the psychologist’s concept of motivation is notusually associated with machine learning, there is parallelismbetween the motivated behaviour of animals and the behaviourof an RL system that tries to maximize the amount of rewardit receives [1]. The maximization of the reinforcement allows

∗ Corresponding author.E-mail addresses: [email protected] (M. Rodrıguez), [email protected]

(C.V. Regueiro).

0921-8890/$ - see front matter c© 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.robot.2007.05.005

the robot to learn a utility function of states and actionscalled Q-function, which reflects the consequences of executingevery possible action in each state — we will assume thatthe robot interacts with the environment at discrete time stepsand it is able to translate the different situations that it maydetect through its sensors into a finite number of states,S. Table 1 shows one of the many reinforcement learningalgorithms that can be applied to teach a robot: the truncatedtemporal differences algorithm, T T D(λ, m) [2]. Basically therobot begins with an initial set of random negative Q-values:Q(s, a) ≤ 0, ∀s, a, and then it performs a stochasticalexploration of its environment. While the robot is movingaround, it keeps track of the m last executed actions so thattheir corresponding Q-values will be decreased or increaseddepending on whether the robot receives or not negativereinforcements. The parameters γ , λ, and β (present in Table 1),determine how much the Q-values are changed for everypositive or negative reinforcement the robot receives. As thelearning process goes by, the robot should tend to executethose actions which seem to be the best ones according to the

http://www.elsevier.com/locate/robot

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.robot.2007.05.005

736 M. Rodrıguez et al. / Robotics and Autonomous Systems 55 (2007) 735–740

Fig. 1. Schematic representation of our proposal. Initially the robot moves using the greedy control policy until it finds a situation it doesn’t know how to solve (a),a genetic algorithm is applied to find a solution (b), once the problem is solved the greedy policy is applied again (c).

Table 1Truncated temporal differences algorithm

1. Observe the current state, s(t): s[0] = s(t)2. Select an action a(t) for s(t): a[0] = a(t)3. Perform action a(t)4. Observe new state s(t + 1) and reinforcement value r(t)5. r [0] = r(t), u[0] = maxa Qt (s(t + 1), a)

6. For k = 0, 1, . . . , m − 1 do:if k = 0 then z = r [k] + γ u[k]

else z = r [k] + γ (λz + (1 − λ)u[k]),0 < γ, λ ≥ 1

7. Update the Q-values:δ = (z − Qt (s[m − 1], a[m − 1]))

∆Qt (s[m − 1], a[m − 1]) = βδ

Q-values Eq. (1):

π∗(s) = arg maxa

{Q(s, a)} (1)

π∗ is called greedy policy.Despite the benefits of the RL paradigm in autonomous

robot-learning, there are important problems to consider whenit is applied. Thus, while the robot is learning it has to selectbetween the execution of actions it has already tried in the past,or the execution of new actions which are selected randomlyand which can either improve the robot behaviour or, underparticular circumstances and if they are wrong, destabilize thelearning process.

2. Description of our proposal

We propose a learning strategy which combines reinforce-ment learning (RL) and a a genetic algorithm (GA). Reinforce-ment learning is going to be used to memorize all the experi-ences the robot goes through — whenever the robot is beingmoved, the Q-values are updated reflecting the consequencesof executing every possible action in every possible state. Thegenetic algorithm is used to find sequences of actions — solu-tions — to those situations where the robot doesn’t know howto move — henceforth we’ll refer to these situations as localproblems as they represent particular configurations of the en-vironment around the robot where the greedy policy is unable tomove the robot so that it doesn’t receive a sequence of negativereinforcements.

Basically, when our proposal is applied, the robot goesthrough three cyclical and clearly differentiated stages: (a)looking for a new starting position or convergence. (b)exploration, and (c) generation of a new population of solutions(chromosomes) for the next exploration stage. During the firststage the robot uses the greedy policy to move accordingto what it has learnt so far. When the robot finds a newsituation where it doesn’t know how to move — local problem— (Fig. 1(a)), it applies a genetic algorithm to find a localsolution (Fig. 1(b)). Once the solution has been achieved a newpopulation of chromosomes is generated and the robot movesagain using the greedy policy until a new local problem isdetected (Fig. 1(c)).

3. How a GA improves RL

3.1. Solving local problems

Our strategy applies a GA [3,4] to solve each local problemthe robot finds. The GA starts with a population of solutionscalled chromosomes. Each chromosome — represented as πi ,i = 1, . . . , chromosome population size — determines theaction the robot has to carry out at each state, s: πi (s). Aswe can see in Fig. 1(b), the first population of chromosomesis random: the actions proposed for each state are selectedrandomly. The population of chromosomes is then evaluatedaccording to an objective function called fitness function, whichreflects for how long a chromosome is able to properly controlthe movement of the robot — in Fig. 1(b) we can see themovement of the robot when a population of chromosomes isbeing evaluated, the starting position of the robot is alwaysthe same, once the robot receives a sequence of negativereinforcements it goes back to the common starting position andthe next chromosome is then used to control it.

Once the population of chromosomes has been evaluated,a new population has to be obtained starting from theprevious one according to the fitness values: the better achromosome seems to be, the higher the probability that thischromosome is selected for the new population. In order tokeep some randomness in the process, certain genetic operatorslike mutation — which carries out random changes in thechromosomes — or chromosome crossover — which combines

M. Rodrıguez et al. / Robotics and Autonomous Systems 55 (2007) 735–740 737

Fig. 2. Representation of our proposal to combine GA with RL.

chromosomes to raise new solutions — have to be applied. Theevolution of the chromosome population allows the robot to finda local solution after facing the same environment configurationseveral times. This wouldn’t be possible with other methods justaimed to build random sequences of actions.

The fact that the GA decides the actions the robot must carryout during the exploration phase, represents a high influenceof the GA on the RL, as the Q-values being updated arethose corresponding to the actions the robot executes. Geneticalgorithms help RL in generalization and selectivity [5], sincethe knowledge about bad decisions is not explicitly preserved.

3.2. Injection of prior knowledge

There are many situations where there is some kind ofknowledge about the task to be learnt, that can speed up thelearning procedure. In our learning framework (Fig. 2), thiskind of prior knowledge about the task is projected into a set ofindependent blocks, B1, . . . , BJ , which constitute the SupportModule, SM, (Fig. 2). The design of each of these blockscan follow different strategies (neural networks, fuzzy logic,classical algorithms, etc.), depending on the complexity of theknowledge that is included in each of them. It is important tonotice that the blocks included within the support module donot have to use the same representation of the environment —mapping function — as the reinforcement learning algorithm.We call Advice the information supplied by each block, and itwill suggest the exploration of one or more actions. To includethe advice coming from the support module the initial setof possible actions A, is conveniently augmented so that anychromosome can suggest the execution of the actions suggestedthrough the advice:

A∗= A ∪ {advice(B1), . . . , advice(BJ )} (2)

where A represents the initial set of possible actions.

4. How RL improves GA

We assume that the Q-values might be useful to bias thegenetic operators and thus reduce the number of chromosomesrequired to find a solution. Therefore the mutation probability

could be high for those policies and states where the selectedaction is poor according to the Q-values. The crossoverbetween two policies could take place in such a way that theexchanged actions have similar Q-values, etc. Nevertheless,to face with non-static environments and keep a suitable ratioof exploration, a small percentage of the chromosomes mustalways be random

In our proposal the probability that mutation changes theaction that a chromosome, π , suggests for state s, π(s), dependson how many actions look better or worse — according to theQ-values — than the one suggested by the chromosome:

N1 = cardinality{a j | Q(s, a j ) ≥ Q(s, π(s))}

N2 = cardinality{ak | Q(s, ak) < Q(s, π(s))}

Pmutation(π(s)) =N1

N1 + N2

N1 represents the number of actions that look better than π(s),N2 the number of worse actions than π(s). Pmutation(π(s))establishes the probability that the mutation changes the actionsuggested by the policy π in the state s. Nevertheless, if themutation is going to be carried out, there is still a big uncertaintyabout which other action, instead of π(s), should be selected.In our case, instead of a random selection, there is a probabilityof picking up every action, Ps(s, ai ), ∀ai , which depends on theQ-values:

Ps(s, ai ) =eQ(s,π(s))/Q(s,ai )∑

jeQ(s,π(s))/Q(s,a j )

. (3)

To understand Eq. (3), it is important to bear in mind thatQ(s, ai ) ≤ 0, ∀ai .

Finally, one of the chromosomes should always be thegreedy policy as it brings together all what has been alreadylearnt by the robot, and it represents the best chance to have afast convergence towards the desired solution.

4.1. Injection of prior knowledge

The presence of the Q-values is also good to estimate whenit is advisable to consider the information provided by the


SM:

PSM(s) =

maxa∈A∗−A

{Q(s, a)} − mina∈A∗

{Q(s, a)}

maxa∈A∗

{Q(s, a)} − mina∈A∗

{Q(s, a)}

A∗ is the augmented set of actions described in Eq. (2), andPSM(s) is the probability that one of the actions proposed bythe SM is selected as action to be taken in state s ∈ S, when anew random chromosome is constructed.

5. Experimental results

We applied both, the reinforcement learning algorithmT T D(λ, m) and the GA+RL mechanism we propose, to teach arobot how to solve a common task: “wall following”. The robotmust learn how to follow a wall located on its right.

The robot we used is a Nomad200 robot equipped with16 ultrasound sensors encircling its upper part and bumpers.To learn this task the robot receives a reinforcement whichpenalizes those situations where the robot is too close or toofar from the wall being followed (r < 0), otherwise r = 0.The finite set of states through which the environment aroundthe robot is identified is the same regardless of the learningparadigm we used. In particular we used a set of two layeredKohonen networks to translate the large number of differentsituations that the ultrasound sensors located on the front andright side of the robot may detect, into a finite set of 220neurones — states [6]. Thus, at every instant, the environmentsurrounding the robot is identified through one of these 220neurones. In order to train the Kohonen neural networks weused a set of sensor readings collected when the robot wasmoved close to a wall, therefore these networks are able torecognize those most frequent situations in a wall followingbehaviour.

To run the simulated experiments standard laptop computerswith Pentium IV processors (about 2 GHz) and the Nomad200robot’s simulator were used.

5.1. Application of T T D(λ, m)

We first learnt the robot controller using the T T D(λ, m)

learning algorithm described in Section 1. The linear velocityof the robot was kept constant in all the experiments(15.24 cm/s). To determine the most suitable values of thelearning parameters, we analysed the time required to learnthe task when they were changed. In particular we tried withβ = {0.25, 0.35, 0.45, 0.55}, λ = {0.4, 0.5, 0.6, 0.7, 0.8}, andm = {10, 20, 30, 40, 50}. The best combination of parameterswe found is β = 0.35, λ = 0.8, γ = 0.95 and m = 30. Weused this combination in all the experiments described below,either if RL or our proposal GA + RL was applied.

In order to determine the action to be executed at everyinstant (step 2 in the T T D(λ, m) algorithm, Table 1), we used astrategy based on the Boltzmann distribution. According to thisstrategy, the probability of executing action a∗ in state s(t), isgiven by the following expression:

Prob(s(t), a∗) =eQ(s(t),a∗)/T (s(t))∑a

eQ(s(t),a)/T (s(t)), (4)

T (s(t)) > 0 is a temperature that controls the randomness.Very low temperatures cause a nearly deterministic actionselection, while high values result in random performances. Asthere is a significant difference in the relative frequency of the220 possible states, the temperature value is independent foreach one of the states: T (s). Each time the robot is faced witha particular state s, its temperature is reduced with a constantratio: Tn+1(s) = 0.999Tn(s).

When the greedy policy is able to properly control the robotfor an interval of 10 min, convergence is detected and thelearning procedure is stopped.

Finally, to compare the movement of the robot whendifferent robot controllers were used, the robot received acommand with the action to be carried out every 300 ms. Therobot was taught how to perform the task in a simulated trainingenvironment, but the performance of the robot-controller wastested not in the training environment but in a different one —testing environment. After 6 experiments the average learningtime was 97 min and 30 s.

5.2. Application of GA + RL

When we applied our proposal, the first thing we noticed wasthat it proved to be much faster than the TTD algorithm. After19 experiments, the average learning time was only 25 minand 9 s. There are 20 policies (chromosomes) being evaluatedand evolved, 6 of them are random policies, and one of theremaining ones is the greedy policy. Convergence is detectedand the learning procedure is stopped when the greedy policyis able to properly control the movement of the robot for aninterval of 10 min.

The movement of the real robot in the real and noisyenvironment shown in Fig. 3 proves that the behaviours learntthrough our GA + RL proposal are useful.

To understand how the use of a GA shortens the learningtime, we can compare one of the classical exploration strategies,like the one based on the Boltzmann distribution, Eq. (4), versusthe GA (Section 3.1). When the Boltzmann based explorationstrategy is used, at the early stages of the learning process theQ-values are nearly random and so are the actions taken bythe robot (Eq. (4)). The degree of randomness is very highand the convergence of the Q-values is very slow. Under thesame circumstances — early stages of the learning process— but when the GA algorithm is used, we have noticed thatwhenever a chromosome makes the robot behave much betterthan the rest of the chromosomes in the same population,there is a high probability that similar sequences of actionsare tried through the next generation of chromosomes. Thereis an important reduction of the degree of randomness whichallows a fast convergence of the Q-values. On the other hand,once the learning process is almost over — the Q-valuesare almost learnt and the temperature is very low causingalmost deterministic actions, Eq. (4) — if the robot finds anenvironment where the greedy policy proves to be unable tomove the robot, the RL system would be trapped and unable tochange the Q-values; when the GA algorithm is used insteadthis wouldn’t happen so easily as the percentage of random

M. Rodrıguez et al. / Robotics and Autonomous Systems 55 (2007) 735–740 739

Fig. 3. Real robot’s trajectory in a noisy and real environment when a controlpolicy learnt through GA+RL is used to control the robot’s movement. Points Aand B in the graph are the same, the misalignment is due to the odometry error.The small dots in the graph correspond to the ultrasound readings.

Fig. 4. Rules which summarize the knowledge about the wall followingbehaviour.

chromosomes in the initial population would trigger again thesearch of new solutions.

5.3. Using prior knowledge about the task

In this work we claim that our proposal allows astraightforward injection of prior knowledge about the task. Inorder to show some preliminary results which prove this, wedefine a very simple set of fuzzy rules which summarize some

knowledge about the wall following behaviour. In particularwe implemented a very simple fuzzy controller which triesto make the robot follow a lateral wall at a certain distance,which we denote threshold distance. The input linguisticvariables which we use are: distance to the lateral wallexpressed as an increment in relation to a threshold distance(INCR THRESHOLD), the angle that the forward movementof the robot makes with the aforementioned wall (ANGLE), andfinally, the distance to the nearest wall detected with the frontalsensors (DIST FRONT). As the output linguistic variable wemake use of the angular velocity (ANG SPEED). A very simpleset of rules (Fig. 4), and some basic membership functions wereused to approximate the desired behaviour.

In order to see how the quality of advice may influence thestability or the time required by the learning process, we usedtwo different pieces of advice:

Advice1 = α1θfuzzy; Advice2 = α2θfuzzy

where α1 = 1, α2 = 1.7, and θfuzzy is the angular velocityproposed by the fuzzy system at every instant. Although theactions advice1 proposes are suitable for some of the states,it is not able to make the robot move properly at the requiredvelocity. Advice2 amplifies the angular velocities proposed bythe fuzzy controller in such a way that its performance issignificantly improved; nevertheless there are certain situations(like concave corners) where the actions advice2 suggests arenot adequate according to the reinforcement values.

When advice1 is combined with our GA + RL approachin the way described in Sections 3.2 and 4.1, the requiredlearning time is reduced to 35.68 min. 24.12 min were enoughto learn a suitable robot-behaviour when advice2 was used. Thedecreasing values of convergence time reflect how GA + RLis able to use properly the knowledge provided by the fuzzysystem.

6. Summary and conclusion

Reinforcement learning is an extremely useful paradigmwhere the specification of the restrictions of the aimedbehaviour of the robot is enough to start a random search ofthe desired solution. Nevertheless, a successful application ofthis paradigm requires a good exploration strategy, suitable tofind a solution without a deep search through all the possiblecombinations of actions. We propose a combination of geneticalgorithms and reinforcement learning, in such a way that themutual influence of both paradigms strengthens their potentialand corrects their drawbacks. Genetic Algorithms reachlocal solutions faster than Reinforcement Learning, whereasReinforcement Learning is able to combine and integrate all thelocal solutions found through Genetic Algorithms. As a furtheradvantage, our proposal also enables us to incorporate priorknowledge of the task to be solved as well as certain objectivesof interest.

Our experimental results after the application of our proposalto solve a particular problem in mobile robotics, confirm clearlythat through a combination of GA and RL the learning processis focused, achieving rapid convergence to the desired solution.


We also combined a very simple fuzzy controller with ourGA + RL approach. The results obtained in this case, althoughpreliminary, confirm the interest of the strategy we propose asthe learning time is reduced and the stability of the processis also increased. This happens despite the fact that the fuzzycontroller is unable to solve the task on its own at the requiredrobot velocity.

Acknowledgements

The authors thank the support received through the researchgrants PGIDIT04TIC206011PR, TIC2003-09400-C04-03 andTIN2005-03844.

References

[1] G. Konidaris, A. Barto, An adaptive robot motivational system, in: FromAnimals to Animats 9: Proceedings of the Ninth International Conferenceon Simulation of Adaptive Behaviour, in: LNAI, vol. 4095, SpringerVerlag, 2006.

[2] P. Cichosz, Reinforcement learning by truncating temporal differences,Ph.D. Thesis, Dpt. of Electronics and Information Technology, WarsawUniversity of Technology, 1997.

[3] Y. Davidor, Genetic algorithms and robotics, in: A Heuristic Strategy forOptimization, World Scientific, 1991.

[4] H.G. Cobb, J.J. Grefenstette, Genetic algorithms for tracking changingenvironments, in: Proc. Fifth International Conference on GeneticAlgorithms, 1993.

[5] D.E. Moriarty, A.C. Schultz, J.J. Grefenstette, Evolutionary algorithmsfor reinforcement learning, Journal of Artificial Intelligence Research 11(1999) 241–276.

[6] R. Iglesias, C.V. Regueiro, J. Correa, S. Barro, Improving wall followingbehaviour in a mobile robot using reinforcement learning, in: ICSCInternational Symposium on Engineering of Intelligent Systems, EIS’98,1998.

Miguel A. Rodrıguez received the B.S. and Ph.D.degrees in Maths from the University of Santiago deCompostela, Spain, in 1975 and 1997, respectively. Heis currently an Assistant Professor in the Department ofElectronics and Computer Science at the University ofSantiago de Compostela, Spain. His research interestsfocus on control and navigation in mobile robotics andmachine learning, mainly reinforcement learning andartificial neural networks.

Roberto Iglesias received the B.S. and Ph.D. degreesin physics from the University of Santiago deCompostela, Spain, in 1996 and 2003, respectively. Heis currently an Assistant Professor in the Department ofElectronics and Computer Science at the University ofSantiago de Compostela, Spain, and honorary lecturerat the University of Essex, UK. His research interestsfocus on control and navigation in mobile robotics,machine learning (mainly reinforcement learning andartificial neural networks), and scientific methods in

mobile robotics (modelling and characterization of robot behaviour).

Carlos V. Regueiro received the B.S. and Ph.D.degrees in physics from the University of Santiago deCompostela, Spain, in 1992 and 2002, respectively.Since December 1993 he has been an AssociatedProfessor in the Faculty of Computer Science atthe University of A Coruna, Spain. His researchinterests focus on software control architectures,perception, control and navigation in mobile robotics,soft computing and machine learning.

Jose Luıs Correa Pombo received the B.S. and Ph.D.degrees in physics from the University of Santiago deCompostela, Spain, in 1988 and 1993, respectively.After a postdoctoral position, he joined the Departmentof Electronics and Computer Science at the Universityof Santiago de Compostela, Spain, as an assistantprofessor in 1994, and as an associate professor since2003. His current research interests comprise mobilerobotics, mainly robot perception, robot localizationand robot learning.

Senen Barro was born in A Coruna, Spain in1962. He received the B.S. and Ph.D. degrees (withhonours) in Physics from the University of Santiagode Compostela, Spain, in 1985 and 1988, respectively.He is currently a Professor of Computer Science andRector of the University of Santiago de Compostela.Before joining this University in 1989, he was anAssociate Professor at the Faculty of Informatics,University of A Coruna, Spain. His research focuses onsignal and knowledge processing (mainly in medical

domains), real-time systems, intelligent fuzzy systems, mobile robotics andartificial neural networks (applications and biological modelling). He is editorof five books and author of over 200 scientific papers in these fields.

Documents

Autonomous and fast robot learning through motivation