Towards Teaching a Robot to Count Objects

Towards Teaching a Robot to Count Objects

Julien Vitay

Loria / INRIA

Campus Scientifique, B.P. 239

54506 Vandoeuvre-les-Nancy Cedex, FRANCE

[email protected]

Abstract

We present here an example of incrementallearning between two computational modelsdealing with different modalities: a model al-lowing to switch spatial visual attention and amodel allowing to learn the ordinal sequenceof phonetical numbers. Their merging via acommon reward signal allows anyway to pro-duce a cardinal counting behaviour that canbe implemented on a robot.

1. Context

The constructivist theory of learning (Piaget, 1972,Vygotsky, 1986) states that cognitive developmentrelies on relatively discrete stages, where the in-fant learns new schemes on the basis of formerlyacquired schemes in the previous stage. Two tran-sitions between stages are of particular interest forthe neurobotics community: the acquiring of sen-sorimotor schemes from motor reflexes; the acquir-ing of basic language abilities like semantics fromsensorimotor schemes. In particular for the sec-ond transition, the quite recent discovery of the so-called ”mirror-neurons” in the premotor area F5 ofthe monkey (Rizzolatti et al., 1996) (which respondequally for the execution and the observation of anaction) has lead (Rizzolatti and Arbib, 1998) to ex-plain the acquiring of language via the common ab-stract representation of sensorimotor schemes be-tween the learner and his social environment.

As this indicates that the semantics of an action(either performed or recognized) is linked to its mo-tor preparation, the same seems to be true withthe semantics of an object. The sensorimotor con-tingency theory by (O’Regan and Noe, 2001) statesthat seeing is not building an internal representationof the whole visual information but rather explor-ing via visuomotor schemes (for example saccades)the behaviourally relevant location and ignoring theothers. A striking evidence is given by the ”changeblindness” experiments which showed how the disap-pearance of a massive part of an image can be totallyunreported by a subject if this part were not relevantfor the understanding of the scene.

This idea of using previously acquired sensorimo-tor schemes to learn the semantics of an action oran object is in our view the major issue in au-tonomous robotics: the work done by Aude Billard(Schaal et al., 2003), Luc Steels (Steels, 2003) andJun Tani (Sugita and Tani, 2002) for example en-lights the advantages of that approach compared toclassical artificial intelligence (based only on explicitrepresentations).

In this paper, we present an example of incremen-tal learning of a cognitive ability (counting objectsin a scene) using a previously acquired sensorimo-tor skill (switch of attention on salient targets). Wewill first briefly describe the proposed task and thenpresent the two different models and their merging.

2. Numbering Objects

Interaction of a robot with its environment needsnon-linear and complex computations to achieve asuccessful behaviour. In particular in natural scenes,targets are not always unique: the task “bring methree apples” does not specify which apples are to bebrought. In such a task, a robot would have first todetermine if there is enough apples in the scene: de-termining the size of a set is called cardinality. Whenperforming the task itself, the robot has to know thatthe first apple is followed by the second one and thenby the third: determining the position of an item ina sequence is called ordinality.

The relationship between these two aspects ofnumbering in developmental psychology is not yetclear (see (Brannon and Van de Walle, 2001) for adebate). Young infants (< 2 years) seem to have acardinal ability limited to 3 or 4 items called “subitiz-ing”, but this ability only improves with the acquir-ing of verbal counting (with counting rhymes for ex-ample) at the age of 3.5 or 4 years. It is only whenthey master the verbal sequence “one two three fourfive..” that they are able to tell that four objects areless than five. In other words, they have to make thecorrespondance between the “four” word related tothe verbal sequence and the four objects in front ofthem, what is a cross-modal task (between phono-logical inputs and visual properties).

In this paper we will not consider the subitizing

125

Berthouze, L., Kaplan, F., Kozima, H., Yano, H., Konczak, J., Metta, G., Nadel, J., Sandini, G., Stojanov, G. and Balkenius, C. (Eds.)Proceedings of the Fifth International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems

Lund University Cognitive Studies, 123. ISBN 91-974741-4-2

process and make the assumption that counting ob-jects has to grow on two distinct but parallel abili-ties: the ability to sequentially focus the interestingobjects in the visual scene and the ability to learnthe counting sequence “one two three four...”. In thetwo next sections, we will briefly present biologically-inspired computational models for these two abilitiesand then show how then can be coupled to producecardinal comparisons.

3. A Spatial Attention-Switching

Mechanism

There is no need in a robotic task to analyze ev-ery pixel in the image grabbed by the camera ofthe robot. Important targets or clues are salientlocations on the image, with respect to differentfeatures (colour, orientation, luminosity, movement,etc.) and at different scales. Analysing a visual sceneis then sequentially focusing these salient locationsuntil the correct target is found. Furthermore, thefeatures involved in the computation of salience canbe biased by task requirements.

In (Vitay et al., 2005), we presented a compu-tational model of dynamical attention-switchingbased on the Continuum Neural Field Theory(Amari, 1977, Taylor, 1999), a framework of dy-namical lateral interactions within a neural map.Each neuron is described by a dynamical equationwhich asynchronously takes into account the activ-ity of the neighbouring neurons via a “mexican-hat” lateral-weight function. We pinpointedsome interesting properties of that framework in(Rougier and Vitay, 2005) like denoising and spatio-temporal stability. Combining different neural mapswith the same dynamical equation while playing withafferent and lateral weights, we were then able tomake emerge a sequential behaviour from this com-pletely distributed substrate.

Figure 1: Attention Switching Architecture: empty

arrows represent excitatory connections, round ar-

rows represent inhibitory ones. For details, see

(Vitay et al., 2005).

Figure 1 shows the architecture of the proposedsystem, but its description would need too much neu-rophysiological jargon for the audience of this work-shop. We just need to say that it is composed of foursub-structures:

- a visual representation system fed by the saliencyimage that can filters out noise and allows only onesalient location to be represented in the focus of

attention map.

- a working memory system that enables to dy-namically remember previously focused objects.

- an inhibition mechanism which can move the fo-cus of attention to a new salient location (with theinformation given by the working memory).

- a basal-ganglia-like channel which can control thetime of the switch of attention. The key signal is aphasic burst of dopaminergic activity in the reward

unit.

As a consequence of this distributed and dynami-cal architecture, the serial behaviour that emerges isthe sequential focusing of the different salient pointson the image, without ever focusing twice the sameobject. Moreover, the time of the switch is controlledby the dopaminergic burst in the reward unit. Wewill use that property for the counting task presentedlater. This model has been successfully implementedon a PeopleBot R© robot, whose task was to succes-sively focus with its mobile camera a given numberof green lemons. A nice feature of this model is thatit can work either in the covert mode of attention(without eye movement) or in the overt mode (witheye movements, because of the dynamic updating ofthe working memory with visual information).

4. A Sequence Learning Mechanism

Figure 2: Ordinality Learning Architecture: empty ar-

rows represent excitatory connections, round arrows rep-

resent inhibitory ones.

This system relies on the basal ganglia (BG) ar-chitecture, as summarized by (Hikosaka et al., 2000)and is directly inspired by the model made by(Berns and Sejnowski, 1998). BG are known to be

126

composed of several segregated channels, each ofwhich being involved in a particular functional loopwith the frontal cortex and its corresponding part ofthe thalamus to achieve selection of action. Basically,a thalamo-cortical network (bidirectional excitatoryconnections) is tonically inhibited by the BG output(here the internal segment of the Globus PallidusGPi). The core mechanism of BG is disinhibition,which means that inhibition of the GPi (by the Stria-tum or by the external part of the Globus PallidusGPe) disinhibits the corresponding part of the tha-lamus, thus allowing reciprocal excitation betweenthe thalamus and the cortex. This mechanism hasalready been used in the attention-switching system.

Without giving too much detail, there are two op-posite pathways in the BG architecture: the directpathway that favorizes disinhibition and an indirectone which prenvents disinhibition. The balance be-tween these two pathways is ensured by the dopamin-ergic signal produced by the Substancia Nigra parscompacta (SNc).

This complex architecture with bidi-rectional connections, internal loops anddopaminergic modulation allows to learnsequences and do selection of action(Berns and Sejnowski, 1998, Gurney et al., 2004).We will describe here how our model is able to learnsimple sequences like ordinal numbers (cf. Figure2).

“zero” “one” “two” “three”

Figure 3: The distributed representation of the phoneti-

cal words ’zero’ ’one’ ’two’ ’three’ in the number map.

- We first organized a neural map called number

representation map with phonological inputs repre-senting numbers (zero, one, two, three). The hearingof one of these numbers therefore implies the appear-ance of a bubble of activity in the number map ata given location on the map (Figure 3). Please notethat this coding is not compact and would not scaleto large numerosities.

- We then present successively the four numbersto the BG model with a phasic burst of dopamine inSNc at the time of the switch. As dopamine standsfor a kind of “reward” signal (Schultz et al., 1992),we can justify this by saying that hearing a voice(one’s mother’s voice for example) is intrinsically re-warding.

- The inner dynamics of the system (not describedhere) ensure that the association between the corticalrepresentation of a number and its follower is learnedby the connections between stn and gpe.

After learning, when some stn neurons are activefor the current number (via their cortical inputs),they tend to excite gpe at the location of the nextnumber, which in turn inhibits gpi. This artificiallycreates disinhibition in the thalamus that can be usedto predict the next number.

- A high tonic level of dopamine favorizes the di-rect pathway so that the representation of the cur-rent number in the number map is stable even with-out any phonological input (it is mainly the samereverbatory mechanism as in the attention-switchingsystem).

- A sudden depletion of dopamine leads to an ad-vantage for the indirect pathway that predicts thelocation of the next number: the cortical representa-tion in the number map switches to the next numberwithout any corresponding phonological input, like akind of mental voice.

5. Merging the two Systems

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200

"reward_trace"

a) b)

Figure 4: a) Timecourse of the reward signal when three

objects are presented in the visual scene. Each depletion

corresponds to a switch of focus of attention. b) Corre-

sponding image.

We briefly showed how the sequence-learning sys-tem could learn to reproduce a phonological sequenceof numbers. After a tonic level of dopamine is appliedin SNc to start the counting task, each dopaminedepletion switches the cortical representation to thefollowing number. If we add a inhibitory connec-tion from the focus map in the attention-switchingmechanism to its reward unit, the system focuseseach salient point in the image once and then stops.The timecourse of the activity of the reward unitis shown in Figure 4 for three objects. It is al-most the same timecourse needed for the SNc ofthe sequence-learning mechanism to reproduce thelearned sequence.

Having noticed this analogy, our idea was to linkthe two systems only by their dopaminergic unit: thereward unit of the attention-switching system be-comes the SNc unit of the other model and then con-trols the restitution of the learned sequence. In otherwords, each time an object is focused, the currentnumber is incremented. At the end, when no moresalient object can be found, the sequence-learning

127

system has stopped on the representation of the ex-act number of objects in the scene. The cooperation

0 0.2 0.4 0.6 0.8

1

0 20 40 60 80 100 120 140 160 180

0 0.2 0.4 0.6 0.8

1

0 20 40 60 80 100 120 140 160 180

0 0.2 0.4 0.6 0.8

1

0 20 40 60 80 100 120 140 160 180

0 0.2 0.4 0.6 0.8

1

0 20 40 60 80 100 120 140 160 180

Figure 5: Activity of four neurons in the number map.

From top to bottom, these neurons respond preferentially

for the words ’zero’ ’one’ ’two’ and ’three’.

between the two systems has been tested and Figure5 shows the evolution of the activity of four neuronson the number map when the reward activity is pro-duced by the attention-switching mechanism. Theseneurons activate sequentially at each dopamine de-pletion so that at end of the visual search the onlyneuron remaining active is the neuron representing“three”. This works with one, two or three visualobjects, but for more objects we would just needmore neurons in the number map thanks to the dis-tributed architecture of the system.

6. Conclusion

We presented here two computational models thatseem in a first view totally independent as they dealwith different modalities (vision and reproduction ofphonetical sequences), but that can cooperate to pro-duce a new behaviour, namely a cardinal countingtask. Our hypothesis was that counting objects in ascene needs to sequentially focus these objects (whatis a motor ability) and to associate this sequencewith the remembered ordinal sequence of numbers.These two models have two separate basal gangliachannels that communicate via a unique dopamin-ergic unit, what is coherent with the diffuse inner-vation of dopamine throughout the Striatum. Thefirst model works in real-time on a robot but for rea-sons of computational cost the merging of the twosystems has only been tested in simulation. Never-theless, there are some problems: the coding of num-bers from phonological inputs is not enough compactand stands for numbers up to ten maybe. We wouldneed another architecture to deal with greater num-bers. The sequence-learning system also learns thesequence offline (without the attention mechanism),it would be an interesting feature if the learning oc-cured online.

References

Amari, S.-I. (1977). Dynamical study of formation ofcortical maps. Biological Cybernetics, 27:77–87.

Berns, G. and Sejnowski, T. (1998). A computationalmodel of how the basal ganglia produce sequences.Journal of Cognitive Neuroscience, 10:108–121.

Brannon, E. M. and Van de Walle, G. A. (2001). The de-velopment of ordinal numerical competence in youngchildren. Cognitive Psychology, 43:53–81.

Gurney, K., Prescott, T. J., Wickens, J. R., and Red-grave, P. (2004). Computational models of the basalganglia: from robots to membranes. Trends in Neu-rosciences, 27(8):453–459.

Hikosaka, O., Takikawa, Y., and Kawagoe, R. (2000).Role of the basal ganglia in the control of purpo-sive saccadic eye movements. Physiological Reviews,80(3):953–978.

O’Regan, K. and Noe, A. (2001). A sensorimotor ac-count of vision and visual consciousness. Behavioraland Brain Sciences, 24:939–1031.

Piaget, J. (1972). The psychology of the child. NewYork: Basic Books, New York.

Rizzolatti, G. and Arbib, M. A. (1998). Lan-guage within our grasp. Trends in Neurosciences,21(5):188–194.

Rizzolatti, G., Fadiga, L., Gallese, V., and Fogassi, L.(1996). Premotor cortex and the recognition of mo-tor actions. Cognitive Brain Research, 3:131–141.

Rougier, N. and Vitay, J. (2005). Emergence of atten-tion within a neural population. Accepted to NeuralNetworks.

Schaal, S., Ijspeert, A. J., and Billard, A. (2003). Com-putational approaches to motor learning by imita-tion. Philosophical Transactions: Biological Sciences(The Royal Society), 358(1431):537–554.

Schultz, W., Apicella, P., Scarnati, E., and Ljungberg,T. (1992). Neuronal activity in monkey ventral stria-tum related to the expectation of reward. Journalof Neuroscience, 12:4595–4610.

Steels, L. (2003). Evolving grounded communication forrobots. Trends in Cognitive Science, 7(7):308–312.

Sugita, Y. and Tani, J. (2002). A connectionist modelwhich unifies the behavioral and the linguistic pro-cesses: Results from robot learning experiments. InStamenov, M. I. and Gallese, V., (Eds.), MirrorNeurons and the Evolution of Brain and Language.John Benjamins.

Taylor, J. G. (1999). Neural bubble dynamics in twodimensions: foundations. Biological Cybernetics,80:5167–5174.

Vitay, J., Rougier, N. P., and Alexandre, F. (2005).A distributed model of spatial visual attention. InWermter, S. and Palm, G., (Eds.), Neural Learningfor Intelligent Robotics. Springer-Verlag.

Vygotsky, L. (1986). Thought and language. MIT Press,Boston.

128

Documents

Towards Teaching a Robot to Count Objects