8
Learning by seeing—associative learning of visual features through mental simulation of observed action Malte Schilling 1,2 1 International Computer Science Institute Berkeley, CA 94704 2 Center of Excellence ’Cognitive Interaction Technology’, University of Bielefeld, D-33501 Bielefeld, Germany [email protected] Abstract Internal representations employed in cognitive tasks have to be embodied. The flexible use of such grounded models al- lows for higher-level function like planning ahead, coopera- tion and communication. But at the same time this flexibility presupposes that the utilized internal models are interrelating multiple modalities. In this article we present how an inter- nal body model serving motor control tasks can be recruited for learning to recognize movements performed by another agent. We show that—as the movements are governed by an equal underlying internal model—it is sufficient to observe the other agent performing a series of movements and that there is no supervised learning necessary, i.e. the learning agent does not require access to the performing agents pos- tural information (joint configurations). Instead, through the shared underlying dynamics the mapping can be bootstrapped by the observing agent from the sequence of visual input fea- tures. Introduction Internal representation are essential in higher-level cognitive tasks. Following the view of embodied cognition internal models have to be grounded and are therefore nowadays as- sumed to be directly linked to the action-perception-cycle. Grounded models appear to be a byproduct which originally served a quite specific action and co-evolved in this con- text (Steels, 2003). But on a later-stage cognition has taken over and the same models could be applied in a more flexi- ble way outside the original context of the grounding action. An example are targeted movements which can be found even in quite simple lifeforms as are insect. Nonetheless, making a targeted movement presupposes an internal model allowing to choose the correct muscle activation to reach a target which was perceived before in a three dimensional space. The ability to use this model not only in the context of one specific type of movements, but to also use the incor- porated knowledge—i.e. how muscle activations and target positions are related—in a broader context appears to be es- sential for cognition. It is assumed that internal simulation is a key mechanism to recruit internal models for high level tasks (Hesslow, 2002). Planning ahead can be understood in this way as using only the internal representation decoupled from the represented body to simulate behaviors and pre- dict their consequences. This allows to try out possibly haz- ardous actions and to evaluate possible alternatives or slight modifications. Findings from diverse fields as neuroscience, psychology and behavioral sciences have contributed over the last years and shaped this view (Jeannerod, 2006). It is now more and more apparent that such a mechanism is at the core of cognition, but also subserves—and is grounded in— action and perception. Perception seems to be shaped by the encoded knowledge and when perceiving others performing actions it seems that internal models of the own body are used (Schacter et al., 2007). Perception tries to fit the per- ceived input to the representation grounded in motor control to make sense of what is perceived in the sense of the lan- guage of the own motor system. This becomes of course even more important in cooperation or communication as different roles between different subjects require different representations of what is going on. A central issue is the multimodal nature of the underlying representational system. It appears that our conceptual sys- tem is—besides organizing concepts—binding diverse sets of features from different modalities and on different lev- els of abstraction. But how is it possible to come up with these connections and interrelate multiple modalities? In this paper, we want to address how associations between an internal body model used for motor control and visual representations can be established in an unsupervised way, i.e. only through observing another agent performing body movements an observing agent can come up with a mapping of the perceived visual features onto its own body model representation (segment orientations). As an example we use a three segmented arm and will first introduce a neural network which allows for motor control and makes targeted movement. In the third section we will explain how this body model can be incorporated into the perception loop and how it can subserve perception as it provides predic- tions of the movement and helps to disambiguate or filter noisy input. Even though only visual input is accessible when observing another agent performing movements, we afterwards show in principle that it is possible to come up

Learning by seeing—associative learning of visual features

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning by seeing—associative learning of visual features

Learning by seeing—associative learning of visual features through mentalsimulation of observed action

Malte Schilling1,2

1International Computer Science Institute Berkeley, CA 947042Center of Excellence ’Cognitive Interaction Technology’, University of Bielefeld, D-33501 Bielefeld, Germany

[email protected]

Abstract

Internal representations employed in cognitive tasks have tobe embodied. The flexible use of such grounded models al-lows for higher-level function like planning ahead, coopera-tion and communication. But at the same time this flexibilitypresupposes that the utilized internal models are interrelatingmultiple modalities. In this article we present how an inter-nal body model serving motor control tasks can be recruitedfor learning to recognize movements performed by anotheragent. We show that—as the movements are governed by anequal underlying internal model—it is sufficient to observethe other agent performing a series of movements and thatthere is no supervised learning necessary, i.e. the learningagent does not require access to the performing agents pos-tural information (joint configurations). Instead, through theshared underlying dynamics the mapping can be bootstrappedby the observing agent from the sequence of visual input fea-tures.

IntroductionInternal representation are essential in higher-level cognitivetasks. Following the view of embodied cognition internalmodels have to be grounded and are therefore nowadays as-sumed to be directly linked to the action-perception-cycle.Grounded models appear to be a byproduct which originallyserved a quite specific action and co-evolved in this con-text (Steels, 2003). But on a later-stage cognition has takenover and the same models could be applied in a more flexi-ble way outside the original context of the grounding action.An example are targeted movements which can be foundeven in quite simple lifeforms as are insect. Nonetheless,making a targeted movement presupposes an internal modelallowing to choose the correct muscle activation to reach atarget which was perceived before in a three dimensionalspace. The ability to use this model not only in the contextof one specific type of movements, but to also use the incor-porated knowledge—i.e. how muscle activations and targetpositions are related—in a broader context appears to be es-sential for cognition. It is assumed that internal simulationis a key mechanism to recruit internal models for high leveltasks (Hesslow, 2002). Planning ahead can be understood inthis way as using only the internal representation decoupled

from the represented body to simulate behaviors and pre-dict their consequences. This allows to try out possibly haz-ardous actions and to evaluate possible alternatives or slightmodifications. Findings from diverse fields as neuroscience,psychology and behavioral sciences have contributed overthe last years and shaped this view (Jeannerod, 2006). It isnow more and more apparent that such a mechanism is at thecore of cognition, but also subserves—and is grounded in—action and perception. Perception seems to be shaped by theencoded knowledge and when perceiving others performingactions it seems that internal models of the own body areused (Schacter et al., 2007). Perception tries to fit the per-ceived input to the representation grounded in motor controlto make sense of what is perceived in the sense of the lan-guage of the own motor system. This becomes of courseeven more important in cooperation or communication asdifferent roles between different subjects require differentrepresentations of what is going on.

A central issue is the multimodal nature of the underlyingrepresentational system. It appears that our conceptual sys-tem is—besides organizing concepts—binding diverse setsof features from different modalities and on different lev-els of abstraction. But how is it possible to come up withthese connections and interrelate multiple modalities? Inthis paper, we want to address how associations betweenan internal body model used for motor control and visualrepresentations can be established in an unsupervised way,i.e. only through observing another agent performing bodymovements an observing agent can come up with a mappingof the perceived visual features onto its own body modelrepresentation (segment orientations). As an example weuse a three segmented arm and will first introduce a neuralnetwork which allows for motor control and makes targetedmovement. In the third section we will explain how thisbody model can be incorporated into the perception loopand how it can subserve perception as it provides predic-tions of the movement and helps to disambiguate or filternoisy input. Even though only visual input is accessiblewhen observing another agent performing movements, weafterwards show in principle that it is possible to come up

Page 2: Learning by seeing—associative learning of visual features

R

y

xL1

L2

L3

D1

D2

Figure 1: Graphic representation of a three segmented arm,consisting of upper arm (L1), lower arm (L2) and hand (L3).Vector R points to the position of the end effector (tip of thehand). D1 and D2 describe additional diagonal vectors. Thearm is restricted to work in a plane (coronal plane).

with a mapping from the visual features to ones own mo-tor system. The shared governing dynamics are enough toallow bootstrapping this mapping from the sequence of vi-sual features. First results from computer simulations willbe presented indicating that this mapping can be establishedquickly.

MMC Networks as a Body ModelMean of Multiple Computation networks are a type ofrecurrent neural networks (Cruse and Steinkuhler, 1993;Steinkuhler and Cruse, 1998). They are based on the Meanof Multiple Computation (MMC) principle which is allow-ing to use known constraints to set up the network instead oftraining the weights. In principle the constraints are given asequations which form the attractor space. The network is inthis way similar to a self-organizing map as the constraintsare enforced on any input given to the network. MMC networks have been used in the past for diverse kinematic tasksand we will use a simple example from this domain for ex-plaining the principle. The general approach will be illus-trated using a three segmented arm which can be movedaround in a plane. The orientation of each segment is de-scribed as a two dimensional vector for illustrative purposes.A joint angle representation can be used as is applied usu-ally in robotics and the approach has been extended to threedimensional movements (Schilling, 2011), for describingmovement dynamics (Schilling, 2009) and a hierarchical or-ganization in order to represent complex structures has beenintroduced (Schilling and Cruse, 2007). It is important tonote that even though the task used here is quite simple, it isso complex that an analytical solution is not feasible as thereare more degrees of freedom to be controlled as there are inthe target space (Bernstein, 1967). The arm is redundant.Therefore, even in this simple example all the demandingcharacteristics are present we are facing in the control ofcomplex movements, e.g. of a robotic or a human arm.

The manipulator is shown in fig. 1. Kinematic equations

yL1

yD1

yL2yR

yD2

yL3

xL1

xD1

xL2xR

xD2

xL3

Figure 2: The MMC network consists of two identical net-works, one for the x-components (black lines) and one forthe y-components (grey lines) of the vectors. The units rep-resent the components of the six vectors L1, L2, L3, D1, D2

and R of the planar arm. Connections with a positive weightare indicated by a black arrowhead and negative weights areshown as black dots. All connections are bidirectional. Theexample equation xD1 = xL1 + xL2 is shown on the right:connections between the three nodes on the right encode allthe equations derived, e.g. xL1 is given through xD1 and thenegative value of xL2 .

describing the arm can be easily set up. The main idea ofthe MMC approach is to not compile the kinematic rela-tions into one single equation, e.g. representing the endposition of the arm, but to establish a set of local relation-ships capturing the redundancy of the arm. As illustratedin the figure, additional diagonal vectors are introduced. Alocal relationship then corresponds to a triangle formed bythree vectors, e.g. the first two segments and the first di-agonal constitute such a triangle. As each triangle estab-lishes a closed polygon chain, these relationships can be ex-pressed as an equation, e.g. for the example above we willget xD1 = xL1 +xL2 . (and an analogous equation for the y-component) Overall a set of equations can be compiled fol-lowing this approach when all possible triangle relationshipsare constructed. Each vector variable is taking part in multi-ple of these equations. In the next step for setting up the net-work, for each variable all equations containing that variableare solved with respect to that variable. In our example, thefirst segment is contained in one additional equation. Solv-ing these two equations for the first segment variable we get:

xL1 = xR − xD2

xL1 = xD1 − xL2 (1)

Following the Mean of Multiple Computation principlethese multiple computations for one variable are integratedthrough calculating the Mean value. In order to restrain thatabrupt and fast changes in one equation affect the whole pro-cess, usually as an additional term the weighted old value isincluded into the mean computation which introduces a sortof damping (Makarov et al., 2008). As a result in our exam-

Page 3: Learning by seeing—associative learning of visual features

OutputSegment

OutputSegment

OutputSegment

InputTargetVector

b)

xL1

xD1

xL2xR

xD2

xL3

InputSegment

InputSegment

InputSegment

OutputTargetVector

a)

xL1

xD1

xL2xR

xD2

xL3

Figure 3: Application of the network to solve kinematictasks. Initially, the network is in a stable state reflectingthe current configuration of the arm (see Fig. 1). In a) itis shown how the net solves the forward kinematic task, i.e.when the segment orientations are known the end effectorposition can be computed. In b) the application for solv-ing the inverse kinematic task is shown. A target positionis given as input to the network and the network adjusts thesegment vectors accordingly. If an input is given, the corre-sponding recurrent channels are suppressed (symbolised bythe open arrow heads).

ple this leads to

xL1(t + 1) =1d(xR(t)− xD2(t)) +

1d(xD1(t)− xL2(t))

+d− 2

dxL1(t) (2)

This set of equations describe the relations between the vari-ables and can be understood as defining the connections ofa neural network. The network is shown in fig. 2. As theresulting network is a recurrent neural net, the activation ofthe network is developing over time in which the state ofthe network can be calculated in an iterative fashion. Theencoded constraints enforce this behaviour and the attractorspace reflects states fullfilling all the kinematic equations.Obviously, when we give a valid configuration of the armto the network all constraints are met and the network isin a stable state (Steinkuhler and Cruse, 1998). The inter-esting cases are the cases in which we only provide partialinformation. Acting like a self-organising map the MMCnet completes the given input pattern into a correspondingactivation of the whole network which matches the require-ments. In this way the net is able to solve any kinematicproblem.

The forward kinematic problem can be solved straight-forward. As an input the segment vectors are fed into thenetwork (fig. 3 a). The corresponding diagonal and end-effector vectors are approached in a few time steps (depend-ing on the damping factor, i.e. the weight of the recurrentconnection). Importantly, the input to the network is givento the network the whole time and is directly setting the in-put variables.

For the inverse kinematic task, we only give as an inputthe desired end-effector position to the network (shown infig. 3 b) after initialising the network with a valid starting

L1 L2

Movement of the arm towards (1.0, 2.0)

L3

x

b) Distance tip of hand from target over time

Figure 4: Solution of the inverse kinematic problem throughthe linear MMC model. A planar arm with three segments(i.e., one extra DoF) should point to a given position, markedby a cross, starting from an initial configuration. The stateof the arm for every second iteration step is shown.

configuration. Through enforcing the new end effector valueonto the network, a disturbance is introduced and the net-work is not in an attractor state anymore. But over time thisactivity is spread to all variables. The encoded kinematicconstraints enforce that the network settles back on its solu-tion space. The network relaxes to a stable state in which thetarget end effector value still holds true and the other vari-ables have adopted corresponding values. As an example,we show in figure 4 an example run of the network. Initially,the arm is fully stretched to the right (bright line, end effec-tor position x = 0.3, y = 0, with all segments having anequal length of 0.1 units). For every second iteration stepthe current configuration of the arm is shown (dashed greyline), until the 25th iteration in which the arm has reachedthe target position (x = 0, y = 0.2, drawn as a solid darkline).

As can be seen in this example—and as has been shownin the past (Steinkuhler and Cruse, 1998)—the MMC net-work is able to solve the inverse kinematic task in only afew iteration steps. We presented the linear MMC networkabove which has one serious drawback as it allows the vari-ables to change freely. There is no cross connection be-tween the x and the y component of the networks. As thex and y components of the variables can be modified in-dependently the length of the vectors can change. This isusually unwanted and problematic for the segment variableswhich should stay of constant length. This problem can beeasily solved through a normalisation step. Even though thisintroduces non-linearities into our network this does not dis-rupt the overall performance (Steinkuhler and Cruse, 1998).In this article we will apply such a normalisation step onthe MMC segment variables (not shown in the diagrams) af-ter each iteration step which could be totally circumventedwhen using other representations like a joint angle represen-

Page 4: Learning by seeing—associative learning of visual features

L1

L2

L3

Action –> Perception

ActorMMC Network controls movement

cont

rol M

MC

net

InputTargetVector

OutputSegmentVectors

ObserverMMC Network utilized in perception

MM

C n

et d

riven

by p

erce

ptio

n

VisualFeatures

Predicted Feat.

Figure 5: Application of the internal model in action (shown on the left) and perception (right). The internal model is used inone agent for motor control. Given a target vector as input it comes up with a movement to the target. On the other side, anequal internal model is utilized in another agent during perception. The embedded dynamics of movements allow to establisha connection from visual features to body postures and to recognize postures of another actor when seeing them.

tation(Schilling, 2011).A property one immediately recognises when looking at

the behaviour of the network is that the arm is moved in thebeginning very fast and later-on dramatically slows down.The distance to the target decreases exponentially. Biolog-ical movements, e.g. human arm movements, are charac-terised by very different properties (Morasso, 1981). Again,we introduced MMC networks in the past which incorporatedynamic influences and which nicely fit to experimental datafor human reaching movements (Schilling, 2009).

Application of the Body Model in PerceptionInternal models are used in motor control, e.g. in reachingtasks inverse models transform target points into joint posi-tions or muscle activation. The introduced MMC networkimplements such an internal model of the own body and al-lows for making targeted movements. But the same inter-nal models have been found active in other tasks, e.g. per-ception, planning ahead or communication (Grush, 2004).It appears that internal models are recruited by these otherfunction. While in this way the utilized internal model isgrounded in action, it remains unclear how it can be con-nected to seemingly quite different tasks. As we want toshow in this paper, the underlying organisation of the bodymodel is providing enough structure (in time) to allow for es-tablishing such connections. We are focussing on the use ofan internal body model in perception of movements. A keyquestion is how humans and even simple animals are ableto recognize and understand movements of conspecifics. Ithas been pointed out that mapping an observed behaviourto ones own body model is essential (Decety and Grezes,1999). But how can this mapping be established? We wantto analyse this relation between perception and motor con-trol through applying our simple body model in perception.As during this learning one has only access to the resulting

perceived visual input, the learning has to take place in anunsupervised manner. Therefore, the acquisition of such amapping from seeing someone moving around to ones ownmovement systems seems quite difficult if not intractable.The main idea in our approach is that the introduction ofthe body model into the processing chain of perception dra-matically simplifies the acquisition of a mapping. Both pro-cesses share the underlying body model in our setting andin this way the dynamic development of both processes isconstrained in the same way. We want to show that this isenough to come up with a mapping and how this simplifiesfinding the mapping.

In figure 5 it is shown how the two models are connectedand how they are incorporated into their respective system.On the left side, an acting agent is shown. Here the internalMMC network model is used in the same way as explainedin the preceding section. A target value is set as an input tothe model. The network is approaching a solution and at thesame time moves the connected arm. The movement of thearm is perceived by the observing agent on the right sight.In a preprocessing step characteristic visual features are ex-tracted from the visual image. The aim is to correlate thevisual features with assumed body configuration. This hasto be done in an unsupervised fashion as only the evolvingvisual features are available and the observer has no infor-mation on the segment vectors (only in the initial situationin which a resting position is assumed). But the observingagent can exploit its knowledge about the dynamics of theunknown segment vectors as this dynamics are shared be-tween both agents and are encoded in the body model. Thegeneral idea is that the observing agent tries to hook the bodymodel up to the visual features and close the loop in tryingto predict the visual features. The underlying assumption isthat the predicted change of the visual features can only becorrectly produced by the dynamics of the observer’s body

Page 5: Learning by seeing—associative learning of visual features

win

MM

C n

et d

riven

by p

erce

ptio

n

wout

2. Model prediction is correlatedto visual features at time t+1

visual features at t+1

1. Visual Features at time tproject onto body model through

wout

MM

C n

et d

riven

by p

erce

ptio

n

win

visual features at t

}Associative learning of projection of visual features at time t onto body model

}After one iteration step of internal body model: Associative learning of mapping from predicted new state onto visual features at time t+1

b)

a)

Figure 6: Steps when applying the internal body model inperception and how this allows for learning associations tovisual features. At first (in a) visual features are associatedtowards the current body model activation. After one pro-cessing step of the body model connections back to percep-tion are learnt which associate the new predicted values ofthe body model with the updated visual features. Initiallythe connections are random and only by accident correla-tions will occur. These will be strengthened over time andmappings between the two spaces evolve.

model when it is in a similar state as the actor’s body model.We want to test this assumption for our simple model and inaddition how easily this then allows to bootstrap the connec-tion to visual features from the body model.

The internal model in the observer is used as a predictor.It can be regarded as a hidden mediating layer of a neuralnetwork linked to the visual features. The input layer arethe visual features at a certain time t and the task would beto learn projections from these visual features to the bodymodel (fig. 6 a). After one time-step in the mediating bodymodel layer the activations of the body model should berouted back to the visual features which have new valuesnow for time t + 1. This mapping should also be learnt (fig.6 b) and as we only have access to the visual features—givenas input and output—both mappings have to be learnt at thesame time. The basic idea is that this is possible and thatthe correlation of the sequence of observed features is corre-lated with the body model dynamics. Hebbian-type learningshould be sufficient to identify the associations and establishthe mapping (Hebb, 1949). Figure 7 shows a different per-spective on the whole network. The network is spread out

xL1 xD1 xL2 xRxD2xL3

xt+1 yt+1

xt yt

win}wout}

Hidden Layer = Body Modelrecurrent weights are fixed

Input LayerVisual features at time t

Output LayerPrediction of visual features at time t+1

backpropagation

backpropagation

Figure 7: Schematic sketch of the network architecture usedfor learning the input and output mapping between visualfeatures and body model. The recurrent connections of thehidden layer are fixed and setup as a MMC network. As thenetwork shall be used in the same mode as when used formotor control the target vector R (right) corresponds to thepredicted target position estimated from the known dynam-ics of the network.

into a three layer neural network. The input layer is giventhrough the visual features at a certain point in time t. Thebody model constitutes the middle layer. As the dynamicsof the two models are essential for establishing a coupling,this network must be driven in the same way as the originalnetwork. Therefore, the R vector (shaded in the figure) doesnot represent the current end effector position, but the targetposition of a movement. This is unknown to the networkand the network can only observe the current state, but in-corporating knowledge about the known dynamics the endstate can be easily estimated (see (Schilling, 2009)). Theweights are predetermined for the hidden layer as it repre-sents the MMC network, but the activation of this layer ishidden during learning. The output layer represents the pre-dicted output for one timestep later (t + 1). Here we havesimplified the view on the overall architecture as we are in-troducing this output level for representing the visual fea-tures at time t+1. The back projection on the visual featuresin the overall framework is more complex as the function ofthese connections depend on the context. In perception thebody model is not supposed to re-activate the visual featuresin general. But in specific cases it would be an advantageto use the prediction, e.g. when part of the movement cannot be observed (the arm might move behind an object andis occluded for a short time). Therefore, it must be possi-ble to use this connections in different ways depending onthe context and inhibit their reactivation during perception.During learning these connections are essential for correlat-ing the predicted state of the body model to the new visualfeatures. Introducing these new visual features as separateunits in an output layer allows us to come up with the simplegeneral structure shown in fig. 7 and to use standard back-propagation-through-time learning (Rumelhart et al., 1986)to learn the two weight matrices at the same time. The fea-ture values from one time step ahead can in this way be usedas the target output values.

In a preprocessing stage visual features are extracted from

Page 6: Learning by seeing—associative learning of visual features

the perceived image of the arm. We use visual image mo-ments. Image moments (Mukundan and Ramakrishnan,1998) reflect characteristics of a foreground object in a givenimage. They capture the statistical regularities of the objectpixel and describe in this way shape properties of the fore-ground object, e.g. size, orientation. The main advantagesof image moments are that they provide a descriptive repre-sentation and at the same time are inexpensive to compute.They can be easily calculated from a binary pixel-based im-age with the intensity function I(x, y) where all object pix-els are represented as a one and all other pixel have a valueof zero:

Mpq =∑

x

∑y

xpyqI(x, y) (3)

Usually a set of image moments of different orders is usedwith the order of an image moment given as the sum of thetwo exponents p and q used in the equation above. The ze-roth order moment is a count of the object pixel and from thefirst order image moments one can derive the visual centerof gravity (COG, the centroid x, y of the object):

x =M10

M00, y =

M01

M00(4)

Higher order moments allow to compute orientation andshape properties of the object shown in the image.

We are only using the centroid information in our simula-tions. Using higher order image moments would of courseallow for a better reconstruction of the visual image. Butthe focus of our work is on how the body model contributesto recognizing and tracking the seen arm. Relying only oninsufficient information emphasizes the contribution of thebody model.

The centroid information can be directly calculated fromthe segment vectors of the moving arm. The overall centerof gravity is constituted as the mean of the individual seg-ment visual COG(we assume uniform length and width ofsegments).The equations describing the segment COGs canbe integrated through calculating the mean value:

xges =13(xL1 + xL2 + xL3) (5)

ResultsWe want to mainly focus on the qualitative result that thenetwork is able to establish input and output connections ina way that both networks activities’ are coupled. We useda simple back-propagation learning rule on a set of initiallyrandom weights. Back-propagation is known for dependingon the initial configuration and converging onto local min-ima, therefore we started a series of simulations for differentinitial weights covering the whole space of weights. Whilein many simulations the network converged, it was not suf-ficiently able to predict the next visual features at all. The

y

xL1

L2

L3

Figure 8: Shown is the initial configuration of the three seg-mented arm as solid black lines. The 12 targets are shownas white crosses.

network got stuck in a local minima. In these cases a con-stant value was returned or simply the input value. In thefollowing we want to concentrate on the other simulationswhich were able to successfully predict the next visual fea-tures and want to look what the internal model was doingduring predicting sequences. In general, the behaviour of allthese networks was similar and in the following we use oneexample simulation series.

The network was trained from an initial arm configuration(shown in fig. 8 with all 12 targets). Both, the moving armand the perceiving arm were initialised in this configuration,this means we assume for the simulations that there is a cer-tain resting posture from which all movements start. Thereare 12 targets around this resting posture and we selected 9for training and 3 later for testing on generalisation. The in-put and the output network (fig. 7) were then set to initialvalues. The perception network was trained on the visualdata which resulted from the movement controlled by themovement network (see equation 5): a target was given tothe movement network and the visual data before doing oneiteration step in the movement control network was used asan input to the perception network. The visual features of thearm after the iteration step of the control network had beencarried out was then used as the target value for the percep-tion network which should learn to predict this value fromthe visual input. A movement lasted 15 iteration steps andthe network was trained on a random order of the 9 train-ing targets for 250 epochs (as mentioned above, the weightsare not completely random, as we only cover a subset of thewhole weight space here, see also discussion). The arm didnot reach the target during the 15 iteration steps, but as themovement of the classical MMC is slowing down at the end,we only used the part of the movement containing rich dy-namics, i.e. the arm is still considerably moving.

In fig. 9 a movement to an example target is shown forboth arms. The body model used in the perception loopis following the leading moving arm. Both networks arein good agreement and synchronized. One advantage ofthe presented approach is that it extends also to movementsnot shown before as underlying knowledge about kinematicsand body constraints is incorporated in the perception pro-cess. Figure 10 shows the behavior of the perception model

Page 7: Learning by seeing—associative learning of visual features

L1

L2

L3

x

Movement of the arm towards (0, 0.3)t = 0 iteration steps t = 5 iteration steps

t = 10 iteration steps t = 15 iteration steps

x

xx

Figure 9: An example of the perceived arm movement.Course of time is going from left to right, top to down.Shown are snapshots of iteration 0, 5, 10 and 15. In thefirst figure at the left, top the initial configuration is shownin light gray. The moving arm is shown as a dashed line andthe current state of the MMC model used for perception isrepresented as the dark grey line.

when the moving arm is approaching a novel target.There was no observable difference between targets used

for training and novel targets. During each test run the mov-ing arm reached out during a period of 15 iteration stepstowards one target from the initial configuration. We are notinterested in finally reaching the target as during the last partof the movement the arm is only moving slowly for the clas-sic MMC approach and the interesting part for our compar-ison is the comparison for the more dynamic starting phase.In general, the two networks converged for the final part ofthe movement to their respective endpoints. The observingnetwork adopted in all cases a qualitatively similar configu-ration (as shown in the example, i.e. the segments of bothnetworks are orientated in a similar way). We compared thedifferences of the single segment orientations to evaluate dif-ferences in configurations of the networks states. The differ-ence angle for the segment orientations of the perceived armand the moving arm were computed for each segment. Themean difference overall segments was 0.125 rad (standarddeviation ±0.396 rad). Mostly differences of the last seg-ments were responsible for the high variation. This can beexplained by the fact that the orientations of the first seg-ment is weighted very high in the computation of the visualfeatures.

DiscussionWe have shown first results indicating that sharing a com-mon principle organizing movement dynamics is sufficientto bootstrap associations from the internal control networkto visual features. After successful learning, the body modelis coordinated with the motor control network solely throughthe simple visual features which in themselves would not besufficient to estimate the manipulator configuration. Untilnow the simulation results are a first step providing a qual-itative finding and the high variation is also a result of thesimple visual features used to describe the postures.

One problem with the presented approach is that the sim-ple back-propagation learning method on its own is not ableto converge as the method depends on the initial configu-ration. Therefore, we started a series of simulations withdifferent initial weight configurations (only for the inputweights) covering large parts of the weight matrix space.To test that—in the successful cases—the success was notalready predetermined through the selection of a suitableweight matrix, we tested the impact of the input network.Even in a supervised case it was not possible to learn theprojections of the visual features onto the manipulator vari-ables. The visual features in themselves do not carry enoughinformation to predict the manipulator state. Therefore, thesuccess of the network seems not given through the inputtransformation, but depends on the interplay between all theparts. In the future, we want to extend our approach and ap-ply a more powerful learning algorithm (like a least-squaremethod) which is able to overcome local minima and doesnot depend on the initial weight configuration. In addition,we want to perform a correlational analysis of the resultingweight matrices.

Other approaches to learn internal models of the bodyusually apply a supervised learning method. A nice exampleis the learning of a visual body model by Spranger (Steelsand Spranger, 2008) in which a robot performs actions infront of a mirror and starts learning to associate the propri-oceptive features to the observed visual features. Hoffmannet al. (2010) gives a thorough review about other approachesalong the same line and on the integration of other modali-ties into the body schema in robots.

In the future, we will apply our approach in a real worldrobot scenario in which a robot is at first learning to rec-ognize what another robot is doing when both apply theirinternal MMC-type body model. The task shall be imple-mented as a communicative scenario and in a second stepa group of robots shall come up with a shared conceptual-ization of body postures through performing short interac-tions (language games (Steels and Belpaeme, 2005)). Be-sides additional preprocessing steps this would require toincorporate more descriptive visual features as are higherlevel centralized visual moments. The implementation onthe robots is done in cooperation with the CSL group (LucSteels, Paris). In the final system, each agent would have de-

Page 8: Learning by seeing—associative learning of visual features

L1

L2

L3x

Movement of the arm towards novel target (0.141, 0.141)t = 0 iteration steps t = 5 iteration steps

t = 10 iteration steps t = 15 iteration steps

x

xx

Figure 10: Movement to a novel target. Course of time isgoing from left to right in two rows. Shown are snapshotsof iteration 0, 5, 10 and 15. In the first figure at the left, topthe initial configuration is shown in light gray. The movingarm is shown as a dashed line and the current state of theMMC model used for perception is represented as the darkgrey line.

veloped a conceptual space from a simple grounded internalrepresentation of the own body which is now multimodal inits nature. Therefore, this internal model and the mappingonto visual features allow to be utilized in perceiving othersmaking movements and coming up with conventional—andin a population agreed on—symbols. This would open thedoor for a simple form of communication and cooperationinside the rules given through the language game.

AcknowledgementsThis work was supported by a DAAD grant to MalteSchilling.

ReferencesBernstein, N. A. (1967). The Co-ordination and regulation of

movements. Pergamon Press Ltd., Oxford.

Cruse, H. and Steinkuhler, U. (1993). Solution of the direct andinverse kinematic problems by a common algorithm based onthe mean of multiple computations. Biological Cybernetics,69:345–351.

Decety, J. and Grezes, J. (1999). Neural mechanisms subservingthe perception of human actions. Trends in Cognitive Sci-ences, 3(5):172–178.

Grush, R. (2004). The emulation theory of representation: Mo-tor control, imagery, and perception. Behavioral and BrainSciences, 27:377–442.

Hebb, D. O. (1949). The Organization of Behavior. John Wiley,New York.

Hesslow, G. (2002). Conscious thought as simulation of behaviourand perception. Trends in Cognitive Sciences, 6(6):242–247.

Hoffmann, M., Marques, H., Arieta, A. H., Sumioka, H., Lun-garella, M., and Pfeifer, R. (2010). Body schema in robotics:a review. IEEE Trans. Auton. Mental Develop., 2(4):304–324.

Jeannerod, M. (2006). Motor Cognition — What Action tells theSelf. Oxford: University Press.

Makarov, V., Song, Y., Velarde, M., Hubner, D., and Cruse, H.(2008). Elements for a general memory structure: propertiesof recurrent neural networks used to form situation models.Biological Cybernetics., 98(5):371–395.

Morasso, P. (1981). Spatial control of arm movements. Experi-mental Brain Research, 42(2):223–227.

Mukundan, R. and Ramakrishnan, K. (1998). Moment Functions inImage Analysis: Theory and Applications. World Scientific,London, UK.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learn-ing internal representations by error propagation, pages 318–362. MIT Press, Cambridge, MA, USA.

Schacter, D. L., Addis, D. R., and Buckner, R. (2007). Remem-bering the past to imagine the future: the prospective brain.Nature Reviews Neuroscience, 8(7):657–661.

Schilling, M. (2009). Dynamic equations in MMC networks: Con-struction of a dynamic body model. In Proc. of The 12th In-ternational Conference on Climbing and Walking Robots andthe Support Technologies for Mobile Machines (CLAWAR).

Schilling, M. (2011). Universally manipulable body models— dual quaternion representations in layered and dynamicMMCs. Autonomous Robots, 30(4):399–425.

Schilling, M. and Cruse, H. (2007). Hierarchical MMC Networksas a manipulable body model. In Proceedings of the Interna-tional Joint Conference on Neural Networks (IJCNN 2007),Orlando, FL, pages 2141–2146.

Steels, L. (2003). Intelligence with representation. PhilosophicalTransactions: Mathematical, Physical and Engineering Sci-ences, 361(1811):2381–2395.

Steels, L. and Belpaeme, T. (2005). Coordinating perceptuallygrounded categories through language: A case study forcolour. Behavioral and Brain Sciences, 28(04):469–489.

Steels, L. and Spranger, M. (2008). The robot in the mirror. Con-nection Science, 20(4):337–358.

Steinkuhler, U. and Cruse, H. (1998). A holistic model for an inter-nal representation to control the movement of a manipulatorwith redundant degrees of freedom. Biological Cybernetics,79(6):457–466.