Deep Forward and Inverse Perceptual Models for Tracking ...tws10/rss17... · cheap, reliable source of information about the robot and its environment, but the precise relationship

Deep Forward and Inverse Perceptual Models forTracking and Prediction

Alexander Lambert, Amirreza Shaban, Zhen Liu and Byron BootsInstitute for Robotics & Intelligent Machines

Georgia Institute of Technology, Atlanta, GA, USA{alambert6, amirreza, liuzhen1994}@gatech.edu, [email protected]

Abstract—We present a non-parametric perceptual model forgenerating video frames with deep networks, and provide aframework for its use in tracking and prediction tasks on a realrobotic system. This is shown to greatly outperform standardde-convolutional methods for image generation, producing clearphoto-realistic images. For tracking, we incorporate the sensormodel into an Extended Kalman Filter and estimate robot tra-jectories. As a comparison, we introduce a secondary frameworkconsisting of a discriminative, inverse model for state estimation,and compare this approach to that using a generative model.

I. INTRODUCTION & RELATED WORK

Several fundamental problems in robotics, such as stateestimation, prediction, and motion planning rely on accuratemodels that can map state to measurements (forward models)or measurements to state (inverse models). Classic examplesinclude the measurement models for global positioning sys-tems, inertial measurement units, or beam sensors that arefrequently used in simultaneous localization and mapping [13],or the forward and inverse kinematic models that map jointconfigurations to workspace and vice-versa. Some of thesemodels can be very difficult to derive analytically, and, in thesecases, roboticists have often resorted to machine learning toinfer accurate models directly from data [2, 14, 5, 10].

Despite the important role that forward and inverse modelshave played in robotics, there has been little progress in defin-ing these models for very high-dimensional sensor data likeimages and video. This has been disappointing: cameras are acheap, reliable source of information about the robot and itsenvironment, but the precise relationship between a robot poseor configuration, the environment, and the generated imageis extremely complex. A possible solution to this problem isto learn a forward model that directly maps the robot poseor configuration to high-dimensional perceptual space or aninverse model that maps new images to the robot pose orconfiguration.

In this paper, we will explore the idea of directly learn-ing forward and inverse perceptual models that relate high-dimensional images to low dimensional robot state. In par-ticular, we use deep neural networks to learn both forwardand inverse perceptual models for a camera pointed at themanipulation space of a Barret WAM arm. While recent workon convolutional neural networks (CNNs) provides a fairlystraightforward framework for learning inverse models thatcan map images to robot configurations, learning accurategenerative (forward) models remains a challenge.

Generative neural networks have recently shown muchpromise in addressing the problem of mapping low-dimensional encodings to a high-dimensional pixel-space [3,9, 11]. The generative capacity of these approaches is heavilydependent on learning a strictly parametric model to mapinput vectors to images. Using deconvolutional networks forlearning controllable, kinematic transformations of objectshas previously been demonstrated, as in [4, 12]. However,these models have difficulty reproducing clear images withmatching textures, and have mainly been investigated on affinetransformations of simulated objects.

Instead of generating images directly after applying trans-formations in a low-dimensional encoding, another approachis to learn a transformation in the high-dimensional spaceof the output. One can then re-use pixel information fromobserved images to reconstruct new views from the samescene, as proposed in [15]. Here, the authors learn a model togenerate a flow-field transformation from an input image-posepair derived from synthetic data. This is then subsequentlyapplied to a reference frame, effectively rotating the originalimage to a previously unseen viewpoint. Using confidencemasks to combine multiple flow-fields generated from differentreference frames is also proposed. However, these frames areselected randomly from the training data.

In the current work, we propose deep forward and in-verse perceptual models for a camera pointed at the manip-ulation space of a Barrett WAM arm. The forward modelmaps a 4-dimensional arm configuration to a (256×256×3)-dimensional RGB image and the inverse model maps(256×256×3)-dimensional images to 4-dimensional config-urations. A major contribution of this work is a new forwardmodel that extends the image-warping model in [15] witha non-parametric reference-frame selection component. Weshow that this model can generate sharp near photo-realisticimages from never before-seen 4-dimensional arm configura-tions and greatly outperforms state of the art deconvolutionalnetworks [4, 12]. We design a convolutional inverse model,and use the two models in tandem for tracking the configu-ration of the robot from an unknown initial configuration andprediction to future images on the real robot.

II. SENSOR MODEL DESIGN

Tracking and prediction are important tasks in robotics.Consider the problem of tracking state and predicting future

images, illustrated in Fig.1. Given a history of RGB-imageobservations Ot = (oi)

ti=0 where o ∈Rm, we wish to track the

state of the system x ∈ Rn for the corresponding sequenceXt = (xi)

ti=0 . Additionally, given a designated goal image oT ,

we wish to define a target end state xT . Both these state-estimation objectives can be accomplished with an inversesensor model, a parametrized function g : Rm 7→ Rn. Havingmapped the planning problem in joint-space, we can performonline motion-planning to acquire a sequence of expectedstates X t = (xi)

Ti=t+1.

Fig. 1. The proposed framework makes use of an inverse sensor model(blue) for inference of a state xi from observed frame oi. A motionplan can be generated given a simulated model of the system. Theresulting trajectory can be mapped into image space using a forwardsensor model (green).

A. Forward Sensor Model

We use a generative observation model g to perform amapping from joint-space values xi to pixel-space values oifor a trajectory sequence of future states X t . This is performedusing image-warping layers as proposed in [15]. Given aninput xi and a reference state-image pair (xr,or), we learnto predict the pixel flow field

−→h (xi,xr) from reference image

or to the image oi. The final prediction is made by warpingthe reference image or with the predicted flow field

−→h :

g(xi) = warp(or,−→h ). (1)

As long as there is a high correlation between visualappearance of the reference image or and the output imageoi, warping the reference image results in higher-fidelity pre-dictions when compared to models which map input directlyto RGB values [12].

We propose a variation of this idea that is extremelyeffective in practice. Our network is comprised of two parts:1) A non-parametric component, where given a state value xi,a reference pair (xr,or) is found such that or has a similarappearance as the output image oi; and 2) A parametriccomponent, where given the same state value xi and thereference pair (xr,or), the inverse flow field

−→h is computed

and used to warp the reference image to produce the outputimage oi. The overall architecture is shown in Fig.2.

For the non-parametric component, we take a subset of thetraining set Fp ⊆F , consisting of configuration-image pairsand store the data in a KD-tree. We can quickly find a referencepair by searching for the training data pair closest to the currentjoint configuration xi in configuration space. Since the state-space is low-dimensional we can find this nearest neighbor

x5 times

fc1(256)

fc2(256)

fc4 (1024)

fc5 (1024)

fc3 (512) fc3

(256x8x8)

...

Deconvi (cixsixsi)

Convi (cixsixsi)

...

iflow (2x256x256)

xr Reference state

xCurrent state

confidence map (256x256)

Forward Branch

Warp

First Ir

NN Forward Branch #1First xr W

eighted Average

iflow

Warp

Second Ir

Forward Branch #2

Second xr iflow

confidence map

confidence mapCurrent state

x

Fig. 2. Forward sensor model architecture. Top: Forward parametricbranch example. Bottom: Complete architecture, containing a NearestNeighbor (NN) module producing (image, state)-pair outputs. Theseare used by individual Forward branches to produce a warpedreference image.

very efficiently. To prevent over-fitting, we randomly sampleone of the k-nearest neighbors in the training phase.

For the parametric component, we use the deep neuralnetwork architecture depicted in Figure 2. Given a currentand reference state vector as input, the network is trained toproduce a flow tensor to warp the reference image (similarlyto [15]). The input is first mapped to a high dimensionalspace using six fully-connected layers. The resulting vectoris reshaped into a tensor with size 256× 8× 8. The spa-tial resolution is increased from 8× 8 to 28 × 28 by using5 deconvolution layers. Following the deconvolutions withconvolutional layers was found to qualitatively improve theoutput images. Finally, we use the predicted flow field of size2×256×256 to warp the reference image. To allow end-to-endtraining, we used a differentiable image sampling method withbi-linear interpolation kernel in the warping layer [6]. UsingL2 prediction loss, the final optimization function is definedas follows:

minW

L(W ) = ∑i||oi−warp(or,

−→h W (xi,xr))||2 +λ ||W ||2 (2)

in which W contains the network parameters and λ is theregularization coefficient. Training was conducted using theCaffe [7] library, which had been extended to support thewarping layer. The ADAM optimization algorithm [8] wasused for the solver method.

The idea of warping the nearest neighbor can be extended towarping an ensemble of k-nearest neighbor reference images,with each neighbor contributing separately to the prediction.In addition to the flow field, each network in the ensemblealso predicts a 256×256 confidence map (as shown in Fig.2).We use these confidence maps to compute the weighted sumof different predictions in the ensemble and compute the finalprediction.

B. Tracking using the Forward ModelA common approach to state estimation is to use a Bayesian

filter with a generative sensor model to provide a correction

conditioned on measurement uncertainty. For instance, theforward model described in Section II-A could be used toprovide such a state update, and potentially allow for asingle model to be used for both tracking and prediction. Wetherefore consider this approach as a suitable comparison tothe inverse-forward framework being proposed.

In order to track a belief-state distribution, we can derive anExtended Kalman Filter from the forward sensor model. Thisis a straightforward process, as network models are inherentlyamenable to linearization. The correction step is performedby computing the Jacobian J as the product of the layer-wiseJacobians:

J = J(L)× J(L−1)×·· ·× J(0), (3)

where L is the total number of layers in the network. Thedimensionality of the observations makes it impractical tocompute the Kalman Gain

(K = ΣJT (JΣJT +R)−1

)directly.

Instead, we use a low-rank approximation of the sensor-noisecovariance matrix R, and perform the inversion in projectedspace:

(JTΣJ+R)−1 ≈U(UT JT

ΣJU +SR)−1UT (4)

where SR contains the top-most singular values of R. Forsimplicity, the state prediction covariance Q was approximatedas Q = γIn×n, where γ = 10−6. A first-order transition modelis used, where next-state joint-value estimates are defined asx′t+1 = xt +(xt − xt−1)∆t for a fixed time-step ∆t. We providean initial prior over the state values with identical covariance,and an arbitrary mean-offset from ground-truth.

C. Tracking using an Inverse Sensor Model

In order to infer the latent joint states xi from observedimages oi, we can also define a discriminative model g f wd .Given the capacity of convolutional neural network models toperform regression on high-dimensional input data, we used amodified version of the VGG CNN-S network [1] containing5 convolutional layers and 4 fully-connected layers (shown inFig.3). The model was trained on 256×256×3 input imagesand corresponding joint labels, optimizing an L2 loss on jointvalues and regularized with dropout layers.

fc6 (4096) fc7

(1024)conv5(512)

conv3(512) conv4

(512)

conv1(96) conv2

(256)fc8

(1024)fc9 (7)

Fig. 3. Inverse sensor model architecture.

III. EXPERIMENTAL RESULTS

A. Datasets

The experiments were conducted using a Barrett WAMmanipulator, a cable-actuated robotic arm. Data was captured

from raw joint-encoder traces and a statically-positioned cam-era, collecting 640x480 RGB frames at a rate of 30 fps. Dueto non-linear effects arising from joint flexibility and cablestretch, large joint velocities and accelerations induce discrep-ancies between recorded joint values and actual positions seenin the images. In order to mitigate aliasing, the joint velocitieswere kept under 10◦/s. This and other practical constraintsimposed limitations on obtaining an adequate sampling densityof the four-dimensional joint space. As such, the training datawas collected while executing randomly generated trajectoriesusing only the first four joints (trajectories were made linearfor simplicity). A total of 225,000 camera frames and cor-responding joint values were captured, with 50,000 reservedfor a nearest-neighbor data-set, and the remaining for trainingdata. Test data was collected for arbitrary joint trajectories withvarying velocities and accelerations (without concern for jointflexibility and stretch).

Fig. 4. Sequences of generated images for an exampl trajectory. The firstrow corresponds to ground-truth images, middle row the predicted output,and bottom row the DECONV baseline. Times indicated for t = 0,20, ...,80,from left to right, respectively.

B. Forward Sensor Model Evaluation

We first examine the predicted observations generated by theproposed forward sensor model (kNN-FLOW). A deconvolu-tional network (DECONV) similar to that used in [4] is chosenas a baseline, with the absence of a branch for predictingsegmentation masks (as these are not readily available fromRGB data). Qualitative comparisons between the ground-truth, forward-sensor predictions, and DECONV outputs aredepicted in Fig.4 for a sequence of state-input values at varioustimes on a pair of test trajectories. Detailed texture and featureshave been preserved in the kNN-FLOW predictions, and thegenerated robot poses closely match the ground truth images.The DECONV outputs suffer from blurred reconstructions, asexpected, and do not manage to render certain components ofthe robot arm (such as the end-effector). Quantitative resultsare shown in Table I, where it is apparent that the modeloutperforms the DECONV baseline in both mean L1 and RMSpixel error.

RGB Error Mean L1 RMSkNN-FLOW (ours) 0.01109 0.023097

DECONV (baseline) 0.03066 0.061374

TABLE IFORWARD MODEL BASELINE COMPARISON.

C. Tracking Task Evaluation

Given a sequence of observations {o0,o1, ...,ot−1,ot}, wewish to infer the corresponding sequences of state values{x0,x1, ...,xt−1,xt} which includes the current state. This track-ing task can be accomplished with a discriminative (inverse)sensor model, which provides deterministic subspace valuesindependently of previously observed frames. An example oftracking accuracy, using the model described in Section II-C, isshown in Fig.5, which demonstrates frame-by-frame trackingcapability within a margin of 3-degrees from ground-truthdata.

We evaluate the accuracy of the EKF tracker on the sametest trajectories as the inverse model. A single-trajectory ex-ample is shown in Figure 5 for 5 and 10-degrees mean-offsetsat t = 0, where the ground-truth joint values are identical tothose shown for the inverse model. The results demonstratethe ability of the tracker to converge the state estimate to thetrue trajectory over time, given an initial prior.

Fig. 5. Inferred joint state values using (a) the inverse model, and (b) theforward sensor model in an EKF update for a 10-degree offset from ground-truth at t = 0.

No latent dynamics model is assumed in the inverse model,and state estimates are produced independently given a cur-rently observed frame. The object is essentially fully-observedfor the given dataset, with few instances of self-occlusion.Future work will entail comparison of the models using scenescontaining partial occlusion.

REFERENCES

[1] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, andAndrew Zisserman. Return of the devil in the details:Delving deep into convolutional nets. arXiv preprintarXiv:1405.3531, 2014.

[2] Anthony Dearden and Yiannis Demiris. Learning forwardmodels for robots. In IJCAI, volume 5, page 1440, 2005.

[3] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrickvan der Smagt, Daniel Cremers, and Thomas Brox.Flownet: Learning optical flow with convolutional net-works. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 2758–2766, 2015.

[4] Alexey Dosovitskiy, Jost Springenberg, MaximTatarchenko, and Thomas Brox. Learning to generatechairs, tables and cars with convolutional networks.IEEE transactions on pattern analysis and machineintelligence, 2016.

[5] Aaron D’Souza, Sethu Vijayakumar, and Stefan Schaal.Learning inverse kinematics. In Intelligent Robots andSystems, 2001. Proceedings. 2001 IEEE/RSJ Interna-tional Conference on, volume 1, pages 298–303. IEEE,2001.

[6] Max Jaderberg, Karen Simonyan, Andrew Zisserman,et al. Spatial transformer networks. In Advances inNeural Information Processing Systems, pages 2017–2025, 2015.

[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, Sergio Guadar-rama, and Trevor Darrell. Caffe: Convolutional ar-chitecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[8] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[9] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli,and Josh Tenenbaum. Deep convolutional inverse graph-ics network. In Advances in Neural Information Process-ing Systems, pages 2539–2547, 2015.

[10] Duy Nguyen-Tuong, Jan R Peters, and Matthias Seeger.Local gaussian process regression for real time onlinemodel learning. In Advances in Neural InformationProcessing Systems, pages 1193–1200, 2009.

[11] Alec Radford, Luke Metz, and Soumith Chintala. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434, 2015.

[12] Maxim Tatarchenko, Alexey Dosovitskiy, and ThomasBrox. Single-view to multi-view: Reconstructing un-seen views with a convolutional network. CoRRabs/1511.06702, 2015.

[13] Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Probabilistic robotics. MIT press, 2005.

[14] Sethu Vijayakumar and Stefan Schaal. Locally weightedprojection regression: Incremental real time learning inhigh dimensional space. In Proceedings of the Seven-teenth International Conference on Machine Learning,pages 1079–1086. Morgan Kaufmann Publishers Inc.,2000.

[15] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, JitendraMalik, and Alexei A Efros. View synthesis by appearanceflow. In European Conference on Computer Vision, pages286–301. Springer, 2016.

Documents

Deep Forward and Inverse Perceptual Models for Tracking ...tws10/rss17... · cheap, reliable source of information about the robot and its environment, but the precise relationship