Sparse pose manifolds - Rigas Kouskouridasrkouskou.gitlab.io/publications/docs/AURO_2014.pdfAuton Robot (2014) 37:191–207 DOI 10.1007/s10514-014-9388-x Sparse pose manifolds Rigas

Auton Robot (2014) 37:191–207DOI 10.1007/s10514-014-9388-x

Sparse pose manifolds

Rigas Kouskouridas · Kostantinos Charalampous ·Antonios Gasteratos

Received: 1 March 2013 / Accepted: 25 February 2014 / Published online: 25 March 2014© Springer Science+Business Media New York 2014

Abstract The efficient manipulation of randomly placedobjects relies on the accurate estimation of their 6 DoF geo-metrical configuration. In this paper we tackle this issue byfollowing the intuitive idea that different objects, viewedfrom the same perspective, should share identical poses and,moreover, these should be efficiently projected onto a well-defined and highly distinguishable subspace. This hypothe-sis is formulated here by the introduction of pose manifoldsrelying on a bunch-based structure that incorporates unsu-pervised clustering of the abstracted visual cues and encap-sulates appearance and geometrical properties of the objects.The resulting pose manifolds represent the displacementsamong any of the extracted bunch points and the two foci ofan ellipse fitted over the members of the bunch-based struc-ture. We post-process the established pose manifolds via l1norm minimization so as to build sparse and highly repre-sentative input vectors that are characterized by large dis-crimination capabilities. While other approaches for robotgrasping build high dimensional input vectors, thus increas-ing the complexity of the system, in contrast, our methodestablishes highly distinguishable manifolds of low dimen-sionality. This paper represents the first integrated researchendeavor in formulating sparse pose manifolds, with experi-

R. Kouskouridas (B)Computer Vision & Learning Lab, Department of Electrical& Electronic Engineering, Imperial College London,South Kensington, London SW7 2AZ, UKe-mail: [email protected]

K. Charalampous · A. GasteratosLaboratory of Robotics & Automation, Department of Production& Management Engineering, Democritus University of Thrace,Vasilissis Sophias 12, GR-671 00 Xanthi, Greecee-mail: [email protected]

A. Gasteratose-mail: [email protected]

mental results providing evidence of low generalization error,justifying thus our theoretical claims.

Keywords Object manipulation · Sparse representation ·Manifold modeling · Neural networks

1 Introduction

The performance of any robotic devices capable of perform-ing manipulation tasks is directly related to its ability notonly to recognize objects, but to also provide accurate esti-mations about their 3D geometrical configuration. Objectmanipulation is a very complex procedure involving the accu-rate estimation of the 6 DoF pose of the testing target, theefficient avoidance of possible obstacles in the arm’s pathand the design of a grasping strategy for hand(finger)-objectalignment (Saxena et al. 2011). Most human operations arerelated to object manipulation, such as drinking, eating, usinga tool, door opening, driving, etc. Driven by its fundamen-tal importance and the ability to be adopted by an abun-dance of diverse applications, several research endeavorswere directed towards object manipulation (Rasolzadeh etal. 2010; Mason et al. 2011; Lippiello et al. 2011; Oikono-midis et al. 2011; Saxena et al. 2008, 2006) or object search-ing (Shubina and Tsotsos 2010; Andreopoulos and Tsotsos2009). In recent years, scholarly activity has focused on solv-ing object recognition and 3D pose estimation through meth-ods involving the construction of large databases of imagesof objects to be compared and matched with similar onesduring the testing phase (Detry and Piater 2011; Hsiao et al.2010; Ferrari et al. 2006; Kouskouridas et al. 2012; Popovic etal. 2010). However, such solutions fail to attract stakehold-ers’ interest since, in most of the cases, a robot’s workingenvironment could well contain unknown objects positioned

123

192 Auton Robot (2014) 37:191–207

randomly. Towards this end, emphasis is given on designingand implementing novel robust image understanding tech-niques that can solve the problem of grasping unknownobjects under minimum supervision. However, there is noground solution to the manipulation problem combininglarge generalization capacities together with modest com-putational complexity, despite the substantial endeavors andcertain achievements so far.

This paper proposes a generalized solution to the 6 DoFobject pose estimation problem, with view to solve manipu-lation tasks, assuming an obstacle-free path and a straightfor-ward grasping strategy. We believe that, under these assump-tions, our approach represents a more general solution toobject grasping that could be easily adopted by any roboticarchitecture. The proposed Sparse Pose Manifolds (SPM)method aims at solving the problem of unknown objectmanipulation by unifying 3D pose manifolds and graspingpoints into a cohesive framework for robot grasping, basedon the following intuitive ideas: (a) an object observed undershifting viewpoints holds properties that can be sufficientlymodeled and projected onto a well-defined subspace, highlydistinguishable among its neighbors, and (b) even totallydifferent objects captured under similar perspectives shouldshare identical 3D pose attributes, leading to similar mea-surements that can indeed establish highly discriminativemanifolds, capable of assigning accurate 3D pose tags to dif-ferent object models. The proposed pose manifolds, formu-lated via a sophisticated architecture, stand for a low dimen-sional data representation that enables direct modification ofthe configuration of the robot’s wrist according to the esti-mated grasping points. Unlike contemporary systems requir-ing extensive supervision and large repositories of imagesof objects, our focus is on providing a ground solution, withlarge generalization capacities based on unsupervised learn-ing. Towards this end, we employ manifold modeling pro-cedures that emphasize on processing available visual datain such a way so that their projection onto the correspond-ing subspaces is sparse, compact and highly representative.Figure 1 illustrates the basic ideas upon which the proposedframework is built. Contrary to the existing solutions forobject manipulation, requiring a priori knowledge of thegeometry of the object, our method builds distinguishablepose manifolds that categorize identical poses of differentobjects into the same classes.

Initially, given an image of an object with a specific pose,we derive a feature vector through the proposed ellipse-fittingmodule. This procedure iterates over the entire dataset com-prising of object images belonging to 6 different classes. Wethen employ l1 norm minimization in order to derive sparserepresentation vectors and, henceforth, to enhance the gener-alization capabilities of our method. As a follow-up step, thederived manifolds are mapped, through supervised learning,to a well-defined and highly representative subspace. The

Fig. 1 The key idea underlying the proposed framework is the estab-lishment of highly representative manifolds that lie on subspaces oflow dimensionality. Different objects shot under the same aspects shareidentical 6DoF estimations that are categorized into the correspondingclasses to form the pose manifolds. In this figure, these classes are rep-resented by buckets of a specific pose, whilst the ultimate goal of theproposed visual grasping method is to efficiently guide the gripper tograsp an unknown object

ultimate goal of the proposed module is to efficiently esti-mate the 6 DoF pose of an unknown testing object with viewto facilitate grasping operations. After the efficient estimationof the pose of the object, through the manifold modeling pro-cedure, its geometrical attributes (pose relative to the robotsframe) are given as input to the robots controller to performa grasp.

It is our belief that this work of ours makes a seriouscontribution to the present state of the art, in the followingways:

• The proposed Sparse Pose Manifolds method lays its foun-dations on a novel module for pose manifold modelingthat establishes compact and highly representative sub-spaces, whilst experimental verification provide evidencefor the existence of such manifolds, sufficient to facilitateunknown object grasping.• Pose manifolds depend on a novel bunch-based architec-

ture (here introduced for the first time) which, unlike pre-vious works (Savarese and Fei-Fei 2007; Hinterstoisseret al. 2007), is able to bypass the part selection processusing unsupervised clustering, whereas by extracting localpatches, encapsulates both appearance and geometricalattributes of the objects.• Extending earlier work reported in Mei et al. (2009, 2011)

where 3D pose estimations are limited to cars, we uti-lize numerous databases of real objects available that arefurther expanded by new artificially generated ones. Inaddition, compared to previous work our method offershigher generalization capacities mainly due to the efficientlearning based on a large a priori training set containingnumerous examples of real and artificial data.

123

Auton Robot (2014) 37:191–207 193

• The established manifolds exhibit low dimensionality,thus avoiding the usage of conventional dimensionalityreduction schemes widely employed in several computervision applications. Besides, through the minimization ofthe l1 norm we build sparse and compact manifolds that arehighly representative, tolerant to common imaging distur-bances (e.g. noise) and superior in terms of discriminationcapabilities.• Comparative evaluation of our work against other related

works demonstrates its superiority in 3D pose estimationand object manipulation tasks. Quantitative and qualitativeexperimental results provide evidence of high 3D poserecovery rates with low generalization error, whist realtime execution boost the potential of our work.

This paper is organized as follows. In the following Sub-section 1.1, we provide an overview of the related work in thefields of eye-in(to)-hand target manipulation, sparse featurerepresentation and 3D pose estimation. In Sect. 2 we for-mulate the proposed Sparse Pose Manifolds (SPM) method.Experimental results demonstrating the quality of our workare presented in Sect. 3, while some final notes and conclu-sions are drawn in Sect. 4.

1.1 Related work

An essential phase in any object classification and pat-tern recognition task is the feature extraction procedure.Most classifiers attempt to compute a separating hyper-planebetween the corresponding classes, which leads to a vari-ety of published methodologies (Bishop 2006). Consideringthe simple two-class problem, in which a linear discrimi-nant function should be computed, the decision hyperplaneis g(x) = wT x + w0, where w is the weight vector, w0 isthe bias term and x is a sample vector. However, the repre-sentation of an object in a low dimensional space aimingto provide a more discriminative information is the mainobjective of any feature extraction technique, as the initialstep towards a minimum error classification and general-ization capacity. Feature extraction algorithms are dividedinto three categories, namely unsupervised, supervised andsemi-supervised ones, depending on the provision of thelabels of the corresponding samples. The majority of thosemethods seek projection vectors A = [a1, a2, . . . , al ] thatminimize a particular cost function, following a certain cri-terion, ultimately supplying an optimized low dimensionalrepresentation of a dataset X = [x1, x2, . . . , xn]. Regard-ing the unsupervised methodologies, the Principal Compo-nent Analysis (PCA) (Jolliffe 1986) , in which the opti-mization function seeks a sub-space, in order to maximizethe variance of the data, is considered the most popular.Linear Discriminant Analysis (Duda et al. 2001) however,another well-known supervised technique, attempts to max-

imize the ratio of the between-class to within-class scattermatrices. Furthermore, the non-linear extension induced viasupport vector theory (Schölkopf and Smola 2002), knownas “kernel trick’, lead to many kernelized versions of lin-ear approaches, such as Kernel-PCA (Schölkopf et al. 1997)and Generalized Discriminant Analysis (GDA) (Baudat andAnouar 2000). A newly developed framework is the GraphEmbedding (GE) (Yan et al. 2007), in which all data sam-ples are considered to form a graph G = {X, Q}, where Qis a similarity matrix incorporating either geometrical or sta-tistical relations that prevail the data set. The latter yieldsa low dimensional manifold encapsulating the relations inthe initial high-dimensional space. Many known techniquescan be clarified under GE, and depending on the definitionof the corresponding similarity measure, the resulting mani-fold may be identical to that produced by another algorithm,such as the PCA. In other words, GE encapsulates the major-ity of feature extraction methods. Sparse representation hasbeen introduced in signal processing and pattern recognitionvia l1 norm minimization, which can be solved using stan-dard linear programming (Chen et al. 2001). The inherentproperties that the l1 norm makes available, including dis-crimination capability and noise tolerance, have drawn theattention of scholars in computer vision and pattern recog-nition fields (Qiao et al. 2010). The l1 norm minimizationhas already been exploited in graph techniques (Cheng et al.2010), dimensionality reduction (Zou et al. 2004; Pang et al.2010), classification (Wright et al. 2009), and action recog-nition (Guha and Ward 2012; Wang et al. 2013; Castrodadand Sapiro 2012) leading to optimum results.

Based on the literature, in the field of robotics research twomajor camera configurations for vision-based scene under-standing are discerned. The eye-in-hand and the eye-to-handarchitectures entail mounting the vision sensor(s) onto therobot’s end-effector or installing the cameras separate fromthe robot, possessing however such a pose with the capabil-ity to observe its entire working space. Such configurationsare utilized in several applications providing solutions to theproblems of 3D object modeling (Torabi and Gupta 2012;Krainin et al. 2011), visual servoing (Chan et al. 2011; Lip-piello et al. 2007) and object manipulation (Agrawal et al.2010; Kragic et al. 2005; Ben Amor et al. 2012; Hebert etal. 2012; Bohg and Kragic 2009). Srinivasa et al. (2010) pro-posed a 6 DoF pose estimation technique that shares com-mon ideas with the work of Choi et al. (2008), while incor-porating the process of matching images of objects of thetest scene with 3D object models of a database constructedoffline. Two eye-to-hand cameras allow stereo-based depthanalysis which, when combined with data transmitted fromthe eye-in-hand sensor, is accurate enough for 3D pose mea-surement of the testing patterns. Besides, feature extractionis accomplished via the Scale Invariant Feature Transform(SIFT) whilst, for the fine adjustment of pose parameters,

123

194 Auton Robot (2014) 37:191–207

the problem is formulated as an optimization one and solvedby the Levenberg-Marquardt algorithm.

The term pose / grasping manifolds was first coined inTsoli and Jenkins (2008) with a view to propose a manifoldconstruction scheme based on data representing the motion ofa human hand during an object manipulation task. However,this approach corresponds to a rather trivial method that pri-marily builds high dimensional manifolds, whilst employingconventional dimensionality reduction frameworks for sub-space embedding. Furthermore, other drawbacks of the workpresented in Tsoli and Jenkins (2008) are thought to be: (a)the configuration of the arm, obtained through a pose simu-lation that needs to be provided by the user; and (b) lack ofsolution to the grasping point calculation problem. In con-trast, in this paper the term pose manifold is adopted withthe view to indicate that there is a direct relation betweenpose-related data and the desired arrangement of the robot’send-effector.

Mei et al. (2009) have shown that given a highly represen-tative dataset and an adequate manifold modeling subroutine,the performance of 3D object pose estimation is appropriatelybootstrapped to the compactness of the established mani-folds. In addition, they proved that part-based architectures,unlike holistic ones, are capable of handling more efficientlyusual imaging disturbances such as partial occlusions andbackground clutter. The key idea underlying the part-basedmethods is the pursuit of points or groups of features that canbe efficiently processed to built highly discriminative objectmodels (Berg et al. 2005; Fergus et al. 2007; Leibe et al.2004; Kouskouridas et al. 2013). According to Savarese andFei-Fei (2007), the most important advantage of part-basedframeworks stems from their ability to authorize condensedobject models by computing the liaison between parts of anobject from several views.

2 Method description

Initially, we collect images of large databases suitable for 3Dobject pose recovery containing several objects shot fromvarying perspectives. Our method utilizes the COIL-100(Nayar et al. 1996) and CVL (Viksten et al. 2009) datasets,whilst incorporating synthetically rendered data of 3D mod-els, available at http://www.evermotion.org/. At the presentstage we hold a large number of training examples, whichare images depicting various objects in arbitrary geometri-cal configurations along with the corresponding ground truthmeasurements stored as feature vector yi ∈ R

6. The trainingset was chosen to be broad, since the proposed methodol-ogy does not employ an on-line training scheme and, thus,new training data cannot be introduced. The utilized datasetsconsists of images of objects captured under discrete inter-vals (e.g. of 5 degrees) in the 6-dimensional pose space,

yet the regressor derives outputs in the continuous space.From every image in the training set we extract the respec-tive SIFT (Lowe 2004) features that are post-processed bya homography-based RANSAC (Fischler and Bolles 1981)for outliers removal. Through the unsupervised clustering ofthe abstracted keypoints we define our bunch-based architec-ture. This is an additional feature representation subroutine,which builds new features considering both appearance andgeometrical attributes of the training object. As a follow-upstep the locations of each member of the architecture are pro-vided as input to an ellipse fitting process in order to establishthe initial pose manifolds. The latter represent the distancesof each member of the part-based structure from the two fociof the fitting ellipse. To this point we would like to statethat the main goal of the manifold modeling technique is todesign a pose model that projects similar poses of differentobjects onto neighboring regions in the corresponding sub-space. To this end, we consulted basic ellipse properties inthe Euclidean Space implying that a 2D ellipse, opposed tocircles or parabola, holds 5 DoFs (position, scale, shape andorientation) and stands for a five-dimensional manifold.

In the next step the initial manifolds are reprojected ontoa highly representative and sparse subspace with large dis-crimination capabilities, via the minimization of the l1 norm.The resulting training set is denoted as {rt , yt } with rt ∈ Vrepresenting the constructed sparse manifolds (input vectors)and yt ∈ Y the respective ground truth measurements (out-put vectors). The process of building the training set is aniterative one that operates over several images of differentobjects. Our sparse pose manifold modeling method con-structs r ∈ V input vectors, taken into account during theprocess of learning of the Radial Basis Function regressorg, portraying the function φ(w). The latter represents thelink that needs to be learnt, in order to establish the correctmapping between input vectors rt ∈ V and the modeled out-put vectors yt = y(rt ;φ(w)) ∈ Y, φ(w) : V → Y . Theultimate goal of the proposed method is to achieve accurateestimations (y∗) about the 3D pose for the unknown object o∗represented by r∗ testing input vectors in order to facilitatethe efficient manipulation of the object. A brief descriptionof the method is illustrated by means of the block diagramof Fig. 2. The proposed sparse pose manifold method lays itsfoundations on four modules that are of fundamental impor-tance and affect directly the discrimination capability of theestablished manifolds. These are: (i) a novel robust bunch-based structure; (ii) the ellipse fitting process and the initialmanifold calculation; (iii) the reprojection of the resultingmanifolds onto a more sparse and more representative sub-space and, (iv) the training of the regressor. The above men-tioned modules, all of which are discussed in detail later onin this section, provide contribution to the present state of theart either in their own or in combination and insofar as theauthors are aware appear for the first time in the literature,

123

http://www.evermotion.org/

Auton Robot (2014) 37:191–207 195

Fig. 2 The figure depicts the two distinguished phases of the proposedalgorithm, namely the training (continuous line) and the testing (dashedline) ones. The main building blocks of the proposed method are: (a)the collection of labeled datasets dedicated to 3D pose estimation alongwith the corresponding ground truth measurement yt ;(b) the module ofbuilding the bunch-based architecture through unsupervised clusteringof the extracted visual cues; (c) the process of constructing initial man-ifolds via conforming an ellipse over the members of the bunch-based

architecture; (d) the establishment of the highly representative subspaceof sparse pose manifolds accomplished by l1 norm minimization; (e)the training of the RBF-based regressor g, from a priori set {rt , yt }witht training examples in order to estimate the continuous and invertiblefunction φ(w) : V → Y , with w representing the vector of adjustableparameters; (f) the process of executing a manipulation task for theunknown testing example r∗ of object o∗

offering an elegant and efficient method for the manipulationof an arbitrarily placed object.

2.1 Bunch-based architecture

This module entails the subroutine of building a part-basedformation that processes the abstracted visual cues of anobject in an unsupervised manner. The most notable feature,constitutes the fact that our ultimate goal was to build a highlyrepresentative topology capable of modeling the visual dataavailable in such a way so as to increase the generalizationcapacities of the resulting method. We employ a novel fea-ture representation scheme to extract parts of objects thatpossess both appearance and geometrical properties of thetarget. Although our method shares common ground withthe work of Mei et al. (2011), we consider the generalizationcapabilities of their technique being rather restricted due totheir supervised-based feature extraction process. The latterextracts key-points through the realization of a ProbabilityDensity Function (PDF) associated with the joint distribu-tion of appearance and geometry properties of the target.

Our bunch-based structure portrays appearance-basedcharacteristics by extracting invariant features of high dimen-sionality with the SIFT descriptor (Lowe 2004) followed byRANSAC (Fischler and Bolles 1981) for outliers removal.Additionally, geometrical attributes of the object are aggre-gated by employing an unsupervised clustering techniqueover the matched SIFT key-points. Particularly, we treat theclustering problem from a Bayesian perspective by applyingthe Expectation Maximization (EM) algorithm to a Mixture

Decomposition Scheme. (Please refer to Sect. 3.1 for addi-tional information regarding the number of clusters). Ana-lytical presentation of the mathematical formulation of theproposed part-based module is available in the Appendix.With a view to the reader’s better understanding and in orderto clarify the unsupervised clustering approach employed weintroduce the respective pseudocode 1.

Algorithm 1 Calculate the positions θk of the γ clusters1: Inputs: Locations of the ρ extracted SIFT features of the object oi

with pose q j2: Initialize: θ ← θ(0), P← P(0), ε ←very small number, τ = 03: while ||Θ(τ + 1)−Θ(τ )|| > ε do4: Compute: P(bk|vζ ;Θ(τ )) ←

(p(vζ |bk; θk(τ ))Pk(τ )/ ∑γ

k=1(p(vζ |bk; θk(τ ))Pk(τ )with ζ = 1, . . . , ρ and k = 1, . . . , γ

5: Find θk(τ + 1) through the maximization of:∑ρζ=1

∑γ

k=1 P(bk|vζ ;Θ(τ )) ∂∂θk

ln(p(vζ |bk; θk)), w.r.t. θk fork = 1, . . . , γ

6: Set: Pk(τ+1)← 1ρ

∑ρζ=1 P(bk|vζ ;Θ(τ )) with k = 1, . . . , γ

7: τ = τ + 18: end while9: Outputs: Locations of the γ clusters over the surface of the object

2.2 Establishing initial non-sparse manifolds

The goal of this module is to establish manifolds of lowdimensionality capable of moderately distinguishing similarposes of different objects within the same pose subspace.Towards this end, we build a generalized model emphasizing

123

196 Auton Robot (2014) 37:191–207

in encoding viewpoint-related data, invariant to object typeand category. The resulting model should be able to projectpose-related data onto the respective manifolds and to mini-mize intra-class variability. The proposed generalized mod-ule makes use of fundamental properties of the Euclideanspace and its geometrical shapes. Essentially, our modelfocuses on fitting an ellipse on the members of the bunch-based structure of an object. It should be noted that fit-ting an ellipse on the contours of the Gaussians (estimatedthrough the EM algorithm) resulted in a pose model that didnot project similar poses of different object onto neighbor-ing regions. Thus, our manifold modeling module employsthe implicit conic equation of an ellipse with points (X,Y )in Cartesian coordinates, which takes the form ψ(Δ) =AX2 + B XY + CY 2 + DX + EY + F = 0. Here Δ =[A, B,C, D, E, F] corresponds to the vector of parametersto be estimated. Furthermore, we formulate the problem ofthe ellipse fitting as a combinatorial minimization problemof the following generalized least-squares cost function:

Q(Δ) = ||b− ψ ||2

= 1/γγ∑

k=1

||AX2 + B XY+CY 2 + DX+EY + F ||

+ λ5∑

N=1

Δ2N (1)

An object oi with pose q is represented by the extractedclusters bk, k = 1, . . . , γ , while N denotes the total numberof parameters of the cost function Q. The regularized para-meter λ is experimentally set to 0.1 and is summed over thefirst five elements of vectorΔ excluding F that is regarded asbias. Additionally, the minimization of the cost function inEq. 1 is accomplished via the PSO (Eberhart et al. 2001) tool,which stands for an evolutionary extrema estimation throughthe “social interaction” of a population of particles. Further-more, we utilize PSO to find the particular particle-atom inthe search space of 6 dimensions - as the dimensionality ofthe vector κ- that holds the best score across all used gener-ations. A more detailed analysis of the PSO falls beyond thescope of this paper; interested readers may refer to Eberhartet al. (2001) for additional information. We would like tonote that utilizing the same regularization weight for differ-ent parameters, although it is definitely suboptimal in mostof the cases, e.g. regularized linear regression, it is com-mon place in PSO. Additionally, PSO was selected againstother optimization techniques, e.g. gradient descent, New-ton’s method, due to the fact that PSO is computationallyless expensive and converges faster. Moreover, in most ofthe cases gradient descent failed to converge to the globalminimum, opposed to PSO.

2.3 Build sparse and representative pose manifolds

Given a sample ω ∈ Rm and a matrix D = [d1,d2, . . . ,

dn] ∈ Rm×n where D is an overcomplete dictionary with

n elements, sparse representation attempts to derive a coef-ficient vector s ∈ R

n , possessing as many zero coefficientsas possible and to simultaneously reconstruct ω as a linearcombination of D. This is formally expressed as:

mins ||s||0subject to ω = Ds (2)

where ||s||0 denotes the l0-norm, i.e. the number of non-zeroelements of vector s. However, solving Eq. 2 is an NP-hardproblem, i.e. the most accurate method to derive a globalsolution is to exhaustively try out all possible subsets of s.Nevertheless, the equivalence between l0 and l1 minimizationproblem has been shown, under the assumption that the solu-tion s0 is sparse enough (Tsaig and Donoho 2006). Makingone of those theoretical findings, the desired sparse solutions can be extracted via the minimization of the equivalent l1problem, which is solvable in polynomial time, i.e.:

mins ||s||1subject to ω = Ds (3)

Moreover, the strict equality constraint in Eq. 3 has beenextended to an error-tolerant scheme, due to the existence ofnoise in real-life applications, leading to a more robust form:

mins||s||1subject to ||ω −Ds|| ≤ ε (4)

where ε ∈ R is the error tolerance.Once the data matrix M=[m1,m2, . . . ,mt ] ∈ R

2∗||bk ||×t

has been extracted, where 2 ∗ b j is the number of featuresand t is the number of samples, the l1-norm is applied toevery sample, expressing each one as a linear combination ofthe remaining data set. Moreover, the minimization criterionis enhanced by the addition of one extra constraint and isformulated as follows:

minri ||ri ||1subject to mi = Mri , i = 1, 2, . . . , t

1 = 1T ri (5)

For each ri ∈ Rt the i − th value equals to zero, due to

the fact that the corresponding mi sample has been removedprior to the minimization. The elements of a ri vector denotethe degree of participation of the remaining data set in order(ri ) to be reconstructed. The reconstruction weight ri pro-vides significant geometric capabilities, such as invarianceto rotations and scalings, while the new imposed constraintattaches consistency to translations. Through this process we

123

Auton Robot (2014) 37:191–207 197

Fig. 3 The process of building the training set {rt , yt } used for the 3Dpose estimation module incorporates the division of the labeled datasetsinto the training and the testing subsets, respectively. The built training

set is fed to the Radial Basis Function-based regressor g in order toproduce an accurate mapping between the input space V and the outputspace Y

reproject the initial pose manifolds onto more compact,highly representative subspaces.

2.4 Training phase

As described above, the members of the bunch-based archi-tecture (clusters bk) are fed to the ellipse fitting module forthe purpose of efficient manifold modeling. The aforemen-tioned procedure proceeds with iteration over several imagesof numerous objects. Along with this repeated process thereis a need to cope with the 2D-2D feature correspondenceproblem to recall patterns that hold highly distinguishabledata. We have gathered a large collection of object imagesshot every 50 accompanied by the corresponding ground truth6 DoF pose measurements stored in the vector yt . Figure 3is drawn with the view to illustrate the process of buildingthe training set

{rt , yt

}, the task of sparse pose manifolds

modeling and network simulation.

3 Experimental results

In the following section we present both qualitative andquantitative results, as obtained by the proposed method. Itbecomes apparent that the effectiveness of the method pre-sented in this paper relies on its efficiency to build highlyrepresentative and sparse manifolds with the view to pro-vide accurate 6 DoF measurements related to geometricalconfigurations of objects in the 3D space. Towards this endwe compared the efficiency of the proposed framework with

other state-of-the-art approaches used in the solution of the3D object pose recovering problem. More specifically, themethods chosen:

– Incorporate a part-based module for 3D pose estimationlike that in Hinterstoisser et al. (2007) (abbreviated asN3M)

– Emphasize on manifold modeling for 3D pose recoveringlike that in Mei et al. (2011) (abbreviated as SMM)

– Utilize a common least squares-based benchmark tech-nique (Lowe 1999); also adopted in proposed frameworksmore recently (Ma et al. 2011) (abbreviated as LS-based)

– Encompass a conventional dimensionality reductionscheme as presented in Yuan and Niemann (2001) (abbre-viated as PCA-based).

At this point, we wish to point out the critical lack ofdatabases containing images of objects in shifting 3D geo-metrical configurations, together with their correspondingground truth measurements. Here, we make use of the onlyavailable datasets viz. COIL-100 (Nayar et al. 1996) andCVL(Viksten et al. 2009), whilst we address the issue ofthe limited amount of 3D datasets by expanding our train-ing set with artificially rendered images for a collection ofobjects available at http://www.evermotion.org/. The train-ing set is essentially comprised of object images from theCOIL-100 (Nayar et al. 1996), CVL (Viksten et al. 2009)(images of uncluttered background) and Evermotion (2012)databases. Moreover, in order to derive accurate and repre-sentative results we performed a leave-one-out notion, with

123


198 Auton Robot (2014) 37:191–207

respect to the classes. Additionally, the efficiency of the pro-posed technique is further assessed by comparison to realimages of objects in cluttered environments (Viksten et al.2009).

3.1 Number of clusters bk and size of the trainingset

{rt , yt

}

The constructed sparse pose manifolds, fed to the RBF-basedregressor to produce the accurate input-output space map-ping, are of 2*||bk || dimensionality (with bk representingthe number of the members of the bunch-based structureextracted via unsupervised clustering). According to the lit-erature, there are two ways to choose the correct number ofclusters: a) the “elbow” method, where the right number ofclusters is considered producing the maximum discrepancybetween two neighboring values of the cost function; b) byevaluating the clustering process based on a downstream pur-pose (in our case the 6 DoF object pose estimation). We haveadopted the second approach since we wish to solve the objectpose estimation problem, rather than the efficient clusteringone. Accordingly, and after exhaustive testing, the number ofclusters was set to 8, as it exhibits low generalization error athigh performance levels. We should further point out that ourmethod is not limited necessarily to the selection of SIFT orthe Mixture Decomposition Scheme. It is, nevertheless, truethat the aforementioned algorithms have shown to be moreeffective in representing both appearance and geometry ofthe objects. Moreover, since affine changes in illuminationmay affect the performance of the descriptor, each featurevector is normalized to unit length for the sake of preserv-ing invariance under such circumstances. Accurate 3D targetpose measurements are provided through the simulation ofa regressor whose input vectors are preprocessed by a RBFkernel. The same principle applies to this feature of the pro-posed algorithm; in the sense that our system does not dependon the selection of the kernel, even if the RBF provided moreaccurate results.

Our labeled training set{rt , yt

}is built through an

iterative procedure that constructs t training examples bymeans of the architecture presented in Sect. 2. Input vec-tors rt of 2*||bk || dimensionality are accompanied with therespective 6-dimensional ground truth estimations yt cor-responding to known object poses. Objects used for train-ing are either placed on a turntable and shot every 50 withchanging camera poses (Viksten et al. 2009) or artificiallyrendered (http://www.evermotion.org/) under varying view-points through Blender (2011). In contrast to other relatedworks, we do not utilize conventional dimensionality reduc-tion schemes, e.g. PCA, prone to the inevitable loss of infor-mation. The size of the training set, without the addition ofnoisy vectors or others caused by partially occluded objects,is [2 ∗ bk × 100, 000], that is 1,000 images/object. Ground

truth measurements yt are determined by storing the cam-era’s rotational and translational parameters through a pre-defined moving path in the 3D space (6 DoF) and with agiven time step. With the addition of noisy input vectors andpartial occlusions-related ones, the size of the training set{rt , yt

}is [2 ∗ bk × 250, 000]. This results in large compu-

tational burden for even a state-of-the-art workstation; typ-ically requiring approximately 200 h on the expanded ver-sion of the training set. Preliminary experimental results forboth un-occluded objects and partially occluded targets aredepicted in Fig. 4.

3.2 Dealing with occlusions

One of the most frequent problems in computer vision is thepartial occlusion of objects with which contemporary frame-works are severely restricted in coping with. We consider thisissue by introducing (a) noisy input vectors and (b) imagesof objects with artificially generated partial occlusions tovarious degrees. The percentage of obstruction lays in therange of [0–60] with a black rectangle of random size beingarbitrarily overlaid on the surface of the training object. Asshown in Bishop (2006), training with noise approximatesTikhonov’s regularization theorem, whilst bootstrapping theperformance of the learning procedure. In the Appendixwe demonstrate how the training with noisy input vectorsis adjusted, adequately in our case. Figure 4 illustrates thevisual outcome of the proposed 3D object pose estimationmodule for non-occluded and partially occluded objects. Wedeal with the cluttered background by utilizing a contempo-rary image segmentation technique, such as that presentedin Shi and Malik (2000). In order to evaluate the perfor-mance of the proposed 3D pose estimation approach undervarious degrees of partial occlusions we make use of the cri-terion stated in Hinterstoisser et al. (2007). According to thisparticular metric, a measurement of the 3D pose of a tar-get is taken as accurate for all the cases where the sum oferror of the rotation parameters is less than 50. As depictedin Fig. 5(a), our sparse pose manifold modeling approachis shown to be more tolerant to partial occlusions (span-ning 0 to 95 percentage of coverage) as opposed to N3M(Hinterstoisser et al. 2007), SMM (Mei et al. 2011), LS-based (Ma et al. 2011) and PCA-based (Yuan and Niemann2001). We have further tested the accuracy of our method,since both our training set and those of other related workscontain images of objects shot every 50 that might haveresulted in overfiting the built models. Figure 5(b) depictsthe respective comparative results for a permissible error of30, whereas both experimental sets of Fig. 5 show resultsobtained from simulated networks with over 250 testingexamples.

123


Auton Robot (2014) 37:191–207 199

Fig. 4 The visual outcome of the proposed approach for un-occluded and partially occluded (under varying percentages) objects in clutteredbackgrounds

3.3 Accuracy of 6 DoF pose estimation and objectmanipulation

In this section we demonstrate the accuracy of the SparsePose Manifolds method for the particular tasks of 6 DoFpose recovering and object manipulation. With a view ourmethod to be comparable to other state-of-the-art method-ologies, the experimental setup for eye-in-hand performancetesting shares common spirit with the one presented in Ma etal. (2011). However, we possess a 5 DoF robotic arm that doesnot allow a complete evaluation of the 6 DoF estimation ofour method. We have addressed this issue by introducing ren-dered images of different objects captured by a moving cam-era in an artificial environment (http://www.blender.org/).

The pose parameters of the camera are considered to be thegroundtruth measurements. Furthermore, in order to demon-strate the generalization capabilities of our method we uti-lize unknown objects that belong to 6 general object classes,i.e. box-shaped objects (e.g. a Rubik’s cube), cars, motorcy-cles, balls, 4-legged animals and cups. Besides these 6 objectclasses, the proposed method is able to generalize to handlearbitrarily shaped objects. Ground truth measurements cor-respond to the pose parameters of a moving camera fixated tothe test object laying at a predefined position in the 3D space.Figures 6 and 7 illustrate the efficiency of our method for allthe test objects of the 6 different object classes. Thereupon,the proposed method is shown to exhibit high efficiency lev-els in recovering the 3 DoF translation parameters since they

123

http://www.blender.org/

200 Auton Robot (2014) 37:191–207

Fig. 5 Comparative evaluationof the tolerance of the proposedmethod under various degrees ofpartial occlusions for apermissible error of 50 (a) and30 (b), respectively

match, almost exactly, the ground truth measurements pro-vided through rendering. Moreover, the estimated rotationalparameters for all the objects under test only slightly differfrom the ground truth.

The proposed method is capable of providing accurate 6DoF pose estimations regardless of the class of the object(see Figs. 6 and 7). From the detailed study of the experi-mental simulations several conclusions are drawn about theunderlying capacities of the Sparse Pose Manifolds. In par-ticular, objects belonging to the classes of cup and 4-leggedanimal, are seen to cause larger oscillations in the estimatedrotational parameters, leading to a slightly decreased per-formance of the object manipulation module as a result. Inessence, the more regular the shape of an object the higher theperformance of the 3D pose recovering and grasping mod-ules. More specifically, it is noted that ellipses with lowereccentricities lead to more distinguishable manifolds, whilecircle-like extracted ellipses lead to low dimensional data that

result in notable degradations in the performance of the 6 DoFpose estimation method. As a result, prismatic objects, suchas cars and motorcycles evinced higher grasping accomplish-ment frequencies as contrasted to balls, 4-legged animals andcups. As the estimation of the pose of any object relies on theellipse fitting module, it is straightforward to claim that thedesigned pose model is feature-sensitive rather than shape-based. This particular attribute of our method facilitates theefficient addressing of issues raised by symmetrical objects,e.g. a ball. Table 1 depicts the condensed results after execut-ing 250 grasping operations for 6 different objects. Althoughwe possess a 5 DoF robotic arm, we tackle the non-holonomicproblem by positioning the objects on frames perpendicularto the working table, whilst the respective response of thenetwork was set to zero. Moreover, a manipulation opera-tion is considered as a successful one, in cases where therobot arm flawlessly grasps the testing object. We would liketo note that the mechanical attributes of the gripper enabled

123

Auton Robot (2014) 37:191–207 201

Fig. 6 6 DoF object pose estimation for a box-shaped object, a motorcycle and a 4-legged animal, along with the corresponding ground truthmeasurements

Fig. 7 6 DoF object pose estimation for a car, a ball and a cup, along with the corresponding ground truth measurements

the manipulation of any object belonging to the respectiveclasses.

The proposed methodology was compared to the classicKNN classification algorithm. The training matrix for theKNN was the same as the one used for the regressor and thecorresponding labels are the poses (in degrees) of trainingsamples. Concerning whether a pose of a test sample has

been accurately classified, we provided an error tolerance of±5 degrees from the ground truth. The KNN was tested using1 and 3 number of neighbors, i.e K=1 and K=3. The resultsare summarized in Table 2. The accuracy of the KNN in com-parison with the regressor in the proposed technique is sig-nificantly inferior, for the aforementioned reasons. Besidesthe reduction of the success rate, the standard deviation was

123

202 Auton Robot (2014) 37:191–207

Table 1 This summary table illustrates the accuracy along with the cor-responding standard deviations (σ ) of the proposed method in accom-plishing grasping operations with respect to multiple objects belongingto 6 different classes

Object class Success rate % Std (σ ) Mean eccentricity Std (σ )

Box-shaped 97.2 0.9 0.31 0.09

Car 95.6 1.1 0.44 0.12

Motorcycle 92.8 1.6 0.61 0.11

Ball 88.4 2.5 0.76 0.08

4-legged animal 87.9 2.9 0.84 0.09

Cup 86.6 3.3 0.87 0.07

As it can be seen, the success rate is closely related to the mean eccen-tricity of the extracted ellipses, since smaller eccentricity values implyhigher manipulation rates

Table 2 Summarizing results for the KNN classification experiments

Object class Success rate% (K=1)

Std (σ )(K=1)

Success rate% (K=3)

Std (σ )(K=3)

Box shaped 94.7 2.6 94.2 2.3

Car 91.5 3.2 94.5 3.0

Motorcycle 92.0 1.8 95.5 1.1

Ball 86.2 3.4 86.1 2.5

4-legged animal 83.3 3.7 83.8 2.9

Cup 82.6 3.4 82.8 3.2

increased dramatically. There are object classes, such as thebox shaped and the car where the standard deviation wasover-doubled. The comparison between the number of neigh-bors, suggest that for K=3 the system improved its successrate in most cases, yet for two object classes, box shaped andball, the usage of K=1 provided slightly better results. More-over, for K=3 the standard deviation was improved for fiveout of six classes, indicating that for number of neighborsK=3 the system provide more stable results.

3.4 Real-time execution and testing with real objects

The calibration procedure set the camera’s extrinsic parame-ters corresponding to its pose in the 3D working space, inaddition to its holding of information of its position rela-tive to end effector’s coordinate system and the frame of therobot itself. The experiments ran on an intel i5 2.4 GHz dualcore processor with 8 GB RAM. The resolution of imageswas 640×480 pixels, while the frame rate was 5 Hz. Themost time demanding module of the proposed method (apartfrom the training session which is performed off-line) is thatof extracting SIFT features. To achieve real-time execution,SIFT is applied over the complete test image only for thefirst frame, with the view to obtain initial set of SIFT fea-tures. Then, at each subsequent time step, SIFT extraction

is limited to a smaller region of interest (ROI) within theimage containing features whose distance from their meanis smaller than 3 times (set experimentally) their standarddeviation. We have further evaluated the performance of ourmethod, in several experimental tests that involve either eye-in-hand or eye-to-hand camera configurations with the use ofrealobjects. In the videos accompanying the paper in hand1,it can be seen: (a) how our method is capable of efficientlyguiding a robotic gripper to grasp an unknown object (eye-to-hand version) and (b) the performance of the 6 DoF poseestimation module (eye-in-hand version). Figures 8 and 9present in a more illustrative form the execution stages ofthe proposed method, in its eye-to-hand version, from theprocess of ellipse fitting to the efficient accomplishment ofmanipulation tasks for a car and a box, respectively. Thelighting conditions were kept controlled, as the main goalin the conducted experiments was to prove that sparse posemanifolds are feasible and can provide efficient solutions tothe pose estimation problem. In a more demanding scenario,the proposed algorithm can be enriched with illuminationcompensation capabilities (Vonikakis et al. 2013).

4 Discussion

In this paper we present a novel solution to the automaticmanipulation of unknown objects based on the first integratedresearch attempt to formulate Sparse Pose Manifolds. Thekey ideas underlying the proposed method are that: a) differ-ent objects viewed under the same perspective share identicalposes, which can be efficiently projected onto a well-definedand highly distinguishable subspace; and b) the reprojec-tion of initial manifolds onto sparse and highly representa-tive subspaces lead to increased generalization capabilitiesof the method. Towards this end, we propose an integratedimage understanding architecture that imposes upon a bunch-based structure, involving the unsupervised clustering of theextracted features, whilst encapsulating both appearance andgeometrical attributes of the objects. Through the process ofellipse fitting we build initial pose manifolds of low dimen-sionality, therefore, minimizing the complexity of the systemand avoiding the usage of conventional dimensionality reduc-tion schemes. In addition, we post-process the establishedpose manifolds via l1 norm minimization to build sparseand compact input vectors characterized by large discrim-ination and generalization capacities. We believe our workmakes the following contributions: a) we show that there are

1 Grasping a pliers:http://www.youtube.com/watch?v=J_gpPu6ZYQQ,Grasping a box-shaped object:http://www.youtube.com/watch?v=cOF0RdeJ6Zg, pose estimation of a car: http://www.youtube.com/watch?v=CSoFMk48DmM, pose estimation of a box-shaped objecthttp://www.youtube.com/watch?v=RTe6usXm9qs.

123

http://www.youtube.com/watch?v=J_gpPu6ZYQQ

http://www.youtube.com/watch?v=cOF0RdeJ6Zg

http://www.youtube.com/watch?v=cOF0RdeJ6Zg

http://www.youtube.com/watch?v=CSoFMk48DmM

http://www.youtube.com/watch?v=CSoFMk48DmM

http://www.youtube.com/watch?v=RTe6usXm9qs

Auton Robot (2014) 37:191–207 203

Fig. 8 Upper: Images obtained through the eye-to-hand camera are fedto the ellipse fitting process to provide the corresponding results. Here,the black circles represent the two foci of the ellipse while the greenones stand for the members of the bunch-based architecture. Lower:

After the establishment of the initial pose manifolds and their reprojec-tion onto a more sparse and highly representative subspace, the roboticarm is adequately configured to grasp an unknown test object (e.g. car)

Fig. 9 Upper: Regardless of the class of the unknown test object theellipse fitting process is performed every 5 frames in order to pro-duce low dimensional sparse input vectors. Lower: The latter are fedto the regressor that provides accurate estimations about the pose of

the unknown test object (e.g. box). These measurements are then trans-fered to the controller of the robot that, in turn, adjusts its 5 DoF jointsto perform a grasping task

indeed manifolds that can be sufficiently modeled througha training session with several objects under varying poses;b) the proposed bunch-based architecture extracts new fea-tures that enjoy both appearance and geometrical attributes ofthe objects, whilst demanding less supervision during train-ing; c) through the minimization of the l1 norm we establishsparse and representative pose manifolds that are fed to aRBF-based regressor to recover the mapping between inputand output spaces.

We can furthermore claim that our work is shown to bemore tolerant to partial occlusions as compared to otherrelated methods, primarily due to the following reasons: Ourbunch-based architecture extracts features that enjoy bothappearance and topological attributes, in contrast to the N3Mmethod, for example, requiring large amount of inter-partconnections for accurate measurements. Regarding the SMMmethod, it is noted that it abstracts parts of objects in a super-vised manner through the realization of the PDF associated tothe joint distribution of appearance and geometry properties

of the object. However, the main drawback of SMM is that itis very sensitive to partial occlusions due to the fact that theextracted parts correspond to specific regions on the surfaceof the object (e.g. the tire of a car) notwithstanding the lim-ited dataset used for training (as it contains only cars). More-over, the SMM method relies on a statistical manifold mod-eling approach that is based on the operations of expansionand alignment in order to minimize the labeling ambiguity.The authors of SMM utilize video sequences in place of stillimages, hence, decreasing the reliability of the ground truthmeasurements. These operations are used in order to facilitatethe efficient categorization of testing objects into the respec-tive classes (which hold data of poses that are similar to thoseused for training). In contrast, the proposed SPM modelingtechnique establishes manifolds that, through the unsuper-vised clustering subroutine and the ellipse fitting procedure,hold highly distinguishable data that can be easily classi-fied into the respective classes. Furthermore, our method isalso less affected by partial occlusions as compared to the

123

204 Auton Robot (2014) 37:191–207

LS-based and PCA-based methods primarily due to the lin-ear mapping between input and output space and, secondly,due to the high dimensional input vectors of limited discrim-ination capacities. More conclusively, experimental resultsprovide evidence of low generalization error rendering thusjustification to our theoretical claims. It should also be notedthat, box-shaped objects (with low eccentricity values) asopposed to objects belonging to the classes of 4-legged ani-mals, cups and balls (with higher eccentricity values), achievehigher rates of success in the particular task of performing agrasp on unknown targets. Although sphere-like objects arethe most regular objects in the real world, their projectiononto a 2D image forms a circle that possesses fewer DoFsthan an ellipse, which leads to singular configurations. More-over, the extracted features of targets belonging to the classesof 4-legged animals, cups and balls are closer to a circle ratherthan an ellipse and exhibit slightly reduced rates of success inobject manipulation tasks, as a result. Finally, with regards tofuture work, we aim to concentrate on designing and imple-menting an advanced manifold modeling structure that laysits foundations in the attributes of a 3D ellipse fitted over 3Dfeatures obtained from an RGB-D depth camera. Intuitively,such a new architecture accompanied by an exhaustive train-ing session with numerous examples should represent a newstream of solutions to the object manipulation problem.

Acknowledgments The authors would like to thank NikolaosMetaxas-Mariatos for his help in conducting the experimental valida-tion of the proposed method.

Appendix

Formulation of bunch-based architecture

Given an image of an object with certain pose, we firstextract the 2D locations of the ρ SIFT keypoints denotedas vζ ∈ R

2. Then we perform clustering over the loca-tions of the extracted interest points (input vectors vζ , ζ =1, 2, . . . , ρ) in order to account for the topological attributesof the object. Supposing there are γ clusters denoted asbk, k = 1, 2, . . . γ , we consider basic Bayesian rules not-ing that a vector vζ belongs to a cluster bk if P(bk|vζ ) >

P(bζ |vj), ζ, k = 1, 2, . . . , γ, ζ �= j . The expectation of theunknown parameters conditioned on the current estimatesΘ(τ ) (τ denotes the iteration step) and the training samples(E-step of the EM algorithm) are:

J (Θ; Θ(τ )) = E[ ρ∑

i=1

ln(p(vζ |bk; θ)Pk)]

=ρ∑ζ=1

γ∑k=1

P(bk|vζ ;Θ(τ ))ln(p(vζ |bk; θ)Pk) (6)

with P1×γ = [P1, . . . , Pγ ]T denoting the a priori proba-

bility of the respective clusters, θ2×γ = [θT1 , . . . , θγ

T ]Tcorresponding to the θk vector of parameters for the k − th

cluster and Θ3×γ = [ ˆθT ,PT ]T . According to M-step of theEM algorithm, the parameters of the γ clusters in the respec-tive subspace are estimated through the maximization of theexpectation:

Θ(τ + 1) = arg maxΘ

J (Θ;Θ(τ )) (7)

resulting in

ρ∑ζ=1

γ∑k=1

P(bk|vζ ;Θ(τ )) ∂∂θk

ln(p(vζ |bk; θk)) = 0 (8)

while maximization with respect to the a priori probabilityP gives:

Pk = 1

ρ

ρ∑ζ=1

P(bk|vζ ;Θ(τ )) with k = 1, . . . , γ (9)

It is apparent that the optimization of Eq. 8 with respectto P stands for a constraint maximization problem that has toobey to Pk ≥ 0, k = 1, . . . , γ and

∑γ

k=1 Pk = 1. We revisethe Lagrangian theory that states:

Given a function f (x) to be optimized subject to severalconstraints built the corresponding Lagrangian func-tion as L(x, λ) = f (x)−∑

λ f (x).

Following on from the above statement, we denote therespective (to Eq. 6) Lagrangian function as:

J (P, λ) = J (Θ;Θ(τ ))− λ( γ∑

k=1

Pk − 1)

We obtain λ and Pk though:

∂J (P, λ)∂Pk

= 0⇒

∂

( ∑ρζ=1

∑γ

k=1 P(bk|vζ ;Θ(τ ))ln(p(vζ |bk; θ)Pk))

∂Pk−

− λ(∑γ

k=1 Pk − 1

∂Pk= 0⇒

Pk = 1

λ

λ∑ζ=1

P(bk|vζ ;Θ(τ ))

Sinceγ∑

k=1

Pk = 1,we can derive that λ = ρ

resulting in the final apriori probability of the k − th

cluster of Eq. 9:

123

Auton Robot (2014) 37:191–207 205

Pk = 1

ρ

ρ∑ζ=1

P(bk|vζ ;Θ(τ )) with k = 1, . . . , γ

Training with noise

The performance of the proposed regressor-based 3D poseestimation module is bootstrapped by adding noise to theinput vectors fed to the RBF-kernel during training. In thefollowing passage we present a slightly modified version ofthe Tikhonov regularization theorem as adjusted to the needsof our case. In cases where the inputs do not contain noiseand the size t of the training dataset tends to infinity, theerror function containing the joint distributions p(yλ, r) (ofthe desired values for the network output gλ) assumes theform:

E = limt→∞

1

2t

t∑k=1

∑λ

{gλ(rk;w)− yλk}2

= 1

2

∑m

∫ ∫{gλ(rk;w)− yλk}2 p(yλ, r)dyλdr

= 1

2

∑m

∫ ∫{gλ(rk;w)− yλk}2 p(yλ|r)p(r)dyλdr

Let η be a random vector describing the input data with prob-ability distribution p(η). In most of the cases, noise distri-bution is chosen to have zero mean (

∫ηi p(η)dη = 0) and

to be uncorrelated (∫

ηiη j p(η)dη = varianceσi j ). In caseswhere each input data point contains additional noise and isrepeated infinite times, the error function over the expandeddata can be written as:

E = 1

2

∑m

∫ ∫ ∫{gλ((rt;w)+ η)− yλk}2

p(yλ | r)p(r)p(η)dyλdrdη

Expanding the network function as a Taylor series in powersof η produces:

gλ((rt;w)+η) = gλ(rt;w)+∑

i

ηi∂gλ

∂ri

∣∣∣∣η=0+

+ 1

2

∑i

∑j

ηiη j∂2gλ

∂ri∂r j

∣∣∣∣η=0+O(η3)

By substituting the Taylor series expansion into the errorfunction we obtain the following form of regularization termthat governs the Tikhonov regularization:

E = E + variance ×Ω

with

Ω = 1

2

∑m

∑i

∫ ∫ {(∂gλ

∂ri)2 + 1

2{gλ(r)− yλ}∂

2gλ

∂r2i

}

p(yλ|r)p(r)drdyλ

References

Agrawal, A., Sun, Y., Barnwell, J., & Raskar, R. (2010). Vision-guidedrobot system for picking objects by casting shadows. IJRR, 29, 155–173.

Andreopoulos, A., Tsotsos, J. (2009). A theory of active object local-ization. ICCV (pp. 903–910).

Baudat, G., & Anouar, F. (2000). Generalized discriminant analysisusing a kernel approach. Neural Computation, 12, 2385–2404.

Ben Amor, H., Kroemer, O., Hillenbrand, U., Neumann, G., Peters, J.(2012). Generalization of human grasping for multi-fingered robothands. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJInternational Conference on. (pp. 2043–2050).

Berg, A., Berg, T., & Malik, J. (2005). Shape matching and objectrecognition using low distortion correspondences. CVPR, 1, 26–33.

Bishop, C. (2006). Pattern recognition and machine learning. Volume4. New York: springer.

Blender. (2011). Blender 3d model creator. (http://www.blender.org/)Bohg, J., Kragic, D. (2009). Grasping familiar objects using shape con-

text. International Conference on Advanced Robotics (pp. 1–6).Castrodad, A., & Sapiro, G. (2012). Sparse modeling of human actions

from motion imagery. International journal of computer vision, 100,1–15.

Chan, A., Croft, E., Little, J. (2011). Constrained manipulator visualservoing (cmvs): Rapid robot programming in cluttered workspaces.IROS (pp. 2825–2830).

Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decom-position by basis pursuit. SIAM Review, 43, 129–159.

Cheng, B., Yang, J., Yan, S., Fu, Y., & Huang, T. S. (2010). Learn-ing with l1-graph for image analysis. IEEE Transactions on ImageProcessing, 19, 858–866.

Choi, C., Baek, S., Lee, S. (2008). Real-time 3d object pose estimationand tracking for natural landmark based visual servo. IROS (pp.3983–3989).

Detry, R., Piater, J. (2011). Continuous surface-point distributions for3d object pose estimation and recognition. ACCV (pp. 572–585).

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification(2nd ed.). New York: Wiley.

Eberhart, R., Shi, Y., & Kennedy, J. (2001). Swarm intelligence. TheMorgan Kaufmann Series in Evolutionary Computation. San Fran-cisco: Morgan Kaufmann.

Evermotion, T.M. (2012). Evermotion 3d models. (http://www.evermotion.org/)

Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervisedscale-invariant learning of models for visual recognition. IJCV, 71,273–303.

Ferrari, V., Tuytelaars, T., & Van Gool, L. (2006). Simultaneous objectrecognition and segmentation from single or multiple model views.IJCV, 67, 159–188.

Fischler, M., & Bolles, R. (1981). Random sample consensus: a para-digm for model fitting with applications to image analysis and auto-mated cartography. Communications of the ACM, 24, 381–395.

Guha, T., & Ward, R. K. (2012). Learning sparse representations forhuman action recognition. IEEE Transactions on Pattern Analysisand Machine Intelligence, 34, 1576–1588.

Hebert, P., Hudson, N., Ma, J., Howard, T., Fuchs, T., Bajracharya,M., Burdick, J. (2012). Combined shape, appearance and silhouette

123

http://www.blender.org/



206 Auton Robot (2014) 37:191–207

for simultaneous manipulator and object tracking. In Robotics andAutomation (ICRA), 2012 IEEE International Conference on, IEEE(pp. 2405–2412).

Hinterstoisser, S., Benhimane, S., Navab, N. (2007). N3m: Natural 3dmarkers for real-time object detection and pose estimation. ICCV(pp. 1–7).

Hsiao, E., Collet, A., Hebert, M. (2010). Making specific features lessdiscriminative to improve point-based 3d object recognition. CVPR(pp. 2653–2660).

Jolliffe, I. (1986). Principal Component Analysis. New York: SpringerVerlag.

Kouskouridas, R., Gasteratos, A., & Badekas, E. (2012). Evaluation oftwo-part algorithms for objects’ depth estimation. Computer Vision,IET, 6, 70–78.

Kouskouridas, R., Gasteratos, A., & Emmanouilidis, C. (2013). Effi-cient representation and feature extraction for neural network-based3d object pose estimation. Neurocomputing, 120, 90–100.

Kragic, D., Björkman, M., Christensen, H., & Eklundh, J. (2005). Visionfor robotic object manipulation in domestic settings. RAS, 52, 85–100.

Krainin, M., Henry, P., Ren, X., & Fox, D. (2011). Manipulator andobject tracking for in-hand 3d object modeling. IJRR, 30, 1311–1327.

Leibe, B., Leonardis, A., Schiele, B. (2004). Combined object catego-rization and segmentation with an implicit shape model. Workshop,ECCV (pp. 17–32).

Lippiello, V., Siciliano, B., & Villani, L. (2007). Position-based visualservoing in industrial multirobot cells using a hybrid camera config-uration. IEEE Transactions on Robotics, 23, 73–86.

Lippiello, V., Ruggiero, F., & Siciliano, B. (2011). Floating visual graspof unknown objects using an elastic reconstruction surface. IJRR, 70,329–344.

Lowe, D. (1999). Object recognition from local scale-invariant features.ICCV, 2, 1150–1157.

Lowe, D. (2004). Distinctive image features from scale-invariant key-points. IJCV, 60, 91–110.

Ma, J., Chung, T., & Burdick, J. (2011). A probabilistic framework forobject search with 6-dof pose estimation. IJRR, 30, 1209–1228.

Mason, M., Rodriguez, A., & Srinivasa, S. (2012). Autonomous manip-ulation with a general-purpose simple hand. IJRR, 31, 688–703.

Mei, L., Liu, J., Hero, A., Savarese, S. (2011). Robust object pose esti-mation via statistical manifold modeling. ICCV (pp. 967–974).

Mei, L., Sun, M., M.Carter, K., III, A.O.H., Savarese, S. (2009). Objectpose classification from short video sequences. BMVC.

Nayar, S., Nene, S., Murase, H. (1996). Columbia object imagelibrary (coil 100). Technical report, Tech. Report No. CUCS-006-96. Department of Comp. Science, Columbia University.

Oikonomidis, I., Kyriazis, N., Argyros, A. (2011). Full dof tracking of ahand interacting with an object by modeling occlusions and physicalconstraints. ICCV (pp. 2088–2095).

Pang, Y., Li, X., & Yuan, Y. (2010). Robust tensor analysis with l1-norm.IEEE Transactions on Circuits and Systems for Video Technology,20, 172–178.

Popovic, M., Kraft, D., Bodenhagen, L., Baseski, E., Pugeault, N.,Kragic, D., et al. (2010). A strategy for grasping unknown objectsbased on co-planarity and colour information. RAS, 58, 551–565.

Qiao, L., Chen, S., & Tan, X. (2010). Sparsity preserving projectionswith applications to face recognition. Pattern Recognition, 43, 331–341.

Rasolzadeh, B., Björkman, M., Huebner, K., & Kragic, D. (2010). Anactive vision system for detecting, fixating and manipulating objectsin the real world. IJRR, 29, 133–154.

Savarese, S., Fei-Fei, L. (2007) 3d generic object categorization, local-ization and pose estimation. ICCV (pp. 1–8).

Saxena, A., Driemeyer, J., Kearns, J., Osondu, C., Ng, A. (2008). Learn-ing to grasp novel objects using vision. In Experimental Robotics,(pp. 33–42) Berlin: Springer .

Saxena, A., Driemeyer, J., Kearns, J., & Ng, A. (2006). Robotic graspingof novel objects. Neural Information Processing Systems, 19, 1209–1216.

Saxena, A., Wong, L., Quigley, M., & Ng, A. Y. (2011). A vision-based system for grasping novel objects in cluttered environments.Robotics Research, 18, 337–348.

Schölkopf, B., Smola, A.J., Müller, K.R. (1997). (pp. 583–588). Kernelprincipal component analysis. ICANN.

Schölkopf, B., & Smola, A. J. (2002). Learning with kernels : supportvector machines, regularization, optimization, and beyond. Cam-bridge: MIT Press.

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation.PAMI, 22, 888–905.

Shubina, K., & Tsotsos, J. (2010). Visual search for an object in a 3denvironment using a mobile robot. CVIU, 114, 535–547.

Srinivasa, S., Ferguson, D., Helfrich, C., Berenson, D., Collet, A.,Diankov, R., et al. (2010). Herb: a home exploring robotic butler.Autonomous Robots, 28, 5–20.

Torabi, L., & Gupta, K. (2012). An autonomous six-dof eye-in-handsystem for in situ 3d object modeling. IJRR, 31, 82–100.

Tsaig, Y., & Donoho, D. L. (2006). Compressed sensing. IEEE Trans-actions on Information Theory, 52, 1289–1306.

Tsoli, A., Jenkins, O. (2008). Neighborhood denoising for learninghigh-dimensional grasping manifolds. IROS (pp. 3680–3685).

Viksten, F., Forssen, P., Johansson, B., Moe, A. (2009). Comparison oflocal image descriptors for full 6 degree-of-freedom pose estimation.ICRA (pp. 2779–2786).

Vonikakis, V., Kouskouridas, R., Gasteratos, A. (2013). A compar-ison framework for the evaluation of illumination compensationalgorithms. IST 2013 IEEE International Conference on. (pp. 264–268).

Wang, J., Sun, X., Liu, P., She, M. F., & Kong, L. (2013). Sparse rep-resentation of local spatial-temporal features with dimensionalityreduction for motion recognition. Neurocomputing, 100, 134–143.

Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2009).Robust face recognition via sparse representation. PAMI, 31, 210–227.

Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007).Graph embedding and extensions: A general framework for dimen-sionality reduction. PAMI, 29, 40–51.

Yuan, C., & Niemann, H. (2001). Neural networks for the recognitionand pose estimation of 3d objects from a single 2d perspective view.IMAVIS, 19, 585–592.

Zou, H., Hastie, T., & Tibshirani, R. (2004). Sparse principal componentanalysis. JCGS, 15, 2006.

123

Auton Robot (2014) 37:191–207 207

Rigas Kouskouridas holdsa Diploma and a Ph.D fromthe Department of Productionand Management Engineering atthe Democritus University ofThrace, Xanthi, Greece, in 2006.He is currently a Postdoctoralresearch associate at the Com-puter Vision and Learning Labat the Department of Electri-cal and Electronic Engineeringof the Imperial College London.His areas of interest include com-puter vision, pattern recognition,machine learning and robotics.He is involved in several national

(Greek) and international (European) research projects in the field ofmachine vision systems. Dr. Kouskouridas is a member of the IEEE,euCognition II and the Technical Chamber of Greece (TEE).

Kostantinos Charalampousholds a Bachelors (2009) anda Masters (2011) degrees bothfrom the School of Informatics,Aristotle University of Thessa-loniki. He is currently a Ph.D.candidate at the Departmentof Production and ManagementEngineering, Democritus Uni-versity of Thrace.

Antonios Gasteratos is anAssistant Professor of “Mech-atronics and Artificial Vision”at the DPME. He teaches thecourses of “Robotics”, “Auto-matic Control Systems”, “Mea-surements Technology” and“Electronics”. He holds a Dip-loma and a Ph.D. from theDepartment of Electrical andComputer Engineering, DUTH,Greece, 1994 and 1999, respec-tively. During 1999-2000 he wasa Post-Doc Fellow at the Lab-oratory of Integrated Advanced

Robotics (LIRA-Lab), DIST, University of Genoa, Italy. His researchinterests are mainly in mechatronics and in robot vision. He has pub-lished one textbook, 3 book chapters and more than 90 scientific papers.Dr. Gasteratos is a member of EURON, euCognition and I*PROMSEuropean networks.

123

Documents

Sparse pose manifolds - Rigas Kouskouridasrkouskou.gitlab.io/publications/docs/AURO_2014.pdfAuton Robot (2014) 37:191–207 DOI 10.1007/s10514-014-9388-x Sparse pose manifolds Rigas