6
Generating Multi-Fingered Robotic Grasps via Deep Learning Jacob Varley, Jonathan Weisz, Jared Weiss, Peter Allen Abstract— This paper presents a deep learning architecture for detecting the palm and fingertip positions of stable grasps directly from partial object views. The architecture is trained using RGBD image patches of fingertip and palm positions from grasps computed on complete object models using a grasping simulator. At runtime, the architecture is able to estimate grasp quality metrics without the need to explicitly calculate the given metric. This ability is useful as the exact calculation of these quality functions is impossible from an incomplete view of a novel object without any tactile feedback. This architecture for grasp quality prediction provides a framework for generalizing grasp experience from known to novel objects. I. I NTRODUCTION Despite the existence of a vast amount of literature on robotic grasping, grasp planning for objects from partial views remains a challenging task. This paper presents a sys- tem relying on deep learning methods to generate grasps for objects that may or may not have been seen before and whose current full geometry is unknown. The proposed system uses a Convolutional Neural Network to quickly calculate grasp affordances such as force closure [4] directly from RGBD images. These methods make our system efficient, allow the use of a large context window for determining grasp quality, and enable the system to generalize precomputed grasps for known models to novel 3D objects. The proposed approach involves training a deep learning network on thousands of quality grasps generated using known object models from BigBIRD[19], the Berkeley In- stance Recognition Dataset. Training grasps are generated in an environment where the full object geometry is known using GraspIt![14], a grasp simulator. For each grasp, the locations of the fingertip positions and palm are recorded along with the grasp quality score representing the stability of the given grasp for this known environment. Then, RGBD images of the grasp can be rendered, and RGBD image patches centered on the fingertip and palm locations can be extracted. These patches paired with the recorded grasp quality score are used to train a deep network to recognize stable fingertip and palm locations for different canonical grasp types directly from RGBD images. At runtime, an RGBD image and a pointcloud for a scene are generated from a single point of view. The pointcloud is segmented and meshes are created for the segmented objects. Since the pointcloud was generated from a single view, these meshes are incomplete, and represent only the faces of the object that were visible to the camera. At the same time that the pointcloud is processed, the RGBD image passes through the deep learning network in order to generate heatmaps or dense per pixel labels revealing how well every location in the image would act as a fingertip or palm location for a number of different canonical hand configurations. The incomplete meshes are then imported into GraspIt!, and Simulated Annealing is run to find stable grasps that align well with the visible portions of the mesh model to be grasped, and whose fingertips and palm positions project into low energy locations in the heatmaps produced by the deep network. II. RELATED WORK Below, we provide a brief overview of the work most relevant to the presented system. There is a significant body of work related to grasping with incomplete sensor data as well as different metrics for generating grasps for known objects for which we refer the reader to [1]. 1) Data Driven Grasping: In the approach taken by [6][5], a grasping simulator plans grasps that maximize a given quality metric offline to generate grasp for known 3D objects. Then, when attempting to grasp an object in a scene, the scene is segmented, and the segmented object’s closest model is found in the database and the precomputed grasps from the closest matching model are used on the new object. This approach works reliably when the object to be grasped can easily be segmented out of the scene and is not markedly different than models in the database, but it is difficult to fine tune a retrieved grasp with this system especially when the full geometry of the object is unknown. Other related works in this area include [3][11] 2) Deep Learning for Parallel Jaw Grippers: In another approach [12] the grasping system learns grasps for a parallel jaw gripper from manually labeled RGBD images. The labels consist of rectangles with two of the edges representing where the gripper should be placed. This approach, while able to learn arbitrary grasp types based on whatever training data is given, has several drawbacks. Each training image has to be hand labeled, the grasp rectangles that are evaluated to determine grasp quality only contain the region of the image that will lie inside of the gripper, and the system is not easily extensible to more complicated robotic hands. 3) Fingertip Space and Precision Grasping: The concept of a Fingertip Space and an efficient algorithm for producing stable grasps within this space has been proposed by a number of researchers including [8][15][13]. For example, [8] can synthesize precision grasps as long as the observed parts of the object are graspable. This stands in contrast to this work where information about the observed parts of the

Generating Multi-Fingered Robotic Grasps via Deep Learningjvarley/MultiFingeredGrasps... · known models to novel 3D objects. The proposed approach involves training a deep learning

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generating Multi-Fingered Robotic Grasps via Deep Learningjvarley/MultiFingeredGrasps... · known models to novel 3D objects. The proposed approach involves training a deep learning

Generating Multi-Fingered Robotic Graspsvia Deep Learning

Jacob Varley, Jonathan Weisz, Jared Weiss, Peter Allen

Abstract— This paper presents a deep learning architecturefor detecting the palm and fingertip positions of stable graspsdirectly from partial object views. The architecture is trainedusing RGBD image patches of fingertip and palm positions fromgrasps computed on complete object models using a graspingsimulator. At runtime, the architecture is able to estimate graspquality metrics without the need to explicitly calculate the givenmetric. This ability is useful as the exact calculation of thesequality functions is impossible from an incomplete view of anovel object without any tactile feedback. This architecture forgrasp quality prediction provides a framework for generalizinggrasp experience from known to novel objects.

I. INTRODUCTION

Despite the existence of a vast amount of literature onrobotic grasping, grasp planning for objects from partialviews remains a challenging task. This paper presents a sys-tem relying on deep learning methods to generate grasps forobjects that may or may not have been seen before and whosecurrent full geometry is unknown. The proposed system usesa Convolutional Neural Network to quickly calculate graspaffordances such as force closure [4] directly from RGBDimages. These methods make our system efficient, allow theuse of a large context window for determining grasp quality,and enable the system to generalize precomputed grasps forknown models to novel 3D objects.

The proposed approach involves training a deep learningnetwork on thousands of quality grasps generated usingknown object models from BigBIRD[19], the Berkeley In-stance Recognition Dataset. Training grasps are generatedin an environment where the full object geometry is knownusing GraspIt![14], a grasp simulator. For each grasp, thelocations of the fingertip positions and palm are recordedalong with the grasp quality score representing the stabilityof the given grasp for this known environment. Then, RGBDimages of the grasp can be rendered, and RGBD imagepatches centered on the fingertip and palm locations canbe extracted. These patches paired with the recorded graspquality score are used to train a deep network to recognizestable fingertip and palm locations for different canonicalgrasp types directly from RGBD images.

At runtime, an RGBD image and a pointcloud for a sceneare generated from a single point of view. The pointcloudis segmented and meshes are created for the segmentedobjects. Since the pointcloud was generated from a singleview, these meshes are incomplete, and represent only thefaces of the object that were visible to the camera. Atthe same time that the pointcloud is processed, the RGBDimage passes through the deep learning network in order to

generate heatmaps or dense per pixel labels revealing howwell every location in the image would act as a fingertipor palm location for a number of different canonical handconfigurations. The incomplete meshes are then importedinto GraspIt!, and Simulated Annealing is run to find stablegrasps that align well with the visible portions of the meshmodel to be grasped, and whose fingertips and palm positionsproject into low energy locations in the heatmaps producedby the deep network.

II. RELATED WORK

Below, we provide a brief overview of the work mostrelevant to the presented system. There is a significant bodyof work related to grasping with incomplete sensor data aswell as different metrics for generating grasps for knownobjects for which we refer the reader to [1].

1) Data Driven Grasping: In the approach taken by[6][5], a grasping simulator plans grasps that maximize agiven quality metric offline to generate grasp for known 3Dobjects. Then, when attempting to grasp an object in a scene,the scene is segmented, and the segmented object’s closestmodel is found in the database and the precomputed graspsfrom the closest matching model are used on the new object.This approach works reliably when the object to be graspedcan easily be segmented out of the scene and is not markedlydifferent than models in the database, but it is difficult to finetune a retrieved grasp with this system especially when thefull geometry of the object is unknown. Other related worksin this area include [3][11]

2) Deep Learning for Parallel Jaw Grippers: In anotherapproach [12] the grasping system learns grasps for a paralleljaw gripper from manually labeled RGBD images. The labelsconsist of rectangles with two of the edges representingwhere the gripper should be placed. This approach, whileable to learn arbitrary grasp types based on whatever trainingdata is given, has several drawbacks. Each training image hasto be hand labeled, the grasp rectangles that are evaluated todetermine grasp quality only contain the region of the imagethat will lie inside of the gripper, and the system is not easilyextensible to more complicated robotic hands.

3) Fingertip Space and Precision Grasping: The conceptof a Fingertip Space and an efficient algorithm for producingstable grasps within this space has been proposed by anumber of researchers including [8][15][13]. For example,[8] can synthesize precision grasps as long as the observedparts of the object are graspable. This stands in contrast tothis work where information about the observed parts of the

Page 2: Generating Multi-Fingered Robotic Grasps via Deep Learningjvarley/MultiFingeredGrasps... · known models to novel 3D objects. The proposed approach involves training a deep learning

Fig. 1: Overview of the grasp training data generation stages.

object is supplemented with knowledge gained from priorgrasps generated on known objects.

III. METHOD

A. Grasp Training Data Generation

The generation of training data is a multi-stage process.Grasps are generated in GraspIt! using complete objectmodels. The locations of the palm and fingertips for graspsin prototypical hand configurations are saved along withthe energy of the given grasp. RGBD images of thesepalm and fingertip locations are then captured in simu-lation, and patches around these locations are saved. Anoverview of the data generation process is shown in Fig. 1.

Fig. 2: Barrett Handwith virtual contactsshown in red.

1) Grasp Generation: Trainingof the grasp generation system re-quired the generation of qualitygrasps on known objects. In or-der to generate these grasps, mod-els from BigBIRD, the BerkeleyInstance Recognition Dataset, wereuploaded into GraspIt!. The datasetcontains mesh models for approxi-mately 125 everyday objects. Sev-eral of the models had large holesdue to transparent regions and werenot used as valid training grasps could not be generated forthem. In GraspIt! each model was uploaded along with aBarrett Hand model as shown in Fig. 2. From each object,points were chosen uniformly over the mesh surface andGraspIt!’s Simulated Annealing planner ran for 30 secondsat a time seeded over the chosen points on the object. Forthe grasp data generation, GraspIt!’s Contact and PotentialEnergy function was used to score the generated grasps.The Potential Energy function ensures that the generatedgrasps exhibit force closure. Only the grasps with low-energyquality scores below a threshold were kept. This generated41,708 grasps with quality scores below our threshold to beused for training.

2) Grasp Type Labeling: The generated grasps were thencategorized into different grasp types based on the 7 jointvalues and the roll of the wrist. Each joint value of the handand the wrist roll were given nBins = 5 bins. This providednBinsnJoints = 58 = 390, 625 different grasp types. Thegrasp type is simply an integer between 0 and 390,624 thatis determined by the coarse configuration of the hand. Theexact formulation is given by:

grasp type =

nJoints−1∑i=0

bin(joint[i]) ∗ nBinsi (1)

Fig. 3: Image Patches extracted from a grasp on an All branddetergent bottle. The center of each image patch correspondsto where the palm and fingertips from the grasp project intothe image plane.

3) Grasp Type Reduction: The vast majority of the abovegrasp types had no grasp examples. For instance, the grasptype corresponding to the hand being completely open neverresults in a stable grasp. Only the n = 8 most common graspstypes were kept as the set of canonical grasps. The set of41,708 quality grasps from above contained 3,770 grasps ofthe 8 most common grasp types. The number of examplesper grasp type was no longer sufficient past the 8th mostcommon grasp type.

4) Grasp Rendering: For each grasp, the target objectwas placed into Gazebo[10]. The position of the palm withrelation to the object was calculated and a simulated Kinectsensor was positioned 2 meters backed off from the objectalong the approach direction of the hand and an RGBDimage was captured.

5) Training Patch Extraction: From each RGBD im-age, patches around the location of the palm andfingertips are extracted. These training patches allowthe deep learning model to learn what good fingertip

Fig. 5: Patches and labelvectors for a single graspwith energy e.

locations look like for spe-cific grasp types, for aknown object, from a spe-cific viewing angle. Fig. 3demonstrates how 4 patchesare generated from a singlegrasp of a known model.

B. Deep Learning Model

1) Training Data Labels:For each training patch, a la-bel vector the length of thenumber of canonical grasptypes (n) multiplied by thenumber of palm and fingertip locations (f ) is created,i.e. len(label vec) = n∗f =8 ∗ 4 = 32 The vector contains zeros except at the entrycorresponding to the grasp and finger type of the patch. Thisentry is filled with the grasp quality score for the grasp thatthis patch was extracted from. Fig. 5 shows how the training

Page 3: Generating Multi-Fingered Robotic Grasps via Deep Learningjvarley/MultiFingeredGrasps... · known models to novel 3D objects. The proposed approach involves training a deep learning

Fig. 4: Deep Model with local contrast normalization (LCN), 5 convolutional layers, and 3 fully connected layers.

Layer 1 2 3 4 5 6 7 8Stage conv + max pool conv + max pool conv conv conv full full full# channels 96 256 256 256 256 1000 100 f ∗ nFilter Size 11x11 3x3 3x3 3x3 3x3 - - -Filter Stride 2x2 - - - - - - -Pool Size 2x2 2x2 - - - - - -Pool Stride 2x2 2x2 - - - - - -Input Size 480x640 118x158 58x78 56x76 54x74 52x72 52x72 52x72

TABLE I: Architecture for the model. The final number of channels in the final stage is the number of grasp types (n)multiplied by the number of interest points on the hand (f) i.e the palm and fingertips.

Fig. 6: The output of the Deep Network can be thought ofas a set of heatmaps representing how well each location inthe image acts as a fingertip or palm location for each grasptype.

labels for a set of patches from the same grasp are composed.2) Model Description: Our model was implemented using

Pylearn2 [7] and consists of a local contrast normalizationstage(LCN) [9] followed by five Convolutional RectifiedLinear layers and then by three fully connected Linear layers.The model was trained using an Nvidia GTX 780 GPU.The parameters of the model are shown in Table I, andthe architecture is shown in Fig. 4. This network is smallcompared to networks built for the ImageNet challenge [16],which have 1000 output categories compared to our 32.

3) Heatmaps: While the model was trained on imagepatches of 72x72, when running it takes as input an entireRGBD image (480x640) and outputs a label vector for everypixel in the image with the exception of the border pixelswhere a full 72x72 patch is not available. The output matrixis downsampled due to the Filter Strides and Max Pooling inthe first two Convolutional Layers. This output matrix can bethought of as a set of potentials showing how well every pixel

Fig. 7: Overview of the Grasp Generation System. After animage is captured, the scene is segmented and meshed whilethe heatmaps are generated in parallel. The meshed sceneand the heatmaps are used within GraspIt! to plan grasps.

would work as a fingertip location for every fingertip type foreach of the canonical grasp types. Fig. 6 demonstrates howthe output of the deep learning model can be interpreted.

4) Training: The model is trained using stochastic gra-dient descent (SGD) with minibatches of size 20, for 151epochs. The learning rate was .1, and an L1 term was used toinduce sparsity in the weights. In addition, small amounts ofGaussian noise was randomly added to the training patches.

C. Grasping from Partial Views

In order to realize a grasp of an object using the modelabove, several additional steps must be taken. An RGBDimage must be captured, the scene must be segmented, andmeshes must be generated for the segmented objects in thescene. Then Simulated Annealing can be run within GraspIt!on the mesh of the scene in order to find an optimal grasp

Page 4: Generating Multi-Fingered Robotic Grasps via Deep Learningjvarley/MultiFingeredGrasps... · known models to novel 3D objects. The proposed approach involves training a deep learning

that is well aligned with the visible portions of the objectand is stable according to the deep network. An overview ofthis process is shown in Fig. 7.

1) Scene Segmentation: The system segments the pointcloud associated with the RGBD image using EuclideanCluster Extraction [18][17]. The largest plane in the point-cloud is found using RANSAC and then removed. Euclideanclustering is then run on the remaining points, and anyobjects that were previously resting on the largest plane inthe scene now form separate clusters.

2) Mesh Generation: The segmented clusters from theprevious step are then meshed and the set of meshed partialobjects is returned.

3) Heatmap Generation: In parallel with the generationof the meshes, the RGBD image is sent through the deeplearning network. A given input RGBD image creates a setof f ∗n heatmaps, one for each of the f finger and the palmlocations for each of the n canonical grasp types.

4) Grasp Planning: Within GraspIt!, the meshes areloaded into a planning scene, a partial mesh is selected forgrasp planning, and GraspIt!’s Simulated Annealing planneris started [2]. At each step, the planner calculates the energyassociated with the current hand configuration as a weightedsum of a Contact Energy term and the Heatmap Energy term,both of which are described in the following section. Thegrasp planner stores a list of the top grasps ranked by energythat are calculated over a pre-specified number of iterationsof the planner. At this point, either the lowest energy graspthat is reachable can be executed automatically or a usercan select a preferred grasp from the list generated by theplanner.

D. Grasp Energy Calculation

Fig. 8: Minimizing Contact Energyis equivalent to minimizing both oi·ni and the length of oi to stay closeto and aligned with the object.

1) ContactEnergy: The ContactEnergy functionmakes use of VirtualContacts, or user-specified points ofinterest on the handas shown in Fig.8. In our case, wespread a total of16 virtual contactsuniformly over theBarrett Hand. TheContact Energymetric penalizes grasps whose virtual contacts are eitherdistant from the object or whose normals do not alignwell with the object. The algorithm is fully described inAlgorithm 1. The two components come from the distancebetween the closest point on the object to be grasped, andthe dot product of the normalized vector extending from thevirtual contact location to the closest point on the modelwith the normal of the virtual contact and the hand. γ isused to scale how large of a contribution the alignment termmakes relative to the distance term. Two versions of the

isVisible function were explored. For forward facing graspsusing only the 4 virtual contacts on the palm was sufficient,while for all cases, the orientation of the vector extendingfrom the virtual contact to the nearest point on the mesh wascompared to the vector extending from the camera locationto the same point. The sign of the dot product was used toensure that contact was being made on a visible face of themesh. The Contact Energy term encourages grasps to alignwell with the visible portions of the meshed object.

Algorithm 1 Grasp Energy Calculations

1: procedure CONTACTENERGY()2: numVisible = 03: energy = 04: energy = 05: for vc in virtualContacts do6: ip = getIntersectionPoint(vc, body)7: if isVisible(ip) then8: numVisible ++9: energy += dist(vc, ip)

10: energy += γ∗ (1 - (vc.norm · ip.norm))11: return energy / numVisible12: procedure HEATMAPENERGY()13: energy = 014: graspType, distance = getGraspType()15: if distance > α then16: return MAX ENERGY17: for vc in virtualContacts do18: fingerType = getFingerType(vc)19: if fingerType in [Palm, F0, F1, F2] then20: u,v = contactToImagePlane(vc.x, vc.y, vc.z)21: index = getIndex(graspType, fingerType)22: energy += heatmaps[index][u][v]23: energy += µ∗distance24: return energy

2) Heatmap Energy: The second component of the energyfunction comes from the heatmaps generated by the deeplearning network. The distance from the current hand con-figuration to the nearest canonical grasp type is computed. Ifthere are no nearby grasp types, then the Heatmap Energy isreturned as a large value implying that the current hand con-figuration is not well associated with any of the learned grasptypes. If there is a nearby grasp type, then the correspondingheatmaps are used to determine the Heatmap Energy. Thecurrent locations of the palm and finger tips of the handare projected back into the plane of the image captured bythe Kinect, and the heatmap values for the individual fingersand the palm for the current grasp type are added to theenergy score for the current hand configuration. In addition,the distance to the nearest grasp type is multiplied by µ andadded to the energy. This penalizes grasps for moving awayfrom the canonical hand configurations. Higher µ values keepthe hand in configurations close to the canonical grasp handconfiguration. A full description of this algorithm is shown inAlgorithm 1. Overall, the Heatmap Energy function favors

Page 5: Generating Multi-Fingered Robotic Grasps via Deep Learningjvarley/MultiFingeredGrasps... · known models to novel 3D objects. The proposed approach involves training a deep learning

Real World Grasps

Trial # Scene Target Object Success SceneSegmentation (s)

MeshGeneration (s)

HeatmapGeneration (s)

SimulatedAnnealing (s)

Total PlanningTime (s)

1 All bottle All bottle yes 0.168 3.014 3.122 10.982 14.1642 water bottle water bottle yes 0.159 2.977 3.141 10.471 13.6123 drill drill yes 0.157 2.991 3.119 12.745 15.8934 shampoo shampoo yes 0.339 3.277 3.122 12.253 15.8695 tool case tool case no 0.175 3.026 3.242 14.328 17.5706 crowded shampoo yes 0.181 3.307 3.127 14.328 17.8167 crowded drill no 0.181 3.167 3.118 15.940 19.2888 crowded All bottle yes 0.188 3.338 3.118 15.507 19.033

Success: 6/8 Average: 16.655

TABLE II: Results of grasp attempts for varying scenes and objects. ρ = .1, µ = 10, and γ = 500

RGBD Input Found finger positions

L Gripper Palm R Gripper

Fig. 9: Scene from the Grasping Rectangles Dataset[12]. Pos-itive grasp examples were scored +1 and negative examples-1. Red corresponds to good gripper or palm positions, andthe argmax for each heatmap is displayed in the overlaidimage.

hand configurations where the palm and fingertips projectinto low energy areas of the heatmaps, areas that the deepmodel has associated with stable grasps.

3) Combined Energy: The Contact Energy and HeatmapEnergy terms are combined using a mixing parameter ρ asshown in the following equation:

ρ ∗ heatmapEnergy + (1− ρ) ∗ contactEnergy (2)

Higher values of ρ lead to grasps where the fingertips areable to move into lower energy positions at the expenseof increasing the contact energy. It is a balance for howmuch a grasp should be rewarded for aligning nicely withthe visible portions of the object against how the currenthand configuration compares to the learned idea of what astable grasp should look like.

IV. EXPERIMENTAL RESULTS

A. Grasping Rectangle Dataset

In order to demonstrate the ability of the system to learnarbitrary grasp quality functions, we trained our system onthe Grasping Rectangle Dataset from [12]. In this dataset,two of the edges of a rectangle represent where to placethe gripper. Image patches were extracted from the centerof each edge of the rectangle to represent each finger, andfrom the center for the palm. The grasps were discretizedbased on the rotation of the rectangle, so that grasp types

related to the rotation of the wrist. The system was able toreproduce results similar to those found by[12]. Fig. 9 showsan example scene with a set of heatmaps produced by thedeep learning network and the best fingertip positions foundby our system.

B. BigBird Generated Dataset

Using the deep learning model trained on the BigBirddataset, grasps were generated for several different real worldscenes, and executed using a Barrett Hand attached to aStaubliTX60 robotic arm. None of the objects in the sceneswere present in the training data for the deep learning model.For each trial, a single RGBD image and pointcloud of thescene was captured. The Simulated Annealing Planner wasrun for 10,000 steps using the Combined Energy functionand the lowest energy reachable grasp generated within thattime was executed by the arm. Success was determined bywhether or not the object was contained in the hand and liftedoff of the table. The total time is generated with the followingformula as the heatmap generation and scene segmentationare run in parallel:

ttotal = max(tsegment + tmesh, theatmap) + tsim.ann. (3)

The results for the experiments are shown in Table II.While ρ = .1 sets more weight to the Contact Energy,the unweighted Contact Energy is smaller than the un-weighted Heatmap Energy, so this value balances the two.The experimental results demonstrate that our system is ableto generalize grasps generated on the BigBird models tothese new objects. Several example grasps are shown inFigure 10. These experiments reveal not only that our systemgeneralizes well, but that it is effective in crowded scenes andwith semi transparent objects. As long as some portion ofthe object is visible and able to be meshed, then this systemis able to generate grasps around that portion of the object.While the current system failed to generate a stable graspsfor the tool case, we attribute this partially to the fact thatvery few of the objects in the training set have handles, andnone of them had horizontal handles like the tool case. Forthe failure case involving the drill, when the partial mesh ofthe drill was created, it was missing portions of the handle.While closing, the hand collided with a portion of the drillthat had not been meshed, this is difficult to avoid, as we can

Page 6: Generating Multi-Fingered Robotic Grasps via Deep Learningjvarley/MultiFingeredGrasps... · known models to novel 3D objects. The proposed approach involves training a deep learning

Fig. 10: Example scenes and lowest energy generated grasps.

Fig. 11: Grasps calculated for the same scene using differentenergy functions. Top: Contact Energy, Middle: HeatmapEnergy, Bottom: Combined Energy

only plan trajectories and generate hand configurations thatavoid collisions with portions of objects that are meshed.

C. Comparison of Grasp Energy Components

Using a single scene, grasps were planned using ContactEnergy, Heatmap Energy, and Combined Energy to demon-strate how the different components of the Combined Energyfunction work in isolation. The Contact Energy functionaligns the palm to the object to be grasped, but does not placethe fingers into any specific configuration. The HeatmapEnergy alone places the fingers into positions that projectinto low energy locations within the heatmaps, but oftenfinds low energy regions of the heatmaps that are for otherobjects, or may be either to far into the scene, or too close tothe camera. The Combined Energy function uses the ContactEnergy to ensure that the hand is near the target object, whilethe heatmap energy refines the fingertip and palm locationslocally. Fig. 11 shows several low energy grasps computedwith the different energy functions.

V. CONCLUSION

We have shown that traditional methods for calculatinggrasps on known 3D objects can be paired with a deeplearning architecture in order to efficiently estimate arbitrarygrasp affordances from RGBD images. The deep architecturecan learn grasp qualities that may take arbitrarily long tocalculate on the training models enabling the use of thoroughand meaningful grasp quality scores to generate traininggrasps. The generated grasps can then be used to train a deepmodel to quickly recognize quality grasps without explicitly

calculating the grasp quality or even requiring a 3D modelof the object to be grasped. This work is supported by NSFGrant IIS-1208153.

REFERENCES

[1] Jeannette Bohg, Antonio Morales, Tamim Asfour, and Danica Kragic.Data-driven grasp synthesis-a survey. Robotics, IEEE Transactions on,30(2):289–309, 2014.

[2] Matei T Ciocarlie and Peter K Allen. Hand posture subspaces fordexterous robotic grasping. The International Journal of RoboticsResearch, 28(7):851–867, 2009.

[3] Renaud Detry, Carl Henrik Ek, Marianna Madry, and Danica Kragic.Learning a dictionary of prototypical grasp-predicting parts fromgrasping experience. In Robotics and Automation (ICRA), 2013 IEEEInternational Conference on, pages 601–608. IEEE, 2013.

[4] Carlo Ferrari and John Canny. Planning optimal grasps. In Roboticsand Automation, 1992. Proceedings., 1992 IEEE International Con-ference on, pages 2290–2295. IEEE, 1992.

[5] Corey Goldfeder and Peter K Allen. Data-driven grasping. Au-tonomous Robots, 31(1):1–20, 2011.

[6] Corey Goldfeder, Matei Ciocarlie, Jaime Peretzman, Hao Dang, andPeter K Allen. Data-driven grasping with partial sensor data. In Intel-ligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ InternationalConference on, pages 1278–1283. IEEE, 2009.

[7] Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, VincentDumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, FredericBastien, and Yoshua Bengio. Pylearn2: a machine learning researchlibrary. arXiv preprint arXiv:1308.4214, 2013.

[8] Kaiyu Hang, Johannes A Stork, and Danica Kragic. Hierarchical fin-gertip space for multi-fingered precision grasping. In Intelligent Robotsand Systems (IROS 2014), 2014 IEEE/RSJ International Conferenceon, pages 1641–1648. IEEE, 2014.

[9] Kevin Jarrett, Koray Kavukcuoglu, M Ranzato, and Yann LeCun. Whatis the best multi-stage architecture for object recognition? In ComputerVision, 2009 IEEE 12th International Conference on, pages 2146–2153. IEEE, 2009.

[10] Nathan Koenig and Andrew Howard. Design and use paradigmsfor gazebo, an open-source multi-robot simulator. In IntelligentRobots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJInternational Conference on, volume 3, pages 2149–2154. IEEE.

[11] Marek Kopicki, Renaud Detry, Florian Schmidt, Christopher Borst,Rustam Stolkin, and John L Wyatt. Learning dexterous grasps thatgeneralise to novel objects by combining hand and contact models. InRobotics and Automation (ICRA), 2014 IEEE International Conferenceon, pages 5358–5365. IEEE, 2014.

[12] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning fordetecting robotic grasps. IJRR, 2014.

[13] Yun Lin and Yu Sun. Robot grasp planning based on demonstratedgrasp strategies. The International Journal of Robotics Research,34(1):26–42, 2015.

[14] Andrew T Miller and Peter K Allen. Graspit! a versatile simulator forrobotic grasping. Robotics & Automation Magazine, IEEE, 11(4):110–122, 2004.

[15] Carlos Rosales, Llus Ros, Josep M. Porta, and Ral Surez. Synthesizinggrasp configurations with specified contact regions. The InternationalJournal of Robotics Research, 30(4):431–443, 2011.

[16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, SanjeevSatheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNetLarge Scale Visual Recognition Challenge, 2014.

[17] Radu Bogdan Rusu. Semantic 3d object maps for everyday ma-nipulation in human living environments. KI-Kunstliche Intelligenz,24(4):345–348, 2010.

[18] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library(PCL). In IEEE International Conference on Robotics and Automation(ICRA), Shanghai, China, May 9-13 2011.

[19] Arjun Singh, James Sha, Karthik S Narayan, Tudor Achim, and PieterAbbeel. Bigbird: A large-scale 3d database of object instances. InRobotics and Automation (ICRA), 2014 IEEE International Conferenceon, pages 509–516. IEEE, 2014.