8
Sequential View Grasp Detection For Inexpensive Robotic Arms Marcus Gualtieri and Robert Platt * Abstract—In this paper, we consider the idea of improving the performance of grasp detection by viewing an object to be grasped from a series of different perspectives. Grasp detection is an approach to perception for grasping whereby robotic grasp configurations are detected directly from point cloud or RGB sensor data. This paper focuses on the situation where the camera or depth senor is mounted near the robotic hand. In this context, there are at least two ways in which viewpoint can affect grasp performance. First, a “good” viewpoint might enable the robot to detect more/better grasps because it has a better view of graspable parts of an object. Second, by detecting grasps from arm configurations nearby the final grasp config- uration, it might be possible to reduce the impact of kinematic modelling errors on the last stage of grasp synthesis just prior to contact. Both of these effects are particularly relevant to inexpensive robotic arms. We evaluate them experimentally both in simulation and on-line with a robot. We find that both of the effects mentioned above exist, but that the second one (reducing kinematic modelling errors) seems to have the most impact in practice. I. I NTRODUCTION Recently, grasp detection has been demonstrated to be a practical approach to perception for robotic grasping [6], [11], [7]. In grasp detection, machine learning methods are used to identify places in an image or point cloud where a grasp would be feasible. For example, Figure 1(a) illustrates grasps detected by our algorithm given a point cloud of a shampoo bottle as input. The yellow grippers denote robotic hand configurations from which a good grasp of the object is expected to be feasible. Since grasp configurations are detected without first estimating the object pose, grasp de- tection typically works well for grasping novel objects when precise information about object geometry is not known in advance. Most current approaches to grasp detection operate based on an image or point cloud obtained by viewing the scene from a fixed viewpoint. For example, Levine’s work at Google uses input from a single RGB camera [14]. Some of our own prior work creates a point cloud by registering together information from a pair of RGBD cameras [18]. In this paper, we consider the idea of sequential view grasp detection, where a camera or depth sensor is mounted near the robotic hand (Figure 1(b)) and views the target object from a series of perspectives until finally attempting a grasp. There are a couple of reasons why multiple views could be better than one in the context of grasp detection. First, it might be possible to improve grasp detection performance by planning viewpoints relative to a specific object/grasp of interest. Second, it might be possible to reduce the effect * College of Computer and Information Science, Northeastern University, Boston, Massachusetts, USA [email protected] (a) (b) Fig. 1. (a) Illustration of grasp detection. (b) Depth sensor mounted near a robotic hand. of manipulator kinematic model errors on grasp success by viewing the object from an arm configuration close to the final grasp configuration. These questions are particularly important in the context of grasping with inexpensive robotic arms such as the Baxter arm where it is impossible to measure end effector pose using forward kinematics calcu- lations accurately. Moreover, whereas it is easy to register multiple views together into a single point cloud (or truncated signed distance function, TSDF) with accurate kinematic estimates, this becomes very difficult if one must rely SLAM techniques because ICP-like methods do not work well in empty/uncluttered near-field scenes. This paper explores the possibilities described above by performing experiments off-line using the BigBird object dataset [17] and on-line using our Baxter robot. Although all experiments were performed using variations of our own grasp detection algorithm, we expect the results to generalize to other algorithms similar to ours such as [11], [7], [4], [5]. We have several interesting findings. First, we find that viewpoint does indeed affect grasp detection performance, although the nature of the effect can depend on object type. Second, we find that there is little benefit to viewing the object from many additional viewpoints once a “good” viewpoint is obtained unless the different-viewpoint data is registered together. Third, we find that there is a significant advantage to viewing the object at close range from a kinematic configuration close to the expected grasp configuration. Doing so seems to make the overall algorithm less sensitive to manipulator kinematic model errors and can improve grasp success rates by as much as 9%. arXiv:1609.05247v1 [cs.RO] 16 Sep 2016

Marcus Gualtieri and Robert Platt - arXiv · measure end effector pose using forward kinematics calcu- ... we use a four-layer-deep convolutional neural ... An alternative to using

Embed Size (px)

Citation preview

Sequential View Grasp Detection For Inexpensive Robotic Arms

Marcus Gualtieri and Robert Platt∗

Abstract— In this paper, we consider the idea of improvingthe performance of grasp detection by viewing an object to begrasped from a series of different perspectives. Grasp detectionis an approach to perception for grasping whereby roboticgrasp configurations are detected directly from point cloud orRGB sensor data. This paper focuses on the situation wherethe camera or depth senor is mounted near the robotic hand.In this context, there are at least two ways in which viewpointcan affect grasp performance. First, a “good” viewpoint mightenable the robot to detect more/better grasps because it has abetter view of graspable parts of an object. Second, by detectinggrasps from arm configurations nearby the final grasp config-uration, it might be possible to reduce the impact of kinematicmodelling errors on the last stage of grasp synthesis just priorto contact. Both of these effects are particularly relevant toinexpensive robotic arms. We evaluate them experimentallyboth in simulation and on-line with a robot. We find that bothof the effects mentioned above exist, but that the second one(reducing kinematic modelling errors) seems to have the mostimpact in practice.

I. INTRODUCTION

Recently, grasp detection has been demonstrated to be apractical approach to perception for robotic grasping [6],[11], [7]. In grasp detection, machine learning methods areused to identify places in an image or point cloud where agrasp would be feasible. For example, Figure 1(a) illustratesgrasps detected by our algorithm given a point cloud of ashampoo bottle as input. The yellow grippers denote robotichand configurations from which a good grasp of the objectis expected to be feasible. Since grasp configurations aredetected without first estimating the object pose, grasp de-tection typically works well for grasping novel objects whenprecise information about object geometry is not known inadvance.

Most current approaches to grasp detection operate basedon an image or point cloud obtained by viewing the scenefrom a fixed viewpoint. For example, Levine’s work atGoogle uses input from a single RGB camera [14]. Someof our own prior work creates a point cloud by registeringtogether information from a pair of RGBD cameras [18]. Inthis paper, we consider the idea of sequential view graspdetection, where a camera or depth sensor is mounted nearthe robotic hand (Figure 1(b)) and views the target objectfrom a series of perspectives until finally attempting a grasp.There are a couple of reasons why multiple views could bebetter than one in the context of grasp detection. First, itmight be possible to improve grasp detection performanceby planning viewpoints relative to a specific object/grasp ofinterest. Second, it might be possible to reduce the effect

∗College of Computer and Information Science, Northeastern University,Boston, Massachusetts, USA [email protected]

(a) (b)

Fig. 1. (a) Illustration of grasp detection. (b) Depth sensor mounted neara robotic hand.

of manipulator kinematic model errors on grasp success byviewing the object from an arm configuration close to thefinal grasp configuration. These questions are particularlyimportant in the context of grasping with inexpensive roboticarms such as the Baxter arm where it is impossible tomeasure end effector pose using forward kinematics calcu-lations accurately. Moreover, whereas it is easy to registermultiple views together into a single point cloud (or truncatedsigned distance function, TSDF) with accurate kinematicestimates, this becomes very difficult if one must rely SLAMtechniques because ICP-like methods do not work well inempty/uncluttered near-field scenes.

This paper explores the possibilities described above byperforming experiments off-line using the BigBird objectdataset [17] and on-line using our Baxter robot. Althoughall experiments were performed using variations of ourown grasp detection algorithm, we expect the results togeneralize to other algorithms similar to ours such as [11],[7], [4], [5]. We have several interesting findings. First,we find that viewpoint does indeed affect grasp detectionperformance, although the nature of the effect can dependon object type. Second, we find that there is little benefit toviewing the object from many additional viewpoints once a“good” viewpoint is obtained unless the different-viewpointdata is registered together. Third, we find that there is asignificant advantage to viewing the object at close rangefrom a kinematic configuration close to the expected graspconfiguration. Doing so seems to make the overall algorithmless sensitive to manipulator kinematic model errors and canimprove grasp success rates by as much as 9%.

arX

iv:1

609.

0524

7v1

[cs

.RO

] 1

6 Se

p 20

16

II. BACKGROUND

A. Viewpoint selection

Optimal viewpoint selection falls into a large body ofresearch called active sensing. We only attempt to mentiona few works in this area in order to demonstrate where thepresent work fits in. Viewpoints can either be consideredindependently as in [20] or sequentially along a trajectoryas in [13], [21]. In this paper we examine the effects ofboth individual and multiple viewpoints on grasp detectionperformance.

In this work we do not attempt to reconstruct a scenefrom multiple views; although, this can sometimes be apowerful approach. We and others have explored this withsignificant benefit to grasp detection performance [9], [6],but the primary difficulty faced with this is that the ICP-based SLAM algorithms do not register multiple views wellin near-field, uncluttered scenes.

Many prior works see viewpoint selection as a problemin maximizing the amount of information gained from athe prospective viewpoint. This can be expressed in termsof entropy [20], KL divergence [19], or Fisher information[13]. The main decision in these various applications is theprobability distribution used for computing the informationobtained from the sensing action. A key challenge addressedby these approaches is the ability to correctly predict whatprobability distribution will be seen from a yet unobservedposition given the current information about the scene.

The advantage of our approach is that it is very easyto generate a distribution of view quality in the frameworkof grasp detection, and no additional training/prediction isneeded beyond what is provided by existing grasp detectionalgorithms. Also, this provides a very natural way to incorpo-rate any prior information that may be known about the typeof task being completed, since this distribution is somewhatobject-specific as we show later.

For overcoming kinematic errors when the arm is near theend of the task, visual servoing is frequently used. Generallythis requires a specific feature of an object or a fiducialon the object [1], which is difficult to get in open-worldenvironments. Levine et al. overcome this problem usingend-to-end deep learning [14]. The present work attemptsto limit these sorts of errors by viewing the scene as closeas possible to the planned grasp configuration.

B. Grasp detection

Traditional approaches to perception for robotic graspingare object-based [2]. First, they plan grasps with respect toa known mesh model of an object of interest. Then, theyestimate the 6-DOF pose of the object and project the graspconfigurations planned relative to the object mesh into thebase frame of the robot. While these methods can workwell in structured environments where object geometries areknown ahead of time, they are less well suited for unstruc-tured real world environments. In contrast, grasp detectionmethods treat perception for grasping like a typical computervision problem [16], [12], [15], [5], [3], [7], [10], [18], [6].

Instead of attempting to estimate object pose, grasp detectionestimates grasp configurations directly from image or pointcloud sensor data. These detected grasp configurations are 6-DOF poses of a robotic hand from which a grasp is expectedto be successful. Importantly, there is no notion of “object”here and there is no segmentation as a pre-processing step. Asuccessful grasp configuration is simply a hand configurationfrom which it is expected that the fingers will establish agrasp when they are closed.

(a) (b) (c)

Fig. 2. (a) input point cloud; (b) grasp candidates that denote potentialgrasp configurations; (c) high scoring grasps.

Most grasp detection methods have the following twosteps: grasp candidate generation and grasp scoring. Duringcandidate generation, the algorithm hypothesizes a largenumber of robotic hand configurations that could potentiallybe but are not necessarily grasps. Then, during grasp scoring,some form of machine learning is used to rank the candidatesaccording to the likelihood that they will turn out to begrasps. There are a variety of ways of implementing thesetwo steps. Figure 2 illustrates ours. Figure 2(b) shows a largeset of grasp candidates that denote potential grasp configura-tions. These grasp candidates are generated by searching forhand configurations that are obstacle free with respect to thepoint cloud and that contain some portion of the cloud be-tween the two fingers. Figure 2(c) shows a set of grasps thatwere scored very highly by the machine learning subsystem.In our case, we use a four-layer-deep convolutional neuralnetwork to make grasp predictions based on projections ofthe portion of the point cloud contained between the fingers.More details on this method can be found in [6].

III. FUSING SEQUENTIAL GRASP DETECTIONS

A natural question to ask is whether it is possible toimprove grasp detection accuracy by incorporating data frommultiple viewpoints. Our prior work has shown that it ispossible to improve grasp success rates by 9% just by usinga SLAM package to create a more complete point cloudof the scene [6]. In that work, we mounted a depth sensornear the wrist and used InfiniTAM [8] to integrate pointcloud data obtained by moving the hand in an arc over thescene. Unfortunately, because we typically view the sceneat relatively close range (approximately 45 cm), InfiniTAMfrequently loses track of the scene relative to the voxel gridwhile moving the arm. This was a particular problem when

(a) (b)

(c)

Fig. 3. (a) 20 viewpoints around a box a CheezIts; (b) grasp candidateson the box; (c) grasp detection accuracy for averaged scores (red) versusaccuracy for each view (blue).

viewing a near-empty tabletop where there is little structureto facilitate correct SLAM matching using ICP. In addition,because we were using the Baxter robot, we sometimesexperienced grasp failures due to kinematic modelling er-rors. Both of these problems are serious challenges whenattempting to do grasp detection using inexpensive robotichardware.

An alternative to using a SLAM package is to run graspdetection directly on each of multiple point clouds and tosomehow fuse the results. One simple approach is to createa single set of grasp candidates, score those candidatesagainst each of the multiple point clouds, and then to makepredictions based on the average score per grasp candidate.We performed this experiment off-line using data from theBigBird dataset [17]. BigBird is a database comprised of aset of 125 objects where each object is associated with 600different point clouds taken from different views at roughly aconstant radius about the object. In this experiment, we usedpoint clouds taken from the 20 different viewpoints shownin Figure 3(a) around a box of CheezIts (the black objectin the upper right of Figure 3(a)). We used the first pointcloud (denoted by a red circle in Figure 3(a)) to generate thegrasp candidates shown in Figure 3(b). Then, we scored eachof the grasp candidates using each of the 20 point clouds.The blue line in Figure 3(c) shows the accuracy of graspdetection as a function of view, which varies between 80%and 100%. The red line shows the detection accuracy thatis obtained by averaging scores from all 20 viewpoints. The

accuracy obtained by averaging scores exceeds the accuracyobtained from most of the individual viewpoints. However,there are two things to notice. First, if there is no wayof knowing which viewpoint is better than another, thenthere is value in averaging scores from different viewpoints.Second, the fact that the accuracy obtained from the bestviewpoint exceeds the average suggests that it might be betterto not average, but to obtain a good viewpoint and do graspdetection using that single cloud. We explore this idea in thenext section. Although the findings presented above are onlyfor the CheezIts box, other objects produce similar results.

IV. CHOOSING THE BEST VIEW FOR GRASP DETECTION

In the last section, we sampled grasps once and then eval-uated the accuracy of making predictions based on averagegrasp scores. For inexpensive robotic arms, this is a best-casescenario because it requires accurately registering the pointcloud in which the grasp candidate was obtained with thepoint cloud obtained from the new perspective. One solutionwould be to encode this uncertainty as “process noise” inthe context of sequential Bayesian filtering. However, sincethis will reduce the “averaged” accuracy shown in Figure 3even futher, it is probably not worth it. Therefore, in thissection, we consider the entire grasp detection process (bothcandidate generation and scoring) rather than just lookingat how to improve the accuracy of scoring. This work ispredicated on the idea that there is an approximate “targetgrasp” that describes how we would like to grasp a particularobject. For example, we envision that the target grasp on aflashlight would be a handle grasp from which the robot wereable to actuate the switch. Given the target grasp, we willdevelop a method of reasoning about how to view that graspin order to maximize the chances of detecting a large numberof grasps with high precision.

First, we use the BigBird dataset to characterize thenumber of ground truth positives versus negatives as afunction of viewpoint. Importantly, viewpoint is expressedin the reference frame of each individual grasp candidate(in terms of azimuth and elevation relative to the candidate).Figure 4(a-c) shows the result as a function of viewpointfor three different objects. This plot was calculated usingkernel density estimation with a spherical Gaussian kerneland a standard deviation of 0.15. Notice that for the tworectangular objects (the CheezIts and the Dove soap), therelative density of positives is maximized in the vicinity ofθ = π±0.6. Looking at the grasp directly along the approachvector works very poorly while looking at the grasp from oneside or other other helps a lot. The result is different for thecylindrical object (the Pringles) where the relative density ofpositives is high from most viewpoints with a peak occurringat θ = π.

The result described above is important because a highproportion of ground truth positives versus negatives in thecandidate set could enable us also to detect more grasppositives with high precision. We tested this hypothesis usingthe BigBird dataset. First, using a mesh provided by BigBirdfor a given object, we generate a ground truth grasp positive

(a) CheezIts (b) Dove Soap (c) Pringles (d) CheezIts (e) Dove Soap (f) Pringles

Fig. 4. (a-c) Density of positives as a function of viewpoint in the reference frame of the grasp candidate (θ =azimuth, φ =elevation) for the curvature-based grasp candidate generation strategy. Yellow denotes areas with high density of ground truth positives. Blue denotes areas with dense ground truthnegatives. (d-f) Same as Figure 4 except for the normals-based candidate generation strategy.

at random. This represents the “target grasp” that describeshow we would like to grasp the object. Our goal will beto detect grasps in the vicinity (within ±2 cm and ±20degrees) of the target grasp with high confidence. Usingthe relationships shown in Figure 4(a-c), we assign a scoreto each of the 600 viewpoints contained in the BigBirddataset for that object. Finally, we select the highest scoringview (the “smart viewpoint”), detect grasps in the corre-sponding cloud, and evaluate the number of ground truthgrasp candidates found. The results are shown in Table I.On average, the grasp candidates generated from the smartviewpoint contain several times more ground truth positivesthan the grasp candidates generated from random viewpoints.Moreover, these views give us better recall at high precision.In grasp detection, the cost of a false positive is particularlyhigh because it can cause a grasp failure. As such, we areparticularly interested in the characteristics of the detectionsystem at high precision. Our prior work introduced ameasure known as recall at 99% precision (equal to thepoint on the precision-recall curve at 99% precision) [6]. Theassumption here is that we want to recall as many grasps aspossible while maintaining no more than 1% false positives.It turns out that for grasps in the vicinity of the target grasp,recall at 99% precision is significantly higher for the plannedviewpoints relative to random viewpoint (the “Rec @ 99%”column in Table I). Moreover, the raw number of detectedtrue positives is higher at 99% precision as well (the “Pos @99%” column in Table I). This suggests a significant benefitto the planned view both in terms of the chances that thehighest scoring grasps is a true positive and in terms of thenumber of high-precision grasps that are available for therobot to choose from.

Why does viewpoint have such an impact? One can thinkabout grasp candidate generation as a set of heuristics thatsuggest potential grasp configurations. In this work, weuse a candidate generation strategy that searches discretizedorientations about the axis of maximum principle curvatureof the local object surface (we will call this the “curvature-based strategy”). Our choice to use this curvature axis hasan important effect on the shape of the plots shown in Fig-ure 4(a-c). Other grasp candidate generation strategies resultin different preferred viewpoint directions. For example, bothHerzog et al. [7] and Kappler et al. [10] use a grasp candidatemechanism that searches discretized orientations about the

Object View Type Num Pos Rec @ 99% Pos at 99%CheezeIts Best 194 0.807 177CheezeIts Random 41 0.220 41

Dove Soap Best 355 0.809 343Dove Soap Random 48 0.132 38

Pringles Best 456 0.584 300Pringles Random 122 0.170 100

TABLE IPERFORMANCE OF GRASP DETECTION FROM BEST VIEW VERSUS

RANDOM VIEW. Num Pos DENOTES THE NUMBER OF GROUND TRUTH

POSITIVES IN THE CANDIDATE SET; Rec @ 99% DENOTES RECALL AT

99% PRECISION (SEE TEXT); Pos at 99% DENOTES NUMBER OF

POSITIVES DETECTED AT 99% PRECISION.

local surface normal instead of a curvature axis (we call thisthe “normal-based strategy”). While this method also resultsin a strong relationship between viewpoint and density ofground truth positives, as shown in Figure 4(d-f), notice thatthe exact nature of the relationship is completely different.Therefore, while smart viewpoint selection is likely to helpregardless of the exact candidate generation strategy used, itis important to calculate the correct viewpoint relationshipsfor the the chosen candidate strategy. As a baseline, wealso evaluated this relationship for a strategy that searchedfor grasp candidate in uniformly random orientations. Thatstrategy also exhibited a strong relationship between view-point and the density of grasp positives. However, sincethis random strategy typically generates many times fewergrasp candidates overall than either of the other candidategeneration strategies, we do not consider it further.

V. SEQUENTIAL VIEW GRASP DETECTION

The experiments described at the end of Section IV sug-gest that it should be possible to improve grasp performanceby viewing grasps from the right viewpoint. In particular,if the system has prior knowledge about approximately howan object should be grasped, then it should be possible toplan views that make it easier to detect the desired grasps.However, this is only one way that viewpoint can help.Another way is to reduce the impact of inaccurate kinematicson grasp success rates. Many inexpensive robotic systems,such as the Baxter research robot, have inaccurate forwardkinematics. One way to reduce this impact is to detectgrasps relative to an arm configuration nearby the expected

final grasp configuration. Assuming that the magnitude ofkinematic errors grow with c-space distance, detecting thegrasps from a nearby configuration should help. Based onthese two ideas, we propose a three-view sequential graspdetection strategy that combines these ideas. We evaluatethe impact of this strategy on grasp success rates using theBaxter research robot in our lab.

A. A three-view sequential grasp detection strategy

We formulate the following three-view sequential graspdetection strategy. First, we view the scene from an arbitraryviewpoint in order to establish an approximate “target grasp”(first view). This target grasp expresses how we would liketo grasp an object. For example, if the objective is to graspand actuate a flashlight, then the target grasp would be ahandle grasp that is close enough to the switch to enablethe robot to actuate it. In the experiments for this paper,the first viewpoint was chosen randomly with the constraintsthat the sensor be 35 cm away from a fixed point on thetable and that the line-of-sight be within 53 degrees of thegravity direction. Grasps were then detected in the pointcloud acquired from this viewpoint, and we then selectedone to be the “target grasp” according to the same procedurein our prior work [6]. (Essentially this involves checking forreachability, collisions, grasp width, and heuristics of task-specific interest.)

Second, based on the target grasp, we select a viewpointfrom which that grasp would best be viewed (second view).This is accomplished using the previously modeled densityof positives versus negatives as illustrated in Figure 4. Ifthe object identity is known, then we use a model such asshown in Figure 4 directly. If object identity is unknown,then we use an averaged density model, given whatever priorinformation is available. In these experiments the densitieswere generated from box-like objects, since high-scoringviews for boxes seem to work well on cylindrical objectsas well. In our experiments, we constrain the second view tobe 30 cm from the target grasp. Once the view is obtained,we perform a second round of grasp detection and select agrasp within a threshold radius (8 cm in our experiments) ofthe target grasp. This selected grasp becomes the final targetgrasp.

Once the final target grasp is obtained, we move thesensor to a pre-grasp arm configuration and obtain a thirdview. Here, the sensor is as close to the target grasp aspossible while still satisfying the minimum-range constraintsof the depth sensor (about 24 cm in our experiments).Using the third view, we perform one final round of graspcandidate generation without performing the correspondinggrasp scoring step. The grasp candidate with the 6-DOF poseclosest to the final target grasp is selected for execution. Thefact that we do not score the grasps on this final step issignificant – the third view is not intended to get a goodlabel for the grasp, but simply to register the target graspthe point cloud. Essentially, the third view “shifts” the targetgrasp so that it is not in collision with the point cloud butstill close to the grasp found in the previous step.

This three-view process is illustrated for a single object ona tabletop in Figure 5. Figure 5(a) shows the first view wherean approximate target grasp is established in the middleof the box. Figure 5(b) shows the second view planed tomaximize the precision of grasp detection. Figure 5(c) showsthe third view that occurs just prior to grasping. Figure 5(d)shows the completed grasp. Figure 6 shows depth imagescorresponding to the second and third views.

(a) Second View (b) Third View

Fig. 6. Depth images that were used to generate point clouds for the secondand third views shown in Figure 5. The fingers of the robot are visible inthe top-center, and shadows of the fingers are on the top sides.

B. Experiments

We evaluated the preceding three-view strategy on ourBaxter robot. Our goal was to evaluate the benefit of theproposed three-view grasp detection strategy relative to vari-ous ablations of the strategy. We consider the following fourvariations:

1) The three-view strategy as described in Section V-A.2) Same as (1) except that we skip the final (third) view.3) Same as (1) except that we skip the middle (second)

view.4) Same as (1) except that we skip the final two views

and only perform the initial arbitrary view.

(a) (b)

Fig. 8. (a) All 25 objects used in robot experiments. (b) Cluttered pile of10 objects that the robot must clear.

The experimental protocol followed for each variation issimilar to the one proposed in our prior work [6]. First,10 objects are selected at random from a set of 25 and“poured” into a pile in front of the robot. (See Figure 8(a)for the object set and Figure 8(b) for an example pile ofclutter.) Second, the robot proceeds to automatically removethe objects one-by-one as the experimenter records successesand failures. This continues until either all of the objects

have been removed, the same failure occurs on the sameobject three times in a row, or no grasps were found after3 attempts. The sensor (Intel RealSense SR300) and gripperhardware used in the experiment are shown in Figure 1(b).Figure 7 shows the robot performing the first half of oneround of an experiment.

Views 1-2-3 Views 1-2 Views 1-3 Views 1Attempts 131 141 154 153Failures 17 31 26 39

Success Rate 0.87 0.78 0.83 0.75

TABLE IIGRASP SUCCESS RATES FOR THE 4 EXPERIMENTAL STRATEGIES.

(Error Type) Views 1-2-3 Views 1-2 Views 1-3 Views 1FK 10 21 9 28

Grasp 4 10 14 9Other 3 0 3 2

TABLE IIICOUNTS BY FAILURE TYPE FOR THE 4 EXPERIMENTAL STRATEGIES.“FK” MEANS A FORWARD KINEMATICS ERROR LED TO THE FAILURE

AND “GRASP” MEANS A DEFECT IN THE DETECTED GRASP LED TO A

FAILURE.

Grasp success rates for the 4 experimental strategies areshown in Table II. A grasp was counted as a success if theobject was lifted from the table and placed into a box on thefloor adjacent to the table. Failures were counted wheneverthe robot attempted a grasp but the object did not make it tothe box. The “attempts” row of the table shows the number ofgrasp attempts made using each strategy (579 grasp attemptstotal over all four strategies). The “failures” row shows thenumber of failures for each strategy and the “success rate”row is one minus the ratio between failures and attempts.These results suggest that both the second and third viewscontribute to successful grasps. Notice that the third viewseems to be more important than the second view in terms ofgrasp success rates: the grasp success rate is higher when thesecond view is removed as opposed to when the third viewis removed. Since grasp detection for the third view can runsquickly (only the sampling step is needed, classification isnot needed), this may be a good strategy for time-sensitivetasks.

The grasp failures in our experiments primarily fell intotwo categories as shown in Table III. The “FK” row inTable III denotes the number of grasp failures caused bya misalignment between the planned grasp and the actualobject in the real world. The “Grasp” row denotes the numberof grasp failures caused by a detection error. (Notice thatthe columnwise sum of the three rows in Table III is equalto the corresponding total number of failures indicated inTable II. The failure modes seem to reinforce intuition aboutwhat should happen if either view 2 or 3 is skipped. Ifview 3 is skipped, then we obtain a large number of FKfailures, presumably because we are not using the last viewto align the detected grasp with the final target grasp. If view

2 is skipped, then we obtain relatively more Grasp failuresbecause the algorithm does not get the best view of the targetgrasp.

VI. CONCLUSIONS AND FUTURE WORK

This paper describes work that demonstrates that view-point can have an important effect on grasp detection. Al-though all experiments are performed using a grasp detectionstrategy we developed in our prior work, we expect thatthey should extend to similar grasp detection strategies suchas [11], [7], [4], [5]. To summarize, we have found thatviewpoint can have an important impact on the relativeproportion of ground truth positives contained within theset of grasp candidates. This manifests itself in terms ofthe recall of grasp detection at high rates of precision (animportant measure which we seek to optimize). We have alsofound that viewing the object as close as possible to the targetgrasp configuration can significantly reduce the chances of amisalignment between the planned grasp configuration andan actual viable grasp. We proposed a sequential three-view strategy for grasp detection that we demonstrate tosignificantly improve practical grasp success rates relativeto a single random view.

In the future we would like to score our views withrespect to information-theoretic measures introduced in priorwork to see if there is a common relationship between theseapproaches. Our approach is nice because it fits in wellwith grasp detection and requires no additional expensivecomputation besides a table look-up to evaluate the scoresof viewpoints. More importantly, in the future we would liketo look at improving the way grasps are shifted on the finalapproach, perhaps by making the process continuous untilthe sensor leaves the minimum viewing range. We hopeto improve grasp success rates even further by optimizingperformance of grasping in the last few centimeters.

ACKNOWLEDGEMENTS

This work has been supported in part by the NationalScience Foundation through IIS-1427081, NASA throughNNX16AC48A and NNX13AQ85G, and ONR throughN000141410047.

(a) First view (b) Second View (c) Third View (d) Grasp

Fig. 5. Simple illustration of the three-view grasp detection strategy.

Fig. 7. First 5 grasps of a typical experiment trial. All grasps were successful. This was run with all 3 views included.

REFERENCES

[1] Francois Chaumette and Seth Hutchinson. Visual servo control, part i:Basic approaches. IEEE Robotics & Automation Magazine, 13(4):82–90, 2006.

[2] Sachin Chitta, E Gil Jones, Matei Ciocarlie, and Kaijen Hsiao. Percep-tion, planning, and execution for mobile manipulation in unstructuredenvironments. IEEE Robotics and Automation Magazine, Special Issueon Mobile Manipulation, 19(2):58–71, 2012.

[3] Renaud Detry, Carl H. Ek, Mariannna Madry, and Danica Kragic.Learning a dictionary of prototypical grasp-predicting parts fromgrasping experience. In IEEE Int’l Conf. on Robotics and Automation,pages 601–608, 2013.

[4] David Fischinger and Markus Vincze. Empty the basket - a shapebased learning approach for grasping piles of unknown objects. InIEEE Int’l Conf. on Intelligent Robots and Systems, pages 2051–2057,2012.

[5] David Fischinger, Markus Vincze, and Yun Jiang. Learning grasps forunknown objects in cluttered scenes. In IEEE Int’l Conf. on Roboticsand Automation, pages 609–616, 2013.

[6] Marcus Gualtieri, Andreas ten Pas, Kate Saenko, and Robert Platt.High precision grasp pose detection in dense clutter. In IEEE Int’lConf. on Intelligent Robots and Systems, 2016.

[7] Alexander Herzog, Peter Pastor, Mrinal Kalakrishnan, LudovicRighetti, Tamim Asfour, and Stefan Schaal. Template-based learningof grasp selection. In IEEE Int’l Conf. on Robotics and Automation,pages 2379–2384, 2012.

[8] Olaf Kahler, Prisacariu V. Adrian, Ren Carl Y. Yuheng, Xin Sun, PhilipTorr, and David Murray. Very high frame rate volumetric integrationof depth images on mobile device. In IEEE Int’l Symp. on Mixed andAugmented Reality, volume 22, pages 1241–1250, 2015.

[9] Gregory Kahn, Peter Sujan, Sachin Patil, Shaunak Bopardikar, JulianRyde, Ken Goldberg, and Pieter Abbeel. Active exploration using tra-jectory optimization for robotic grasping in the presence of occlusions.In IEEE Int’l Conf. on Robotics and Automation, pages 4783–4790,2015.

[10] Daniel Kappler, Jeannette Bohg, and Stefan Schaal. Leveragingbig data for grasp planning. In IEEE Int’l Conf. on Robotics andAutomation, 2015.

[11] Daniel Kappler, Sstefan Schaal, and Jeannette Bohg. Optimizing forwhat matters: the top grasp hypothesis. In IEEE Int’l Conf. on Roboticsand Automation, 2016.

[12] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning fordetecting robotic grasps. The Int’l Journal of Robotics Research, 34(4-5):705–724, 2015.

[13] Daniel Levine, Brandon Luders, and Jonathan P. How. Information-rich path planning with general constraints using rapidly-exploringrandom trees. In AIAA Infotech@ Aerospace Conf., 2010.

[14] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen.Learning hand-eye coordination for robotic grasping with deep learn-ing and large-scale data collection. arXiv preprint arXiv:1603.02199,2016.

[15] Joseph Redmon and Anelia Angelova. Real-time grasp detection usingconvolutional neural networks. In IEEE Int’l Conf. on Robotics andAutomation, pages 1316–1322, 2015.

[16] Ashutosh Saxena, Justin Driemeyer, and Andrew Y. Ng. Roboticgrasping of novel objects using vision. The Int’l Journal of RoboticsResearch, 27(2):157–173, 2008.

[17] Arjun Singh, James Sha, Karthik S. Narayan, Tudor Achim, and PieterAbbeel. Bigbird: A large-scale 3d database of object instances. InIEEE Int’l Conf. on Robotics and Automation, pages 509–516, 2014.

[18] Andreas ten Pas and Robert Platt. Using geometry to detect graspposes in 3d point clouds. In Int’l Symp. on Robotics Research, 2015.

[19] Herke Van Hoof, Oliver Kroemer, Heni Ben Amor, and Jan Peters.Maximally informative interaction learning for scene exploration. InIEEE Int’l Conf. on Intelligent Robots and Systems, pages 5152–5158,2012.

[20] Pere-Pau Vazquez, Miquel Feixas, Mateu Sbert, and Wolfgang Hei-drich. Viewpoint selection using viewpoint entropy. In Int’l Conf. onVision Modeling and Visualization, volume 1, pages 273–280, 2001.

[21] Javier Velez, Garrett Hemann, Albert S. Huang, Ingmar Posner, andNicholas Roy. Planning to perceive: Exploiting mobility for robustobject detection. In Int’l Conf. on Automated Planning and Scheduling,2011.