5
Detection of hand grasping an object from complex background based on machine learning co-occurrence of local image feature Shinya Morioka * , Yasuhiro Hiramoto * , Nobutaka Shimada * , Tadashi Matsuo * , Yoshiaki Shirai * * Ritsumeikan University, Shiga, Japan Abstract— This paper proposes a method of detecting a hand grasping an object from one image based on machine learning. The learning process progresses as a boostrap scheme from hand- only detection to object-hand relationship modeling. In the first step, Randomized Trees that learns the positional and visual relationship between image features of hand region and wrist position and that provides the probability distribution of wrist positions is built. Then the wrist position is detected by voting the distributions for multiple features in newly captured images. In the second step, the suitability as hand is calculated for each feature by evaluating the voting result. Based on that, the object features specified by the grasping type are extracted and the positional and visual relationship between the object and hand are learned so that the object center can be predicted from hand shape. On the local coordinate constructed by the predicted object center and the wrist position, it is learned where the object, hand region and the background is located in the target grasping type by One class SVM. Finally, the grasping hand, the grasped object and the background are respectively extracted from the cluttered background. I. I NTRODUCTION There are researches of detecting unknown objects from image based on machine learning like [1]. In addition, another approach is proposed based on more image cues like human actions and the kind of objects. In order to recognize various actions of the human hands, such as holding or picking up something, detecting the hands and estimating their postures from images are to be solved. In some previous researches the object category is recognized by using functional relationship between an object and the hand action [2][3]. They simulta- neously evaluate the image appearances of hand and object. However, when a hand is holding something, it is difficult to detect the hand and precisely classify its posture because some part of regions may be occluded by the grasped object. This paper proposes a method that detects a hand position holding an object based on local hand features in spite of the occlusion. By learning local features from the hand grasping various tools with various way to hold them, the proposed method estimates the position, scale, direction of the hand and the region of the grasped object. By some experiments, we demonstrate that the proposed method can detect them in cluttered backgrounds. II. WRIST POSITION DETECTION BASED ON RANDOMIZED TREES LEARNING RELATIVE POSITIONS OF THE LOCAL IMAGE FEATURES A. Training of Randomized Trees model providing the proba- bility distribution of wrist positions Randomized Trees (RT)[4] is a multi-class classifier that accepts multi-dimensional features and it provides probability distributions over the class indices. Here we construct a classi- fier that provides the probability distribution of wrist positions when SURF[5] image features are input into the classifier. It is necessary to estimate the distribution even if the hand is located at the image region, in the scale and orientation different from those of the training images. For the purpose, the training data of the wrist positions are once normalized based on the position, scale and orientation of each SURF feature of the grasping hands seen in the training images, and then the pairs of the normalized wrist positions and the features are input into the classifier for training. Since the probability distribution predicted by RT is discrete, however, the continuous distribution over the image space cannot be directly treat by RT. Therefore based on the normalized coordinate, the normalized image space is divided into discrete grid blocks each of that has a unique class label, the class label on which the wrist position lies is determined, and then an RT that outputs the probability distribution of the wrist position over the grids for the input of 128 dimensional SURF image feature is constructed. A local coordinate space defined for each SURF vector is segmented into some blocks by a grid (Fig.1-(c)). The number of blocks is finite and they are indexed. A label j is an index that means a block or background. The j -th block C j is a region on a local coordinate space defined as C j = {[ u v ] |u - u c (j )| <S block /2, |v - v c (j )| <S block /2 } , (1) where (u c (j ),v c (j )) t denotes the center of the C j and S block denotes the size of a block. When a wrist position is x wrist =(x wrist ,y wrist ) t and a SURF vector v is detected on a hand, a label j of the SURF vector v is defined as a block index j such that T v (x wrist ) C j , where T v is defined the translation into the local coordinate

Detection of hand grasping an object from complex ... of hand grasping an object from complex background based on machine learning co-occurrence of local image feature Shinya Morioka

Embed Size (px)

Citation preview

Detection of hand grasping an object from complexbackground based on machine learning

co-occurrence of local image featureShinya Morioka∗, Yasuhiro Hiramoto∗, Nobutaka Shimada∗, Tadashi Matsuo∗, Yoshiaki Shirai∗

∗Ritsumeikan University, Shiga, Japan

Abstract— This paper proposes a method of detecting a handgrasping an object from one image based on machine learning.The learning process progresses as a boostrap scheme from hand-only detection to object-hand relationship modeling. In the firststep, Randomized Trees that learns the positional and visualrelationship between image features of hand region and wristposition and that provides the probability distribution of wristpositions is built. Then the wrist position is detected by voting thedistributions for multiple features in newly captured images. In thesecond step, the suitability as hand is calculated for each featureby evaluating the voting result. Based on that, the object featuresspecified by the grasping type are extracted and the positionaland visual relationship between the object and hand are learnedso that the object center can be predicted from hand shape. Onthe local coordinate constructed by the predicted object centerand the wrist position, it is learned where the object, hand regionand the background is located in the target grasping type byOne class SVM. Finally, the grasping hand, the grasped objectand the background are respectively extracted from the clutteredbackground.

I. INTRODUCTION

There are researches of detecting unknown objects fromimage based on machine learning like [1]. In addition, anotherapproach is proposed based on more image cues like humanactions and the kind of objects. In order to recognize variousactions of the human hands, such as holding or picking upsomething, detecting the hands and estimating their posturesfrom images are to be solved. In some previous researches theobject category is recognized by using functional relationshipbetween an object and the hand action [2][3]. They simulta-neously evaluate the image appearances of hand and object.However, when a hand is holding something, it is difficult todetect the hand and precisely classify its posture because somepart of regions may be occluded by the grasped object. Thispaper proposes a method that detects a hand position holdingan object based on local hand features in spite of the occlusion.By learning local features from the hand grasping various toolswith various way to hold them, the proposed method estimatesthe position, scale, direction of the hand and the region of thegrasped object. By some experiments, we demonstrate that theproposed method can detect them in cluttered backgrounds.

II. WRIST POSITION DETECTION BASED ONRANDOMIZED TREES LEARNING RELATIVE

POSITIONS OF THE LOCAL IMAGE FEATURES

A. Training of Randomized Trees model providing the proba-bility distribution of wrist positions

Randomized Trees (RT)[4] is a multi-class classifier thataccepts multi-dimensional features and it provides probabilitydistributions over the class indices. Here we construct a classi-fier that provides the probability distribution of wrist positionswhen SURF[5] image features are input into the classifier.It is necessary to estimate the distribution even if the handis located at the image region, in the scale and orientationdifferent from those of the training images. For the purpose, thetraining data of the wrist positions are once normalized basedon the position, scale and orientation of each SURF featureof the grasping hands seen in the training images, and thenthe pairs of the normalized wrist positions and the featuresare input into the classifier for training. Since the probabilitydistribution predicted by RT is discrete, however, the continuousdistribution over the image space cannot be directly treat by RT.Therefore based on the normalized coordinate, the normalizedimage space is divided into discrete grid blocks each of that hasa unique class label, the class label on which the wrist positionlies is determined, and then an RT that outputs the probabilitydistribution of the wrist position over the grids for the input of128 dimensional SURF image feature is constructed.

A local coordinate space defined for each SURF vector issegmented into some blocks by a grid (Fig.1-(c)). The numberof blocks is finite and they are indexed. A label j is an indexthat means a block or background.

The j-th block Cj is a region on a local coordinate spacedefined as

Cj ={[

uv

]∣∣∣∣ |u − uc(j)| < Sblock/2, |v − vc(j)| < Sblock/2}

,

(1)where (uc(j), vc(j))t denotes the center of the Cj and Sblock

denotes the size of a block.When a wrist position is xwrist = (xwrist, ywrist)

t and aSURF vector v is detected on a hand, a label j of the SURFvector v is defined as a block index j such that Tv (xwrist) ∈Cj , where Tv is defined the translation into the local coordinate

as

Tv(x, y) =1Sv

[(x − xv) cos θv + (y − yv) sin θ−(x − xv) sin θv + (y − yv) cos θ

], (2)

and (xv, yv) denotes the position where the SURF v is detected,θv and Sv denote the direction and scale of the SURF v,respectively.

A label of a SURF vector detected on background is definedas a special index jback and it is distinguished from blockindexes.

To learn the relation between a SURF vector v and its labelj, we collect many pairs of a SURF vector and its label frommany teacher images where the true wrist position is known.Then, we train RT with the pairs so that we can calculate aprobability distribution P (j|v).

Fig. 1. Conversion of wrist position and positions of features to relativepositional relationship class labels

B. Wrist position detection based on votes from the probabilitydistribution of wrist positions

A wrist position is estimated by “voting” on the image space,based on the probability distribution P (j|v) learned with theRT. The votes function Vwrist(x, y) defined as

Vwrist(x, y) =∑

v for all SURF vectors

Vv(x, y), (3)

where

Vv(x, y) =

{P (j = j̃|v) if ∃Cj̃ s.t. Tv(x, y) ∈ Cj̃ ,

0 (otherwise) ,(4)

and P (j|v) is a probability distribution calculated by the trainedRT.

The position with the maximum votes, arg maxx,y

Vwrist(x, y),

is considered as a position suitable for a wrist. However, theglobal maximum may not be the true position because thevotes mean a local likelihood and global comparison of themmakes no sense. Therefore, we allow multiple candidates ofwrist positions that have locally maximum votes.

To find local maxima without explicitly defining local re-gions, we use mean-shift clustering. Seeds for the clusteringare placed at regular intervals in the image space as shownin Fig.2(a), where a blue point denotes a seed. a green pointdenotes a trajectory, a red point denotes a limit of convergenceand a green and blue circle denotes the first and secondwrist candidate, respectively. The image space is clusteredby limit positions of convergence (Fig.2(b)). For each cluster,the position with the maximum votes is taken as a candidate(Fig.2(c)).

Fig. 2. Flow of wrist candidate detection

If a background region is larger than a hand region, the abovevoting is disturbed by SURFs on the background. To overcomethis, we roughly distinguish a SURF seemingly originated froma hand and others by a SVM[6]. We perform the above votingalgorithm with SURFs except those apparently originated fromnon-hand regions.

An example of finding candidate is shown in Fig. 3. First,SURFs are extracted from an input image(Fig. 3(a)). By usingthe SVM, they are roughly classified and SURFs apparentlyoriginated from non-hand regions are excluded (Fig. 3(b),(c)).For each of the rest SURF vectors v, as the above men-tioned, the local coordinate is defined and the conditionalprobability distribution P (j|v) is predicted by the RT classifier(Fig. 3(d),(e)). By voting the probability distribution, candidatesof wrist positions are determined (Fig. 3(f),(g)). The red rect-angle in the image Fig. 3(g) is the detected wrist position, andit is shown that the wrist is successfully detected.

Fig. 3. Flow of based on votes from the probability distribution of wristpositions

III. EXTRACTION OF HAND AND OBJECT REGION BASED ONLEARNING OF CO-OCCURENCE OF HAND AND OBJECT

FEATURES

To find a wrist position from features observed on a handgrasping an object, it is required to classify them into featureson a hand and those on an object. We introduce a One ClassSVM[7] for the classification. It is trained with a object relativeposition represented by a wrist-object coordinate system. First,we estimate the center of gravity of an object by RT and derive awrist-object coordinate system from the estimated center. Then,we distinguish between feature points on an object and thoseon a hand by the One Class SVM.

A. Estimating gravity center of a hand by coarse classification

A wrist position can be found by the algorithm described thesection 2. We can derive a wrist-object coordinate system byfinding a center of gravity of an object. Here, we are looking atimages showing object grasping states, and we need to classifythese into the image features that belong to the hand andthose that belong to the object. Thus, suitability as hand h

represents how much the detected SURF features contributedto the voting for that position when the wrist position wasbeing determined (i.e. the probability of votes) (Fig.4-(b)). Theposition (x, y) within the image is classified using h. Themethod of calculating suitability as hand h is to take the indexof SURF features voted for vmax as i, and the values votedfor by feature i in that place as P i

wrist, and express this as inequation(5).

vmax =∑

i

P iwrist (5)

The higher the value of P iwrist is greater, feature i has strongly

contributed to determining the wrist position. In other words,the suitability as hand for each feature is shown in equation(6).

hi = P iwrist (6)

K-means clustering is used to segment the positions of fea-tures in the image and the groups representing their suitabilityas hand x = (x, y, h) to determine whether they are hand classfeatures or object class features. The h value of center xm1 andxm2 for each cluster is compared, with the larger one beingjudged to be the hand class Lhand and the other one to be theobject class Lobj .

hm1 > hm2 →

{m1 → mLhand

m2 → mLobj

otherwise →

{m1 → mLobj

m2 → mLhand

(7)

From this, we can use the key point xi, and the Euclideandistance d(xi, xm) of the center of gravity vector xm todetermine the class in which each key point belongs. Thehand class Lhand and object class Lobj are calculated fromequation(8).{

d(xi,xmLhand) < d(xi,xmLobj

) → i ∈ Lhand

otherwise → i ∈ Lobj

(8)

This will generate the centers of gravity of the wrist positionand object class features as well as the wrist-object coordinatesystem (Fig.4-(c)). Fig.4 shows the hand region based on theeffective scope of SURF features classified into the hand class(dark color indicates greater suitability as hand).(Pink rectangle: hand feature class, Blue rectangle : object feature class)

B. Learning of an RT to output the probability distribution ofobject positions

Using the process in the previous section, it is possible tosegment the respective areas for hand and object in the objectgrasping sample image for training use. By using the segmentedimage grasping the object for training purposes, it is possibleto newly train it on the relationship between the wrist positionand the center position of the object in these kinds of claspingpatterns.

Thus, as in Section 2.A, RT is used to teach the center ofgravity of objects in the sample training image of a grasping

Fig. 4. Flow of Classification of hand and object features

hand extracted in the previous section and the relationshipof hand SURF features in the same image and a system isbuilt to predict the object’s location relative to local features.After detecting the wrist position, the probability distributionoutput by the device for predicting the object’s location isvoted into the image space relative to SURF features that havehigh suitability as hand (h) and the object’s center of gravityis detected. Fig.5 shows an example result of Detection.(Redrectangle : wrist position. Blue rectangle : center of gravity ofobjects)

Fig. 5. Detection of center of gravity of objects.

C. Lerning of an One Class SVM to identify hand and objectfeatures based on the wrist-object coordinate system

By obtaining the classifiers in the previous section, it be-comes possible, by inputting the hand grasping state, to predictwhat position within the image the center of the object is likelyto appear. In this way, as it is possible to set the detected wristposition and wrist-object coordinate system estimated from thepredicted central point of the object as previously explained, itis possible to train it from these standard coordinates where,relatively, the clasped object area will appear and make itpossible to segment the object area and the background area.

In the wrist-object coordinate system defined in Section 3.A,in order to estimate the range of hand and object regions in theimage, both a hand feature One Class SVM and object featureOne Class SVM are built. One Class SVM is a classifier thatlearns using only positive instances to classify items into twoclasses. For hand features, it uses the input position and likenessto a hand (x, y, h) in the wrist-object coordinate system to learnclassifiers. The conversion of image coordinates x

′= (x.y.h)

to wrist-object coordinates x′

is expressed in equation(9).

x′= (x

′, y

′, h) (9)

Specifically, the classifier fhand(x′) of hand class feature

becomes equation (10).{fhand(x

′) > 0 → x

′ ∈ Lhand

otherwise → x′

/∈ Lhand

(10)

For object features, because the ”suitability as object class” thatis common to the appearance of all items that are normallygrasped cannot be evaluated, it learns classifiers with only thetwo-dimensional position (x, y) in the wrist-object coordinatesystem of the image.The object class feature classifier fobj(y

′), which uses y

′=

(x′, y′) as the wrist-object coordinates converted from theimage coordinates, is as expressed in equation(11).{

fcobj(y) > 0 → y ∈ Lcobj

otherwise → y /∈ Lcobj

(11)

x′

and y′

are calculated from the input image. As countlessobjects exist each time there is the shape of a hand clasping anobject, fhand(x

′) is trusted more than fobj(y

′). Thus, object

class candidate recognition is expressed as in equation(12).{fcobj(y) > 0 ∩ fhand(x

′) ≤ 0 → y ∈ Lcobj

otherwise → y /∈ Lcobj

(12)

The classifier function fl(x′y

′) for hand class Lhand, object

candidate class Lobj and background class Lback from equa-tion(10,11,12), is expressed in equation(13).

fl(x′,y

′) =

Lhand fhand(x

′) > 0

Lcobj fcobj(y) > 0 ∩ fhand(x′) ≤ 0

Lback otherwise

(13)

Fig.6 shows an example result of classification into hand,object, and background classes (features that are neither handnor object) using these two classifiers.(Pink rectangle : Hand feature , Blue rectangle : object feature, Green rectangle : background feature , Red rectangle : Wristposition)

IV. CONCLUSION

We constructed the RT that output the probability distributionof the wrist positions with the wrist having undergone thetraining in its positions as related to the SURF features of thehand class. Then we searched the wrist positions this complex

Fig. 6. Result of feature class classification

background. We calculated the suitability as hand of SURFfeatures from the results thus detected on the wrist positionsand classified the features into hand class and object class.From the results of our classification, we constructed the RTthat output the wrist-object position probability distributionwith the wrist having been trained in its relationship with thelocation of the center of gravity.Then we prepared the wrist-object coordinate system. To enable us to infer the hand andobject area in the wrist-object coordinate system thus prepared,we constructed the One Class SVM of the hand features andthe One Class SVM of the object features and proceeded tocarry out the segmentation of the hand, object, background areaunder complex background from the two classifiers which wehad constructed. In the future, how various objects change thegrip of the hand will be studied in order to extract objects basedon that relationship.

REFERENCES

[1] S. Kawabata, S. Hiura, and K.Sato, “A Rapid Anomalous Region Extrac-tion Method by Iterative Projection onto Kernel Eigen Space,” The Instituteof Electronics, Information and Communication Engineers , D,J91-D(11),pp. 2673-2683, November, 2008

[2] N. Kamakura, Shape of hand and Hand motion. Ishiyaku Publishers, 1989.[3] H.Kasahara, J.Matsuzaki, N.Shimada, H.Tanaka “Object Recognition by

Observing Grasping Scene from Image Sequence,” Meeting on ImageRecognition and Understanding 2008, IS2-5, pp.623-628,2008.

[4] V. Lepetit and P. Fua: “Keypoint recognition using randomized trees”,Transactions on Pattern Analysis and Machine Intelliggence, 28,9,pp.1465-1479

[5] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool: “SURF:SppededUp Robust Features”, Computer Vision and Image Understanding,Vol.110,No.3,pp346-359(2008)

[6] V.Vapnik: “The Nature of Statistical Learning Theory”, Springer, (1995).[7] B.Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.

Williamson: “Estimating the support of a high-dimensional distribution.”,Neural Computation, 13(7):1443-1471,2001.