12
Real-time viewpoint-invariant hand localization with cluttered backgrounds , ☆☆ Enver Sangineto a, , Marco Cupelli b a Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia, via Morego 30, 16163, Genova, Italy b Gepin S.p.A., via degli Articieri, 53-00143 Rome, Italy abstract article info Article history: Received 4 May 2010 Received in revised form 13 October 2011 Accepted 4 November 2011 Keywords: Hand detection Articulated object recognition Model based techniques Geometric constraints Graph matching Curve matching Over the past few years there has been a growing interest in visual interfaces based on gestures. Using gestures as a mean to communicate with a computer can be helpful in applications such as gaming platforms, domotic environments, augmented reality or sign language interpretation to name a few. However, a serious bottleneck for such interfaces is the current lack of accurate hand localization systems, which are necessary for tracking (re-)initialization and hand pose understanding. In fact, human hand is an articulated object with a very large degree of appearance variability which is difcult to deal with. For instance, recent attempts to solve this problem using machine learning approaches have shown poor generalization capabilities over different viewpoints and nger spatial congurations. In this article we present a model based approach to articulated hand detection which splits this variability problem by separately searching for simple nger models in the input image. A generic nger silhouette is localized in the edge map of the input image by combining curve and graph matching techniques. Cluttered backgrounds and thick textured images, which usually make it hard to compare edge information with silhouette models (e.g., using chamfer distance or voting based methods) are dealt with in our approach by simultaneously using connected curves and topological information. Finally, detected ngers are clustered using geometric constraints. Our system is able to localize in real time a hand with variable nger congura- tions in images with complex backgrounds, different lighting conditions and different positions of the hand with respect to the camera. Experiments with real images and videos and a simple visual interface are presented to validate the proposed method. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Recognizing and tracking hands in video sequences can be exploited in the construction of smartand naturalHuman-Computer Inter- faces (HCI) in which the user communicates with a computer by means of simple gestures and a digital camera. Examples of possible ap- plications are: visual interfaces [7], automatic/semiautomatic domotic environments [16] in which the user controls a television and other household electric appliance using gestures, augmented reality [19], video games [16], sign language interpretation [26], etc. (see [27] and [10] for a review). Despite the noticeable application interest, developing reliable algorithms for the detection and tracking of articulated objects such as the human body and its limbs are still an open problem in computer vision. Signicant progress has been done in this century in generic object detection using machine learning techniques (e.g., [36,35,12]). In spite of this, state-of-the-art successful object detectors usually assume to work with rigid objects. Difculties with articulated objects arise from the fact that the large variability of appearances of a generic object class becomes even larger in dealing with articulated object classes due to the variable positions of each object's sub-part with respect to the others. This extreme variability of appearances to take into account makes the application of machine learning techniques hard due to both the necessity of a huge number of training samples representing all the possible situations and to the non-linearities introduced in the class boundaries. Most of the solutions proposed so far to this problem are based on xed congurations to the rigid components of the object to recognize (e.g., in the hand case, xed positions of each nger with respect to the others) or constraints to the application domain, such as to the image/video background nature. Currently, recognizing articulated objects with full degrees of freedom (DOFs) in cluttered background images and with real-time performances is a very hard problem. In this paper we present an approach to articulated object detection applied to the problem of hand detection which is based on two phases. In the rst phase we separately search for the object rigid sub-parts (i.e., the ngers in our case) using a curve matching tech- nique. Curve matching is preferred to other edge-based comparison Image and Vision Computing 30 (2012) 2637 This paper has been recommended for acceptance by Maja Pantic. ☆☆ This paper has been recommended for acceptance by Vladimir Pavlovic. Corresponding author. E-mail address: [email protected] (E. Sangineto). 0262-8856/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2011.11.004 Contents lists available at SciVerse ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Real-time viewpoint-invariant hand localization with cluttered backgrounds

Embed Size (px)

Citation preview

Page 1: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Image and Vision Computing 30 (2012) 26–37

Contents lists available at SciVerse ScienceDirect

Image and Vision Computing

j ourna l homepage: www.e lsev ie r .com/ locate / imav is

Real-time viewpoint-invariant hand localization with cluttered backgrounds☆,☆☆

Enver Sangineto a,⁎, Marco Cupelli b

a Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia, via Morego 30, 16163, Genova, Italyb Gepin S.p.A., via degli Artificieri, 53-00143 Rome, Italy

☆ This paper has been recommended for acceptanc☆☆ This paper has been recommended for acceptanc

⁎ Corresponding author.E-mail address: [email protected] (E. Sanginet

0262-8856/$ – see front matter © 2011 Elsevier B.V. Alldoi:10.1016/j.imavis.2011.11.004

a b s t r a c t

a r t i c l e i n f o

Article history:Received 4 May 2010Received in revised form 13 October 2011Accepted 4 November 2011

Keywords:Hand detectionArticulated object recognitionModel based techniquesGeometric constraintsGraph matchingCurve matching

Over the past few years there has been a growing interest in visual interfaces based on gestures. Usinggestures as a mean to communicate with a computer can be helpful in applications such as gaming platforms,domotic environments, augmented reality or sign language interpretation to name a few. However, a seriousbottleneck for such interfaces is the current lack of accurate hand localization systems, which are necessaryfor tracking (re-)initialization and hand pose understanding. In fact, human hand is an articulated object witha very large degree of appearance variability which is difficult to deal with. For instance, recent attempts tosolve this problem using machine learning approaches have shown poor generalization capabilities overdifferent viewpoints and finger spatial configurations.In this article we present a model based approach to articulated hand detection which splits this variabilityproblem by separately searching for simple finger models in the input image. A generic finger silhouette islocalized in the edge map of the input image by combining curve and graph matching techniques. Clutteredbackgrounds and thick textured images, which usually make it hard to compare edge information withsilhouette models (e.g., using chamfer distance or voting based methods) are dealt with in our approachby simultaneously using connected curves and topological information. Finally, detected fingers are clusteredusing geometric constraints. Our system is able to localize in real time a hand with variable finger configura-tions in images with complex backgrounds, different lighting conditions and different positions of the handwith respect to the camera. Experiments with real images and videos and a simple visual interface arepresented to validate the proposed method.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

Recognizing and tracking hands in video sequences can be exploitedin the construction of “smart” and “natural” Human-Computer Inter-faces (HCI) in which the user communicates with a computer bymeans of simple gestures and a digital camera. Examples of possible ap-plications are: visual interfaces [7], automatic/semiautomatic domoticenvironments [16] in which the user controls a television and otherhousehold electric appliance using gestures, augmented reality [19],video games [16], sign language interpretation [26], etc. (see [27] and[10] for a review).

Despite the noticeable application interest, developing reliablealgorithms for the detection and tracking of articulated objects suchas the human body and its limbs are still an open problem in computervision. Significant progress has been done in this century ingeneric object detection using machine learning techniques (e.g.,

e by Maja Pantic.e by Vladimir Pavlovic.

o).

rights reserved.

[36,35,12]). In spite of this, state-of-the-art successful object detectorsusually assume to work with rigid objects. Difficulties with articulatedobjects arise from the fact that the large variability of appearances of ageneric object class becomes even larger in dealing with articulatedobject classes due to the variable positions of each object's sub-partwith respect to the others. This extreme variability of appearances totake into account makes the application of machine learningtechniques hard due to both the necessity of a huge number oftraining samples representing all the possible situations and to thenon-linearities introduced in the class boundaries.

Most of the solutions proposed so far to this problem are based onfixed configurations to the rigid components of the object to recognize(e.g., in the hand case, fixed positions of each finger with respect tothe others) or constraints to the application domain, such as to theimage/video background nature. Currently, recognizing articulatedobjects with full degrees of freedom (DOFs) in cluttered backgroundimages and with real-time performances is a very hard problem.

In this paper we present an approach to articulated object detectionapplied to the problem of hand detection which is based on twophases. In the first phase we separately search for the object rigidsub-parts (i.e., the fingers in our case) using a curve matching tech-nique. Curve matching is preferred to other edge-based comparison

Page 2: Real-time viewpoint-invariant hand localization with cluttered backgrounds

27E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

techniques such as the chamfer distance because it is more accurate, asshown in Section 4, resulting in a very low False Alarm Rate (FAR).However, curve matching is based on the point-to-point comparisonof a pair of connected lines but edge detection is an unstable andnoisy process and the resulting objects' boarders are usually non-connected (e.g., see Figs. 2 (a) and 3 (b)). In order to address theseproblems we pre-organize edge information using a graph structureand suitable graph visit algorithms to collect (possibly non-connected) edge segments associated with the graph elements. Finally,in the second phase the relative positions of the detected fingers areanalyzed in order to validate the correct detection of a hand by check-ing a set of geometric constraints.

This combination of curve and graph-based techniques makes oursystem able to localize a hand in cluttered images, with differentrelative positions of a finger with respect to the others and withvariable positions of the whole hand with respect to the scene.Minor out-of-plane rotations of the hand are tolerated as well.

We have constructed a simple demonstrative gesture-based auto-matic interface using this approach. The proposed interface allows auser facing a camera to communicate with the computer withoutusing the keyboard. Simple commands have been developed tocontrol a graphic tool. This tool is able to grab a snapshot of thecurrent camera's frame, to save it, and to draw some lines and sur-faces on the frame following the user's hand movements like abrush on a virtual paper.

The article is organized as follows. In Section 2we give an overviewof related works on articulated object detection. In Section 3 we showsome pre-processing techniques we adopted with the aim to select aRegion of Interest (ROI) in the currently analyzed frame. In Sections4 and 5 we respectively show how we represent and localize rigidsub-parts (fingers) of the hand while in Section 6 we describe howthe detected fingers are clustered for the articulated hand configura-tion checking. Experimental results as well as a brief presentation ofthe example interface are given in Section 7. Finally, we conclude inSection 8.

2. Related works

Boosting-based machine learning techniques for hand detectiontasks have been adopted by Kolsch et al. [18,19] who use the Viola–Jones object detector [36] in order to recognize hands in fixed fingerconfigurations. They selected a set of hand shapes (e.g., open hand,closed hand, fist, etc.) and trained a different classifier for eachshape. The most accurate classifier, corresponding to the open handshape, has then been selected to be used by the tracking systemwhich is able to recognize and track open hands in images with a dif-ficult cluttered background. However, hand detection, necessary forthe tracking initialization, can deal with only limited pose variations(e.g., only up-right hands) and a fixed shape: the training and testingfinger configurations (open hand) must be similar.

Articulated objects with variable shapes (the human body) aredealt with in [37] using the multiclass version of Boosting proposedin [35] in which features can be shared among different classes,while a (“standard”) Boosting approach for hand detection is usedalso in [25]. In the latter case hand shape variability is addressed byclustering training samples using shape context descriptors [3] beforeperforming the Boosting-based training. The final system is organizedinto a two-layer tree structure, where each node of the tree is aclassifier specialized in a different hand shape. However, the general-ization potentiality of the method is not clear, since testing results areshownwith only images having the same, quite simple background ofthe training samples, missing an evaluation of the proposed methodin more challenging scenarios (e.g., outdoors or with clutteredbackgrounds).

Stenger et al. use a set of different thousands of 2D shape models,organized in a three level tree, exhaustively matching these models to

the edge map using the chamfer distance [33]. This large set ofmodels should cover position changes in a viewer-centered approach.In fact, each possible discrete value of the hand's position parametersis represented by a specific 2D hand model, synthesized off-line usinga 3D model. The hierarchical structure in which the models areclustered alleviates the number of model-image comparisons requiredto analyze each video frame and the chamfer distance is used to furtherspeed up the search. However, the whole process is still quite slow,requiring 2–3 s per image on a standard PC as shown in the experimen-tal results [33]. Moreover, the finger configuration is fixed in order torestrict the number of 2D shape models the system needs to comparewith the input image.

A similar approach is proposed by Athitsos et al. in [2], where 15hand joints and 20 DOFs are represented using a hand configurationvector. The 20 parameters of this vector together with the 3 parame-ters describing the camera point of view are estimated comparing theinput image with a huge set of 107,328 images of synthetic handsproduced off line using a 3D handmodel and computer graphics tech-niques. Different metrics for hand shape comparison are combined(e.g., chamfer distance and line matching). For a correct handlocalization the proposed system requires that a bounding box is apriori provided to select the image portion containing the searchedfor hand.

As is evident from the previous examples, the large number ofhand's DOFs together with the different points of view make the num-ber of possible hand appearances difficult to deal with. In the Analysisby Synthesis paradigm the hand's DOFs are typically represented aspoints in the configuration space, where each point corresponds to aspecific value of the joint angles and/or the hand position[27,23,22,38]. A 3D hand model is used to produce on-line a handshape corresponding to the current point in the configuration spacewhich is then compared with the input image. The difference withrespect to the works of Stenger et al. and Athitsos et al. is that the syn-thetic hand shape images are produced on-line rather than stored inthe system's database. In order to make the configuration space searchtractable, geometric constraints are usually adopted to restrict theadmissible DOFs. In [21,22] Moeslund et al. propose a method fordetection and tracking of a human arm when the user is sitting infront of the camera. Geometric and kinematic constrains on the possi-ble movements of the upper and the lower arm with respect to thetorso are used to restrict the configuration space. Moreover, collisionconstraints rule out those configurations in which more than onebody element occupies the same 3D space. However, the systemassumes that the hand is a “stiff” extension of the lower arm along itsaxis and relies on accurate color-based skin segmentation.

Other authors represent the hand shape deformations using statis-tic models such as the Active ShapeModels (ASMs) [8]. In the trainingphase a Point Distribution Model (PDM) is learned in order todescribe the shape variability. The concatenation of the 2Dcoordinates of specific landmark points on the hand forms a shapevector, and the PDM is a generative model of such shape vectorsrepresented in a reduced dimensionality space obtained by meansof Principal Component Analysis (PCA). At detection time a greedy al-gorithm alternates between local point matching and adjusting thecurrent shape parameter values [20,9]. In [20] this approach isadopted using a skin color based segmentation of the image to extractthe hand silhouette. The hand landmarks are the points correspond-ing to the highest curvature values on the silhouette. In [9] a similartechnique is proposed using the distance of the silhouette pointsfrom a reference center to estimate the fingertips composing thelandmark set. Both approaches rely on an accurate skin colorsegmentation to extract the hand shape. Moreover, ASMs areknown to suffer from different drawbacks, the most important ofwhich is the necessity of an initial, quite accurate localization of thehand shape position for the greedy algorithm to converge toward aglobal optimum.

Page 3: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Fig. 1. An example of input image (a) and the corresponding final ROI mask (b).

28 E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

The previous framework is extended in [39] introducing a statisticaltreatment of the 20 hand DOF dynamics. An Iterative Closest Point(ICP) algorithm [4] is adopted to perform matching between edgeand model points. The kinematics of the finger configuration is thenrepresented in a reduced dimensionality space obtained by means ofEigen Dynamic Analysis [39].

Shimada et al. [31] propose a curve matching approach to com-pare the hand silhouette with a set of stored model silhouettes. Thedeformation due to the finger relative movements is dealt with byusing maxima points in the input silhouette (corresponding to finger-tips) in order to warp the input shape. Nevertheless, the assumptionof a completely connected curve representing the input handsilhouette, extracted by comparing the hand and the backgroundbrightness, is hard to satisfy in realistic scenarios. In the presentpaper we also use curve matching to assess the model-input point-to-point error. However, unlike [31], we set out to work with noisyedge detection with possible unconnected lines and we rely on agraph-based representation of the edge map to deal with non-connectivity.

The possibility of separately searching for each object sub-parts inthe image and then assembling the rigid detection results has beenexplored by different works [13,15,28]. In [28] the authors describean articulated object (e.g., the human body) by a tree structure withnodes associated with the rigid components and arches representingspatial relations between pairs of sub-parts linked by a joint. Rigidcomponents are separately searched for into the image usingribbon-like colored templates. After that, detected ribbons areprogressively grouped and, for each group, perspective invariantfeatures are computed taking into account all the components of thegroup. The corresponding feature vector is used to classify the objectusing a statistical classifier previously trained.

Ribbons organized in a tree-structure are used also by Felzenszwalbet al. [13]. They extend distance transform techniques in order to dealwith “template and spring” [14] metrics and reduce the object's rigidcomponents' clustering time from O(mn) to O(mn) (m being thenumber of discrete position values of a component into the imageand n the number of components). Nevertheless, the proposed ap-proach is still slow, requiring about 10 s on a standard PC to recognizean 11 component object in a very low-resolution image. Moreover,spatial relations described by a “template and spring” metric assumea “preferred” configuration of the object's components with respectto other possible configurations, which is not always realistic.

Other authors propose a probabilistic framework in order tomodel both the uncertainty related to rigid subpart detection andthe subpart mutual spatial relations. For instance, Sigal et al. [32] rep-resent the conditional probability of the position of a given humanbody limb with respect to the positions of the adjacent limbs usinga non-parametric probabilistic formulation. Moreover, a variation ofparticle filtering is proposed to combine probabilistic information informulating a final articulated body configuration hypothesis. Unfor-tunately, the proposed system needs a multiple view of the scenegiven by 4 calibrated cameras to perform limb detection. Moreover,conditional probabilities governing mutual limb positions need tobe learned using a huge amount of ground truth data (specifically,the authors used 4928 ground truth frames [32]), which is not alwayseasy to collect.

As a matter of fact, state-of-the-art machine learning basedmethods up to now have shown poor results in dealing with articu-lated objects due to the high variability of their appearances. On theother hand, model based techniques such as those which rely oneither the off-line or the on-line production of synthetic images of allthe possible hand shapes have computational problems as well asimage-model matching open issues. In our part-based approach, weuse simple models of the finger silhouette and a powerful combinationof curve and graph matching in order to efficiently search for candidatefingers in the edge map and assemble the results.

3. Pre-processing

Handmovement in human gestures is difficult to predict and com-mon tracking algorithms based on assumptions on the motion model(e.g., the Kalman filter) usually have very low performances in thisdomain [33]. For this reason we adopt a tracking by detection strate-gy, i.e., we perform hand detection in every frame of the videosequence (e.g., [33]). To speed-up detection, we first apply to eachframe a set of pre-processing operations aimed at selecting a Regionof Interest (ROI) on which the subsequent localization process canfocus attention.

Basically, we perform combined motion and skin detection in orderto select a ROI into the current frame. Motion detection is realized bymeans of background subtraction and a simple running average-based recursive technique for the background updating [6]. On theforeground pixels resulting from the motion analysis, we perform arule-based skin detection on an R-G projection of the RGB space (see[29] for more details). Let us call Rt a binary mask representing theROI in frame t, defined as:

Rt pð Þ ¼ 1 if p is a skin� colored foreground pixel0 otherwise:

�ð1Þ

Finally, on the ROI so selected, we perform the Canny edge detec-tion [5] (hysteresis threshold values used: 30 and 70) and we prunesmall edge lines. We indicate with Dt the pruned (binary) edge mapof It. From now on we will omit the subscript t when it is clear thereference is to the current frame t. Figs. 1 (b) and 2 (a) show, respec-tively, the ROI and the edge map of a same input image.

It is worth noticing that only edge detection is strictly necessary forthe subsequent processing steps, being the ROI selection steps onlyused to speed-up computation while discarding part of the image toanalyze. Moreover, when the system is applied to still images, theROI mask Rt() is computed by means of the only skin detection step.Furthermore, it is well known that both color and motion basedsegmentation algorithms are prone to either false negatives (missingareas) or false positives (unimportant areas selected as relevant).However, our interest here is only to discard part of the image inwhich the hand does not have to be searched. For these reasons, wetuned our segmentation thresholds in order to have a ROI mask withpossible false positive regions, minimizing the probability of falsenegatives. As it will be clear later (Section 5), connected regions of pos-itive elements of Rt() are analyzed in turn searching for hand shapes,which will lead to reject false positive regions. Unlike other approacheswhich rely on color and/or motion segmentation to extract the handsilhouette (see Section 2), our proposed shape matching algorithm ismainly based on edge information (Sections 4–5). For this reason weare more robust with respect to color segmentation errors and wecan handle a larger range of lighting conditions.

Page 4: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Fig. 2. (a) An example of edge detection result after small segment pruning. (b) Ashape (left) and an appearance (right) model of a finger.

29E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

4. The finger model

As a first step in our proposed hand detection approach, we searchfor a possible match between the image and an affine-transformedrepresentation of the finger's appearance and shape models. Thefinger's appearance model M(A) is a simple geometric and colormodel, which can be possibly extended to include texture informa-tion (e.g., in dealing with articulated objects different from thehuman hand). M(A) is given by a set of points describing the innerpart of a finger (see Fig. 2 (b)), represented in a 2D reference frameand projected into the input image to verify the presence of skinand motion pixels and the absence of noisy edge pixels.

More formally, let T be a rigid affine transformation from the modelreference frame to the image reference frame and {Ms

(A)}s=1,… a set ofappearance models off-line built for different discrete scale value s. Inturn Ms

(A) is a set of 2D points: Ms(A)={r1,…rhs}. If we assume that

the movements of the whole hand and of the single fingers happenin a plane parallel to the camera plane, we can restrict our attentionto symmetric transformations. Hence, T can be defined by a 4parameter vector v=(tx, ty,θ,s)T, being tx, ty the translation offsets, θthe rotation angle and s the scaling factor:

Tv qð Þ ¼ tx þ x⋅s⋅cosθ−y⋅s⋅sinθty þ x⋅s⋅sinθþ y⋅s⋅cosθ

� �; ð2Þ

where q=(x,y)T. With these assumptions, the appearance error (E(A))obtained in projecting the finger model Ms

(A) into the image using Tv(being v=(tx, ty,θ,s)T) is given by:

E Að Þ ¼ E Rð Þ þ λE Dð Þ; ð3Þ

where E(R) is the error computed on the ROI mask, E(D) the error on theedge map and λ a weighting factor:

E Rð Þ ¼Xhsi¼1

1−R Tv rið Þð Þ½ � ¼ hs−Xhsi¼1

R Tv rið Þð Þ; ð4Þ

E Dð Þ ¼Xhsi¼1

D Tv rið Þð Þ: ð5Þ

Eq. (4) states that the error in projecting M(A) into the ROI of thecurrent frame is given by the number of pixels for which R(Tv(ri))=0. Conversely, in Eq. (5) the error is proportional to the number ofnoisy edge pixels possibly present in the inner part of the finger(D(Tv(ri))=1). While in principle we could rely on only one appear-ance model M(A), we prefer to compute off-line different versionsMs1

(A),Ms2(A),…, each having its own cardinality (hs1,hs2,…) in order

to avoid expensive computational run-time scaling operations. We

currently use 5 different appearance models, whose scales are a factorof 1.2 apart.

The shape model M(S)={q1,…qm} is a set of pairwise adjacentpoints forming an open curve which represents the finger silhouette(see Fig. 2 (b)). During the finger detection process we search for arigid transformation Tv and a curve C in D such that Tv(M(S)) and Care perceptually similar. If m is the cardinality of M(S), then thesimilarity between C and Tv(M(S)) can be checked by uniformlysubsampling C in order to extract the same number m of points andthen comparing corresponding points in the two resulting curves. IfC′={p1,…pm} is obtained by subsampling C, then the shape errorbetween C and Tv(M(S)) is given by:

E Sð ÞTOT ¼

Xmi¼1

jjpi−Tv qið Þjj2 ð6Þ

and:

E Sð ÞPUNCTUAL ¼

Xmi¼1

ϕ pi; Tv qið Þ; vð Þ; ð7Þ

where:

ϕ p; q; vð Þ ¼ 1 if jjp−qjj2 > sγp

0 otherwise:

(ð8Þ

If EPUNCTUAL(S)

=0, all the points of C′ are at most s×γp pixels far fromthe projection of the model points into D, where γp is a prefixedthreshold and, as usual, v=(tx, ty,θ,s)T. The value of γp has been cho-sen by maximizing the system's accuracy on a validation set (seeSection 7.1).

Shape comparison performed using point-to-point squared dis-tance as in Eqs. (6)–(7) is much more accurate than common modelbased shape verification techniques (e.g., the chamfer distance, theHausdorff distance or Hough-like techniques). In fact, we do notonly check that the generic point qj of M(S), once projected into D(i.e., T(qj)), is sufficiently close to any edge point p′ of D. We also im-pose that the distance must be computed exactly with the j-th pointpj of C′. This ordering implicitly introduces context information inthe point matching procedure. Conversely, chamfer distance andother common methods for model based shape verification arecontext-free. For instance, in chamfer map based techniques, thepointp′ of Dwhich is the closest to T(qj) can possibly be a backgroundpoint (not belonging to the searched for curve C). Even if p′ is a pointof the searched for object shape C, its relative position with respect tothe other points of C is not guaranteed to be the same or a similarposition of qj with respect to the other points of M(S) (see [1] for de-tails). In images with thick textures and cluttered backgrounds thischaracteristic can bring a high number of false positives [1]. Similarconsiderations apply to the Hausdorff distance and to Hough-liketechniques. Our experimental results (Section 7) confirm the lownumber of false positives obtained using curve matching.

However, in order to be effective, searching for all the possibletransformations Tv in the parameter space and all the possible curvesC in Dmust be performed efficiently. In the next section we show howthis can be done.

5. Efficient curve matching mixing topological andgeometric information

Given the above mentioned assumption to deal with only similar-ity transformations, the correspondence between 2 pairs of points(e.g., p̂1; p̂2∈D and q̂1; q̂2∈M Sð Þ) is sufficient to uniquely determine

Page 5: Real-time viewpoint-invariant hand localization with cluttered backgrounds

30 E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

the values of the parameter vector v characterizing Tv. In fact, fromTv q̂ ið Þ ¼ p̂ i (i=1,2), we obtain the following system:

x1 ¼ tx þ x ′1⋅s⋅cosθ−y′1⋅s⋅sinθ ð9Þ

y1 ¼ ty þ x′1⋅s⋅sinθþ y′1⋅s⋅cosθ ð10Þ

x2 ¼ tx þ x′2⋅s⋅cosθ−y′2⋅s⋅sinθ ð11Þ

y2 ¼ ty þ x′2⋅s⋅sinθþ y′2⋅s⋅cosθ; ð12Þ

where p̂1 ¼ x1; y1ð ÞT , p̂2 ¼ x2; y2ð ÞT , q̂1 ¼ x′1; y′1ð ÞT , q̂2 ¼ x′2; y′2ð ÞT , andwhose solutions are given by:

s ¼ ‖p‖2=‖q‖2; ð13Þ

θ ¼ arccosqTp

‖q‖2⋅p‖2; ð14Þ

tx ¼ x′1−s⋅ x1⋅cosθ−y1⋅sinθð Þ; ð15Þ

ty ¼ y′1−s⋅ x1⋅sinθþ y1⋅cosθð Þ; ð16Þ

being q ¼ q̂2−q̂1 and p ¼ p̂2−p̂1.We fix q̂1 ¼ q1 and q̂2 ¼ qm (remember thatM(S)={q1,…qm}) and

we search for high curvature points inD as candidate points for the baseof the finger (i.e., candidate good matchings for q1 and qm).

We use the Kanade, Lucas and Tomasi features (KLT features) [30]to select in the current frame I those points having at least two strongedge directions (we refer to [30] for more details). Such “salient” cor-ner points do not necessarily correspond to edge points of D, but areusually at most 2–3 pixels far from the edge points of D. It is thenquite easy to project KLT points onto D points by selecting for eachKLT point p the closest edge point p′ in a 5×5 local neighborhoodof p. Let us call K the set of all such p′ points in D (see Fig. 3 (a)).Matching a pair of points of K with the pair (q1,qm) makes it possibleto find the parameters v of Tv using Eqs. (13–16).

Concerning C, we observe that it is composed of consecutivelyadjacent points of D, as well asM(S) is a set of pairwise adjacent pointsrepresented in the model reference frame. For this reason, candidateC curves to be matched with M(S) can be found by following edgeadjacency relation among points in D. However, it is well knownthat the edge detection process is quite unstable, produces a lot ofnoise and, most important, edge line segments are frequently inter-rupted due to low contrast in the image (e.g., see Figs. 2 (a) and3 (b)). Nevertheless, searching for C can be efficiently achieved by

Fig. 3. (a) An example of KLT features extracted from an input image. (b) The finger de-tection results on the same image: the shape model of a finger has been superimposedby the system on two out of the three open fingers.

adequately structuring edge information. To this aim we construct agraph G=(N,A) representing the topological structure of D: salientedge points are associated with nodes of the graph (N) and connectedlines with arcs (A). More formally, let us call an endpoint an edgepoint which is adjacent to only one other edge point (i.e., it is theend of a line) and a junction point an edge point adjacent to morethan two points (hence, it is the conjunction of two or more lines).If S is the set of such topologically salient edge points of D, we canconstruct a graph G=(N,A) by biuniquely associating each node inNwith a point in Z=K∪S. Moreover, if l is a line (i.e., a set of pairwiseadjacent edge points) connecting p1∈Z with p2∈Z, and ν∈N isassociated with p1 and υ∈N is associated with p2, then the arce=(ν,υ) is added to A and the set of points of l is associated with e.The length of l (number of points of l) is associated with e as well.Referring to the example in Fig. 4, the points p1,p2,p4,p5 and p6 areendpoints, respectively associated with the nodes n1,n2,n4,n5 andn6; p3 is a junction point associated with n3 (p1,p2,p3,p4,p5,p6∈S),and k1 is a point in K associated with the node n7. Moreover, thelines l1− l5 with their corresponding lengths in pixels are respectivelyassociated with the arcs e1−e5.

To deal with missing edge points due to low contrast in the image,we add virtual arcs to A. For each pair of endpoints (p1, p2), respec-tively associated with ν and υ, and whose Euclidean distanced=||p1−p2|| is such that d≤γw, the virtual arc w=(ν,υ) is added toA and associated with the length value d. γw is a threshold used tocheck that p1 and p2 are close enough the one to the other (seeSection 7.1). In Fig. 4 the nodes n5 and n6, corresponding to the end-points p5 and p6, are connected by means of the virtual arc w1. Viceversa, other nodes corresponding to different endpoint pairs such asand p2 or p2 and p4 are not connected because the distance of theassociated endpoints is greater than γw.

Finally, G is partitioned into connected components G1,G2, etc..Each connected component will then be separately analyzed tosearch for the presence of a hand shape. G and its connected compo-nents can be constructed in O(n+nZ

2), where n and nZ are, respective-ly, the total number of edge points and the cardinality of Z.

Given a connected component Gj of G, and the corresponding setof salient points Zj, for each pair of points a,b∈Zj, if a and b are nottoo far from and not too close to each other with respect to theexpected dimensions of a finger into the image, we match a with q1

and b with qm (see above). Let va,b=(txa,b, tya,b,θa,b,sa,b)T be the pa-rameter vector defining the transformation Tva, b such that Tva, b(fa)=q1 and Tva, b(b)=qm. va,b is computed using Eqs. (13–16).

Using Tva, b we project the points of Msa, b(A) into D and R and we

compute E(R) and E(D) using Eqs. (4) and (5). The match hypothesisa with q1 and b with qm is rejected if E(R) and E(D) are greater thanthe two prefixed thresholds depending on the cardinality of Msa, b

(A)

(hsa, b). This first test using the appearance model helps in rejecting alot of false match hypotheses but it is clearly not sufficient in findinga finger shape without using M(S). Then, if E(R) and E(D) are smallenough, we search for a curve C in D to compare with M(S). A sketchof the algorithm is presented below.

FingerDetection (R,D,Gj, {Ms(A)}s=1,…,M(S))

0 for each ν,υ∈Nj:1 a:=PointOf(ν),b:=PointOf(υ),δ=‖a−b‖.2 Compute va,b=(txa,b, tya,b,θa,b,sa,b)T using Eqs. (13)–(16) and

assuming Tva, b(a)=q1 and Tva, b(b)=qm.3 If sa,b is too large or too small w.r.t. the image size then go to 1.

***Check the appearance model:***4 Let Msa,b

(A)be the appearance model whose scale is the closest to

sa,b.5 Compute the errors E(R) and E(D) using Eqs. (4)–(5).6 If E(R) and E(D) are too large w.r.t. hsa, b then go to 1.

*** Check the shape model: ***

Page 6: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Fig. 4. An example of graph-based representation (b) of an edge map (a).

31E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

7 x :=∅.8 For each e∈Aj :marked[e] := false.9 If CheckShape(D,M(S),va,b,ν,υ,0, length(Tva, b(M(S))),x)10 then return bva,b,x>. *** A finger has been detected. ***11 return Nil. *** No finger has been detected in Gj***.

In Line 9 the CheckShape() procedure is called with the followingparameter values: D, the edge map; M(S), the finger shape model;va,b, the current geometric transformation parameters; ν and lon, re-spectively, the source and the target nodes of Gj in the graph-traversalalgorithm (see below); 0, the initial virtual component of the linematched with M(S); length(Tva, b(M(S))), the total length (in pixels) ofthe curve Tva, b(M(S)) and x=∅ the initially empty path from ν to υ.In CheckShape() we perform a depth-first visit of Gj, starting from ν,searching for a path x={e1,e2,…} which possibly connects ν with υand such that the curve C given by the concatenation of the lines l1,l2, … corresponding to e1,e2,... is similar to M(S). The path x is incre-mentally built recursively calling the procedure CheckShape().

In a generic call to CheckShape() (see the pseudo-code below), Mis the set of model points still to be matched with points of D, x is apartial path in Gj starting from ν and n1 is the current node. In Line1 of CheckShape() all the unvisited arcs of n1 are analyzed in turn. Ife=(n1,n2) is the current (real) arc and l is the line (set of edgepoints) associated with e, then we compare l with the correspondingportion of M, called M1. In turn, M1 is obtained (Line 8) by comparingthe length of l with the length of Tv(M) and computing the number ofmodel pointsm1 (Lines 5–6) which have to be extracted fromM.m1 isalso used to subsample l. Once l′ and M1 are composed of the samenumber of points, Eqs. (6) and (7) can be used to compute the errorintroduced adding e to x. If EPUNCTUAL(S) =0 and ETOT

(S) is lower than thethreshold ρ, then e is added to x and CheckShape() is recursivelycalled in Line 17 using the set M2 of unmatched points of M for thesubsequent comparisons until either a failure or the node υ hasbeen reached. Note that M(S), M, M1 and M2 are dealt with as orderedsequences of points in CheckShape() (e.g., Lines 8, 16, 24). Finally, if eis a virtual arc whose length is d, we add e to x if the cumulativelength of the virtual part of x (VirtualRatio) remains below a giventhreshold (η).

CheckShape(D,M,v,n1,υ,VirtualRatio,TotL,x)0 ok:=false.1 For each e∈ArcsOf(n1) t.c. ¬marked[e] and while ¬ok:2 Let e=(n1,n2),M={q1,…qm}.3 marked[e]:=true.4 If Real(e) then:

*** Case e is a “real” arc: ***5 l :=LineOf(e),L := length(l),LS := length(Tv(M)).6 m1 := ⌊L/Ls∗m⌋

7 l′=Subsample(l,m1). Let l′={p1,…pm1}.

8 M1:={q1,…qm1}. ***M1 is the set of thefirstm1 points ofM ***

9 Compute ETOT(S) and EPUNCTUAL

(S) using l′ and M1 and Eqs. (6)–(7).

10 If ETOT(S) >ρ or EPUNCTUAL

(S) ≠011 then go to 112 else13 Push (x,e).14 If n2=υ then return true15 else:16 M2 :={qm1

,…,qm}.17 ok:=CheckShape(D,M2,v,n2,υ,VirtualRatio,TotL,x).

else*** Case e is a virtual arc: ***

18 p1 :¼ Po intOf n1ð Þ;p2 :¼ Po intOf n2ð Þ;d :¼ p1−p2:19 r:=d/TotL.20 If VirtualRatio + r>η then goto 1.21 else22 Push (x,e).23 qm1

:=argminq∈M||Tva, b(q)−p2||.24 M2 :={qm1

,…qm}.25 ok:=CheckShape(D,M2,v,n2,υ,VirtualRatio+r,TotL,x).26 return ok.

No backtracking is performed during the visit of Gj (no arc isremoved from x once inserted) and, since all the arcs in Aj are setto ¬marked[.] in Line 8 of the FingerDetection() procedure before call-ing CheckShape() and never marked false in CheckShape(), each arc isvisited at most once. For this reason, the worst case computationalcomplexity of CheckShape() is bounded by O(km), where k is thecardinality of Aj and m the number of points of M(S). Moreover, theoutmost loop of FingerDetection() (Line 0) is repeated no more thann2 times, where n is the cardinality of Nj. Thus the whole fingerdetection algorithm worst case complexity is O(n2(hsa, b+km)), whereΘ(hsa, b) is necessary to compute the errors E(R) and E(D) using Eqs. (4)and (5) in Line 6. However, the upper bound O(n2(hsa, b+km)) is nottight, because very few lines in D are perceptually similar to thesearched for shape (a finger-like curve, in our case). Thus, wrong partialpaths x are cut out as soon as the accumulated error becomes excessive,which typically happens when we meet a point in D which is fartherthan sa,bγp pixels from Tva, b(M(S)) (Eqs. (7) and (8)). Fig. 3 (b) showsthe result of the finger detection process above described in a noisyedge map: two out of the three open fingers have been correctlylocalized.

6. Searching for valid finger configurations

The process described in the previous section is extended to findthe shape of a hand, composed of different fingers. Using the sameterms of the previous section, once a valid path x, corresponding tocurve C, is found between ν and υ, we continue to search for otherneighboring fingers in the same connected graph component Gj. Thepresence of neighboring fingers lowers down the likelihood that xcorresponds to a false positive because the shape found in D becomesmore and more complex.

Page 7: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Fig. 6. Image samples of the parameter validation set. (a) A hand image. (b) A back-ground image.

32 E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

Assuming that Salient(x) returns the set of points in Z correspond-ing to the nodes of x, searching for a second finger is performed by re-peating the process described in the previous section using a secondpair of starting points a′;b′ selected in such a way that:

a′;b′∈ Zj−Salient xð Þ� �

∪ a;bf g∧: a′ ¼ að Þ∧ b′ ¼ bð Þð Þ: ð17Þ

Note that in Eq. (17) we permit that one (but not both) between aand b can be reused in order to allow the system to detect adjacentfingers sharing a salient point (Fig. 3).

In order to take into account the relative positions of differentfingers, we transform a search in parameter space in a constraintsatisfaction problem. Basically, instead of searching for all the possible(discrete) positions of a finger with respect to another, we check thatgroups of separately found fingers have a valid displacement theone with respect to the others. Suppose we have found nF fingers inGj, each one associated with a model-to-image rigid transformation:Tva1, b1,…,TvanF, bnF. A hand H={vai1,bi1,…,vaio,bio} is given by grouping asubset (with cardinality o, o≥2) of the nF separate detections suchthat:

(1) sai1,bi1≃…≃ saio,bio,

(2) (θaij,bij−θaij+ 1,bij+ 1) mod 2π≤γθ,(3) ||(tx, ty)aij,bij−(tx, ty)aij+ 1,bij+ 1||≤γt,

being γθ and γt two prefixed thresholds. Constraint 1 assures that allthe fingers of H have more or less the same scale, Constraint 2 deter-mines the maximum rotation range between two adjacent fingersand Constraint 3 imposes that separate detections must be close toeach other. Special care must be taken for the possible detection ofthe thumb, which is farther than the other fingers. We remark thatthese constraints hold because of the assumption to deal with handmovements in a plane parallel to the camera plane (Section 4). Minorout-of-plane rotations can be dealt with as well because in Constraint1 we do not have an exact equivalence of the involved scales, while dif-ferent finger configurations and missing fingers do satisfy Constraints 2and 3 with suitable choices of the thresholds (e.g., γθ=1/2π).

It is worth noticing that, even if the candidate finger assembling andconstraint satisfaction process is exponential in the number of total fin-ger candidates (nF), the number of detectedfingers in a given connectedcomponent ofG is usually very low. In Fig. 3 (b) a hand (finger cluster)His shown which is composed of o=2 elements.

As a final check we find the center of the detected fingers (see thebig dot drawn on the images of Fig. 5) and we use an appearancemodel for the palm of the hand to verify the presence of sufficient

Fig. 5. Some examples of correct detections showing the pos

pixels in the ROI mask (E(R)) and the absence of noisy edge pixels(E(D)) just below the fingers' center. Finally, the whole process isrepeated for all the connected graph components G1,G2, etc..

The three simple constraints above mentioned allow us to detect avery large set of finger configurations and hand postures. For instance,we can detect hands having one or more closed fingers and/or withdifferent rotation angles among the open fingers (e.g., see Fig. 5). Todeal with minor out-of-plane hand rotations we use different (shorter)shape models; other shape models are necessary for dealing with thethumb. A total amount of only 9 shape models is used, iterating theCheckShape() procedure for 9 different model curves M1

(S),…,M9(S). No

training is necessary to learn the shape or the appearance modelsand the system can deal with different human beings without anyidentity-specific parameter tuning. Concerning the scenario character-istics, our system is able to work with different lighting conditions,cluttered background, possible occlusions, presence in the scene ofother moving objects or other skin-colored patches. The presence ofobjects with a skin-like color just behind the hand is not a problemas well but it usually deteriorates the detection rate since it degradesthe object-background contrast, hence making the edge detection pro-cess harder. Partial hand occlusions as well as self-occlusions (fingersnon visible being closed or occluded by other fingers or objects) areaddressed by the fact that the above constraints (1)–(3) do not dependon finger adjacency and the finger searching process does not assumethe initial curve points (a and b) are connected with the other alreadyfound fingers (Eq. (17)). As a consequence, a finger cluster H is com-posed of a set of detected fingers with variable cardinality and withpossible missing or non-adjacent fingers.

When pre-processing operations (see Section 3) make it possibleto select a ROI mask, detection speed is achieved at 10–20 fps with

ition and the direction of the hand found by the system.

Page 8: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Fig. 7. Results obtained with the validation set by varying the values of γp (a) and γw (b). Note that the abscissa values are not sampled uniformly.

33E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

320×240 images on a modest Pentium IV 2.4 Ghz with non-optimized code. The hand detection process is performed in eachframe of the input video sequence and the results of the past framesare used to improve the detection rate. In fact, if no successful detec-tion has been obtained in the current frame It, the system takes intoaccount the past 5 frames (It−5,…, It−1) with a majority vote: if inat least three out of 5 frames a hand has been detected, the result ofthe current frame is a positive hand detection as well, whose positionparameters are obtained by averaging the past 5 positions.

7. Results

7.1. Performance of the system as a function of its parameter values

In order to choose parameter and threshold values of the proposedalgorithm, we tested the system with a validation set composed of6000 images. They were acquired using a Webcam in indoor environ-ments and are composed of 1000 frames showing an open hand indifferent positions as the foreground object and 5000 backgroundimages (e.g., see Fig. 6). In order for the validation process to be fea-sible, we assume that each parameter and threshold of the system isindependent from the others and we separately select the bestvalue for each parameter the others being fixed.

The parameter values are chosen by maximizing the number ofcorrect images given by:

Correct ¼ CDþ CRN

; ð18Þ

where N=6000 and CD (Correctly Detected) is the number of imageswith hands correctly detected by the system and CR (CorrectlyRejected) is the number of background images with a “no hand”response.

Fig. 8. Some frames taken from a video sequence in which a user utilizes the visual interfacedrawing of the first white line. In Frame 7 the default color is changed from white to gray inmodality”. In Frame 12 the current tool is changed from “pencil” to “fill-in” (being the system“drawing modality” and fills in the inner figure.

It is worth noticing that, since the ratio of hand versus backgroundimages composing the validation set is 1/5, it is clear that false positiveshave a stronger influence in expression (18). However, our interest isin obtaining a very low False Alarm Rate (FAR) in order to use detec-tions in past frames to deduce the presence of a hand in the currentframe (see Section 6).

For lack of space, we show results concerning only the two systems'thresholds which have the strongest impact on the whole perfor-mance: γp and γw (respectively, Fig. 7 (a) and (b)). The experimentalresults presented in the rest of this section have been obtained withthe following values: γp=10 and γw=10.

Finally, we remark that, even if we use a validation set to tune theparameter values of our system, the proposed approach cannot beconsidered a machine learning based method since it does not build astatistic model of the object class appearance which is rather describedby a priori models. Moreover, if on the one hand common machinelearning based techniques have up to now shown poor performanceswith articulated objects (see Section 2), our approach is less sensitiveto the “training” set composition because samples are used only forthreshold fixing.

7.2. Qualitative evaluation

Figs. 5 and 9 show some examples of the system's output in imageswith cluttered background; different no-hand skin-colored objects(e.g., faces, etc.); different finger configurations and hand positions,including minor out-of-plane rotations and hands of different persons.

Fig. 8 shows some frames of a video sequence in which our systeminteracts with a simple visual interface that we have developed inorder to demonstrate the potentialities of the proposed hand localiza-tion approach. Using a low cost Webcam and without touching thekeyboard, a user can control a graphic tool which utilizes the user'shand as a virtual brush to draw lines. The drawing made by the user

to draw two closed lines. The first 6 frames (from left to right, top to bottom) show the“non-drawing modality”. From Frame 8 to Frame 11 a second line is drawn in “drawingin “non-drawing modality”). Finally, in Frames 13 and 14 the system switches back to

Page 9: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Table 1A comparison of hand detection systems with similar assumptions on the possiblehand postures and the number of used cameras.

Method Clutter Markers Speed (f/s)

Our proposed system Y N 10–20Holden et al. [17] N Y –

Thayananthan et al. [34] Y N 1–2Nolker et al. [24] N N -

34 E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

is in real-time superimposed on the frames grabbed by the camera andpossibly saved.

A line can be drawn selecting the “drawing” modality (openingthe hand upwards with all 5 fingers visible), then closing the handand moving it sketching the desired shape (Fig. 8). While the handis completely closed, tracking is realized using an OpenCV version ofthe KLT feature-based tracking [30] initialized with the last openhand position. However, hand detection is performed in everyframe of the video sequence, including the tracking period, since weneed to recognize open finger-gestures which are used by the systemfor state changing. Fig. 8 shows a sequence in which two closed linesare traced.

Other commands are: change the current drawing tool (“pencil”,“eraser” and “fill-in”), save the current drawing or change the currentcolor (white, black or gray), which are defined by different orienta-tions of the hand and different numbers of open fingers. Note thatthe pencil cursor is not positioned in the center of the palm but be-tween the thumb and the forefinger in order to give the impressionof grasping a pencil while tracing a line (see Fig. 8).

Finally, we remark that performing a quantitative comparisonwith other systems is very difficult because of the lack of commonbenchmarks. Most of the approaches published so far usually presentjust a qualitative evaluation by showing the output of the proposedsystem with respect to a few input images [10]. The ones which pro-pose an extensive and quantitative evaluation usually focus on track-ing errors rather than hand detection (or “tracking initialization”)

Fig. 9. Other examples of system hand localizations in challenging images. Top row: robustnBottom row: some difficult lighting conditions.

aspects or use synthetic data rather than real images [10]. As under-lined in [10], one of the reasons for the lack of quantitative evalua-tions is the difficulty in obtaining ground truth data for articulatedmoving objects.

Another problem in performing cross-system evaluation is thelarge variability of finger configurations and hand postures takeninto account from the existing systems. In Table 1 we compare ourproposed approach with some methods which assume the palmparallel to the camera plane as our system does. Data presented inTable 1 have been taken from the qualitative comparison done in[10], from which we have selected only those systems based on amonocular camera with no-depth information.

The first column of Table 1 refers to the specific systems (thefirst one being our proposed approach), while the second specifieswhether the corresponding algorithm deals with cluttered back-grounds, the third whether artificial markers on the user's handneed to be used and the fourth the system's speed in number offrames per seconds. From the data reported in the table it is evidentthat our system is the only one able to work in real time with a clut-tered background and without artificial markers.

Finally, we summarize in the following the most importantaspects of our approach. In Fig. 9, first row, our system correctly de-tects out-of-plane rotated hands where the angle between the palmand the camera plane ranges in [−1/2π,1/2π]. The second rowshows the robustness of the system with respect to missing fingersand occlusions. Two open fingers are sufficient to detect a hand intothe input image (see Section 6: a hand is defined as a cluster of 2 ormore fingers). Other examples of different finger configurations areshown in Fig. 5. Figs. 5 and 9 show that our system can localize ahand independently of the person's identity (e.g., notice the presenceof a baby in the first row, last column of Fig. 5 and in the third andfourth column, medium row of Fig. 9). Figs. 5 and 9 also show thefull invariance of our approach to 2D (in-plane) transformations. Inthe last row of Fig. 9, hand detection with minor localization errorshas been performed in some difficult lighting conditions (includingshadows). As mentioned in Section 3, color segmentation is used

ess with respect to out-of-plane rotations. Medium row: occlusions and missing fingers.

Page 10: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Fig. 10. Finger detection errors: false negatives due to low contrast images.

35E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

only to build the ROI mask, being the shape matching approach main-ly based on edge information. As a consequence of this, our approachdoes not strictly depend on the skin detection results and can handlea large number of lighting conditions. However, difficulties can arisein images with low contrast producing a poor edge map. This isshown in Fig. 10 in which the finger detection step has failed mainlydue to missing edge pixels.

More schematically, the most important positive features of oursystem are:

• No user customization is necessary. Hands of different persons, in-cluding babies and adults, can be dealt with using very few shapemodels.

• Low computational complexity and real time performances.• Robustness in cluttered background and other (possibly skin-colored) objects.

• Different lighting conditions which preserve a sufficient image con-trast are addressed as well.

• Different finger configurations, including different angles betweenthe open fingers and hands with missing/occluded fingers can berecognized using simple geometric constraints.

• Full invariance with respect to 2D transformations (in-plane rota-tions, scale changes and translations) and robustness in minorout-of-plane rotations.

Conversely, the main drawbacks can be summarized as follows:

• Low contrast images with a poor edge map can lead to a low fingerrecognition rate and, hence, to false negatives of the whole hand de-tection process. This is currently the main drawback of the pro-posed method. We are working on extending the system usingmultiple edge maps, each one obtained with a different combina-tion of hysteresis thresholds of the Canny edge detector.

• At the moment only open finger silhouettes are included in the sys-tem shape models, hence hand detection with all closed or partially

Fig. 11. Some examples of still images used for testing the system in “static” mode (see Tabhands taken from the Caltech 101 dataset [11].

closed fingers cannot be detected. Our current work is focused onincluding shape models of partially closed fingers.

• Out-of-plane rotations greater than ±−1/2π are hardly capturedbecause the perspective effects on both the hand appearance andthe finger geometric relations do not fit our model.

7.3. Quantitative evaluation

We have quantitatively evaluated the proposed system perform-ing two different types of tests: the first using a database of still im-ages and the second with annotated video sequences. The testimages are different from the ones used in the validation set(Section 7.1). Both the test images and the test videos show hand ges-tures of different subjects acquired in different environments andwith different cameras.

In all the experiments a detection is considered valid if the centerof the hand detected by the system overlaps with the real hand repre-sented in the input image and the direction error is less than 30de-grees with respect to the real direction (see Fig. 5 for someexamples). In the still image test the detection rate is lower than inthe video sequence experiments because in the former case the sys-tem does not use information about past detections to guess the pres-ence of a hand in the current frame (see the end of Section 6).

For testing the system's classification accuracy with still imageswe used the well known Caltech 101 Dataset [11], taking 8994 imagesout of 9145 (we discarded only too large images). None of the select-ed images contains hands. Then we added 400 other images contain-ing hands in various positions, different finger configurations, takenboth in indoor and outdoor environments and with a nonuniformbackground. Each image contains just one hand. The hands belongto different individuals whom we asked to randomly open some fin-gers and move their hand in a random position but taking care tohave the palm roughly parallel to the camera view. Both sides of thehand (and both the right and the left hand) are represented in thecollection (see Figs. 5 and 11 for a few examples). Table 2 shows

le 2). Top row: images containing hands collected by us. Bottom row: images without

Page 11: Real-time viewpoint-invariant hand localization with cluttered backgrounds

Table 2The detection results achieved by the proposed system using still images (first row)and video sequences (second row).

Number of test images FAR Detection rate

Static detection 9394 0.0321 0.83Dynamic detection 1645 0.0036 0.9019

Fig. 13. ROC curve of the proposed hand detector on a test set of 800 images.

36 E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

the detection rate and FAR achieved. In this still image experimentthe ROI mask is realized using only skin detection (due to the absenceof motion).

For testing the system with video sequences, we grabbed twovideos in two different indoor environments with different lightingconditions. The two video sequences contain altogether 1645 framesin which scenes with the presence of a hand (989 frames) alternatewith scenes without any hand (656 frames). Also in this case the im-ages contain a cluttered background and hands in different positionsand with different shapes, sometimes overlapping with the face orother skin-colored objects. Fig. 12 shows some frames taken fromthe two video sequences. The results are shown in Table 2. Notethat the simple technique of averaging with respect to the past 5frames (Section 6) makes it possible to largely increase the detectionrate. Also, note that the FAR is extremely low (less than 4%) in bothtests.

Finally, we have investigated the relationship between the sys-tem's detection rate and its FAR. For this purpose we have used a sub-set of the still image test set abovementioned composed of all the 400images containing hands (remember that each image contains justone hand) and 400 other Caltech images randomly chosen amongthe 8994 of the previous test. The Receiver Operating Characteristic(ROC) curve drawn in Fig. 13 has been obtained varying the valuesof γp and γw (Eqs. (6)–(8)) from 0 to +∞. For each pair of γp andγw values we have run our system on the 800 test images and aver-aged the obtained detection rates over those trials having the samenumber of false positives. We have not been able to obtain a detectionrate greater than 0.88 even with high FARs. This is due to the presenceof some low contrast images in the test set which produce a quitepoor edge map, preventing the system from detecting the finger sil-houettes even with very shallow threshold values. As mentioned inSection 7.2, we are working on an extension of the current approachin which different edge maps are created using different thresholdvalues in the Canny edge detection step and the shape matching pro-cess is separately iterated in each map. However, we believe that thecurrent detection rates and false acceptance rates are surprisinglygood for the recognition of an articulated object class. Finally, asshown in Table 2, using the implicit dependence of consecutive

Fig. 12. Some examples of frames randomly taken from the two video sequences used for teBottom row: frames without any hand.

frames of a video sequence, the detection rate of our proposed ap-proach can be further increased.

8. Conclusions

We have shown a hand detection system working in real time andable to localize hands independently of the person's identity and thehand position and which is robust in different lighting conditionchanges as well as in cluttered background images. The proposal pre-sented in this paper is focused on a part-based representation of thehuman hand in which the basic components are finger models. Inturn, each finger is described by an appearance and a shape modelcorresponding to specific matching errors. Finger models can be effi-ciently matched with the edge map of the input image by combininggeometric and topological information. Finally, different spatial con-figurations of the detected fingers are verified using simple anatomicconstraints. The proposed recognition approach is an efficient andaccurate combination of curve and graph matching strategies. Ourmethod places few assumptions on the articulated object's natureand the proposed solution can be applied to recognition of otherarticulated objects as well.

We showed experiments in a “real world” scenario using both stillimages and video sequences. The false alarm rate achieved is very lowdue to the accuracy of the curve matching procedure. The detectionrate is also quite high, especially when the recognition responses ofthe past frames can be exploited. Finally, we have presented a simple

sting the system in “dynamic” mode (see Table 2). Top row: frames containing hands.

Page 12: Real-time viewpoint-invariant hand localization with cluttered backgrounds

37E. Sangineto, M. Cupelli / Image and Vision Computing 30 (2012) 26–37

visual interface, based on the proposed hand detection approach, toshow the potential of the real-time, viewpoint-invariant framework.

References

[1] A. Ardovini, L. Cinque, E. Sangineto, Identifying elephant photos by multi-curvematching, Pattern Recognit. 41 (2008) 1867–1877.

[2] V. Athitsos, S. Sclaroff, Estimating 3D hand pose from a cluttered image, CVPR 03,2003, pp. 432–439.

[3] S. Belongie, J. Malik, Matching with shape contexts, IEEE Workshop on Content-Based Access of Image and Video Libraries, 2000, pp. 20–26.

[4] P.J. Besl, N.D. Mckay, A method for registration of 3-D shapes, IEEE Trans. PAMI 14(1992) 239–256.

[5] J. Canny, A computational approach to edge detection, IEEE Trans. PAMI 8 (6)(1986) 679–698.

[6] S.C. Cheung, C. Kamath, Robust techniques for background subtraction in urbantraffic video, Video Communications and Image Processing, 2004.

[7] C. Colombo, A. Del Bimbo, A. Valli, Visual capture and understanding of handpointing actions in a 3-D environment, IEEE Trans. PAMI 33 (4) (2003) 677–686.

[8] T.F. Cootes, C.J. Taylor, D.H. Cooper, J. Graham, Active shape models—their train-ing and application, Comput. Vision Image Understanding 61 (1) (1995) 38–59.

[9] J. Doublet, O. Lepetit, M. Revenu, Hand detection for contact less biometrics iden-tification, Proc. of Cognitive Systems with Interactive Sensors (COGIS 06), 2006.

[10] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly, Vision-based hand poseestimation: a review, Comput. Vision Image Understanding 108 (2007) 52–73.

[11] L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from few train-ing examples: an incremental Bayesian approach tested on 101 object categories,CVPR Workshop on Generative Model-Based Vision, 2004.

[12] L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories, IEEE Trans.PAMI 28 (4) (2006) 594–611.

[13] P.F. Felzenszwalb, D.P. Huttenlocher, Efficient matching of pictorial structures,Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2000.

[14] M. Fischler, R. Elschlager, The representation and matching of pictorial structures,IEEE Trans. Comput. C-26 (1977) 236–242.

[15] D.A. Forsyth, M.M. Fleck, Body plans, IEEE Conference on Computer Vision andPattern Recognition, 1997, pp. 678–683.

[16] W.T. Freeman, D.B. Anderson, P. Beardsley, C.N. Dodge,M. Roth, C.D.Weissman,W.S.Yerazunis, H. Kage, I. Kyuma, Y. Miyake, K. Tanaka, Computer vision for interactivecomputer graphics, Comput. Graphics Appl., IEEE 18 (3) (1998) 42–53.

[17] E. Holden. Visual recognition of hand motion, Ph.D. thesis. Department ofComputer Science, University of Western Australia, 1997.

[18] M. Kolsch, M. Turk, Robust hand detection, Int. Conf. on Automatic Face andGesture Recognition, 2004.

[19] M. Kolsch, M. Turk, T. Höllerer, J. Chainey, Vision-based interfaces for mobility,Int. Conference on Mobile and Ubiquitous Systems (MobiQuitous), 2004.

[20] N. Liu, B.C. Lovell, Hand gesture extraction by Active Shape Models, Proc. of DigitalImage Computing: Technqiues and Applications (DICTA 05), 2005, pp. 59–64.

[21] T.B. Moeslund, E. Granum, 3D human pose estimation using 2D data and an alter-native phase space representation, Workshop of Human Modeling, Analysis andSynthesis at CVPR 2000, 2000.

[22] T.B. Moeslund, E. Granum, Modelling and estimating the pose of a human arm,Mach. Vis. Appl. 14 (2003) 237–247.

[23] T.B. Moeslund, E. Granum, A survey of advances in vision-based human motioncapture and analysis, Comput. Vision Image Understanding 104 (2006) 90–126.

[24] C. Nolker, H. Ritter, Grefit, Visual recognition of hand postures, Gesture-Based Com-munication in Human-Computer Interaction. International Gesture Workshop,1999, pp. 61–72.

[25] E.J. Ong, R. Bowden, A Boosted classifier tree for hand shape detection, Proc. Int.Conf. on Automatic Face and Gesture Recognition, 2004.

[26] S. Ong, S. Ranganath, Automatic sign language analysis: a survey and the futurebeyond lexical meaning, IEEE Trans. PAMI 27 (6) (2005) 873–891.

[27] V.I. Pavlovic, R. Sharma, T.S. Huang, Visual interpretation of hand gesture forhuman-computer interaction: a review, IEEE Trans. PAMI 19 (7) (1997) 677–695.

[28] D. Ramanam, D.A. Forsyth, Finding and tracking people from the bottom up, CVPR2003, 2003, Madison, WI.

[29] J. Ruiz-del-Solar, A. Shats, R. Verschae, Real-time tracking of multiple persons, thInt. Conf. on Image Analysis and Processing (ICIAP03), 2003.

[30] J. Shi, C. Tomasi, Good features to track, Conf. on Computer Vision and PatternRecognition (CVPR'94), 1994.

[31] N. Shimada, K. Kimura, Y. Shirai, Real-time 3-D hand posture estimation based on2-D appearance retrieval using monocular camera, Recognition, Analysis, andTracking of Faces and Gestures in Real-Time Systems, IEEE ICCV Workshop on,2001, pp. 23–30.

[32] L. Sigal, S. Bhatia, S. Roth, M.J. Black, M. Isard, Tracking loose-limbed people, Conf.on Computer Vision and Pattern Recognition (CVPR 2004), 2004, pp. 421–428.

[33] B. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Model-based hand trackingusing a hierarchical Bayesian flter, IEEE Trans. PAMI 28 (9) (2006) 1372–1384.

[34] A. Thayananthan, B. Stenger, P.H.S. Torr, R. Cipolla, Learning a kinematic prior fortree-based filtering, British Machine Vision Conf., volume 2, 2003, pp. 589–598.

[35] A. Torralba, K. Murphy, W.T. Freeman, Sharing visual features for multiclass andmultiview object detection, IEEE Trans. PAMI 29 (5) (2007) 854–869.

[36] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple fea-tures, CVPR 01, 2001, pp. 511–518.

[37] J.P. Wachs, M. Kolsch, D. Goshorn, Human posture recognition for intelligentvehicles, J. Real Time Image Process. 5 (2010) 231–244.

[38] Y. Wu, T.S. Huang, Hand modeling, analysis and recognition, IEEE Signal Process.Mag. 18 (2001) 51–60.

[39] H. Zhou, T.S. Huang, Tracking articulated hand motion with Eigen dynamics anal-ysis, Proc. 9th Int. Conf. on Computer Vision, 2003, pp. 1102–1109.