9
Recognizing hand gestures using the weighted elastic graph matching (WEGM) method Yu-Ting Li, Juan P. Wachs Industrial Engineering, Purdue University, West Lafayette, IN 47906, USA abstract article info Article history: Received 26 June 2012 Received in revised form 24 March 2013 Accepted 19 June 2013 Keywords: Elastic bunch graph Graph matching Feature weight Hand gesture recognition This paper proposes a weighted scheme for elastic graph matching hand posture recognition. Visual features scattered on the elastic graph are assigned corresponding weights according to their relative ability to discriminate between gestures. The weights' values are determined using adaptive boosting. A dictionary representing the variability of each gesture class is expressed in the form of a bunch graph. The positions of the nodes in the bunch graph are determined using three techniques: manually, semi-automatically, and automatically. Experimental results also show that the semi-automatic annotation method is efcient and accurate in terms of three performance measures; assignment cost, accuracy, and transformation error. In terms of the recognition accuracy, our results show that the hierarchical weighting on features has more signicant discriminative power than the classic method (uniform weighting). The hierarchical elastic graph matching (WEGM) approach was used to classify a lexicon of ten hand postures, and it was found that the poses were recognized with a recognition accuracy of 97.08% on average. Using the weighted scheme, computing cycles can be decreased by only computing the features for those nodes whose weight is relatively high and ignoring the remaining nodes. It was found that only 30% of the nodes need to be computed to obtain a recognition accuracy of over 90%. © 2013 Elsevier B.V. All rights reserved. 1. Introduction One of the main goals in the humancomputer interaction (HCI) eld is the study of innovative ways to enhance the user experience through natural communication and develop the technology that enables such interaction. In this context, new trends include the development of a new generation of smaller, cheaper and versatile sensors [1,2]. Users' subjective satisfaction favors those systems that provide an enhanced interaction experience based on the naturalness and expressiveness that they offer [3]. Among those modalities relying on natural interaction, gestures are found explicitly, as one of the main channels to interact with computers in many fronts; such as sign language interpretation [2], assistive technologies [4], and game control applications [5] to mention a few. Gestures are being adopted also in areas where touch can be a vehicle of infection transmissions (e.g. browsing medical images in the operating room) [6] or in outpatient clinics [7]. Common approaches for vision based-hand posture recognition involve [8] (1) 3D model based methods [9], (2) appearance based model [10], and (3) shape analysis [11]. See [12] for a detailed review on gesture based interfaces. 1.1. Elastic graphs Elastic graph matching (EGM) is a technique used for object recognition [13], where the object is represented by a labeled bunch graph. The bunch graph consists of a connected graph where the most salient features on the image are represented as series of nodes. A bunch graph is built on a set of template images (also called dictionary). To compare the similarity between one template image within the bunch and a target image, the graph obtained from the template image is matched against the target image. Filter responses are computed at each node in the graph, and a cost function is minimized based on a metric applied to the nodes assignment. Over the years, EGM was implemented for tasks such as face recognition [13,14], face verication [15] and gesture recognition [16]. In Wiskott et al. [13], EGM was used to recognize the facial expressions in images where features were extracted from typical face parts (e.g. the pupils, the beard, the nose, and the corners of the mouth). Triesch et al. employed EGM to develop a classication approach for hand gestures against complex backgrounds [16]. EGM was also shown to have better performance over eigenfaces [17] and auto-association and classication neural networks [18]. EGM outperformed the aforementioned two methods due to its robustness to lighting variation, face position and expression changes. Another variant of EGM is morphological elastic graph matching (MEGM) [19], which was applied for frontal face authentication based on multi-scale dilationerosion operations. One of the main drawbacks of this Image and Vision Computing 31 (2013) 649657 This paper has been recommended for acceptance by Xavier Roca. Corresponding author at: School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA. Tel.: +1 765 496 7380. E-mail address: [email protected] (J.P. Wachs). 0262-8856/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.imavis.2013.06.008 Contents lists available at SciVerse ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

  • Upload
    juan-p

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

Image and Vision Computing 31 (2013) 649–657

Contents lists available at SciVerse ScienceDirect

Image and Vision Computing

j ourna l homepage: www.e lsev ie r .com/ locate / imav is

Recognizing hand gestures using the weighted elastic graph matching(WEGM) method☆

Yu-Ting Li, Juan P. Wachs ⁎Industrial Engineering, Purdue University, West Lafayette, IN 47906, USA

☆ This paper has been recommended for acceptance b⁎ Corresponding author at: School of Industrial Engine

Lafayette, IN 47907, USA. Tel.: +1 765 496 7380.E-mail address: [email protected] (J.P. Wachs).

0262-8856/$ – see front matter © 2013 Elsevier B.V. Allhttp://dx.doi.org/10.1016/j.imavis.2013.06.008

a b s t r a c t

a r t i c l e i n f o

Article history:Received 26 June 2012Received in revised form 24 March 2013Accepted 19 June 2013

Keywords:Elastic bunch graphGraph matchingFeature weightHand gesture recognition

This paper proposes a weighted scheme for elastic graph matching hand posture recognition. Visual featuresscattered on the elastic graph are assigned corresponding weights according to their relative ability todiscriminate between gestures. The weights' values are determined using adaptive boosting. A dictionaryrepresenting the variability of each gesture class is expressed in the form of a bunch graph. The positionsof the nodes in the bunch graph are determined using three techniques: manually, semi-automatically, andautomatically. Experimental results also show that the semi-automatic annotation method is efficient andaccurate in terms of three performance measures; assignment cost, accuracy, and transformation error. Interms of the recognition accuracy, our results show that the hierarchical weighting on features has moresignificant discriminative power than the classic method (uniform weighting). The hierarchical elasticgraph matching (WEGM) approach was used to classify a lexicon of ten hand postures, and it was foundthat the poses were recognized with a recognition accuracy of 97.08% on average. Using the weightedscheme, computing cycles can be decreased by only computing the features for those nodes whose weightis relatively high and ignoring the remaining nodes. It was found that only 30% of the nodes need to becomputed to obtain a recognition accuracy of over 90%.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

One of the main goals in the human–computer interaction (HCI)field is the study of innovative ways to enhance the user experiencethrough natural communication and develop the technology thatenables such interaction. In this context, new trends include thedevelopment of a new generation of smaller, cheaper and versatilesensors [1,2]. Users' subjective satisfaction favors those systems thatprovide an enhanced interaction experience based on the naturalnessand expressiveness that they offer [3]. Among those modalities relyingon natural interaction, gestures are found explicitly, as one of themain channels to interact with computers in many fronts; such as signlanguage interpretation [2], assistive technologies [4], and gamecontrol applications [5] to mention a few. Gestures are being adoptedalso in areas where touch can be a vehicle of infection transmissions(e.g. browsingmedical images in the operating room) [6] or in outpatientclinics [7]. Common approaches for vision based-hand posturerecognition involve [8] (1) 3Dmodel basedmethods [9], (2) appearancebased model [10], and (3) shape analysis [11]. See [12] for a detailedreview on gesture based interfaces.

y Xavier Roca.ering, Purdue University, West

rights reserved.

1.1. Elastic graphs

Elastic graph matching (EGM) is a technique used for objectrecognition [13], where the object is represented by a labeled bunchgraph. The bunch graph consists of a connected graph where themost salient features on the image are represented as series ofnodes. A bunch graph is built on a set of template images (also called‘dictionary’). To compare the similarity between one template imagewithin the bunch and a target image, the graph obtained from thetemplate image is matched against the target image. Filter responsesare computed at each node in the graph, and a cost function isminimized based on a metric applied to the nodes assignment. Overthe years, EGM was implemented for tasks such as face recognition[13,14], face verification [15] and gesture recognition [16]. In Wiskottet al. [13], EGMwas used to recognize the facial expressions in imageswhere features were extracted from typical face parts (e.g. thepupils, the beard, the nose, and the corners of the mouth). Trieschet al. employed EGM to develop a classification approach for handgestures against complex backgrounds [16]. EGM was also shown tohave better performance over eigenfaces [17] and auto-associationand classification neural networks [18]. EGM outperformed theaforementioned two methods due to its robustness to lightingvariation, face position and expression changes. Another variant ofEGM is morphological elastic graph matching (MEGM) [19], whichwas applied for frontal face authentication based on multi-scaledilation–erosion operations. One of the main drawbacks of this

Page 2: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

X

Similarity Response without Hierarchy on Features

2 4 6 8 10 12 14 16 18

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

X

Y

Similarity Response with Hierarchy on Features

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

18

20

Y

2

4

6

8

10

12

14

16

18

20

A B

Fig. 1. Similarity responses of bunch graph matched to an example image (A) with weight; (B) without weight.

Fig. 2. Weight distribution on an example image.

650 Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

method is the computational complexity involved in the detectionand classification processes.

1.2. Motivation

One significant contribution of this paper is a procedure to deter-mine the graph nodes' weights thus validating the importanceof weighting the features for classification purposes. We propose theweighted elastic graph matching (WEGM) method for hand posturerecognition. In ourmethod, those features with higher likelihood to ap-pear in the target image have higher weight compared to those featureswhich are less consistent with the graphmodel. Using weight allows usto allocate more computational resources to those features that aremore discriminative while ignoring those features with lower impor-tance [20]. Three metrics are used in our experiments to show that fea-tures with more discriminative power dominate the recognitionperformance of the system. A secondary contribution is a comparativestudy on efficient annotation techniques to create the bunch graphs.

The rest of the paper is organized as follows: in Section 2the elastic bunch graph matching (EBGM) and adaptive boostingalgorithm are described. In Section 3 the proposed annotationmethods and the weighted hand gesture recognition algorithm(WEGM) are presented. Experimental results in Section 4 demon-strate the feasibility and efficiency of the proposed techniques. Finallythe discussion and conclusions are presented in Section 5.

2. Fundamentals of proposed algorithms

2.1. Elastic bunch graph

The section below describes briefly the principles of elastic bunchgraph. For more details see [16]. Bunch graphs were used to representand recognize hand postures [16,21] in grayscale images. Each bunchgraph is a collection of individual graphs representing a posture.Salient points on the underlying image are labeled as nodes of agraph over the object. The links connecting the nodes express sometopological metric, such as the Euclidian distance. A Gabor jet isdefined as the set of responses on specific locations in the imagesobtained when convolving a set of images (the dictionary set) witha bank of Gabor filters. The jet is a vector of complex responses at agiven pixel x which follows the form:

ψk xð Þ ¼ k2

σ2 exp − k2x

2

2σ2

!exp i k x

� �− exp −σ2

2

!" #ð1Þ

where ψk(x) is the Gabor-based kernel with the wave vector k whichdescribes the variation of spatial frequencies and orientations,represented by the index ν ∈ [0, …, L − 1] and μ ∈ [0, …, D − 1].Different values of k are found using:

kνμ ¼ kνcosϕμsinϕμ

� �with kν ¼ kmax

f ν;ϕμ ¼ μπ

Dð2Þ

where L is the number of frequency levels and D is the number oforientations. The following parameters were chosen based onempirical studies [16]: f = 1=

ffiffiffi2

p, and kmax = 1.7. The width of the

Gaussian envelope function is σ/k with σ = 2.5. The jet is acomplex vector consisting of L × D filter responses and it is definedas Jj = aj exp(iϕj). Jj is used to compute the similarity of a targetimage and a bunch graph (obtained from dictionary images), whosenode positions are annotated a priori. In this paper, the objects ofinterest are hand postures. Thus, the classification of a given imageas a gesture is obtained by measuring the likelihood of two jets(one from the target image and one from the bunch graph). Thesimilarity function using the magnitude aj and phase ϕj of the two

Page 3: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

Fig. 3. 10 classes of sample hand gesture images after matching process.

651Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

jets to find a matching score between the target image and the bunchgraph is stated as follows:

Spha J; J0� �

¼ 12

1þX

ja ja

0

j cos ϕ j−ϕ0

j

� �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX

ja2jX

ja

0j2

q0B@

1CA ð3Þ

where aj' and ϕj

' is obtained from Jj' = aj

' exp(iϕj'), the jet is derived

from the target image. The phase information varies rapidly betweencontinuous pixels, thus providing an advantageous mean to have agood initial estimate about the position of the hand within the targetimage.

2.2. Elastic bunch graph matching procedure

The classification task is done by finding the position of the templatewhich maximizes the similarity between the bunch graph and thetarget image. The detailed elastic bunch graph matching (EBGM)procedure consists of three steps [16]:

• Approximately position the graph: The bunch graph is applied onthe image and scanned in steps of 3 pixels in both the x and ydirections. All the nodes in each bunch graph are visited andcompared, the similarity score of the matching is given by a linearcombination of the scores between the nodes in the bunch graphand the target image.

• Rescale the graph: The bunch graph can be resized by up to +20%and −20% (five scales are used) without relative changes of theedge lengths.

• Refining position of each node: All nodes are allowed to vary +3and −3 pixels from the position found in step 1. A penalty cost isintroduced to prevent great distortion of the graph:

C ¼ 1M

Xid Eið Þ ð4Þ

where d(Ei) is the cost of the difference of edges before and aftershifting the graph relative to the original lengths. Considering thedistortions of the nodes, the total score of the matching becomes:

Stotal ¼ Spha−λC ð5Þ

where λ determines the extent of penalizing the solutions thatdepart from the original positions. In this paper, the value of λ is

chosen the same as the state-of-art approach [16] in order toperform the comparison analysis.

During the overall matching process, the best fitting jet is selectedaccording to the maximum similarity score in Eq. (5) among thebunch graphs. The classification is determined by the maximumscore over all the detectors (Max–Wins rule [22]).

2.3. Adaptive boosting

In this paper we use boosting to assign weighted values to thenodes within the bunch graph to maximize the recognition accuracy.These weights are in practice coefficients that maximize the discrim-inative function between feature vectors that are retrieved fromspecific positions in the hand and negative observations.

Boosting [23–25] is a general machine learning technique used todesign, train and test classifiers by combining a series of weak classi-fiers to create a strong classifier. This technique was adopted in ourposture recognition algorithm to reflect the weight of nodes in thebunch graphs. In the boosting technique, a family of weak classifiersforms an additive model:

F vð Þ ¼XM

mf vð Þ ð6Þ

where f(v) denotes a weak detector, v is a feature vector, andM is thenumber of iterations (or number of weak detectors) to form a strongclassifier, F(v). When training, a weight is associated with each of thetraining samples, which is updated in each iteration. The updatesincrease the weight of the samples which are misclassified at thecurrent iteration and decrease the weights of those which werecorrectly classified. The weightsωi ¼ e−zi F við Þ for each training samplei with class label zi are defined so the cost of misclassification isminimized by adding a new optimal weak classifier that meets:

argmin f m

XNi¼1

ωi zi− f m við Þð Þ2: ð7Þ

Upon choosing the weak classifier and adding to F(vi), the estimatesare updated: F(vi) = F(vi) + fm(vi). Accordingly, the weights over thesamples are updated by:

ωi ¼ ωie−zi f m við Þ

: ð8Þ

Page 4: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

Fig. 4. Example RGB images of hand gestures after the matching process.

652 Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

In this paper, the gentleboost cost function [23] is used to minimizethe error.

3. Hand gesture recognition methodology

3.1. Node annotation techniques

The bunch graph was created by selecting a set of nodes for eachimage which belongs to the dictionary set. Each node has to representthe same landmark in the hand in every image in the set. The processof selecting these nodes is called “annotation”. Two types of nodeswere annotated: edge nodes (nodes lying on the contour of the

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate Gesture A

Gesture BGesture CGesture DGesture GGesture HGesture IGesture LGesture VGesture YAverage

Fig. 5. ROC curve for weight-based hand gesture recognition.

hand) and inner nodes (nodes lying inside the contour). Threemethods to accomplish the annotation task were compared in thispaper: manual, semi-automatic and automatic. Among these threemethods, the semi-automatic and automatic approaches were proposedto compare with the standard manual annotation approach. The manualmethod consists of selecting by hand every landmark in every imageand ensuring that every landmark corresponds roughly to the samepoint in all the images in the dictionary set. In the automatic method,the landmarks are automatically selected by applying a Harris cornerdetector [26], which responds to highly textured regions within thehand. The semi-automatic approach is the same as the automaticapproach except that it allows the user to correct manually those pointsthatwere detected automatically but had anoffsetwith respect to visuallyidentified landmarks. All the methods rely on the fact that the contour inevery image was annotated manually for precise alignment.

The difference among these three methods is the manner onwhich the nodes are selected within the hand region. For the twomethods (automatic and semi-automatic), one reference graph ischosen and the remaining five graphs are aligned with respect to it.A linear assignment problem (LAP) is applied to find the points ineach graph in the bunch that better correspond to those points inthe reference graph. The objective is to find the least displacementpairs of nodes from a larger set of candidates of the current graph.This is a minimization problem whose formulation is provided inEqs. (9) and (10):

minzi

XN1�N2

i¼1zidi

� �ð9Þ

s:t:XN2

j¼1zj ¼ N1; zj ¼ 0 or 1 ð10Þ

where dk = ‖(xi1,yi1) − (xi2,yi2)‖ is the Euclidian distance between thenodes (i = 1 … N1, j = 1 … N2), (xi1,yi1) is the node of the reference

Page 5: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

95.38 1.54 1.54 1.54

1.54 98.46

1.54 96.92 1.54

1.54 96.92 1.54

1.54 98.46

95.38 3.08 1.54

3.08 1.54 95.38

1.54 98.46

100

1.54 1.54 1.54 95.38

A B C D G H I L V Y

A

B

C

D

G

H

I

L

V

Y

90

91

92

93

94

95

96

97

98

99

100

Fig. 6. Confusion matrix for 10 gestures.

653Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

graph, and (xi2,yi2) is the node of the graph to bematched. The detailedprocess is summarized in the following Algorithm table.

Algorithm. Node Annotation

The effectiveness of the proposed annotation methods is evaluatedby three different metrics: (1) Costs entailed to match the nodes.Relative displacements of the nodes with respect to each other in thedifferent graphs result in a matching ‘cost’. The Euclidean distancesof each pair of nodes are summed up as the total matching costs.(2) Transformation errors are those errors resulting from affinetransformation disparities between the reference graph and the onesaligned to it [27] (see Eq. (11) below). (3) Errors in recognition accuracy

are those errors that can be observed once the bunch graph is built andused to classify the postures in the testing state.

E ¼ Ω1−R�Ω2−t�Th iT

Ω1−R�Ω2−t�Th i

ð11Þ

where R* is the optimal rotational (θ) and scaling matrix (s)(least-square minimization approach is used to reach the optimal) ap-plied to Ω:

R� ¼ s cosθ −s sinθs sinθ s cosθ

� �ð12Þ

where Ω is the vector representation of the coordinates the points ineach image (xil,yil), l ∈ {1,2}. Also, t* is the optimal translationparameter.

The semi-automatic approach allows the user to correct manuallythose points that were detected automatically. The correction is doneby subjective observation, while the automatic method does notallow re-placing the nodes once found. To this end, the tradeoffsbetween the semi-automatic and automatic approaches are thetime-saving with higher matching cost and transformation errorand thus affecting the recognition accuracy.

3.2. Weighting scheme on features

We propose to assign a weight to each node of the graph. Thestandard approach assumes that equal weights are given to everynode in the bunch graph when determining the similarity functionfor graph matching. However, some features of the hand are moredominant than others, in terms of their discriminative power. Thus,the importance (weight) over the nodes should be considered to re-flect this attribute within the total similarity metric Spha . Thesimilarity metric is weighted by the coefficient vector c that representsthe discriminatory degree of each node:

Spha ¼XK

k¼1ckSpha B kð Þ

; J x kð Þ� �� �ð11Þ

Page 6: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

96.67 3.33

93.33 6.67

96.67 3.33

96.67 3.33

3.33 96.67

100

100

96.67 3.33

100

100

100

3.33 90 6.67

100

100

3.33 93.33

3.33 3.33 3.33 3.33 90

AnC AnO CaC CaO DiC DiO FeC FeO HaC HaO HaX NeC NeO SaC SaO SuOAngry_C

Angry_O

Calm_C

Calm_O

Disgust_C

Disgust_O

Fear_C

Fear_O

Happy_C

Happy_O

Happy_X

Neutral_C

Neutral_O

Sad_C

Sad_O

Surprise_O80

82

84

86

88

90

92

94

96

98

100

Fig. 7. Confusion matrix for 16 facial expressions.

654 Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

where B is the bunch graph with node index k, and J(x) is the jetcomputed from the target image taken at node vector x. The adaptiveboosting described in Section 2.3 is used to train a strong classifier toclassify the observed vectors based on the score Spha . For differenthand postures, the classifiers are trained separately. Positive samples(true hits) are created by extracting the feature vectors assigned thenodes in the positive images. Negative samples are feature vectorsextracted by searching the best matching location of a bunch graph inthe negative set of images from the training set (this method is broadlyused to find negative instances that could potentially be recognized astrue hits). Fig. 1 shows the similarity response of a sample imagewhen the similarity metric is computed with and without weightassignment (the bunch graph is scanned over the entire image withan increment of 4 pixels).

As can be seen, the similarity response when weight is used (theleft image) is more ‘focused’ in a single point than the responsewithout weight (the right image). In other words, the similarityscores of the entire image exhibit global maxima when weight isapplied. The more focused the response is, the fewer local maxima,which provides more effective and reliable decision criterion.

Fig. 2 shows the importance of the nodes represented by a heatmap (the edges are omitted to emphasize the nodes' coloring system).Warm colors represent high weight, while cold ones represent lowweight. As can be seen, for those nodes with positions that blend with

Fig. 8. Examples of 16 facial expression ima

the background, lower weights are assigned (yellow color). On theother hand, those nodes over the rim of the hand are assigned higherweights (warmer colors) since they are more distinct from thebackground and more descriptive of the hand.

3.3. Determining dominance of features

According to the testing results shown in Section 3.2, the ability tobetter discriminate features leads to a better decision surface whichenables a more reliable classification. Furthermore, the fact thatsome features have assigned lower hierarchies indicates that theireffect on the classification performance is low compared to thosefeatures assigned with higher hierarchies. Thus, the computation ofthese features can be skipped without affecting the recognitionaccuracy substantially. To explore the effect of the number of featuresselected on the algorithm, performance (the classification results) ofthe three scenarios was studied, where in each, the features (thenodes) were selected in different fashions:

1. Selection by weight: Sort the features by their assigned weights indescending order and select the N highest ranked features.

2. Selection by the magnitude of similarity: Sort the features by theirmagnitude of similarity score and select N highest ranked features.

ges used with matched bunch graphs.

Page 7: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

100 90 80 70 60 50 40 30 20 100.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Feature Used for Classification (%)

Rec

ogni

tion

Acc

urac

y

Selection by weightSelection by the magnitude of similarityRandom Selection

Fig. 9. Recognition accuracy vs. reduced features.

655Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

3. Random selection: Randomly order the feature in a list, and selectthe N highest ranked features.

4. Experimental results

The proposed methods were validated with the Triesch's handposture dataset [28]. The dataset consists of 10 different hand gesturesagainst complex, light and dark backgrounds performed by 24 people.This results in a total of 710 gray-scale 128 × 128 pixel images. Eachbunch graph was created by selecting two instances of a given postureperformed by three subjects against light and dark backgrounds(a total of six instances in each bunch graph). This constitutes thedictionary used. The geometry of the nodes (their position) on thebunch graph was averaged from the six graphs. Overall 60 imageswere used to create the bunch graphs. The remaining 650 imageswere used for the training and testing datasets. The results presentedcorrespond to the classification performance among the extractedfeatures from those 650 images. Examples showing theWEGM's detec-tion performance are showed in Fig. 3. The corresponding bunch graphswere fitted to 10 hand postures. Each imagewas scanned in incrementsof 4 pixels in the horizontal and vertical directions.

Manual Semi-automatic Automatic0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Annotation Method

Per

form

ance

Mea

sure

Matching Cost

Recognition Error

Transformation Error

Fig. 10. Performance measures for different annotation techniques.

The various colors from warm to cold colors were used to representthe nodes' weights. Light blue lines indicate the edges. The edges wereallowed to distort to reflect the variation of gesture among imageswith-in the same category.

Several RGB images were captured to test the WEGM detectionalgorithm. The images were resized to 128 × 128 pixels and thebunch graphs were scanned over the image by an increment of4 pixels. The matching results of several examples of three handgestures' images with light, dark, and complex backgrounds areshown in Fig. 4.

4.1. Hand gesture classification

The receiver operating characteristic (ROC) curves are presentedin Fig. 5. The curves were generated using 5-fold cross-validationfor the 10 hand gestures. A true positive was determined based onwhether the classification score was greater than a given threshold(found empirically), otherwise it was regarded as a miss. When anobservation was classified as a certain gesture, which was in fact atrue negative, this event was considered as a false alarm. Followingthis guideline, ROC curves were plotted to show the relationshipbetween the true positives and false alarms among the 10 classes,one for each hand gesture. The average recognition accuracy was91.84%. This value was found by averaging all the 10 recognitionaccuracies on the operational point (the point closest to the top leftcorner of the graph).

The second metric to evaluate the hand gesture classificationperformance was the maximum score over the 10 classifiers (Max–Wins rule). This metric always assures a single detection (correct orincorrect) and no false positive cases. If the maximum score fell onthe incorrect class, that gesture was misclassified (it was considereda confusion). The confusion matrix (see Fig. 6) was calculated bycomparing the scores delivered by each classifier on a given sampleimage, and taking the maximum from all the classifiers. The averageaccuracy of correct classification over the confusion matrix reached97.08%. Both these values show better performance than thosereported in the literature [16,21]. In order to show that the improve-ment is significant, the t-test of paired two samples (650 observationsfor each) for equal means is conducted on the classification results ofWEGM and EGM [16]. The one tailed p-value (1.5665E−06 b .05) ofthe statistical test indicated that the classification is statisticallysignificant regarding the performance.

4.2. Facial expression classification

To illustrate the generality of the weighted feature approach, thealgorithm was tested on the MacBrain Face Stimulus database [29].The dataset consists of 16 emotions performed by 40 people. Thedictionary includes 96 faces to build the bunch graph per face. Thenodes of the bunch graph are annotated on the fiducial points suchas the inner and outer corners of the eyes, the inner and outer endof the eyebrows, the tips of the nose, and the corners of the mouth.We use 480 images (30 faces per facial expression) to conduct5-fold cross-validation. The confusion matrix (see Fig. 7) of classifiedemotions presents 96.88% accuracy of correct classification. Fig. 8shows 16 facial expressions when the bunch graph showing theweighted node is applied. The results show that the WEGM algorithmis also applicable to other types of human feature (facial expression)classification.

4.3. Weight-based feature selection

Three different scenarios were studied to validate the effect of thenumber of features selected (and how they were selected) on theclassification accuracy. Fig. 9 shows the recognition results whenapplying the three different feature reduction policies (weights,

Page 8: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

656 Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

magnitude of similarity and random). Once the features/nodes weresorted, only the N top percentage of the sorted list was selected todetermine the effects on recognition accuracy. Nine cases wereevaluated from 100% (no reduction) to 10% with decrements of 10%of the total number of features. The responses are presented inFig. 9. It can be seen that up to 30% of the nodes can be discardedwithout reducing the recognition accuracy below 90% when the firstselection policy was applied. The recognition accuracy decreases ata pace slower than the other two scenarios (selected by magnitudeof the similarity and randomly). The worst results occurred whenfeatures were discarded randomly. When the second scenario wasapplied (the features were selected by the sorted magnitude ofsimilarity score), 50% of the nodes were required to assure 90% ofrecognition accuracy. It can be seen that in this scenario, the overallperformance was not as good as in the first scenario, but still betterthan when the selection of nodes was random. Thus, the experimen-tal results show that using the WEGM method, the computation timecan be reduced by 30% by discarding those nodes which have nosignificant effect on the overall recognition accuracy.

4.4. Performance on different annotation techniques

In this section the performance of each annotation technique usedto create the bunch graph is discussed. In the automatic andsemi-automatic methods candidate nodes were found in highlytextured regions inside the hand. The semi-automatic methodallowed nodes to be adjusted manually after being detected. Theresults displayed in Fig. 10 illustrate the performance measureswhen using the three different methods to annotate the nodes usedin the bunch graphs. Three classifiers were trained using the threedifferent annotation methods, and tested with light and dark back-ground images. When using the semi-automatic technique testedwith light and dark background images, the recognition error(7.88%) was less than the other two methods (9.07%, and 10.74% formanual and automatic, respectively). The normalized matching costwas the highest for the automatic technique due to the inconsistencyof the nodes' position among the graphs. For the similar reason, thenormalized transformation error was also the highest for theautomatic technique. However, the costs and errors of matchingbetween manual and semi-automatic approaches were comparable.The recognition error was slightly greater for the manual case.Although matching costs and errors of the semi-automatic methodwere slightly greater than those using the manual method, thesemeasures were substantially less than those when using the automat-ic method. Therefore, there is a trade-off between recognition errorand the speed of creating the annotation, which is expressed by thehigh matching costs and transformation errors. The proposedsemi-automatic technique is an efficient annotation method forbuilding up the bunch graph faster when desired recognition accuracyis acceptable.

5. Conclusion

This research proposed an enhanced graph-based approachincorporating the concept of node weight (WEGM) to recognize a lex-icon of ten hand gestures. The WEGM algorithm was validated using astandard dataset of postures against three different backgrounds:light, dark and complex. The WEGM algorithm classified the postureswith a recognition accuracy of 97.08% on average. This shows thatintroducing weight in the bunch graphs improves the overall perfor-mance. The reason for this is that theWEGM augments the discrimina-tory power of the nodes for each gesture with respect to the remaininggestures. Furthermore, by computing the features of only the nodeswith a relatively high weight and discarding the rest, the recognitionperformance is not significantly affected. Thus, the WEGM approach

improves the recognition performancewhile reducing the computationaltime required for computing the features.

Additionally, semi-automatic and automatic annotation techniqueswere proposed which allow the flexible selection of nodes which areconsistent between images of the same posture. The semi-automaticapproach was shown to deliver the highest recognition accuracy(lowest recognition error) though not the least matching andtransformation costs, compared to the manual and automatic methodsin constructing the bunch graphs.

Future work includes extending the WEGM algorithm to includedepth information with color. One simple approach would be to usethe range information to have a good initial region of interestfor matching the WEGM with the target image. This will result in asmaller search and will reduce the overall computation time. Inaddition, we are interested in experimenting with multimodal images(thermal, depth and color) and suggest an efficient method to com-bine these modalities to enhance overall performance. We plan to ex-periment with other features, like wavelets and HOGs, and includelarger and more complex datasets. Finally, we will develop a parallelimplementation of this algorithm for real-time detection and classifi-cation of hand gestures.

References

[1] R. Poppe, R. Rienks, Evaluating the future of HCI: challenges for the evaluation ofemerging applications, Proceedings of the International Conference on ArtificialIntelligence for Human Computing, 4451, 2007, pp. 234–250.

[2] S.M.M. Roomi, R.J. Priya, H. Jayalakshmi, Hand gesture recognition for human–computer interaction, J. Comput. Sci. 6 (9) (2010) 1002–1007.

[3] S.S. Rautaray, A. Agrawal, Interaction with virtual game through hand gesturerecognition, International Conference on Multimedia, Signal Processing andCommunication Technologies (IMPACT), Dec 2011, pp. 244–247.

[4] Y.-J. Chang, S.-F. Chen, A.-F. Chuang, A gesture recognition system to transitionautonomously through vocational tasks for individuals with cognitive impairments,Res. Dev. Disabil. 32 (6) (2011) 2064–2068.

[5] T. Leyvand, C. Meekhof, Yi-Chen Wei, Jian Sun, Baining Guo, Kinect identity:technology and experience, Computer 44 (4) (2011) 94–96.

[6] J.P. Wachs, H.I. Stern, Y. Edan, M. Gillam, J. Handler, C. Feied, M. Smith, Agesture-based tool for sterile browsing of radiology images, J. Am. Med. Inform.Assoc. 15 (3) (2008) 321–323.

[7] K. Wood, C.E. Lathan, K.R. Kaufman, Development of an interactive upper extremitygestural robotic feedback system: from bench to reality, Annual InternationalConference of the IEEE on Engineering in Medicine and Biology Society (EMBC),Sept 2009, pp. 5973–5976.

[8] S. Bilal, R. Akmeliawati, M.J. El Salami, A.A. Shafie, Vision-based hand posturedetection and recognition for sign language — a study, 2011 4th InternationalConference on Mechatronics (ICOM), May 2011, pp. 1–6.

[9] M. de La Gorce, D.J. Fleet, N. Paragios, Model-based 3D hand pose estimationfrom monocular video, IEEE Trans. Pattern Anal. Mach. Intell. 33 (9) (2011)1793–1805.

[10] S. Koelstra, M. Pantic, I. Patras, A dynamic texture-based approach to recognitionof facial actions and their temporal models, IEEE Trans. Pattern Anal. Mach. Intell.32 (11) (Nov 2010) 1940–1954.

[11] Weiqi Yuan, Lantao Jing, Hand-shape feature selection and recognition performanceanalysis, 2011 International Conference on Hand Based, Biometrics, Nov 2011,pp. 1–6.

[12] J.P. Wachs, M. Kölsch, H. Stern, Y. Edan, Vision-based hand-gesture applications,Commun. ACM 54 (2) (2011) 60–71.

[13] L. Wiskott, J.-M. Fellous, N. Kruger, C. von der Malsburg, Face recognition byelastic bunch graph matching, Int. Conf. Image Process. 1 (1997) 129–132.

[14] H.-C. Shin, S.-D. Kim, H.-C. Choi, Generalized elastic graph matching for facerecognition, Pattern Recognit. Lett. 28 (9) (2007) 1077–1082.

[15] A. Tefas, A. Kotropoulos, I. Pitas, Face verification using elastic graph matchingbased on morphological signal decomposition, Signal Process. 82 (6) (2002)833–851.

[16] J. Triesch, C. von der Malsburg, Robust classification of hand postures againstcomplex backgrounds, Proceedings of the Second International Conference onAutomatic Face and Gesture Recognition, Oct 1996, pp. 170–175.

[17] M.A. Turk, A.P. Pentland, Face recognition using eigenfaces, IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, Jun 1991, pp. 586–591.

[18] Jun Zhang, Yong Yan, M. Lades, Face recognition: eigenface, elastic matching, andneural nets, Proc. IEEE 85 (9) (Sept 1997) 1423–1435.

[19] C. Kotropoulos, A. Tefas, I. Pitas, Frontal face authentication using morphologicalelastic graph matching, IEEE Trans. Image Process. 9 (4) (Apr 2000) 555–560.

[20] Y.-T. Li, J.P. Wachs, Hierarchical elastic graphmatching for hand gesture recognition, tobe appeared, 17th Iberoamerican Congress on Pattern Recognition CIARP 2012, 2012.

[21] P.P. Kumar, P. Vadakkepat, Loh Ai Poh, Graph matching based hand posturerecognition using neurobiologically inspired features, 11th International Conferenceon Control Automation Robotics & Vision (ICARCV), Dec 2010, pp. 1151–1156.

Page 9: Recognizing hand gestures using the weighted elastic graph matching (WEGM) method

657Y.-T. Li, J.P. Wachs / Image and Vision Computing 31 (2013) 649–657

[22] J.H. Friedman, Another approach to polychotomous classification, TechnicalReport, Stanford Department of Statistics, 1996.

[23] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical viewof boosting, Ann. Stat. 28 (2) (2000) 337–374.

[24] A. Torralba, K.P.Murphy,W.T. Freeman, Sharing features: efficient boosting proceduresfor multiclass object detection, Proceedings of the 2004 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition 2, July 2004, pp. II-762–II-769.

[25] A. Torralba, K.P. Murphy, W.T. Freeman, Sharing visual features for multiclass andmultiview object detection, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (May2007) 854–869.

[26] C. Harris, M. Stephens, A Combined Corner and Edge Detector, Fourth AlveyVision Conference, Manchester, UK, 1988. 147–151.

[27] M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis, and Machine Vision, 3rded. Thomas Engineering, Toronto Canada, 2008.

[28] Sebastien Marcel hand posture and gesture datasets: Jochen Triesch static handposture database, http://www.idiap.ch/resource/gestures/.

[29] N. Tottenham, J. Tanaka, A.C. Leon, T. McCarry, M. Nurse, T.A. Marcus, A.Westerlund, B.J. Casey, C.A. Nelson, The NimStim set of facial expressions:judgments from untrained research participants, Psychiatry Res. 168 (3) (2009)242–249.