Classiﬁer Fusion for Gesture Recognition using a Kinect Sensor · besides segmentation and symbolization, GMM (Gaus-sian Mixture Model)/GMR (Gaussian Mixture Regres-sion) are applied

Classifier Fusion for Gesture Recognition using a Kinect Sensor

Ye Gu, Qi Cheng and Weihua Sheng

School of Electrical and Computer Engineering

Oklahoma State University

Stillwater, OK, 74078, U.S.A

{ye.gu, qi.cheng, weihua.sheng}@okstate.edu

Abstract

Gesture recognition is becoming a more and morepopular research topic since it can be applied to lotsof areas, such as vision-based interface, communicationand interaction. In this paper, experiments are imple-mented to verify the potential to improve vision basedgesture recognition performance using multiple classi-fiers. The proposed approach implements gesture recog-nition which combines decisions from a Dynamic TimeWarping (DTW) based classifier and a Hidden MarkovModel (HMM) based classifier. Both of these two classi-fiers share the same features extracted from the humanskeleton model which is generated from a Kinect sensor.Then fusion rules are designed to make a global decision.The experiment results indicate that with the proposedfusion methods, the performance can be improved com-pared with either classifier.

1 Introduction

A gesture is a motion of the body that containscertain information. Edward T. Hall, a social anthro-pologist claims that 60% of all our communications arenonverbal [1]. Gestures are widely used from expressingemotions to conveying information. Therefore, more andmore efforts are devoted to gesture recognition. It be-comes a popular topic in computer science and languagetechnology with the goal of interpreting human gesturesvia mathematical algorithms [2].

Gesture recognition has a wide application includ-ing Human Machine Interaction (HMI), Human RobotInteraction (HRI) and Social Assistive Robotics (SAR).This technology has the potential to change the wayusers interact with computers or robots by eliminatinginput devices such as joysticks, mouse and keyboards forHMI and robot controllers for HRI.

1.1 Related Works

Traditional gesture recognition is obtained throughvision information. Depending on the type of the in-put data, the approach for interpreting a gesture could

be done in different ways. Two most widely used ap-proaches are skeletal-based and appearance-based algo-rithms. The former method makes use of 3D informationof key body parts in order to obtain several importantparameters, like palm position or joint angles. On theother hand, appearance-based systems take 2-D imagesor videos for direct interpretation [3]. Some vision-basedrelated work are listed in [4–6].

Besides vision based approaches, wearable sensorbased gesture recognition has been gaining attention.Due to the advancement in MEMS and VLSI technolo-gies, several researchers use multiple sensors worn on ahuman body to record data of human movements. [7–9]is some related work.

In order to improve the recognition performance ofthe detection, the information fusion concept has beenexplored. Most of the fusion methods are decision-levelfusion. Some fuse the decisions from different sensors.Zhang et al. [10] presented a framework for hand gesturerecognition based on the information fusion of a three-axis accelerometer (ACC) and multichannel electromyo-graphy (EMG) sensors signals. A decision tree and mul-tistream HMMs are utilized for decision-level fusion togenerate a final decision. Chen et al. [11] presented arobust visual system that allows effective recognition ofmultiple-angle hand gestures in finger guessing games.Three Support Vector Machine (SVM) classifiers weretrained for the construction of the hand gesture recog-nition system. The classified outputs were fused by pro-posed plans to improve system performance. The systempresented can effectively recognize hand gestures, at over93%, for different angles, sizes, and different skin colors.Feature-level fusion has also been explored. In He’s pa-per [12], a new feature fusion method for gesture recog-nition based on a single tri-axis accelerometer has beenproposed. Both time-domain and frequency-domain fea-tures are extracted. Recognition of the gestures is per-formed using SVMs. The average accuracy results us-ing the proposed fusion methods is 89.89% which is im-proved compared with the approach using only one ofthe features.

It is obvious that if multiple sensors are used to-gether, the performance can be improved. Unlike theprevious work, the focus of our paper is to seek improve-

978-1-880843-88-8/ISCA CAINE/Novermber 2012

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

Figure 1: Flowchart of the gesture recognition system.

ment on recognition results with mono data source. Wetry to find complementary information among differentrecognition algorithms, which may push the performanceto maximum when limited sensors are available. Specifi-cally in this work, we attempt to verify the possibility ofperformance improvement using a single sensor. We per-form temporal human gesture recognition using a Kinectsensor. Despite the limitations of the device, it seemsto have caught on to a large (and growing) extent inthe marketplace which has brought gesture-recognitionthrough 3-D sensors into the mainstream. Through thiscamera, the non-color based features of the human ges-tures can be extracted. The features are not sensitive tothe changes of lightening condition and common imagenoise; also it is very convenient to integrate this platformwith robots, such as mobile robots for human robot in-teraction purpose.

The rest of the paper is organized as follows. In Sec-tion II, the recognition and fusion approaches are intro-duced. Section III presents the implementations of theexperiments. In Section IV, the experiment results areanalyzed. Finally in Section V, the conclusion is drawnand some potential research directions are discussed.

2 Methedology

The problem rises naturally as a need for improve-ment of classification rates obtained from individual clas-sifiers. Fusion of data/information can be carried outon three levels of abstraction closely connected with theflow of the classification process: data level fusion, fea-ture level fusion, and classifier fusion [13]. We apply thesame features for different classifiers. Therefore, it isreasonable to adopt classifier fusion approaches for per-formance improvement.

We have already developed a gesture recognitionsystem in which HMMs are chosen to model and recog-

nize the dynamics of the gestures [14]. In addition tothe HMM-based classifier, a DTW classifier is developedwhich uses the same features i.e., time sequences of thefour left arm related joint angles. Currently we consider4 joint angles. However, even we have more joint anglesinvolved, after data-preprocessing, they will always bemapped into 1-D symbols. So the feature space does nochange much. The flowchart is shown in Fig. 1. In thedata preprocessing procedure of the DTW classifier flow,besides segmentation and symbolization, GMM (Gaus-sian Mixture Model)/GMR (Gaussian Mixture Regres-sion) are applied to the multiple training data sets togenerate the template for each gesture.

2.1 Dynamic Time Warping (DTW)

DTW is an algorithm for measuring similarity be-tween two sequences which may differ in time or speed.DTW has been applied to video, audio, and graphics.Indeed, any data which can be turned into a linear rep-resentation can be analyzed with DTW [15]. A wellknown application has been automatic speech recogni-tion, to cope with different speaking speeds. Due to thesimilarity between the gesture and speech signals, it canbe applied to gesture recognition too. The first step isto find the DTW templates for each gesture given thetraining data. Here the GMM/GMR is used to find thegeneralized trajectory. In the recognition phase, the newtesting data is compared with each template, and the to-tal distance is calculated to measure the similarity.

2.2 Fusion Method

Two mainstream fusion methods suitable for ourcase are selected. The first one attempts to determinea single classifier, which is the most likely one to pro-duce the correct classification label for an input sample.While the other one takes into the information from eachclassifier into consideration to obtain a combined result.

2.2.1 Dynamic Classifier Selection (DCS)

DCS methods reflect the tendency to extract a singlebest classifier instead of mixing many different classi-fiers [16,17]. As a result, only the output of the selectedclassifier is taken as a final decision. In our case, thetwo classifiers used take different measure for classifica-tion. The HMM based classifier decides on the modelwith the maximum likelihood, while the DTW classifierselects the template with the minimum total distance.Therefore, a measure which can be used to compare theperformance of different classifiers should be defined forthe fusion purpose.

Here we define a parameter called decision confi-dence. The purpose is to compare the classification con-fidence between these two classifiers. If one of the classi-fiers has higher confidence than the other one, then thecontribution to the classification from the other classifierwill be ignored. It means that the final global decisionis made based on the best local decision. The definitionof the parameter is shown in the following equations.

For the HMM based classifier, the decision confi-dence αH is defined as follows,

αH =

1logP (O|λ∗)

∑5i=1

1logP (O|λi)

(1)

O is the observation sequence. λi is the ith HMM. Themodel λ∗ is the HMM with the highest likelihood for thegiven observation sequence O.

Similarly for the DTW based classifier, the decisionconfidence αD is defined as follows:

αD =1

Dist∗∑5i=1

1Disti

(2)

Disti is the normalized total distance when comparedto each template i. Dist∗ is the minimum total distanceamong all when compared with each template. The pa-rameters defined for the local classifiers can reflect thedegree of the ambiguity of the local decisions. Each pa-rameter is within [0, 1] interval. The higher the param-eter is, the more confidence the classifier shows. Theoutput of the classifier with higher decision confidencewill be chosen as the global decision.

2.2.2 Classifier Structuring and Grouping(CSG)

Meanwhile, another fusion method belonging to CSG isalso designed. Instead of only taking the best decision,outputs of the classifiers are combined into one decision.The idea of CSG fusion method is to organize differ-ent classifiers in parallel, simultaneously and separatelyget their outputs as an input for a combination functionor alternatively sequentially apply several combinationfunctions [16]. Here a combination function is designedas follows,

argmaxk[

1logP (O|λk)

∑5i=1

1logP (O|λi)

+1

Distk∑5i=1

1Disti

] (3)

In this case, the global decision is collaborativelymade by these two classifiers. Given an oberservationsequence, the decision confidence for each class is cal-culated for both of the two classifiers. Then the “totalconfidence” is obtained by adding up the decision con-fidences from each classifier. Finally, the decision withthe highest “total confidence” is the global decision.

3 Implementation

Five gestures are defined for the experiments, i.e.,come, go, wave, rise up and sit down. One demonstra-tion for gesture “wave” is shown in Fig. 2. Each ges-ture is modeled in an HMM. Meanwhile, a template foreach gesture is created for the DTW matching purpose.All the gestures are made by left arm. Features are ex-tracted from four joint angles: left elbow yaw and roll,left shoulder yaw and pitch.

Figure 2: Snapshots for the gesture ”wave”.

3.1 Training phase

The flowchart of the HMM training phase is shownin Fig. 3.

��

��

��

��

��

��

��

��

��

�� ! ��" ��#

��

��

��

��

��

��

��

��

Figure 3: Flowchart of the HMMs training phase.

Each model is trained with 15 sets of training datafrom one subject with sampling rate of 20 Hz. A rule-based method is adopted for training data segmentation.

Starting pose is defined. Each training data set con-sists of thirty data point (around 1.5s) after the startingpoint. Then K-means clustering is used to convert thevectors into the observable symbols for HMMs. The cen-troids are saved for clustering further testing data. Tobalance the computation complexity, efficiency and theaccuracy, we set up parameters for HMM as follows: thenumber of states in the model is thirty; the number ofdistinct observation symbols is six.

3.2 Recognition phase

For real-time processing, the Robot Operation Sys-tem (ROS) framework is adopted. The difference com-pared with our previous work [14] is the real-time clas-sification node. Both HMMs based classifier and DTWbased classifier are applied here followed by the imple-mentation of two different fusion methods.

To get rid of noise and reduced the false alarm rate,for HMM-based decisions, first we use the variance of theinput to judge if it is a gesture or not. Secondly we set athreshold for each HMM. If the likelihood is smaller thanthe threshold, it is treated as noise. For DTW-baseddecisions, a similar threshold is also set. The flowchartof the recognition phase is shown in Fig. 4.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� ! $

%"# �� !

��

��

(

��

�� &�� '�� &��

*��+��

Figure 4: Flowchart of the recognition phase.

4 Experiment Results

4.1 Training results

The training results of the HMMs are shown in Fig.5. As the training iteration increases, the likelihood foreach model converges. The templates created for theDTW algorithm are shown in Fig. 6. The GMM/GMRalgorithm outputs the generalized templates for each ges-ture. Then the centroids obtained from the training dataare used to cluster the 4-D templates into 1-D templates.

0 5 10 15 20 25 30 35 40−1000

−900

−800

−700

−600

−500

−400

−300

−200

−100

0

Iteration times

Log(

likel

ihoo

d)

HMM1HMM2HMM3HMM4HMM5

Figure 5: HMMs Training result.

0 20 400

50

100

150template 1

0 20 400

50

100

150

200template 2

0 20 400

50

100

150

200template 3

0 20 400

50

100

150template 4

0 20 4060

80

100

120

140

160

180template 5

0 20 400

1

2

3

4

5

6

7template 1

0 20 400

1

2

3

4

5

6

7template 2

0 20 400

1

2

3

4

5

6

7template 3

0 20 400

1

2

3

4

5

6

7template 4

0 20 400

1

2

3

4

5

6

7template 5

GMM/GMR Output

Templates after clustering

Figure 6: Templates for each gesture.

4.2 Recognition results

• Offline results

For the offline experiment, the data is saved to afile for post processing. Since it is offline-processing, weignore the computation cost and use a sliding windowof 30 data points with the step of one data point. Twooffline testing results are shown in Fig. 7 and Fig. 8.Each gesture is made with one stroke; that is the sub-ject will stay still for a couple of seconds between anytwo consecutive gestures. Most of the gestures are rec-ognized by both classifiers. There are some gestures thathave been recognized by only one classifier. Type I erroroccurs to the HMM based classifier. And both Type Iand Type II errors occur to the DTW based classifier.One of the gestures shown in Fig. 7 and Fig. 8 is recog-nized by neither of the classifiers. With these two fusionmethod used, most errors are eliminated. Therefore, theperformance of the fused decision is better than that ofeither single classifier. As the offline results indicate, theperformance of these two fusion methods has no obviousdifference. In the testing phase, we have two subjects,the one who participated in the training data collection(trainer) and the one who did not (tester). However, itis found that once the tester is familiar with the prede-fined gestures, the performance of the trainer and tester

has no big difference.

500 1000 1500 2000 25000

100

200

Time(time step=50ms)

Ang

les(

degr

ee)

Raw Testing Data

LERLEYLSRLSP

500 1000 1500 2000 2500

0

2

4

6


Dec

isio

n ty

pe

Recognition results

HMM Recognition resultsDTW Recognition resultsGround Truth

500 1000 1500 2000 2500

0

2

4

6


Dec

isio

n ty

pe

Fusion results

Fusion result(DCS)Fusion result(CSG)

Figure 7: Offline recognition results 1.

0 500 1000 15000

50

100

150

200


Ang

les(

degr

ee)

Raw Testing Data

LERLEYLSRLSP

0 500 1000 1500

0

2

4

6


Dec

isio

n ty

pe

Recognition results

HMM Recognition resultsDTW Recognition resultsGround Truth

0 500 1000 1500

0

2

4

6


Dec

isio

n ty

pe

Fusion results

Fusion result(DCS)Fusion result(CSG)

Figure 8: Offline recognition results 2.

• Online results

For application purpose, the system should allowthe user do certain gestures with any strokes, and mostof the gestures should be recognized. Some statistical re-sults are collected from the real-time experiment. Table1 and Table 2 show the accuracy of the HMM and DTWclassifiers respectively. The sum of each row i is the totalnumber of the gesture i has been made. The column i ofrow i gives the number of gesture i being detected. Thecolumn 6 gives the number of missed detections for eachgesture. In the final column, the accuracy is given.

The results show that the HMM classifier has bet-ter performance than the DTW classifier. The statisti-cal results for the two fusion approaches are shown inTable 3 and 4, respectively. It indicates that the fu-sion approaches can improve the recognition accuracycompared to either single classifier. For the DCS based

fusion method, it improves the performance because itcatches the complementary characteristics of these twoclassifiers. As shown in offline results, one of the classi-fier detect certain gesture which is missed by the otherone. For the CSG based classifier, it improves the perfor-mance because it can detect the gesture which is missedby both of the classifiers. Since it takes into accountthe contribution of these two individual classifier, thetotal decision confidence can be high enough, even whenneither of the classifier is confident with the detectionresult.

Table 1: Accuracy of different gestures withtrainer(HMM classifier)

Ground truthGesture Recognized

Test accuracy1 2 3 4 5 6

1 37 0 0 0 0 4 .90242 0 40 0 2 0 4 .86963 2 0 36 0 0 3 .87804 0 0 0 45 0 5 .90005 3 0 0 0 44 3 .8800

Table 2: Accuracy for different gestures withtrainer(DTW classifier)



1 35 0 0 0 0 6 .85372 0 38 4 0 0 4 .82613 0 0 36 0 0 5 .87804 0 3 0 42 0 8 .84005 0 0 6 0 39 5 .7800

Table 3: Accuracy of different gestures with trainer(DCSclassifier)



1 38 0 0 0 0 3 .92682 0 42 0 0 0 4 .91303 0 0 39 0 0 2 .95124 0 0 0 46 0 4 .92005 0 0 0 0 44 6 .8800

Table 4: Accuracy of different gestures with trainer(CSGclassifier)



1 38 0 0 0 0 3 .92682 0 43 0 0 0 46 .93473 0 0 38 0 0 3 .92684 0 0 0 45 0 5 .90005 0 0 0 0 44 6 .8800

5 CONCLUSIONS AND FUTURE

WORK

In this work, two fusion methods are proposedfor non-intrusive human gesture recognition through aKinect sensor. Both HMMs and DTW algorithms areused for preliminary classification. HMMs are statisti-cal models while DTW is a deterministic method. Theresults show that with appropriate fusion approaches de-signed, the performance of the classification can be im-proved.

The fusion method is very basic and straightfor-ward. More efforts should be devoted to the theoreticalexplanation of the impact of the methods to the exper-iment results. Currently these two classifiers use thesame information source. Different sensors such as mo-tion sensors are encouraged to be used with the Kinectsensor to increase the diversity of the data we need toverify the possibility of further improvement, since fu-sion of data with large diversities may end up with morepromising results.

References

[1] G. Imai., “Body language and non-verbal communication.” [Online]. Available:http://www.csupomona.edu/ tassi/gestures.htm/

[2] M. Rehm, N. Bee, and E. Andr, “Wave likean egyptian – accelerometer based gesture recog-nition for culture specific interactions,” in IN:HCI 2008 CULTURE, CREATIVITY, INTERAC-TION, 2008.

[3] V. I. Pavlovic, R. Sharma, and T. S. Huang, “Visualinterpretation of hand gestures for human-computerinteraction: A review,” IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 19, pp.677–695, 1997.

[4] S. Marcel, O. Bernier, and D. Collobert, “Hand ges-ture recognition using inputcoutput hidden markovmodels.”

[5] E. Sanchez-Nielsen, L. Anton-Canalıs, andM. Hernandez-Tejera, “Hand gesture recognitionfor human-machine interaction,” in WSCG, 2004,pp. 395–402.

[6] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Hu-man activity detection from rgbd images,” CoRR,vol. abs/1107.0169, 2011.

[7] C. Zhu, W. Sun, and W. Sheng, “Wearable sensorsbased human intention recognition in smart assistedliving systems,” in International Conference on In-formation and Automation, june 2008, pp. 954–959.

[8] M. Bashir, G. Scharfenberg, and J. Kempf, “Per-son authentication by handwriting in air using abiometric smart pen device,” in BIOSIG, 2011, pp.219–226.

[9] C. Lee and Y. Xu, “Online, interactive learning ofgestures for human/robot interfaces,” in In IEEEInternational Conference on Robotics and Automa-tion, 1996, pp. 2982–2987.

[10] J. Yang, X. Chen, Y. Li, V. Lantz, K. Wang, andJ. Yang, “A framework for hand gesture recog-nition based on accelerometer and emg sensors,”IEEE Transactions on Systems, Man and Cyber-netics, Part A: Systems and Humans, vol. PP, pp.1–13, 03. 2011.

[11] Y.-T. Chen and K.-T. Tseng, “Multiple-angle handgesture recognition by fusing svm classifiers,” inIEEE International Conference on Automation Sci-ence and Engineering, sept. 2007, pp. 527–530.

[12] Z. He, “A new feature fusion method for ges-ture recognition based on 3d accelerometer,” in2010 Chinese Conference on Pattern Recognition(CCPR), oct. 2010, pp. 1–5.

[13] D. Ruta and B. Gabrys, “An overview of classifierfusion methods,” Computing and Information Sys-tems, vol. 7, no. 1, pp. 1–10, 2000.

[14] Y. Gu, H. Do, J. Evert, and W. Sheng, “Hu-man gesture recognition through a kinect sensor,”in IEEE International Conference on Robotics andBiomimetics (ROBIO), submited, December 2012.

[15] P. Senin, “Dynamic Time Warping Algorithm Re-view,” Department of Information and ComputerSciences, University of Hawaii, Honolulu, Hawaii96822, Tech. Rep. CSDL-08-04, dec. 2008.

[16] T. K. Ho, J. Hull, and S. Srihari, “Decision combi-nation in multiple classifier systems,” IEEE Trans-actions on Pattern Analysis and Machine Intelli-gence, vol. 16, no. 1, pp. 66–75, jan. 1994.

[17] K. Woods, J. Kegelmeyer, W.P., and K. Bowyer,“Combination of multiple classifiers using local ac-curacy estimates,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 19, no. 4,pp. 405–410, apr. 1997.

Documents

Classiﬁer Fusion for Gesture Recognition using a Kinect Sensor · besides segmentation and symbolization, GMM (Gaus-sian Mixture Model)/GMR (Gaussian Mixture Regres-sion) are applied