Tentap: a piano playing gesture recognition system …profs.etsmtl.ca/sandrews/I3D18Posters/i3D_Poster_abstract_47.pdf · Tentap: a piano-playing gesture recognition system based

Tentap: a piano-playing gesture recognition system based on ten

fingers for virtual piano

Kyeongeun Seo

Korea University

Sejong city, South Korea

[email protected]

Hyeonjoong Cho *

Korea University

Sejong city, South Korea

[email protected]

ABSTRACT

We propose a system that recognizes 32 gestures to provide all

possible tap combinations and finds applicability to the head

mounted display (HMD) for playing virtual piano using an RGB-

D camera. While several existing hand interaction algorithms

have introduced a mid-air interaction using a sensor installed in

front of a user, our system is designed to recognize hand

interaction with a planar object using a sensor installed over a

user’s head. It detect the location of hands and recognize tap down

and up gestures on a planer object. The proposed system consists

of three procedures, i.e., hand detection, hand pose estimation,

and gesture classification. Especially, the hand pose estimation is

performed with 3D Convolutional Neural Network (3DCNN)

which uses a temporal and a spatial information. Additionally, the

gesture classification is performed with Support Vector Machine

(SVM) classifiers and normalized 3D hand positions being

invariant to scale, viewpoint, and hand-orientation. To train and

validate the system, we collect 240K data where each element

consists of a depth image with a RGB-D camera and a hand pose

with an optical motion capture system. Preliminary results show

that our method achieves the hand pose estimation error of about

10mm and the gesture classification accuracy of 83% which is

about 10% improvement over a state-of-the-art method. 1

CCS CONCEPTS

• Human-centered computing → Human computer interaction

(HCI); Interaction techniques

ADDITIONAL KEYWORDS AND PHRASES

Hand Pose Estimation, Gesture classification, Tap Detection,

Human Computer Interaction

1 INTRODUCTION

In recent years, a significant amount of research has been

introduced to estimate hand poses and recognize hand gestures

using a consumer depth sensor [1,7]. We found that most of them

assume that the hands are in mid-air. Such assumption diminishes

their utilities for the applications of virtual instrument due to two

* Corresponding author

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).i3D 2018 Posters, May 15–18, 2018, Montreal, Canada© 2018 Copyright held by the owner/author(s).

Figure 1: Experimental setups. (a) Front view of the environment,

(b) top view of the environment, (c) a kinectV2 depth camera, (d)

a IR camera, (e) a hand attached six IR tags.

reasons[5]: users get tired quickly and no tactile feedback. To

overcome these drawbacks, Barehanded Music was developed to

allow users to put their hands and then tap fingers on the planar

object [5]. The system though was restricted to support simple six

tap gestures.

We propose a system that recognizes 32 gestures to interact

with a planar object for playing virtual piano from a sensor

installed over a user’s head as shown in Fig 1. Our setup can be

easily applicable to the HDM, i.e., a depth sensor is equipped with

the headset. We recognize 32 gestures consisted of none tap

gesture and 31 tap gestures including all possible tap-down

combinations with five fingers to provide piano chords. For this

system, we have three significant challenges. First, a fingertip is

highly occluded at the moment of bending a finger when users tap

on a planer object because a sensor installed above a user’s head,

not above a user’s hand. Second, it is hard to recognize gestures

with a naive threshold method using a spatial information because

its classification performance is highly dependent on the accuracy

of an input than a method using a spatial and a temporal

information. Third, there is no publicly released dataset consisted

of the sequences of labeled hand gestures with a depth image and

3D positions of hand joints to train and validate our algorithm.

To tackle these problems, we propose a system, Tentap,

consisted of three procedures, i.e., hand detection, hand pose

estimation, gesture classification. In the hand detection, we detect

the region of a hand with a traditional image processing technique.

In the hand pose estimation, we estimate a hand pose expressed by

the positions including five fingertips and a wrist. The hand pose

is estimated by a trained 3DCNN [4] which uses a series of

preceding images where the last image is a current. In the gesture

classification, we obtain a series of 3D hand positions to consider

a spatial and a temporal information, normalize them to extract

invariant positions, and train SVM classifiers with the normalized

data. To train and validate our system, we collect 240K dataset

consisted of the real depth images and high-quality hand pose

annotations by a Kinect V2 and an OptiTrack system with five

infrared (IR) cameras and six IR tags.

(a) (b)

(c)

(d)

(e)

Figure 2: 3DCNN structure for the hand pose estimation.

2 TENTAP SYSTEM

In our system, we use a stream of depth images and obtain three

outputs, i.e., translation offset, a hand pose, and gesture classes.

The translation offset and the hand pose are used for the 3D

absolute position of the hand where the user’s hand was located

on a planar object. The gesture classes are used to provide 32

gestures. To extract outputs, we sequentially perform three

procedures.

In the hand detection, we obtain a segmented depth image for

an input of the hand pose estimation and a value of translation

offset which is the position of user’s wrist because it is a stable

landmark. First, we perform pre-processing steps on a depth

image as follows: (1) building a background model, (2)

subtracting the background from the depth image with the

Background model, (3) binarizing the background-removed image,

and (4) finding contours of the binarized image. Second, we find a

rectangle based on contour approximation from contours and crop

the depth image with the coordinates of the rectangle. Third, we

obtain the value of translation offset by comparing the width

lengths of contours. We use the width lengths from the end of

user’s arm to his hand because the width just above a wrist is

relatively bigger than that of the arm.

In the hand pose estimation, we infer 3D positions from a set of

segmented 20 images and a 3DCNN [4] as shown in Fig 2. We

use a residual connection for rapid and easier optimization and a

max pooling layer for reducing dimension and providing some

translation invariance. And the FC1 layer and FC2 layer are used

for dimension reduction and for 18 outputs each. For training

3DCNN, we perform pre-processing steps on an input and an

output. For input, we resize the segmented images into 48X48

resolution and normalize the values of depth pixels to [0,1] by

mean normalization. As output, we use relative 3D positions

which can be obtained by subtracting 3D positions from a wrist

position. In training 3DCNN, we optimize the hyper-parameters

by minimizing the value of a loss which is defined as the

difference between the expected 3D positions and the estimated

3D positions. To optimize the hyper-parameters, we set the

learning rate to 0.0005 and use the momentum of 0.9. The batch

size is 128 and the number of epochs is 65. We apply two

regularization techniques, weight decay (r=0.5%) and drop out

(p=0.5), to reduce overfitting.

In the gesture classification, we recognize 32 gestures with five

classes from five SVM classifiers. For classifying and training

SVM classifiers, a set of 20 consecutive 3D positions are

collected and bundles of five consecutive 3D positions are used

for calculating the average of each 3D position. The averages of

3D positions are normalized by Skeleton Quad descriptor, where

described in [2]. Before training, we apply the SMOTE+Tomek

technique on the training set to avoid a class imbalance problem.

Figure 3: Percentages of average hand pose prediction accuracy

with a different distance threshold.

3 EXPERIMENT

We set an environment to collect data as shown in Fig 1. The

resolution of kinectV2 is 512 × 424 with 56~62 FPS rate. An

OptiTrack system produced accurate 3D positions of IR tags

attached on a hand from IR cameras. Five subjects performed 25

sessions including all gestures with two hands for each session.

We required subjects moving their hands side by side, reducing/

increasing space between fingers, and tapping quickly or slowly

on an unmarked plane. We captured 240k depth images including

6.3k images as a tap gestures. A tap gesture is defined as the

moment after the finger moves down and finally reaches a planar

object threshold. We found an image of that moment and then we

labeled the image and its previous nine images as a tap gesture.

To train and validate our algorithms, we performed hold-out

cross-validation with 70/30% train/test split of our dataset. We

compared our 3DCNN to a state-of-the art architecture, i.e. Basic

network [3]. We obtained the average 10 mm error. Fig 3 shows

the performance with two evaluation metrics as described in [6].

To compare our gesture classification algorithm, we implemented

Barehanded Music [5]. The accuracy of our algorithm was 83%,

which was a 10% improvement than Barehanded Music (75%).

The accuracies of 27 gestures out of 32 were more than 80%.

4 CONCLUSION

We presented a system, Tentap, to estimate 3D positions and

detect 32 gestures for virtual piano. We collected a new dataset

for training and validation our algorithms and outperformed on

the hand pose estimation and the gesture classification in our

setup. In the future, we will consider a piano’s physical action and

a processing time for providing more realistic virtual piano.

ACKNOWLEDGEMENTS

This work was granted by Industry and Energy (10085608, 2017)

REFERENCES [1] De Smedt, Q., Wannous, H., & Vandeborre, J. P. 2016. Skeleton-based

dynamic hand gesture recognition. In CVPRW 2016, pp. 1206-1214.

[2] Evangelidis, G., Singh, G., & Horaud, R. 2014. Skeletal quads: Human action

recognition using joint quadruples. In ICPR 2014, pp. 4513-4518.

[3] Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., & Yang, H. 2017. Region

ensemble network: Improving convolutional network for hand pose estimation.

arXiv preprint arXiv:1702.02447.

[4] Ji, S., Xu, W., Yang, M., & Yu, K. 2013. 3D convolutional neural networks for

human action recognition. Pattern analysis and machine intelligence

[5] Liang, H., Wang, J., Sun, Q., Liu, Y. J., Yuan, J., Luo, J., & He, Y. 2016.

Barehanded music: real-time hand interaction for virtual piano. In I3D. ACM.

[6] Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep

learning for hand pose estimation. arXiv preprint arXiv:1502.06807.

[7] Yi, X., Yu, C., Zhang, M., Gao, S., Sun, K., & Shi, Y. 2015. Atk: Enabling ten-

finger freehand typing in air based on 3d hand tracking data. In UIST. ACM.

5X5X

5 co

nv1

/

20X48X48X4

5X5X

5 co

nv2

/

20X48

X48

X4

1X3X3 M

ax p

ool1

/

10X24

X24

X4

3X3X

3 co

nv3

/

10X24X24X16

3X3X

3 co

nv4

/

10X24

X24

X16

3X3X

3 co

nv5

/

5X12X12X32

3X3X

3 co

nv6

/

5X12X12X32

1X3X3 M

ax p

ool2

/

5X12

X12

X16

FC1/

512

FC2/

18

Output 18

+

1X3X3 M

ax p

ool3

/

3X6X

6X32

+ +

0

20

40

60

80

100

0 10 20 30 40 50

Fra

ctio

n o

f fr

am

es

wh

tin

Dis

tan

ce /

%

Distance thrshold / mm

Basic resnet

Ours

i3D 2018 Posters, May 15–18, 2018, Montreal, Canada Seo and Cho

Documents

Tentap: a piano playing gesture recognition system …profs.etsmtl.ca/sandrews/I3D18Posters/i3D_Poster_abstract_47.pdf · Tentap: a piano-playing gesture recognition system based