Hand Pose Estimation - CSE, IIT Bombaypratikm/projectPages/...Figure 1.2: Applications of hand pose estimation (Image Source: [Wan]) Computer interfaces based on human hand have so

RnD Project

Hand Pose Estimation

Author:Pratik Kalshetti

Supervisor:Parag Chaudhuri

May 1, 2017

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

India

Abstract

Real-time and accurate hand pose estimation can open new doors for making theentire world more interactive. The existing systems for hand pose estimation failto produce an accurate and a physically valid pose in real-time. The approach inthis project aims to tackle these problems by applying a discriminative model forreal-time prediction of joint locations and, incorporates kinematic constraints forproducing a geometrically valid pose, thus leading to accurate pose estimation.

The discriminative model is a deep network consisting of convolutional, fullyconnected and dropout layers. The kinematic constraints are incorporated as akinematic layer towards the end of the network, which acts as a prior for the handpose. The results are evaluated on the NYU hand pose data-set and compared withstate-of-the-art methods.

2

Contents

1 Introduction 51.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Related Work 82.1 Types of Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Model-based (Generative) . . . . . . . . . . . . . . . . . . . . 82.1.2 Learning-based (Discriminative) . . . . . . . . . . . . . . . . . 92.1.3 Hybrid (Discrminative and Generative) . . . . . . . . . . . . . 9

2.2 Issues in Existing Techniques . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Hand Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Non-linear Regression . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Interaction with Scene . . . . . . . . . . . . . . . . . . . . . . 10

3 Approach 113.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Base Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.2 Constraint Network . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Joint Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Evaluation 144.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 16

3

List of Figures

1.1 Rendering human hand in real-time (Image Source: [Wan] ) . . . . . . . . . 51.2 Applications of hand pose estimation (Image Source: [Wan]) . . . . . . . . 61.3 Challenges of hand pose estimation . . . . . . . . . . . . . . . . . . . 61.4 Aim of hand pose estimation . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Existing approaches (Image Source: [The15]) . . . . . . . . . . . . . . . . . 82.2 Bottleneck layer to enforce constraints. (Image Source: [OWL15a]) . . . . . 92.3 Multi-view CNN method.(Image Source: [Ge+16]) . . . . . . . . . . . . . . 10

3.1 Hand detection .(Image Source: [Zha+17]) . . . . . . . . . . . . . . . . . . . 113.2 Network Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Hand Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 NYU hand pose dataset . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Results of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Comparison with state-of-the-art methods . . . . . . . . . . . . . . . 16

4

Chapter 1

Introduction

1.1 Background

Immersive virtual environment is the integration of computer graphics and variousinput and display technologies, to create the illusion of immersion in a computergenerated reality. The user interacts with a world containing seemingly real, 3-Dobjects in a 3-D space that responds interactively to each other and the user [Bry05].

Keyboards and mice work fine for computers, but to feel fully immersed in theirvirtual surroundings, users need to feel like what they’re doing with their bodyrelates accurately to what they’re doing on-screen, which is where, pose estimationcomes into play.

(a) Real Environment (b) Virtual Environment

Figure 1.1: Rendering human hand in real-time (Image Source: [Wan] )

Due to recent rapid advances in display and tracking technologies, virtual reality(VR) and augmented reality (AR) are becoming popular. But perhaps, just asimportant, as the ability to display virtual objects to the user, is for the user to beable to interact with those virtual objects and environment, as naturally as possible.In the real world, we use our hands to reach out and touch objects, which react toour interactions according to the laws of physics [Tay+16]. So it is necessary todevelop systems which allow the use of hands for interaction (see Fig. 1.1).

1.2 Applications

The human hand is remarkably dextrous, capable of high bandwidth communica-tion such as typing and sign language. Hand motion plays a crucial component of

5

Hand Pose Estimation Chapter 1

non-verbal communication, plays an important role in the animation of humanoidavatars, and is central for numerous human-computer interfaces. Hand pose esti-mation is now gaining traction in the research community as a natural step towardsa complete system for online human communication in virtual environments. Re-cent industrial trends in interaction systems for virtual environments have led tothe development of software packages for the processing of RGBD data, like theIntel RealSense SDK, or purpose-designed hardware, like the Leap Motion and theNimble sensors.

(a) Commercial products (b) Education

Figure 1.2: Applications of hand pose estimation (Image Source: [Wan])

Computer interfaces based on human hand have so far been limited in theirability to accurately and reliably estimate the pose of a user’s hand in real-time.If these limitations can be lifted, hand pose estimation will become a foundationalinteraction technology for a wide range of applications including immersive virtualreality, assistive technologies, robotics, home automation and gaming.

1.3 Challenges

Hand pose estimation is challenging: hands can form a variety of complex poses dueto their many degrees of freedom (DoF), and come in different shapes and sizes.

(a) Self-occlusion (b) Noisy data (c) Self-similarity

Figure 1.3: Challenges of hand pose estimation

Despite some notable successes, solutions that augment the user’s hand withgloves or markers can be cumbersome and inaccurate. Much recent effort has thusfocused on camera-based systems. However, cameras, even modern consumer depthcameras, pose further difficulties: the fingers can be hard to disambiguate visually

6

Chapter 1 Hand Pose Estimation

and are often occluded by other parts of the hand (see Fig. 1.3). Even state-of-the-art academic and commercial systems are thus sometimes inaccurate and susceptibleto loss of track, e.g. due to fast motion. Many approaches address these concernsby severely constraining the pose estimation setup, for example by supporting onlyclose-range and front facing scenarios, or by using multiple cameras to help withocclusions.Thus fully articulated hand pose estimation has not yet become the userinterface of choice for AR and VR.

1.4 Aim

The aim of this project is to develop a hand pose estimation system which providesaccurate result in real-time. The input to the system is a depth image and the outputis the location of the joints as shown in 1.4. The contribution of this project is:1) understand start-of-the-art approaches towards solving the hand pose estimationproblem 2) implement a hand pose estimation system that outperforms many recentapproaches and achieves state-of-the-art performance. This system will then serveas the baseline for further improvement.

(a) Input - Depth Image(b) Output - Joint Locations

Figure 1.4: Aim of hand pose estimation

In Chapter 2 the recent techniques in hand pose estimation are discussed. Theapproach used in this project is outlined in Chapter 3. The implementation detailsand experimental results are presented in Chapter 4.

7

Chapter 2

Related Work

Hand pose estimation is important for many human-computer interaction applica-tions and has been intensely studied for decades. The relevant previous work hasbeen summarized below.

2.1 Types of Techniques

2.1.1 Model-based (Generative)

These methods [RK94; WLH01; Ste+06; OKA11; MKO13; Qia+14; MKA15; Tag+15;Sha+15] synthesize an image from hand geometry, define an energy function to quan-tify the discrepancy between synthesized and observed images. This function is thenoptimized to obtain the hand pose. The advantages of these methods is that they areaccurate and guaranteed to be valid poses, but suffer from problems of initializationand local minima (see Fig. 2.1a).

(a) Generative Methods(b) Discriminative Methods

Figure 2.1: Existing approaches (Image Source: [The15])

8


2.1.2 Learning-based (Discriminative)

These methods [AS03; WP09; WPP11; RKK10; Kes+12; TYK13; Sun+15; Li+15]learn a direct regression function that maps the image appearance to hand pose (seeFig. 2.1b). These methods have the advantage of being efficient and thus producereal-time results. However, the results are coarse and also violate hand geometry.

2.1.3 Hybrid (Discrminative and Generative)

These methods [Bal+12; ZCX12; Tom+14; Sri+15; OWL15b] use a discriminativemethod for initialization and then refines the result by a generative method. Thusthe overall system is divided into two separate stages which may lead to sub-optimalperformance.

2.2 Issues in Existing Techniques

2.2.1 Hand Prior

The hand geometry (kinematic and physical constraints) is not exploited in discrim-inative approaches. Hybrid approaches tackle this problem by adding further stagesto their pipeline. [Tom+14] applies post-processing to tackle this problem. Inversekinematics is used to optimize the hand skeleton from joints. However, it is separatefrom training and thus sub-optimal.

Another recent approach [OWL15a] inserts a linear layer (bottleneck) in thenetwork that projects high dimensional joints into a low dimensional space (see Fig.2.2). This indirectly applies a prior on the pose space by constraining it to a lowerdimensional space. However, invalid poses still persist, due to the linear projection.

Figure 2.2: Bottleneck layer to enforce constraints. (Image Source: [OWL15a])

9


2.2.2 Non-linear Regression

Due to the success of convolutional neural networks (CNNs) and the availabilityof large hand pose dataset, many recent works on hand pose estimation use themfor achieveing high accuracy. However, the direct mapping with CNNs from imagefeatures to continuous 2D/3D locations is of high non-linearity and complexity aswell as low generalization ability. A better method is to map image features toheatmaps (one per joint denoting its likelihood). A good metric for evaluationwould then be the L2 norm of difference. However, the depth information is notutilized in such a case. This problem is tackled by the multi-view CNNs.

To tackle this problem, in [Ge+16] the depth images are projected onto threeorthogonal planes and regresed for 2D heatmaps to estimate joint positions, whichare then fused to produce final pose as shown in Fig. 2.3. However, the multi-viewCNNs still cannot fully exploit 3D spatial information in the depth image, since theprojection from 3D to 2D will lose certain information. Although increasing thenumber of views may improve the performance, the computational complexity willbe increased when using more views.

Figure 2.3: Multi-view CNN method.(Image Source: [Ge+16])

2.2.3 Interaction with Scene

Most of the recent methods assume that the hand is the closest object to the sensor,thus making it easy for the hand detection stage, which then form the input to thehand pose estimation system. However, this severely constrains the environmentand thus restricting its applications. It becomes difficult to extract hand whilemanipulating any object in the scene. Also, these systems do not perform well inpresence of multiple hands.

10

Chapter 3

Approach

3.1 Overview

The approach towards solving the problem of hand pose estimation was aimed to-wards achieving accurate results in real-time. The input to the system is a depthimage which is pre-processed (hand detection) resulting in a output that consistsof pixels belonging to hand or background. This image is then passed through adeep network that predicts joint locations in 3D. A kinematic layer is added intothe network (towards the end) that acts as a hand prior that produces geometricallyvalid poses. The approach follows a similar strategy adopted by [Zho+16].

3.2 Preprocessing

The depth images acquired from the sensor contain pixels that do not belong to thehand. Thus the hand needs to be segmented which is done by pixel-level classificationusing random forest [Zha+17] (see Fig. 3.1). A fixed-size cube is extracted aroundthe hand and the depth values are normalized in the range [-1, 1]. The input tothe current implementation is assumed to be preprocessed as is done by most of therecent works.

(a) Scene - depth image (b) Output - detected hand

Figure 3.1: Hand detection .(Image Source: [Zha+17])

11


3.3 Deep Learning

The network consists of two sections: base network and constraint network.

3.3.1 Base Network

The base network is the standard network used in state-of-the-art techniques con-sisting of convolutional layers, fully connected layers and dropout layers. The inputto this section is a depth image and the output is the pose parameter vector (26D).This output is then feeded into the constraint network.

The base network starts with 3 convolutional layers with kernel size 5, 5, 3respectively, followed by max pooling with stride 4, 2, 1 (no padding), respectively.All the convolutional layers have 8 channels. The output from these layers is thenpassed onto two fully connected layers, each with 1024 neurons and followed by adropout layer with dropout ratio 0.3. For all the layers, the activation function isReLU.

3.3.2 Constraint Network

The constraint network forms the kinematic constraints on the hand pose. Themapping function is the forward kinematics function (F ) which takes as input apose vector and produces the joint positions as output. Thus this network is freefrom parameters. Also, F is differentiable and can be accommodated into gradient-descent like optimization.

The overall network is shown in Fig. 3.2.

Figure 3.2: Network Architecture.

12


Hand Model The hand model has pose parameters θ which has 26 degrees offreedom (DoF) as shown in Fig. 3.3. There are 6 DoF for global palm position andorientation and, the remaining DoF are rotation angles on joints.

Figure 3.3: Hand Model.

The canonical pose is considered to be the zero vector. All other poses are rep-resented by parameters with respect to the canonical pose. The forward kinematicsfunction maps the pose parameters (26-D) to joint positions (31-D). Each joint isassociated with a local 3-D transformation corresponding to bone length and jointangle.

3.4 Joint Loss Function

The output of the overall network is the joint positions vector which is a function F(forward kinematics) of the pose parameter vector (θ). Thus the loss function (L),which is the joint position error, can be represented in terms of pose parameter.Another term can be added to this loss, which takes into account the physicalconstraints on the rotation angle range.

The joint location loss is the standard Euclidean loss.

L(θ) =1

2‖F (θ)− Y ‖2 (3.1)

where Y ∈ R21×3 is the ground truth joint locations.In optimization, standard stochastic gradient descent is used with batch size 512,

learning rate 0.003 and momentum 0.9. The training is processed until convergence.

13

Chapter 4

Evaluation

4.1 Implementation

The approach mentioned has been implemented in C++ and python. The librariesused included Caffe, OpenCV. Background experiments were conducted using Keraswith Tensorflow backend, Torch and MATLAB.

The system specifications are mentioned as follows: CPU - Intel Core i7-77003.60 GHz, RAM - 32 Gb, GPU - Nvidia GeForce GTX 760 Ti.

4.2 Data

The NYU hand pose dataset (90Gb) contains 72757 training samples and 8252 testsamples. The input is a depth image and output is a position vector of 36 joints(see Fig. 4.1). The PrimeSense camera was used for acquiring the images. Groundtruth joints are annotated using accurate offline PSO algorithm. This dataset hasthe largest pose variation and is the most challenging dataset available publicly.

The training was performed on part of the dataset (10000 training samples and1200 test samples). 31 joints were considered throughout the experiments.

(a) Input - Depth Image (b) Label - Joint Locations in 3-D

Figure 4.1: NYU hand pose dataset

4.3 Qualitative Evaluation

The predicted joint locations for various poses along with ground truth are shownin Table 4.1. The results suggest that the predictions are quite accurate.

14


Input Prediction Ground Truth

Table 4.1: Qualitative results of hand pose estimation. Each row corresponds to apose. The colored marker denotes the joint position.

15


A comparative study between results obtained without any prior and, with thekinematic layer suggests the latter produces geometrically valid poses as is evidentfrom Fig. 4.2.

(a) Input (b) Output - without prior(c) Output - with kine-matic layer

Figure 4.2: Results of experiments

4.4 Quantitative Evaluation

Metric Different metrics were studied from the existing literature and the follow-ing evaluation metrics were used.

1. Average joint error over all test images

2. Fraction of images whose maximum joint error is below a threshold

The quantitative results for three different cases: Without any prior, existing bestprior, and with kinematic layer are shown in Fig. 4.3. The results are competitivewith state-of-the-art approaches to hand pose estimation.

(a) Fraction of frames within threshold

Technique Error

No prior 6395.45

Existing best prior 4699.16

Kinematic prior 3079.38

(b) Average joint position error

Figure 4.3: Comparison with state-of-the-art methods

A forward pass took about 8ms, resulting in 120 frames per second in test. Thusthe aim of accurate and real-time performance is achieved.

16

Conclusion

The project successfully implements a hand pose estimation which achieves com-petitive accuracy in real-time. The notion of hand prior is smoothly embedded intothe deep network making the overall pipeline relatively simple. The results obtainedare comparable to the best approaches in this domain. This system will serve as abaseline for further improvements.

Some limitations still exist. The temporal information can be utilized for furthergain in accuracy. A physical constraint can be added on top of the kinematic layerwhich can limit the joint angles. Another major concern is the non-linear mappingof the depth image to the joint positions where the volume information is lacking.The systems can be made more robust to work in scenes containing interaction withobjects. These all together will lead to a world where user’s will not just control thevirtual world through input devices, but be a part of it.

17

Acknowledgements

I thank Prof Parag Chaudhuri for his guidance throughout the RnD project.The project was developed in the Vision, Graphics and Imaging Laboratory

(ViGIL) of the Department of Computer Science and Engineering, Indian Instituteof Technology Bombay.

18

Bibliography

[AS03] Vassilis Athitsos and Stan Sclaroff. “Estimating 3D hand posefrom a cluttered image”. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR). Vol. 2. IEEE. 2003,pp. 432–439 (cit. on p. 9).

[Bal+12] Luca Ballan et al. “Motion capture of hands in action using discrim-inative salient points”. In: European Conference on Computer Vision(ECCV). Springer. 2012, pp. 640–653 (cit. on p. 9).

[Bry05] Steve Bryson. “Survey of Virtual Environment Technologies andTechniques”. In: ACM Transactions on Graphics (SIGGRAPH). Vol. 92.2005, p. 1 (cit. on p. 5).

[Ge+16] Liuhao Ge et al. “Robust 3D hand pose estimation in single depthimages: from single-view CNN to multi-view CNNs”. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2016, pp. 3593–3601 (cit. on p. 10).

[Kes+12] Cem Keskin et al. “Hand pose estimation and hand shape classifi-cation using multi-layered randomized decision forests”. In: EuropeanConference on Computer Vision (ECCV). Springer. 2012, pp. 852–863(cit. on p. 9).

[Li+15] Peiyi Li et al. “3-D hand pose estimation using randomized decisionforest with segmentation index points”. In: Proceedings of the IEEEInternational Conference on Computer Vision (ICCV). 2015, pp. 819–827 (cit. on p. 9).

[MKA15] Alexandros Makris, Nikolaos Kyriazis, and Antonis A Argy-ros. “Hierarchical particle filtering for 3d hand tracking”. In: Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) Workshops. 2015, pp. 8–17 (cit. on p. 8).

[MKO13] Stan Melax, Leonid Keselman, and Sterling Orsten. “Dynam-ics based 3D skeletal hand tracking”. In: Proceedings of Graphics Inter-face. Canadian Information Processing Society. 2013, pp. 63–70 (cit. onp. 8).

[OKA11] Iason Oikonomidis, Nikolaos Kyriazis, and Antonis A Argy-ros. “Full dof tracking of a hand interacting with an object by modelingocclusions and physical constraints”. In: Proceedings of the IEEE inter-national conference on computer vision (ICCV). IEEE. 2011, pp. 2088–2095 (cit. on p. 8).

19

[OWL15a] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit.“Hands deep in deep learning for hand pose estimation”. In: arXivpreprint arXiv:1502.06807 (2015) (cit. on p. 9).

[OWL15b] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit.“Training a feedback loop for hand pose estimation”. In: Proceedings ofthe IEEE International Conference on Computer Vision (ICCV). 2015,pp. 3316–3324 (cit. on p. 9).

[Qia+14] Chen Qian et al. “Realtime and robust hand tracking from depth”. In:Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR). 2014, pp. 1106–1113 (cit. on p. 8).

[RK94] James M Rehg and Takeo Kanade. “Visual tracking of high dof ar-ticulated structures: an application to human hand tracking”. In: Euro-pean Conference on Computer Vision (ECCV). Springer. 1994, pp. 35–46 (cit. on p. 8).

[RKK10] Javier Romero, Hedvig Kjellstrom, and Danica Kragic. “Handsin action: real-time 3D reconstruction of hands in interaction withobjects”. In: International Conference on Robotics and Automation(ICRA). IEEE. 2010, pp. 458–463 (cit. on p. 9).

[Sha+15] Toby Sharp et al. “Accurate, robust, and flexible real-time handtracking”. In: Proceedings of the 33rd Annual ACM Conference on Hu-man Factors in Computing Systems. ACM. 2015, pp. 3633–3642 (cit. onp. 8).

[Sri+15] Srinath Sridhar et al. “Fast and robust hand tracking using detection-guided optimization”. In: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR). 2015, pp. 3213–3221(cit. on p. 9).

[Ste+06] Bjorn Stenger et al. “Model-based hand tracking using a hierarchi-cal bayesian filter”. In: Transactions on Pattern Analysis and MachineIntelligence (PAMI) 28.9 (2006), pp. 1372–1384 (cit. on p. 8).

[Sun+15] Xiao Sun et al. “Cascaded hand pose regression”. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2015, pp. 824–832 (cit. on p. 9).

[Tag+15] Andrea Tagliasacchi et al. “Robust Articulated-ICP for Real-TimeHand Tracking”. In: Computer Graphics Forum. Vol. 34. 5. Wiley On-line Library. 2015, pp. 101–114 (cit. on p. 8).

[Tay+16] Jonathan Taylor et al. “Efficient and precise interactive hand track-ing through joint, continuous optimization of pose and correspondences”.In: ACM Transactions on Graphics (SIGGRAPH) 35.4 (2016), p. 143(cit. on p. 5).

[The15] Christian Theobalt. “Real-time Capture of Hands in Motion”. In:Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) Workshops. 2015, pp. 44–52 (cit. on p. 8).

[Tom+14] Jonathan Tompson et al. “Real-time continuous pose recovery ofhuman hands using convolutional networks”. In: ACM Transactions onGraphics (SIGGRAPH) 33.5 (2014), p. 169 (cit. on p. 9).

20

[TYK13] Danhang Tang, Tsz-Ho Yu, and Tae-Kyun Kim. “Real-time ar-ticulated hand pose estimation using semi-supervised transductive re-gression forests”. In: Proceedings of the IEEE international conferenceon computer vision (ICCV). 2013, pp. 3224–3231 (cit. on p. 9).

[Wan] Robert Wang. Nimble VR Kickstarter. url: https://www.youtube.com/watch?v=v_U3BmDlmtc (visited on 04/27/2017) (cit. on pp. 5, 6).

[WLH01] Ying Wu, John Y Lin, and Thomas S Huang. “Capturing naturalhand articulation”. In: Proceedings of the IEEE international conferenceon computer vision (ICCV). Vol. 2. IEEE. 2001, pp. 426–432 (cit. onp. 8).

[WP09] Robert Y Wang and Jovan Popovic. “Real-time hand-trackingwith a color glove”. In: ACM transactions on graphics (SIGGRAPH).Vol. 28. 3. ACM. 2009, p. 63 (cit. on p. 9).

[WPP11] Robert Wang, Sylvain Paris, and Jovan Popovic. “6D hands:markerless hand-tracking for computer aided design”. In: Proceedingsof the 24th annual ACM Symposium on User Interface Software andTechnology (UIST. ACM. 2011, pp. 549–558 (cit. on p. 9).

[ZCX12] Wenping Zhao, Jinxiang Chai, and Ying-Qing Xu. “Combiningmarker-based mocap and RGB-D camera for acquiring high-fidelityhand motion data”. In: Proceedings of the ACM SIGGRAPH/Eurographicssymposium on computer animation. Eurographics Association. 2012,pp. 33–42 (cit. on p. 9).

[Zha+17] Zhao Zhang et al. “Accurate per-pixel hand detection from a singledepth image”. In: Optical Engineering 56.3 (2017), p. 33107 (cit. onp. 11).

[Zho+16] Xingyi Zhou et al. “Model-based Deep Hand Pose Estimation”. In:International Joint Conference on Artificial Intelligence (IJCAI). 2016(cit. on p. 11).

21

https://www.youtube.com/watch?v=v_U3BmDlmtc

https://www.youtube.com/watch?v=v_U3BmDlmtc

Documents

Hand Pose Estimation - CSE, IIT Bombaypratikm/projectPages/...Figure 1.2: Applications of hand pose estimation (Image Source: [Wan]) Computer interfaces based on human hand have so