8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1101 CVPR #1101 CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Regression-based Hand Pose Estimation from Multiple Cameras Anonymous CVPR submission Paper ID 1101 Abstract The RVM-based learning method for whole body pose estimation proposed by Agarwal and Triggs is adapted to hand pose recovery. To help overcome the difficulties pre- sented by the greater degree of self-occlusion and the wider range of poses exhibited in hand imagery, the adaptation proposes a method for combining multiple views. Compar- isons of performance using single versus multiple views are reported for both synthesized and real imagery, and the the effects of the number of image measurements and the num- ber of training samples on performance are explored. 1. Introduction Interpreting views of objects undertaking articulated mo- tion remains a considerable challenge to computer vision re- searchers more than two decades after Hogg demonstrated visual tracking of a walking person modelled using 3D cylinders [9]. Two quite different approaches to the prob- lem are apparent in the literature. The first, and more tra- ditional, is the generative approach in which an estimate of the pose updates the prediction of the object’s appearance in the image, say by projecting a 3D model into the im- age. Measurements of the discrepancy between predicted and actual images allow the pose update to be determined. A wide variety of methods have been proposed for gener- ative tracking of articulated bodies, reflecting the difficulty in minimisation in high dimensional spaces. Recently, discriminative approaches have been more widely explored [5, 12, 7, 10, 1]. The idea is to recover a direct, but not physically-based, mapping between a ro- bust representation of appearance and the model parameters such as joint angles. As Wu et al.[15] note, the approach exploits the fact that the typically explored range of hand poses is much smaller than the potential range. One approach to relating image measurements qualita- tively to 3D poses is that of classification, where a discrete set of 3D poses constitutes the set of classes. Training sam- ples are generated using synthetic images of a hand model at several poses [5, 12, 10].Although high accuracy can be obtained, these frameworks demand a large set of classes if a comprehensive range of recoverable poses is desired, which makes the computation time prohibitive. An alternative is to use simpler image measurements and stronger temporal priors. Brand [7] uses ten scale-invariant central moments on low resolution silhouette images as ob- servations, and a hidden Markov model (HMM) to model temporal sequences. This allows video-rate implementa- tion, but, with evidence as weak as image moments, the learnt prior can dominate the reconstruction. This can be particularly bad for hands, as, unlike walking, cyclic se- quences of movements are not the norm. Agarwal and Triggs [1] used richer image measure- ments: shape contexts [6]. Unlike the classification meth- ods, this is a regression-based method and provides a con- tinuous map between image measurements and 3D poses. This mapping is learnt using a Relevance Vector Machine (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to include a dynamical model Based on these works, a framework including multiple hy- potheses detection and tracking with a particle filter was de- veloped Agarwal and Triggs in [4]. A similar method was proposed contemporaneously by Sminchisescu et al.[11]. In this paper, a learning-based approach to hand pose re- covery is taken, following in part Agarwal and Triggs’ work on whole body pose estimation. However, the hand pose recovery is in general a more difficult problem, not least because of the far greater degree of actual occlusion, and of “apparent” occlusion where fin- ger bounding contours are lost. For this reason the paper proposes an extension of the single view method to multi- ple cameras, an approach which Erol et al.[8] points out has not been widely explored for this problem. An experimen- tal comparison of single and multiple view performance is presented, taking into account variation in the number of image measurements and training samples needed. 2. Extracting Multiple View Image Descriptors The initial step of the method (both in training and appli- cation phases) is the conversion each image of a hand into a silhouette contour, and thence into a compact description 1

Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Regression-based Hand Pose Estimation from Multiple Cameras

Anonymous CVPR submission

Paper ID 1101

Abstract

The RVM-based learning method for whole body poseestimation proposed by Agarwal and Triggs is adapted tohand pose recovery. To help overcome the difficulties pre-sented by the greater degree of self-occlusion and the widerrange of poses exhibited in hand imagery, the adaptationproposes a method for combining multiple views. Compar-isons of performance using single versus multiple views arereported for both synthesized and real imagery, and the theeffects of the number of image measurements and the num-ber of training samples on performance are explored.

1. IntroductionInterpreting views of objects undertaking articulated mo-

tion remains a considerable challenge to computer vision re-searchers more than two decades after Hogg demonstratedvisual tracking of a walking person modelled using 3Dcylinders [9]. Two quite different approaches to the prob-lem are apparent in the literature. The first, and more tra-ditional, is the generative approach in which an estimate ofthe pose updates the prediction of the object’s appearancein the image, say by projecting a 3D model into the im-age. Measurements of the discrepancy between predictedand actual images allow the pose update to be determined.A wide variety of methods have been proposed for gener-ative tracking of articulated bodies, reflecting the difficultyin minimisation in high dimensional spaces.

Recently, discriminative approaches have been morewidely explored [5, 12, 7, 10, 1]. The idea is to recovera direct, but not physically-based, mapping between a ro-bust representation of appearance and the model parameterssuch as joint angles. As Wu et al. [15] note, the approachexploits the fact that the typically explored range of handposes is much smaller than the potential range.

One approach to relating image measurements qualita-tively to 3D poses is that of classification, where a discreteset of 3D poses constitutes the set of classes. Training sam-ples are generated using synthetic images of a hand modelat several poses [5, 12, 10].Although high accuracy can be

obtained, these frameworks demand a large set of classesif a comprehensive range of recoverable poses is desired,which makes the computation time prohibitive.

An alternative is to use simpler image measurements andstronger temporal priors. Brand [7] uses ten scale-invariantcentral moments on low resolution silhouette images as ob-servations, and a hidden Markov model (HMM) to modeltemporal sequences. This allows video-rate implementa-tion, but, with evidence as weak as image moments, thelearnt prior can dominate the reconstruction. This can beparticularly bad for hands, as, unlike walking, cyclic se-quences of movements are not the norm.

Agarwal and Triggs [1] used richer image measure-ments: shape contexts [6]. Unlike the classification meth-ods, this is a regression-based method and provides a con-tinuous map between image measurements and 3D poses.This mapping is learnt using a Relevance Vector Machine(RVM) [14]. In [2], the regression-based tracking conceptdeveloped in [3] was applied to include a dynamical modelBased on these works, a framework including multiple hy-potheses detection and tracking with a particle filter was de-veloped Agarwal and Triggs in [4]. A similar method wasproposed contemporaneously by Sminchisescu et al. [11].

In this paper, a learning-based approach to hand pose re-covery is taken, following in part Agarwal and Triggs’ workon whole body pose estimation.

However, the hand pose recovery is in general a moredifficult problem, not least because of the far greater degreeof actual occlusion, and of “apparent” occlusion where fin-ger bounding contours are lost. For this reason the paperproposes an extension of the single view method to multi-ple cameras, an approach which Erol et al. [8] points out hasnot been widely explored for this problem. An experimen-tal comparison of single and multiple view performance ispresented, taking into account variation in the number ofimage measurements and training samples needed.

2. Extracting Multiple View Image DescriptorsThe initial step of the method (both in training and appli-

cation phases) is the conversion each image of a hand intoa silhouette contour, and thence into a compact description

1

Page 2: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

using shape contexts [6]. Because of the wide variation inscale and orientation of hands in imagery, we explore possi-bilities for scale and rotation invariance that can be achievedwith shape contexts. A novel modification for rotation in-variance is proposed to increase the discriminatory power.

Recovery of the silhouette of the hand, assumed un-gloved, is achieved using a trained histogram-based skincolour classifier applied, for robustness, to the Cr and Cbchannels of the YCbCr chromatic space. Images are sub-sampled to

����������pixels to reduce computation cost, and

the shape contexts are computed only from positions on thesilhouette contour, which is easily derived by edge detectionin the resulting skin/not-skin binary image (see Fig. 1).

(a) (b) (c)

Figure 1. A hand image (a), and the pixel-wise silhouette contourobtained from skin and edge detection (b).

A shape context [6] is a local non-parametric descriptionof shape computed at each point on the silhouette contour.At any point on the contour, neighbouring contour pixelsare accumulated in 60 bins arranged in log-polar fashion,five along the radial direction and twelve around the po-lar angle, spaced equally in log-distance and angle, respec-tively. To provide a first layer of scale invariance, the innerradius is set to �� , where is the mean of the distancesbetween all the pairs of points in the silhouette. The radiusincreases in octaves to

� , typically covering all of the handsilhouette. The resulting 60-bin histogram is normalised,providing again for scale-invariance. For image � the com-plete image description is generated as the set of ��� 60-binhistograms computed at ��� points along the silhouette con-tour.

Belongie et al. ensured rotational invariance by aligningthe fiducial

���line of the shape context with the tangent

to the silhouette contour at each point. While this workswell if the contour is smooth (which in our experience re-quires either large images or fitting parametrized curves tothe edges), the result in low resolution images, and usingpixel contour points, was found to be noisy. A more robustalternative is to use the geometric centre of the silhouetteand set the fiducial line to be perpendicular to the line fromthe centre to the contour point. The rotation invariance ofboth tangent-based and centroid-based methods is obtainedat the cost of reducing the amount of global informationabout the shape of the silhouettes.

The solution adopted in this paper is to orient the shapecontext with the axis that links the wrist to the tip of the

(i) (ii) (iii) (iv)input not-inv. tangent centroid princ axis

(a)

(b)Figure 2. Two sets (a) and (b) of nearest-neighbour classificationresults using multiple view descriptors obtained from the silhou-ettes shown in the first column. (i) uses non-rotation invariantshape contexts and (ii-iv) use different methods of rotation invari-ance. Note that using the principal axis (iv), the fingers ambigu-ity is avoided (sample a). Also, even though the hand is roughlyaligned with the training data, tangent-based and principal axis-based rotation invariant shape contexts provided better results thanthe shape contexts without rotation invariance (sample b).

hand. For simplicity, it is assumed that two points of the sil-houette lie on the image borders, and these points are takento be either side of the forearm. The results in Fig. 2 showthat this maintains the discrimination power of non-rotationinvariant shape contexts and adds robustness to planar rota-tions.

The number of dimensions required to describe an imageis reduced by quantising the shape context manifold into acodebook using � -means. Since shape contexts are his-tograms, the natural dissimilarity measure for histograms isthe ��� test statistic [6]. To soften the effects of spatial quan-tisation, histograms are built by allowing context vectors tovote with Gaussian weights into the few centres nearest tothem [1]. These histograms are normalised w.r.t. the num-ber of points in the silhouette contour, again for scale in-variance.

Three ways to combine multiple view information havebeen considered. The low level approach is to group all theshape contexts from all the images together before cluster-ing to build the � histograms. The problem of this approachis that the improvement obtained by using multiple viewsmay be not be very significant, as one set of measurementscan be associated with more than one global orientation.

An alternative is to estimate the pose from each view in-dividually and combine the results at a high level using, forexample, a graphical model. If global pose parameters canbe estimated using triangulation, and if regressors can betrained with a comprehensive samples set, then the same re-gressor can be applied for all the cameras, and the setup ofcameras may not need to be the same as in training. How-ever, as discussed later, it is not realistic to use comprehen-sive training sets.

The approach proposed here is to combine the informa-tion at an intermediate level, by generating description vec-tors � for each camera individually and concatenating theminto a higher dimensional vector that describes the current

2

Page 3: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.290D manifold with training data

1st PC

2nd

PC

Figure 3. 90-d manifolds of multiple view � vectors obtained fromthe training data visualised using projection onto the first two prin-cipal components. The silhouettes of the hand at some key poses(6 poses for each ��� angle) are shown in their location in the man-ifold.

measurements from all the cameras. The regressor is thentrained using these concatenated vectors. In our implemen-tation, � was set to 30 for each of three views, so the con-catenated vector � has length 90. The projection onto thetwo principal axes of the 90-d manifold for the training datais shown in Figure 3, using hand axis-oriented shape con-texts.

Note that the first and second principal components areroughly aligned with the variation in ��� and with the over-all degree of flexion of the fingers, respectively. This hintsthat this dataset of hand appearances can roughly be rep-resented with two degrees of freedom. This effect cannotbe observed for single view descriptors � . In that case, themanifold seems to need at least three dimentions to showmore separability between hands poses.

3. Obtaining and testing the training dataAn essential input to the later regression process is, of

course, the association of each � � with a set of knownjoint angles � � . For this paper training pairs � � �"! � �$#were obtained by generating imagery synthesized from ahand model using joint angle data from the hand trajectorydatabase prepared by Stenger et al. [12].

A hand model including forearm, palm, thumb and fin-gers was created using generalised cylinders and spheres.The data set used was obtained from a glove that does notinclude sensors between the forearm and the palm, so thetwo degrees of freedom of this joint were set to constantvalues. The palm is rigid, and each finger is modelled asa planar mechanism with 3 dof for flexion and 1 dof forabduction and adduction with the palm. The same model isused for the thumb, but its plane is not parallel to the fingers’

Figure 4. %'&$( row: sample images from camera 2 with modifica-tions in orientation, translation and scale. The nearest-neighbourclassification results using single view with scale and rotation in-variant descriptors are shown in the )+*�, row. The -+./, row showsthe same, using multiple views.

planes. This gives a total of 20 internal dof plus 2 inactivedof for the wrist, and 6 dof of global pose parameters. Thusthe hand pose is described by vector �10 IR �"2 .

3.1. Training Sets

In this paper, we use two training sets. The first set,dubbed open-close, consists of a trajectory that starts withall the fingers stretched and a grasping gesture is performedin 78 frames. The glove used to generate this data did nothave a global position and orientation sensor, so the tra-jectory was duplicated seven times for

�+34�spaced values� �65 � � 5 �7� �

, giving a total of 546 poses. For multi-camera application, the hands were rendered from three dif-ferent viewpoints.

The second training set, dubbed complex, was gener-ated from a sequence of 239 internal poses in which fin-gers move independently. As before, the trajectory was re-produced for seven instances of �7� , giving a total of 1673three-dimensional poses.

3.2. Training set Assessment

In order to assess the discriminatory power of the im-age descriptors � � , a nearest neighbour classification exper-iment was performed with 36 hand images – 9 hand posestaken from 4 orientations. The results, shown in Figure 4,suggest that the image descriptor is robust enough to pro-vide a good qualitative description of the hand shape fromimages that are not in the training set, even though the handmodel is not accurate. The same figure also shows that theuse of multiple views can improve the nearest neighbourclassification result.

4. Learning to Relate Descriptors to 3D PosesTo relate the image descriptors ��� to the 3D joint and

pose settings � � , Agarwal and Triggs [1] proposed the useof a regression method that learns the relation between 8

3

Page 4: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

pairs of vectors �9� �"! � � # by estimating the coefficients orweights of a linear combination of basis functions :�; . Theproblem is described as:

� �=< >?;'@�ACB ; : ; �9�D� #�EGFIHKJ�L �9��� #DEMF (1)

where F is a residual error vector, � � 0 IR N ( � <� ! � !PO�OPOQ! 8 ), and B ; 0 IR N ( R < � ! � !�OPO�OS!UT ). For com-pactness, the weight vectors can be gathered into an V � Tmatrix JMH � B A B � O�OPO B > # and the basis functions into aIR

>-valued function L �9� # < �U:DA��9� # : � � � # O�OPO : > �9� #"#XW . As

discussed later, T < � for linear kernel, and T < 8 forGaussian kernel.

For 8 training pairs, the estimation problem takes theform:

JZY <\[�]"^`_badcJ egf? � @�Aihdh JjL �9� � #lk � � hdh � EGm � Jn#o (2)

where m � O # is a regulariser on J . Gathering the training vec-tors into an V � 8 matrix p H �9� A � � O�OPO � f # and a T � 8feature matrix q H � L �9� A #rL � � � # OPO�O L �9� f #X# , equation (2)can be rewritten as:JsY <K[]t^`_baucJwv huh J q k p huh � EMm � J4#�x (3)

For unidimensional signals y , Tipping [14] proposed theuse of Relevance Vector Machine (RVM), a method basedon sparse Bayesian learning to estimate efficiently a goodapproximation of JSz A�{ >P| with large sparsity. This sparsitycan save computational time and space. A straightforwardextension for multidimensional patterns can be achieved byregressing input vectors � against each individual parametery�} (of vector � ). The obtained row vectors of weights canbe concatenated into matrix JQz N { >P| .

With the open-close data set, using � < ���(i.e. � <K~ �

for each view) and linear kernel functions ( L ( � ) = � ), the re-sulting J matrix is shown in Figure 5 (top row). For samplesin the training set, this resulted in the mean absolute error(computed by � f� h J4L �9� � #�k � � h ��8 ) of

�j� �, and mean stan-

dard deviation of�n� �j�

. The maximum average error andstandard deviation were ~ ��� �4�

and� � �

respectively, butboth occurred for the interphalangeal joint of the thumb,which is occluded in many of the training images. (For handjoints nomenclature, see [13]).

A problem with regressing parameters independently isthat noisy data can potentially provide impossible outputposes. For example, a regressor trained to recover 3D poseof walking humans might output poses having both legs tothe front. Furthermore, training each row of J individuallycan cost too much computational time.

In [1], Agarwal and Triggs describe an adaptation of Tip-ping’s method that estimate the whole matrix J in a sin-

0 30 60 90

0

10

20

nz = 243

A matrix from unidimensional regressions

0 30 60 90

0

10

20

nz = 280

A matrix from A&T regression method

Figure 5. Map of non-zero elements of matrix �j�������'� resultingfrom: (top) RVM regression of individual parameters separately,using a threshold to select an average of 10 relevant vectors perdof; (bottom) linear regression using Agarwal and Triggs’ method,selecting 10 relevant vectors in total.

gle process, creating a linear combination of relations withmulti-dimensional output.

The first step of this algorithm is to initialise J with ridgeregression. The regulariser is chosen to be m � J4#�H�� hdh J hdh � ,where � is a regularisation parameter. The problem can bedescribed as the minimisation of

hdh JQ�q k��p hdh � Y < hdh J q k p hdh � E�� huh J huh � ! (4)

where �q H �9q ���# and �p H �9p�� # . J can be estimated bysolving the linear system JQ�q < �p in least squares. Ridge so-lutions are not equivariant under scaling of inputs, so both �and � vectors are scaled to have zero mean and unit variancebefore solving.

The next step is to apply a modification of RVM that suc-cessively approximates the penalty terms with “quadraticbridges”. Therefore, with � an element of J , the regularisersm � � # <��i�u�7^ hdh � hdh are approximated by � � hdh � hdh ��n���/���u� # � E�¡  ��¢�£ , which has the same gradient as the original function.If �¡  ��¢�£ is set to � �U¤  +¥ huh � ���/���d� huh k A� # the regularising func-tion values match at � �X�/���u� , though this is irrelevant for theleast squares solution. Quadratic bridges approximation al-lows parameters to pass through zero if they need to, withless risk of premature trapping and over-fitting.

Agarwal and Triggs proposed the use of column-wise setof priors in the regulariser m � Jn# : being B a column of J ,m � B #§¦ � � hdh B hdh ��4�X�/���u� # � E �¡  ��¢�£ , implying that the esti-mated matrix J has some whole columns B ;©¨«ª . Depend-ing on the kernel function used, two different aspects of costreduction for pose estimation can be achieved:¬ If linear basis functions are used, i.e., L �9� # < � , the

nil vectors B ; indicate which components of vectors �can be removed without compromising the regressionresult. Therefore, RVM can be used as a feature selec-

4

Page 5: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

tion method, resulting in a reduction in the amount ofshape descriptors needed.¬ Alternatively, kernel basis functions can be used.They are expressed by : � � � # <®­ �9� ! � � # , makingL � � # <°¯ ­ � � ! ��A # ! ­ �9� ! � � # !POPO�OD! ­ � � ! �D± #/²³W , where­ �9� ! � � # is a function that relates � with the trainingsample � � . For example (as used in this paper), one canuse a Gaussian kernel ­ � � ! �=� # <µ´�¶�·�· �=¸��D� ·�· ¹ , with ºestimated from the scatter matrix of the training data.In this case, the column-wise sparsity of J acts as amethod to select relevant training samples.

The estimation of J is then performed in a similar fashionto Equation 4, by iteratively solving the linear system:J �9qI» # < � p¼� # ! (5)

where � is a V � T matrix of zeros and » is a T � T ma-trix whose columns are defined by � ���C���/���u� , and �j���/���u� isthe norm huh B hdh of each column vector of J from the previ-ous iteration. To reinforce sparsity, the columns of J whosenorms are small are set to zero. This process is repeateduntil convergence of J .

Figure 5 (bottom row) shows the J matrix obtained bythis method using linear kernel functions in the open-closedata set, with multiple view 90-d descriptors � . The thresh-old ½ B on vectors B was tuned to select 10 relevant features,resulting on the selection of 5 features from camera 1 (sideview), 3 features from camera 2 (top view), and 2 featuresfrom camera 3 (another side view). For samples in the train-ing set, regression with this matrix resulted in the mean ab-solute error of

�n��¾��, and mean standard deviation of

�n� �j�.

The maximum average error and standard deviation were����� �and � �7�

respectively, both for the interphalangealjoint of the thumb. This represents an improvement in com-parison to the results obtained by regressing the dofs indi-vidually, with a simplification of matrix J , allowing featureand sample selection. It is interesting to note that many ofthe vectors selected using Tipping’s method coincide withrows selected by Agarwal and Trigg’s method, confirmingthe common theoretical basis of both.

It has been observed that Gaussian kernel functions canprovide better results at the expense of being slower thanlinear kernel functions [1]. Indeed, the results showed latersuggest that linear functions are less stable to noise thanGaussian kernel functions. The alternative proposed hereis to combine both by first reducing the dimensionality ofthe image descriptors � with feature selection and then us-ing regression with Gaussian kernel functions to select themost relevant samples. Since the dimension of the vectors �is reduced in the first stage, all the distance calculations re-quired to compute L � � # with Gaussian kernels are speededup.

Linear kernel Gaussian Kernel

0 0.5 1 1.5 20

30

60

90

Ta

Rel

evan

t vec

tors

single viewthree views

0 1 2 3 4 5 60

78

156

234

312

390

468

546

Ta

Rel

evan

t vec

tors

single viewthree views

Figure 6. Amount of selected relevant vectors for linear and Gaus-sian kernels for single and multiple views as a function of thethreshold ¿�À evaluated for the open-close training set.

# Data Kernel Sel. Sel. Avg. STD Worst WhichViews Set Ftrs. Smpls. Error Result dof

1

open-close

lin. 3 273 Á+ áÄIÃ+ šÄÇÆ'È+ ơÄÊÉ �gauss. 90 10 Ë+ áĩÌ� ȡÄZÍ�Ì�Â Ë¡Ä T IPboth 13 29 Æ+Â È Ä Æ+Â Î Ä Ã+Â Î Ä É �

complexlin. 31 839 È+ ΡÄIÆ+ Ï'ÄZÍ'Í' ÌPÄ M DIP

gauss. 90 42 Æ+Â Å Ä Æ+Â Ë Ä Å+Â Å Ä M DIPboth 35 36 Æ+Â Å Ä Æ+Â Ã Ä ÍXÎ+Â Ï Ä M DIP

3

open-close

lin. 2 273 Ë+Â Ì Ä Ì�Â Ì Ä Í"Ï�Â Î Ä T IPgauss. 90 10 È+Â Ã Ä Æ+Â Ï Ä Í�Ì�Â Å Ä T IPboth 12 29 Í'Â Ã Ä Í'Â Æ Ä Ï�Â Î Ä T IP

complexlin. 31 839 Æ+Â Ë Ä Æ+Â�Í Ä Á+Â Å Ä M DIP

gauss. 90 41 Æ+Â Ì Ä Æ+Â Î Ä Á+Â È Ä M DIPboth 34 36 Æ+Â Ì Ä Æ+Â Î Ä Å+Â Î Ä M DIP

Table 1. Results with synthetic data obtained using 273 and 839training samples for open-close and complex data sets, respec-tively. The same amount of samples was used for testing, thoughthere is no intersection between the sets. Both data sets have 90features in total.

5. Experiments and Results5.1. Amount of Relevant Vectors

The graphs of figure 6 show the number of selected rel-evance vectors as a function of the threshold ½ B . Note thatthe same threshold leads to the selection of more relevancevectors for a single view. This hints that even though thesame number of training samples (and of the same dimen-sionality) is used in both cases, fewer relevance vectors areselected for multiple views, indicating that their measure-ments are more discriminative. Fewers samples and fewerfeatures are needed to achieve the same relevance for mul-tiple views.

5.2. Synthetic Images

For the experiments with synthetic images, the data setwas evenly split in a training set and a testing set (with nointersection), and ground truth data is available.

Table 1 shows a quantitative evaluation of the results forboth data sets using synthetic images. The columns ‘sel.ftrs.’ and ‘smpls.’ indicate how many relevant vectors wereselected with linear and Gaussian kernel, respectively. Thecolumn ‘worst result’ shows the average error for the pa-

5

Page 6: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

1st PC

30 clusters of all the shape contexts from camera 2

2nd P

C

training datacentroids

Figure 7. Left: Silhouettes obtained from a sample pose in thetraining set from camera 1 (top) and 2 (bottom), highlighting (withred ‘*’) the points whose shape context is taken into account af-ter the selection of two relevant features. Right: Shape contextsmanifold with the centroid of the selected cluster from camera 2indicated by a blue circle.

rameter (dof) whose estimate was the worst, indicated inthe column ‘which dof’. The abbreviation T IP refers to thethumb’s inter-phalangeal joint, and M DIP to middle fin-ger’s distal inter-phalangeal joint.

As expected, the worst estimates occurred in two cases:(i) for dofs related to parts of the hand whose contour wasoccluded in many of the images, and (ii) for the rotation ���when a single view is used, as this is not a rotation parallelto the top view image plane.

The sequence of movements in the open-close datasetcan roughly be described by two degrees of freedom: flex-ion of the all joints and twisting movement of the handabout the forearm axis ( � � ). In order to verify the ability ofthe regressor to identify this, a feature selection experimentwas performed, tuning the threshold ½ B to select only tworelevant vectors. But for a single-view, three features wereselected, because any greater threshold resulted on only onefeature. For three-views, one vector from the top view andanother vector from one of the side views (camera 1) wereselected, as shown in Figure 7.

Note that, for both views the centroids selected are closeto the wrist rather than the finger tips. A possible reason forthat is that features closer to the finger tips present too muchvariation between samples and they are not present in someof the samples, e.g. those with the hand in fist pose. Thishas also been observed for single view.

The obtained regression results (see table 1) show thatthe regressor is able to give a rough approximation of thepose using a minimal set of selected vectors (in this case,image features). Even using less features for multiple viewsit is possible to achieve higher accuracy than with a singleview. It was also observed that, for single view, as � � grows,the pose estimate gets poorer because the top view does notoffer enough distinct features on its own when the fingersget nearly aligned to the camera axis.

0 78 156 234 312 390 468 5460

15

30

45

60

75

90

time (frames)

inde

x fin

ger P

IP fl

exio

n (o )

ground truthsingle viewmultiple view

(a)

0 78 156 234 312 390 468 546

0

15

30

45

60

75

90

time (frames)

rota

tion

θ Zo

ground truthsingle viewmultiple view

(b)Figure 8. Regression results combining both feature selection andsamples selection for the interphalangeal joint of the index finger(a) and the angle � � (b). The parameters were tuned to select 13or 12 features for single and multiple view, respectively; and 29samples.

When using Gaussian kernels, it is harder to intuit theminimal set of samples needed to estimate the pose. ½ Bwas chosen so that 10 relevant samples were selected fromthe training set, and the results are shown in table 1.

Both for single and multiple views, the selected samplesare mostly from “near-fist” hand poses. This may seem odd,but it is not usual in RVM for the the most relevant samplesto be distant from the obtained pose estimates, and for themnot to be the most comprehensive samples in terms of thevariability of state (poses) [14].

Figure 8 reports the application of feature selection fol-lowed by samples selection to combine speed and perfor-mance. Note that the superiority obtained for multipleviews is more evident for � � . The pose of the hand wasestimated individually for each frame, which explains thejittering trajectory.

In general, the improvement obtained by using multi-ple views is evident. particularly when the number of fea-tures used is small. However the improvement is view-dependent, and if a single view captures the most mean-ingful silhouette the improvement is diminished. A furtherreduction in improvement arises because the synthetic im-ages used so far are noise free. As shown in next section,

6

Page 7: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 9. Results obtained from real images (top row) for single(middle row) and multiple views (bottom row), using Gaussiankernel with all the samples and all the features.

Figure 10. Results obtained from real images (top row) for singleview (middle row) and multiple views (bottom row), using a subsetof 32 features and 38 selected samples.

therotation and scale improvement is restored when usingreal images.

5.3. Real Images

For real images, whole training sets were used, giving1679 training pairs for the complex data set. Since there isno ground truth available for the real images, only qualita-tive results are shown.

Fig. 9 shows that multiple views provide a significantimprovement over single view data. This improvement be-comes more evident when a small selection of features andsamples is used, as shown in Fig. 10. Note that, for sin-gle view, the regressor seems to be unable to recover someof the poses, probably because the measurements generateposes that extrapolate the space of trained poses.

5.4. Computational Cost

As expected, the training phase, which is done off-line,is very demanding both in terms of memory and CPU usage,specially in the clustering for vector quantisation. However,once the histogram descriptors � are obtained, training theregressor is not so expensive: it takes between 7s for linearkernel functions using 32 features, and 328s for Gaussiankernels using all features and 38 samples. This is reducedto 305s if only 32 selected features are used. These mea-surements were obtained running a MatLab implementatinon a 2.4GHz Pentium 4 computer, averaging the timing for

the complex data set.For application phase, the extraction of the image de-

scriptors � , is the only steps whose computational cost isÐ �ÒÑ # , where Ñ is the number of cameras. The averagetime for this step is 170ms per image.

The actual pose estimation process is extremely fast, tak-ing between 7.2 s for linear kernel functions using 32 fea-tures, and 35.7 s for Gaussian kernels using all features and38 samples. This is reduced to 25.4 s if a subset of 32selected features are used. Therefore, using the most ex-pensive parameters, the application of the algorithm takes652ms per frame if three cameras are used.

In terms of memory usage, the computation of the his-togram � is the most expensive part if all the shape contexts.If Gaussian kernel functions are used, matrix J is

Ð �9V � 8 # ,which has shown to be not so demanding even using all the1679 training samples.

6. ConclusionsThis paper has presented a regression-based method for

estimation of hand pose in 3D from multiple view imagedescriptors, advancing the single-view method of Agarwaland Triggs [1] proposed for human pose estimation.

Skin silhouettes were extracted from colour imagery, andtheir contour points described using the shape contexts ofBelongie et al. [6]. The considerable variation in hand posetypically observed in imagery requires care to be taken toensure scale and rotational invariance in the contexts. Theuse of contexts aligned with the axis of the forearm wasfound to be optimal. By ensuring rotational and scale invari-ance, the number of training samples needed was reduced,provided triangulation was first used to recover the globalpose parameters.

A global image descriptor for each view was obtained bycoding the manifold of shape contexts using vector quan-tisation, and the descriptors combined at an intermediatelevel into a multiview descriptor by concatenation. Themapping between multiview descriptors and 3D poses waslearned using Agarwal and Triggs’ [1] extension of Tip-ping’s Relevance Vector Machine [14].

Our experiments have, inter alia, examined the effectsof feature selection (linear kernel functions) and sample se-lection (Gaussian kernel functions) both on the quality ofpose determination and on the computational time, usingboth synthetic and real imagery. We have found that linearkernel functions have the advantage of computational costindependent on the amount of training data used. However,we have found Gaussian kernel functions to be more robust.Our experiments have also shown that, for general views,fewer relevance vectors are needed in the multiple viewcase. Their measurements are more discriminative, allow-ing correct pose estimates to be recovered in cases where asingle view all but fails.

7

Page 8: Regression-based Hand Pose Estimation from Multiple Camerasdwm/Publications/campos... · (RVM) [14]. In [2], the regression-based tracking concept developed in [3] was applied to

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1101

CVPR#1101

CVPR 2006 Submission #1101. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

An obvious modification to the current method of learn-ing would be to use Jurie and Triggs’ blah blah. However,the main thrust of future work will be to XXX.

References[1] A. Agarwal and B. Triggs. 3d human pose from silhouetts by

relevance vector regression. In Proc IEEE Conf on ComputerVision and Pattern Recognition, Washington DC, June 17 -July 2, 2004. IEEE Computer Society Press, 2004. 1, 2, 3,4, 5, 7

[2] A. Agarwal and B. Triggs. Learning to track 3d humanmotion from silhouettes. In Proceedings of the )�%¡&$( Inter-national Conference on Machine Learning, Banff, Canada,2004. 1

[3] A. Agarwal and B. Triggs. Tracking articulated motion us-ing a mixture of autoregressive models. In T. Pajdla andJ. Matas, editors, Proceedings of the European Conferenceon Computer Vision, number 3023 in Lecture Notes in Com-puter Science, pages 54–65, Prague, Czech Rep., May 2004.Springer-Verlag Berlin Heidelberg. 1

[4] A. Agarwal and B. Triggs. Monocular human motion cap-ture with a mixture of regressors. In Workshop on Vision forHuman Computer Interaction (V4HCI), in conjunction withCVPR, San Diego, CA, June 2005. IEEE Computer SocietyPress. 1

[5] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. Boostmap:A method for efficient approximate similarity rankings. InConference on Computer Vision and Pattern Recognition.IEEE, IEEE Computer Society Press, 2004. 1

[6] S. Belongie, J. Malik, and J. Puzicha. Shape matching andobject recognition using shape contexts. IEEE Transactionson Pattern Analysis and Machine Intelligence, 24(24):509–522, April 2002. 1, 2, 7

[7] M. Brand. Shadow puppetry. In Proc 7th Int Conf on Com-puter Vision, Corfu, pages 1237–1244. IEEE Computer So-ciety Press, 1999. 1

[8] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, andX. Twombly. A review on vision-based full dof hand mo-tion estimation. In Workshop on Vision for Human ComputerInteraction (V4HCI), in conjunction with CVPR, San Diego,CA, June 2005. IEEE Computer Society Press. 1

[9] D. C. Hogg. Model-based vision: a program to see a walkingperson. Image and Vision Computing, 1(1):5–20, 1983. 1

[10] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estima-tion with parameter-sensitive hashing. In Proc 9th Int Confon Computer Vision, Nice France, Oct 13-16, 2003, 2003. 1

[11] C. Sminchisescu, A. Kanaujia, Z. Lio, and D. Metaxas. Dis-criminative density propagation for 3d human motion esti-mation. In Proc IEEE Conf on Computer Vision and PatternRecognition, San Diego CA, June 20-25, 2005. 1

[12] B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla. Filter-ing using a tree-based estimator. In Proceedings of the ICCV.IEEE, 2003. 1, 3

[13] D. J. Sturman. Whole-Hand Input. PhD thesis, Me-dia Arts and Science Laboratory, Massachusetts Instituteof Technology, Cambridge, MA, USA, February 1992.http://xenia.media.mit.edu/ djs/thesis.ftp.html. 4

[14] M. E. Tipping. Sparse bayesian learning and the relevancevector machine. Journal of Machine Learning Research,1:211–244, June 2001. http://jmlr.csail.mit.edu/. 1, 4, 6,7

[15] Y. Wu, J. Y. Lin, and T. S. Huang. Capturing natural hand ar-ticulation. In Proceedings of the International Conference inComputer Vision (ICCV), Vancouver, Canada, 2001. IEEE.1

8