[email protected] arXiv:1410.0117v2 [cs.CV] …arXiv:1410.0117v2 [cs.CV] 6 Oct 2014 Coupling Top-down and Bottom-up Methods for 3D Human Pose and Shape Estimation from Monocular

arX

iv:1

410.

0117

v2 [

cs.C

V]

6 O

ct 2

014

Coupling Top-down and Bottom-up Methods for 3D Human Pose and ShapeEstimation from Monocular Image Sequences

Atul KanaujiaObjectVideo, Inc.

Reston, [email protected]

Abstract

Until recently Intelligence, Surveillance, and Reconnais-sance (ISR) focused on acquiring behavioral informationof the targets and their activities. Continuous evolution ofintelligence being gathered of the human centric activitieshas put increased focus on the humans, especially inferringtheir innate characteristics - size, shapes and physiology.These biosignatures extracted from the surveillance sensorscan be used to deduce age, ethnicity, gender and actions,and further characterize human actions in unseen scenar-ios. However, recovery of pose and shape of humans insuch monocular videos is inherently an ill-posed problem,marked by frequent depth and view based ambiguities due toself-occlusion, foreshortening and misalignment. The like-lihood function often yields a highly multimodal posteriorthat is difficult to propagate even using the most advancedparticle filtering(PF) algorithms. Motivated by the recentsuccess of the discriminative approaches to efficiently pre-dict 3D poses directly from the 2D images, we present sev-eral principled approaches to integrate predictive cues us-ing learned regression models to sustain multimodality ofthe posterior during tracking. Additionally, these learnedpriors can be actively adapted to the test data using a like-lihood based feedback mechanism. Estimated 3D poses arethen used to fit 3D human shape model to each frame inde-pendently for inferring anthropometric biosignatures. Theproposed system is fully automated, robust to noisy test dataand has ability to swiftly recover from tracking failures evenafter confronting with significant errors. We evaluate thesystem on a large number of monocular human motion se-quences.

1. Introduction

Extracting biosignatures from fieldable surveillance sen-sors is a desired capability for human intelligence gathering,identifying and engaging in human threats from a signif-

icant standoff distances. This entails fully automated 3Dhuman pose and shape analysis of the human targets invideos, recognizing their activities and characterizing theirbehavior. However 3D human pose and shape inference inmonocular videos is an extremely difficult problem, involv-ing high dimensional state spaces, one-to-many correspon-dences between the visual observations and the pose states,strong non-linearities in the human motion, and lack of dis-criminative image descriptors that can generalize across ahugely varying appearance space of humans. Tradition-ally, top-downGenerative modelingmethods had been em-ployed to infer these high-dimensional states by generat-ing hypotheses in anthropometrically constrained param-eter space, that get continuously refined by image-basedlikelihood function. However top-down modeling, being asomewhat indirect way of approaching the problem, faceschallenges due to the computationally demanding likeli-hood function and its requirement of accurate physical hu-man models to simulate and differentiate ambiguous ob-servations (see fig.1). Failure of top-down models havemotivated the development of bottom-up,Discriminativemethods - fast feed-forward approaches to directly predictstates from the observations using learned mapping func-tions. Bottom-up methods, while being simple to apply,are frequently plagued by lack of representative features tomodel foreshortening, self and partial occlusion which limittheir performance in unseen scenarios. In this work weattempt to combine the two approaches under a commonframework of non-parametric density propagation systembased on particle filtering. Fig.2 shows the key componentsof 3D pose tracking and human shape analysis system.

Particle filtering forms a popular class of Monte Carlosimulation methods for approximately and optimally esti-mating non-Gaussian posteriors in systems with non-linearmeasurements and analytically intractable state transferfunctions. Intrinsically, particle filtering is a non-parametricgenerative density propagation algorithm, involving recur-sive prediction and correction steps to estimate the poste-rior over the high dimensional state space from a tempo-

http://arxiv.org/abs/1410.0117v2

Figure 2. Overview of our system for 3D human pose and shape estimation and analysis from monocular image sequences

ral sequence of observations. Generative simulation filtershave been widely applied to various tracking problems invision and are well understood. However, techniques toovercome their drawbacks, such as irrecoverable trackingfailure due to noisy observations, by incorporating discrim-inative(predictive) cues, are less well explored. We developthree principled techniques to incorporate discriminativecues into the particle filter based 3D human pose trackingframework. The techniques are aimed towards overcominglimitations of particle filtering by improving both the pro-posal density modeling as well as the likelihood computa-tion function. Unlike past approaches, we use bottom-upmethods to not only initialize and provide discriminativecues to the top-down methods for improved tracking, butalso have a feedback mechanism to adapt bottom-up mod-els using online learning from top-down modeling. In Par-ticle Filtering(PF), sampling from a marginal distribution ismade tractable by recursively computing particle weights,causing degeneracy of particles. This is efficiently over-come using re-sampling, which however, over longer se-quences, causes sample impoverishment problem. This isa more difficult problem and currently no principled mech-anism exist to overcome it. In context of 3D human posetracking, lack of particle diversity may cause failure to pre-serve multimodality of the posterior density. Fig.1 illus-trates the severity of tracking failures due to inability totrack all the modes using a particle filtering framework. Inthis work we tackle the problem of characterizing the mul-tivaluedness of the dataset and develop algorithms to sus-

Figure 1. Ambiguous observations with one-to-many correspon-dences between the 2D image and the 3D pose. The likeli-hood function based on silhouette overlap gives similar likelihoodscores for these images

tain it during tracking. We identify the cause of the sampleimpoverishment as the underlying generative, model-based

tracking mechanism of the PF, that cannot model large de-viations from the examples that are typical to a data set. Incontrast, predictive (discriminative) methods provide anal-ternative approach to handle difficult cases not modeled bygenerative methods. Specifically, learned priors can be usedto furnish additional particles during the tracking processand maintain multimodality for enhanced 3D pose recov-ery. Although pure discriminative methods such as [14] hadbeen used in the past to propagate the posteriors at each timestep using a learned conditional, our work attempts combinethe two approaches in a mutually complementary frame-work and overcoming setbacks in one using strengths of theother. The tracked 3D pose are then used to estimate 3Dhuman shapes using simulated annealing. The 3D shapes,estimated independently at each frame, are then used forinferring attributes such gender, weight, height and dimen-sions of various body parts by averaging over the sequenceof frames. In principle, the tracked 3D poses in a sequenceof frames can be used to iteratively infer the posterior over3D shape parameters using a forward-backward algorithm.We anticipate that this extension will improve shape analy-sis compared to current framework and is planned as futurework during the course of project. We provide extensiveevaluation results for various components of the system, anddemonstrate the efficacy of our algorithms in overcomingstrong ambiguities observed in the data.

Contributions: (a) We develop a fully automated systemfor estimating and analyzing shapes of humans in monoc-ular video sequences ; (b) We develop a novel measure tocharacterize multimodality in the dataset ; (c) We developprincipled techniques to incorporate predictive cues in theparticle filtering framework and preserve multimodality forthe 3D pose tracking; (d) Demonstrate principled integra-tion of online learning into tracking framework. The learn-ing progressively adapts the predictive models to the datasetby including accurately predicted examples in the basis setand reducing errors due to training bias.

System Overview: Fig.2 sketches the control flow dia-gram and lists the key components of our system : (a) Sil-houettes extracted using background subtraction are usedto compute shape descriptors ; (b) Low-dimensional repre-sentation of 3D human body pose is learned offline usingnon-linear Latent Variable Model(LVM) [7]; (c) Bottom-up(discriminative) models are trained offline from the la-beled examples obtained from the motion capture data.Training involves learning hierarchical mixture of expertsby partitioning the data set based on viewpoint at level 1 andone-to-many mapping ambiguity at level 2; (d) Bottom-upmodels are used to initialize the global orientation and 3Djoint angles (pose) from the 2D silhouette shape. Transla-tion in 3D space is estimated using the camera calibrationparameters. Joint angles and global orientation are opti-mized using simulated annealing and average 3D shape; (e)

Tracking is performed using annealed particle filtering bysampling particles both from the dynamics and the bottom-up proposal distributions; (f) Statistical 3D human shapemodel is learned offline by non-rigidly deforming a 3D tem-plate mesh to laser scan data; (g) 3D Shape is fitted to theobserved silhouettes using annealed particle filtering ; (f)Estimated 3D shape is used to extract biosignatures andphysiological attributes of the human.

2. Related Work

Since the introduction about 20 years ago, particle fil-tering have been widely applied in various domains of tar-get tracking and optimization problems. A comprehensivetutorial on various particle filtering methods is given in[5][4][1]. A number of enhancements of particle filteringalready exist in literature (such as Auxiliary PF, GaussianSum PF, Unscented PF, Swarm Intelligence based PF andRao-Blackwellised Particle Filtering for DBN) specificallyfocused on setbacks of simulation based filtering. Althoughonly a few of the works have addressed possible ways of in-corporating discriminative information in the filtering pro-cess. Some of the relevant works that have attempted tocombine the two approaches in the past include [15, 12] forarticulated body pose recovery in static images, [6, 3, 8] forimproving tracking and [11] for non-rigid deformable sur-face reconstruction. Sminchisescuet al.[15] proposed anefficient learning algorithm to combine the generative andthe discriminative information by incorporating a feedbackmechanism from the generative models to improve predic-tions of the discriminative model. Rosales and Sclaroff [10]employed discriminative model based on mixture of neu-ral networks as a verification step to the generative poseestimation in static images. Urtasunet al.[11] proposeda combined framework of the two approaches by explic-itly constraining the outputs of discriminative regressionmethods using additional constraints learned as a genera-tive model. Notable among these are the approaches pro-posed in [6, 3, 8, 12] that employ simulation based meth-ods to recover 3D pose in monocular image sequences. Si-galetal.[12] used discriminative models as an initializationstep for the pose optimization problem for static images.Lee and Nevatia [9] developed 3D human pose tracking us-ing data driven MCMC. They used a rich combination ofbottom-up belief maps as the proposal distribution to sam-ple pose candidates in their component-based Metropolis-Hastings approach. However their mixing ratios are pre-determined and chosen in ad hoc fashion. Whereas wepropose a more principled approach to adaptively deter-mine these ratios to overcome specific limitations of simu-lation based filters. The approach to combine top-down andbottom-up information in [6, 3] also employs a pre-definedimportance sampler from data-driven belief map using dis-tributions are not learned, a significantly different approach

Figure 3. Degree of Multimodality between input(image features)and output(3D joint angles) for the HumanEva motion capturedata[13] with N = 6940 data pairs andNclusters = 1500

than ours. Sustaining multimodality in particle filtering do-main has been addressed in the past by Vermaaket al.[17],albeit in context of limiting the re-sampling to a set of mix-ture components fitted to the particles. The approach maybe somewhat restrictive if number of mixture componentsare too small. To the best of our knowledge no princi-pled mechanism to combine the discriminative and genera-tive information to overcome deficiencies of the PF trackingmethod while simultaneously adapting the predictive priorshave been proposed in past. Active online learning has beenwidely applied in all major learning based frameworks ofcomputer vision. However, none of the works in the pasthave applied it in context of the problem of 3D human poserecovery from monocular image sequence.

3. Measure of Multimodality

We develop an information theoretic measure to quan-tify multivaluedness of mapping from 2D image to 3D posein a human motion capture dataset. Our approach extendsthe multimodal data representation presented in [14] whichmodels the degree of multivaluedness present in the data asnumber of unique 3D pose (xt) clusters in correspondencewith the elements in the 2D image (rt) clusters. The clus-ters obtained using K-Means for bothrt andxt model per-turbations due to noise in the observation and pose spacerespectively. Fig.3 shows the histogram obtained from themultimodality analysis of the data. We make two modifica-tions to the weights given to these associations: (a) Largeobservation(input) cluster sizes associated to single or mul-tiple pose(output) clusters implies stronger multimodality inthe dataset. We explicitly give weights proportional to clus-ter sizes in generating the histogram; (b) Within each clus-ter, the distribution of points associated to different clus-

ters reflects the multimodality of the data. For example,an input cluster with output cluster association indices asA = [1, 1, 1, 1, 2, 3] reflects weaker tri-modality comparedto B = [1, 1, 2, 2, 3, 3] even when both have same clustersizes. We encode this information using the Shannon’s En-tropyH(x) = −

∑i p(xi)log2p(xi) that captures the reg-

ularity of the probability distribution of the input clusterpoints to be associated to different output clusters. For ourcaseH(A) = 1.8136 andH(B) = 2.1972. The weightsfor the correspondence between the input clusters (x) andthe output clusters (y) can be formulated as :

hn(x) = N(x)exp(−

∑n

i p(yi)log(p(yi)))

n(1)

wherehn(x) is the weights attached to the associations be-tweenN(x) elements of thex cluster and the correspondingoutputsy spread acrossn clusters. Fig.3 shows the mul-timodality plot obtained from HumanEva dataset[13] forN = 6940 data points andNclusters = 1500. As discussedin the experiment section7, we use this measure to quanti-fying the degree of multimodality maintained by the trackedhypotheses in the particle filtering.

4. Predictive Models for 3D Human Poses

We work with temporally ordered sequence of vectorsYn = y1, · · · ,yn denoting 3D human body pose as avector of joint angles,Xn = x1, · · · ,xn as the states (la-tent space of the 3D pose vectors) learned using non-linearlatent variable model andRn = r1, · · · , rn as the im-age observations in the form of silhouettes obtained usingbackground subtraction. We use Spectral Latent VariableModel(SLVM)[7] to compute the low-dimensional repre-sentations of the pose vectors. In principle, any latent vari-able model (such as GPLVM, GPDM and GTM) that sup-ports structure preserving, bi-directional mappings, canbeused here for removing correlations between redundant di-mensions of joint angle space. We work in the latent spaceof X both for learning predictive models and filtering. Theoriginal joint angle spaceY is used for likelihood compu-tation, 3D pose visualization and rendering. To preservediversity of the particles and multimodality of the poste-rior, we replenish the particles by sampling from a mul-timodal prior learned as hierarchical Bayesian Mixture ofExperts (hBME) to model multivalued relation between 2Dimage space to 3D human pose space. In hBME, each ex-pert(functional predictors) is paired with an observationde-pendent gate function that scores its competence in predict-ing states when presented with different inputs(images). Fordifferent inputs, different experts may be active and theirrankings (relative probabilities) may change. The condi-

tional distribution over predicted states has the form:

p(x|r,Ω) =

Nv∑v=1

gv(r|γv)

Nd∑i=1

gi(r|v,λi)pi(x|r,Wi,Σ−1i )

whereΩ = W,γ,λ,Σ are the parameters of the classi-fication (gv andgi) and regression functions. At the high-est level, gate functionsgv partition the data intoNv view-specific clusters to model view-based ambiguities. Withineach cluster we further partition the data intoNd pre-dictive sets to model depth-based ambiguities. For eachset, we train regression models using Relevance VectorMachine[16] with the predictive distribution for the expertspi learned as Gaussian functions (2) centered at the expertpredictions (non-linear regressors with weightsWi).

pi(x|r,Wi,Σ−1i ) = N (WiΦ(r), σD +Φ(r)ΣiΦ(r)

T )(2)

where the predictive variance is sum of fixed noise term intraining dataσD and input specific varianceΦ(r)ΣiΦ(r)

T

due to uncertainty in the weightsWi. The gate functionsgv andgi are the input dependent linear classifiers modeledas softmax functions with weightsλi and are normalized

to sum to 1 for any given inputr. gi(r) =exp(λ⊤

ir)

∑kexp(λ⊤

kr)

,

wherer are the image descriptors,x state outputs in the la-tent space in correspondence to 3D posey in original jointangle space. The gate functiongv(r|γv) is trained to rec-ognize the viewv using observationr. Within each viewv,g(r|v,λi) outputs the confidence of an using an expert forpredicting the state.Bayesian Online Learning for hBME: Performance ofpredictive models depends on the assumption that trainingexamples are representative of the test data. We developan online learning algorithm to dynamically adapt predic-tive models to the test data during the generative filteringprocess. Accurate 3D pose hypotheses generated by the PFare used to improve the accuracy and specialize the predic-tive priors to the test domain. This involves both updat-ing the parameters as well as adaptively updating the basesset of the of the gates and experts of the hBME. We useBayesian relevance criteria to add/delete elements from thebases of the learned models, that attempts to maximize themarginal likelihood(ML) of the observation with respect tothe hyper-parameters of the model. The hyper-parametersare the parameters of hierarchical priors that control thesparsity of the models using Automatic Relevance Determi-nation(ARD) mechanism [16]. A new labeled data(ri,xi)is included into a bases set of an RVM classification (or re-gression) if its inclusion improves the ML of the model. Thedecomposition of the covariance matrix aid the computationof the change in ML due to an individual element :

|Σ−1| = |Σ−1−i | −

Σ−1−iΦ(ri)Φ(ri)

TΣ−1−i

αi +Φ(ri)TΣ−1−iΦ(ri)

(3)

whereΣ−1 andΣ−1−i are the covariance with and without

the new data, andαi is the hyperparameter denoting the un-certainty of the weightsWi of the new basis element to bezero. The change in the ML due to added basis elementL(α) = L(α−i) + l(αi) can be independently analyzed us-ing l(αi) to make decision about its inclusion in the basisset. Inclusion of a new basis may result in redundancy dueto presence of other elements which can be consequentlyre-evaluated to support inclusion (or deletion) of other el-ements in the bases set. The new bases set are used to re-estimate the parameters of the models.

5. Incorporating Predictive Cues in AnnealedParticle Filtering

Tracking is initialized using predictive models. Approx-imate translation is estimated using geometry, assuming nocamera tilt and the human to be of height1.78m in up-right poses, from the calibration parameters of the camera.Generative filtering algorithm involves recursive propaga-tion of the posterior over the state sequences at each timestepn using the following prediction and correction stepp(Xn|Rn) ∝

p(rn|xn)

∫p(xn|Xn−1,Rn−1)p(Xn−1|Rn−1)dx

Particle filtering propagates the posterior as a set ofNs

weighted particles (hypotheses) at each time stepn asxi

n,wini=1···Ns

. Particle Filtering computes these impor-tance weights at successive time steps, recursively using theweights in the previous time step. This avoids increasingcomputational complexity for recomputing weights for theentire sequenceXn at every time stepn:

win = wi

n−1

p(rn|xin)p(x

in|x

in−1)

q(xin|X

in−1,R

in)

(4)

where the importance density at time stepn is further ap-proximated asq(xn|Xn−1,Rn) ≈ q(xn|xn−1, rn) Simu-lated Annealing(SA) is a stochastic optimization algorithmthat runs a series of re-sampling and diffusion steps to at-tain an approximate global optima. APF, introduced byDeutscher et. al[3], employs Simulated Annealing opti-mization at each time step to diffuse the particles to othermodes of the cost function. We extend Annealed ParticleFiltering(APF) algorithm to integrate predictive cues fromthe learned priors. APF approximates the importance den-sity asq(xn|x

in−1, rn) = p(xn|x

in−1). The weight up-

date equation thus becomeswin ∝ wi

n−1p(rn|xin). The

re-sampling and simulated annealing optimization are thenperformed alternately at each iteration.

5.1. Optimal Proposal Filtering (OPF)

The optimal importance density[4] is given byqopt(xn|xn−1, rn) = p(xn|xn−1, rn). This density is

Figure 4. Comparison of tracking results for the 4 differentParticle Filtering algorithms on HumanEva I data set for Jogging sequence ofSubject S2 for the frame 13, 34, 127, 174, 198 and 243. We show the estimated pose and the deformed average 3D human mesh model.All the particles were initialized using pose estimates from the bottom-up model in frame 13.(Top row)Tracking results using AnnealedParticle Filtering(APF). Notice tracking failure in frame34 and 198 due to left-right leg ambiguity and viewpoint ambiguity in frame 127; (Second row)Tracking results usingOptimal Proposal Density. The learned distribution is accurately able to resolve ambiguities andprevents tracking failure ;(Third row) Tracking results usingJoint Particle Filteringachieves best level of accuracy with accurate partsalignment with observation,(Fourth row)Tracking results usingJoint Likelihood Modeling

called optimal as sampling it gives the following recursiveupdate equation of the weights of theith particle aswi

n ∝ win−1p(r

in|x

in−1) thus making the new weights

effectively independent of the sampled particlesxin. Our

first method for incorporating predictive information learnsthis distribution as conditional Bayesian Mixture ofMExperts (cBME). The form of the Bayesian Mixture ofExpert model is similar to as discussed in Section4 withonly one level of gate functions. A key issue in learningthis conditional is to accurately model the relative scalesof the state space data pointsxt−1 and the observationsrt.This is required to avoid the prediction in the current frameto be entirely driven by either the current observation or theprevious state. Therefore, for training the experts and gatesin our BME, we use kernel basis of the form:

Φ(x, r) = Kσx(x,xi)Kσr

(r, ri) (5)

where the rbf kernel has the formKσx(x,xi) =

e−σx‖x−xi‖2

. The scalesσx andσr determine how well thelearned conditionalp(xn|xn−1, rn) is able to generalize totest examples. Too narrow width forx may turn-off the ker-nels if the estimated pose from previous time step differseven slightly from the training exemplars, and may causethe predictors to output an average pose. If the scale is toowide, the regression model may simply average from the

multiple observation based predictions. As is true in anypredictive modeling, this method assumes that both trainand test exemplars are representative samples from a com-mon underlying distribution.

5.2. Joint Particle Filtering (JPF)

Importance density should be as close to the posteriorto achieve optimal tracking performance. Already thereexist techniques (based on partitioned sampling and bridg-ing densities) to overcome sub-optimal proposal densities.Choosing an appropriate importance density can reduce theeffect of sample impoverishment in particle filtering andconsequently its ability to recover from errors. Narrowerpredictive distributions from the learned bottom-up modelsprovide a useful proposal to generate particles with higherweights that can competently span the posterior state space.The predictive proposal distribution is a mixture of Gaus-sian summed across all the viewpoints and expert modelsrepresent all plausible poses for a given observation. Ateach time step we replace a few particles by the the samplesfrom the predictive proposalpB(xn|rn) to maintain particlediversity. The importance density is modeled as:

q(xn|xn−1, rn) = (1 − γn)p(xn|xn−1) + γnpB(xn|rn)(6)

Critical to this approach is to dynamically adjust the fractionγn at each time stepn. γn acts as a balance between the pre-dictively and dynamically sampled particles. This will en-able effective recovery from errors during failure when theproposal density fails to generate any particles near the trueposterior. In our experiments, we found the tracking accu-racies to be greatly influenced by the fractionγn. A pos-sible approach to estimateγn is to use traditional data fu-sion model to combine probabilistic densities using CentralLimit Theorem(CLT) that assigns the weights as inverse oftheir variances to the individual densities. Both the proposaldensities from the learned dynamic model and the bottom-up models are probabilistic non-linear regression functionslearned using Relevance Vector Machine(RVM)[16]. Theproposal densities has the same form as specified in theeqn. (2). The predictive variance for a test inputr isσ = σD + Φ(r)ΩΦ(r)T . HereσD is the fixed maximumlikelihood estimate of variance due to the training data. Thesecond data dependent term denotes the confidence of theregression function in the prediction from a given inputr.CLT sets the fraction asγn = σ2

(σ1+σ2)whereσ1 andσ2 are

the predictive variance of the dynamical modelp(xn|xn−1)and the predictive proposalp(xn|rn) respectively. In prac-tice, however we found this approach less effective in bal-ancing the particles to be sampled from either of the distri-butions. The fraction tends to be strongly dependent on thelearned models rather than the observations. In an ideal sce-narioγn should increase when particles sampled from thedynamic model has lower weights and should be at lowervalue when the dynamic model is doing well. We adopt asimplistic approach to control the number of particles sam-pled from the dynamics model and the discriminative mapby adjusting the fraction purely based on weights of the par-ticles sampled from either of the distributions. At each timestep we compute the gamma as

γn =

∑iǫNBU

n−1

w(i)n−1∑

iǫNBU

n−1

w(i)n−1 +

∑iǫNDY N

n−1

w(i)n−1

(7)

whereNBUn−1 andNDYN

n−1 denote the set of particles sampledfrom the predictive proposal map and dynamic distributionrespectively. Thew(i)

n−1 are the particle weights before re-sampling in the previous time step. The motivation behindthis weighting scheme is to assign high weights to the pro-posal which is generating particles closer to the true poste-rior. We initialize theγ0 to 0.5 in the first time frame andat each time step update the fraction to adaptively controlsamples diversity. In principle, the dependence of the frac-tion γn on the particle weights can be extended to includelonger history of particle weights from the previousN > 1frames. However, we found the current implementation tobe sufficient yet significantly improve the accuracies of theAPF tracker.

5.3. Joint Likelihood Modeling (JLM)

Likelihood distribution computes the belief of particlesin the light of current observations. Ambiguities in 2Dto 3D mappings are primarily due to failure of the likeli-hood function to assign different weights to seemingly sim-ilar but different 3D poses. Likelihood computation canbe enhanced by incorporating richer low-level features thatdiscriminative yet can generalize to different test scenar-ios. We propose an effective method to improve the like-lihood model treating the learned conditionalpB(x|r) frombottom-up models as a prior distribution over the statex

space conditioned on the inputr. The joint likelihood ismodeled as:

pL(rn|xn,H(rn)) ∝ p(rn|xn)1−βpB(xn|H(rn))

β (8)

whereH denotes the extracted descriptor of the observedsilhouette. The fractionβ that gives different weights tothe likelihood distribution and the predictive conditional ischosen by cross-validation and is fixed to 0.35 in all theexperiments. In our case,p(rn|xn) is modeled as com-plex non-linear transformation of projecting a synthetic 3Dmesh based computer graphic model of human in the posex (see section2) and compute the degree of overlap be-tween the projected and the observed silhouettes. Whereasthe bottom-up models employ shape information (H(rn))extracted from the outer contour of the silhouette (shapecontext followed by vector quantization) that are in somesense complementary to the silhouette overlap informationused inp(rn|xn). As pB(...) has an analytical form of mix-ture of Gaussians (see eqn. (2)), it can be readily evaluatedfor any of the particlex(i)

n . The conditional prior simplyreweighs the likelihood cost based on how close the hy-pothesized state of the particlex(i)

n is to the discriminativelypredicted statex. In theory, a linear combination of the twodistributions (with adjustable weights) may also be used tocompute likelihood weights of the particles. In the next sec-tion we compare the results of the extensive evaluation weperformed for the three filtering algorithms with the base-line annealed particle filtering based tracker.

6. 3D Human Shape Modeling and Estimation

Principal Component Analysis (PCA) is used to calcu-late global shape subspace that models variation of 3D hu-man shapes of 1500 subjects. To learn the shape space, weregister a template reference mesh model with 1200 verticesto CAESAR[2] laser scan data to parameterize human bodyshapes as 3600 dimensional vector. This reference model isa hole-filled, mesh model with standard anthropometry andstanding in the pose similar to the subjects in the CAESARdataset. The CAESAR dataset has 73 landmark points onvarious positions, and these can be used to guide the 3Dshape registration. The registration process consists of the

Figure 5. Tracking results usingJoint Particle Filteringon sequences significantly different from the training data. (First row) HumanEvaI sequence for the test subject S4 jogging in arbitrary orientation. JPF can successfully track the sequence as the learned dynamics doesnot include root orientation ;(Second row)Tracking results usingJoint Particle Filteringon Boxing and Walking(Third row)sequence forthe subjects S3 and S1 respectively ;(Fourth row)Tracking results on HumanEva II data set

following steps: (1) Using the MAYA graphic software, wegenerate a reference mesh model that has a similar pose asthe models in the CAESAR dataset ; (2) We annotate land-mark points on this reference model (as illustrated in2(f));(3) We then deform the reference model to fit the CAE-SAR model. The vertices template and the CAESAR modelare brought to one-to-one correspondence using an unsuper-vised non-rigid point set registration algorithm. The goalof3D point set registration is to establish correspondence andrecover the transformation between vertices of the templatemesh and the CAESAR model. Our 3D registration pro-cess is based on iterative gradient-based optimization of theenergy function with three data cost terms: (i) cost of fit-ting the non-landmark vertices to the nearest surface pointof the laser scan (EV ); (ii) cost of fitting of the landmarkpoints (EL); and (iii) the internal regularization term to pre-serve the shape (ER). The combined cost function is givenby:

E = αEV + βEL + γER (9)

The weightsα, β, γ are adjusted to balance the smooth-ness of the registered shape and accuracy of alignment.The optimization process is illustrated in2(f)). Using thismethod, we have registered about 1500 CAESAR NorthAmerican Standing scan data images. With this registration,the CAESAR scan data are now transformed to a common

parametrization scheme.

For a test image sequence, we estimate the 3D shape ofthe human target by searching in the low-dimensional shapespace learned using Principal Component Analysis (PCA).The 3D shape fitting algorithm essentially searches in thelearned subspace of human 3D shapes for estimating bestfitting PCA coefficients that has highest likelihood (sameas used for pose optimization). Sampling the shape spacehowever models anthropometric variability and can gener-ate shapes of humans standing in a canonical pose. Theshape is therefore non-rigidly deformed under the currentpose for each sampled shape hypothesis. For doing thesmooth deformation, each of the vertices in the 3D meshis associated to multiple joints (less than a maximum of6 joints). For optimizing the 3D shape, we use AnnealedParticle Filtering to obtain optimal shape parameters bestaligns with observation when projected to 2D image plane.6 shows the entire 3D shape estimation framework. Anthro-pometric skeleton is critical to the accuracy of 3D shape fit-ting algorithm as it determines the alignment and realisticdeformation of the 3D shape under the influence of skele-tal pose. We estimate the skeleton for a 3D shape by es-timating skeletal link lengths from the vertices and fittingthe skeleton to the joint original locations using Levenberg-Marquardt(LM) optimizer. This optimization re-estimates

Figure 6. Overview of shape estimation algorithm

the joint angles specific to the new skeleton and shape.

Extracting BioSignatures: Biosignatures extracted by oursystem include height, weight, gender and anthropometricmeasurements of the 3D human shape. Standard anthropo-metric distance measurements is done using geodesic dis-tances. However, geodesic measurements are often difficultto simulate. For instance, the CAESAR neck base circum-ference is determined by resting an adjustable chain neck-lace on the subject, then measuring the length of the chain.Such a procedure would require a full physical simulation toachieve in software. Hence, we restrict ourselves to a moretractable class of geodesic measurements: horizontal bodypart circumferences, in particular, the chest, waist, hips,right thigh, and right calf, as shown in fig.7(right)). On a3D scan, these measurements can be approximated by find-ing a curve of intersection between the mesh and the planeof measurement(plane parallel to ground plane), finding the2D convex hull to better simulate the taut tape or band,and measuring the length of the closed curve. The firststep, finding the intersection of the plane and mesh, requiressome filtering to ensure only the correct body part was mea-sured. For the groundtruth laser scan data, we manually as-sign the which vertices correspond to which body part. Inaddition, we also manually assign part labels to each vertexfor an average shape (one time labeling). Notice in fig.6different parts of the human body are color coded. Thisallow the intersection to cover vertices of a specific bodypart. Additionally, limit marker vertices were chosen foreach measurement, denoting the vertical extents of the re-gion to be measured. This step ensured that the chest wouldbe measured below the armpit, the thigh below the crotch,etc. The results of these filtering steps for the chest areshown on the right in fig.7. For cases where the circumfer-ence was to be maximized or minimized, the smoothness ofthe function was exploited by using Levenberg-Marquardtoptimization to quickly find the optimal height at which thecircumference is maximum. Finally, taking the convex hullof the initial intersection curve proved very important for

Figure 7. Data used for evaluating 3D shape estimation framework.Part circumferences computed from the estimated 3D shapes arecompared against groundtruth measurements estimated fromlaserscan shape of the same subject

accuracy; the hips in particular often have deep concavitiesin the regions of the buttocks and crotch, as shown in thecross-sectional view in fig.7.Height, Weight and Gender Estimation: Height of a hu-man body can be computed directly from the estimated 3Dshape using specific vertices at the top (head) and bottom(feet) of the 3D mesh. However in most poses, the humanshape appears bent or not perfectly aligned in a standingpose. Distances between the two vertices usually give apoor estimate of the height. Similarly, for computing weightof a human body, we can compute volume of each of thebody part and use average mass density of different parts toestimate the weight. However, different subjects may havedifferent part density and this may give inaccurate estimateof the weight of the subject. Therefore we employ learn-ing based methods to estimate the height using overall 3Dshape of the subject. Overall anthropometric shape of thehuman subjects is strongly correlated to its height, weightand gender. We use non-linear regression (Relevance Vec-tor Machine) functions to classify the gender and predictorheight and age from the shape coefficients.

7. Experiments

We evaluate our tracking algorithms on HumanEva dataset and provide pose estimation accuracies in terms of jointangles and joint center locations. The data set contained 3subject (S1, S2 andS3) performing three different activities(Walking, Jogging and Boxing). We only usedC2 sensorfor training and testing our system. One of the testing se-quences also include data captured fromC3 sensor (a view-point not used in our training data). For error reporting andtesting, we partitioned the data set into training and test-ing sets. From each activity sequence, the first 200 frameswere used for testing and the rest was used for training thebottom-up models as well as the optimal proposal densityp(xn|xn−1, rn). Both the distributions were trained usingoptimal set of parameters selected using cross-validationwith the validation set containing randomly selected10%of the training data.

Feature Extraction: We use only shape descriptors ex-tracted from silhouettes to train the predictive models. Ourinitial experiments using silhouette shapes along with in-ternal edges in the image descriptors as well as for likeli-hood computation gave worse results than using silhouettealone. In all the experiments, the results were generatedusing shape context histogram (SCH) as the input imagedescriptor for learning the predictive models. SCH is com-puted from outer contour by uniformly sampling 100 pixelsfrom it and voting for 5 radial and 12 angular bins with rel-ative distance of pixels ranging from1/8 to 3 on a log scale.The shape context is a robust shape descriptor that encodesrelative locations of the sampled pixels w.r.t. a referencepixel. The features are vector quantized by clustering theminto K = 400 prototype cluster centers and modeling thedistribution of these features using normalized inverse dis-tances from these learned prototypes .

HumanEva 3D Pose Representation: HumanEva data setrepresents an articulated 3D human body pose as set of 20joint locations. Joint location data cannot be readily trans-ferred to animate a deformable mesh. We there pre-processthe data by fitting a skeleton with 30 joints (≈ 55 degrees offreedom) to extract Euler angles for each joint. The skeletonfor each dataset was estimated as average link lengths overfirst 100 frames of the motion capture data. We used thesejoint angles as groundtruth for validation of our framework.The average loss of joint location accuracy due to skeletonfitting ranged from5-7mm. The skeleton is fitted using theLM based damped least square optimization. In doing so,we impose angular limit constraints to accurately estimatefeet and wrist joints (not present in the HumanEva dataset).These are useful for overcoming ambiguity in twist anglesof some of the joints. The global orientation of the humanbody is represented in cyclic co-ordinates usingcos/sintransform. 3D pose data in the original joint angle space hashigh dimensionality (≈ 90) and is reduced to 5 dimensionsusing SLVM[7]. Separate SLVM is trained for each activityand provides bi-directional mapping between the ambientand the latent space. The overall parameter space of 3Dpose is 11 dimensional (6 due to rigid body motion and 5due to 3D pose).

Predictive Models: Both the predictive distributionspB(x|r) and p(xn|xn−1, rn) are learned using BayesianMixture of Experts.pB(x|r) is modeled using two-level hi-erarchical Bayesian Mixture of Expert (hBME) model. Atthe first level, we cluster the data points based on globalpose orientation of the human target and train a classifierto recognize the orientation of the human body with respectto camera image plane. We quantize the360o human ori-entation span into 8 views and train a classifier to recog-nize the view based on the shape descriptor. At the secondlevel, we train 2 view dependent expert predictors to out-put 3D pose. p(xn|xn−1, rn) is however trained using a

Figure 8. Multimodal distribution propagation, ratio of degree ofmultimodality propagated by each of the particle filtering algo-rithm(Best viewed in color)

simpler Bayesian Mixture of Expert model with 5 experts.As the predictors output the points in the decorrelated la-tent state space, we treat independent hBME for each latentspace dimension. Both for learning classification and re-gression models, we used Relevance Vector Machine[16].For each viewpoint, we also learn 3 view dependent regres-sion functions to estimate exact orientation of the human.

Likelihood Computation and Hardware Optimization:Likelihood computation is the costliest operation in the gen-erative tracking. For the likelihood function during track-ing, we use an average 3D human shape and deform it usingan averaged sized skeleton. Using average shape regularizesthe 3D pose optimizer at each time step and overcome lo-cal optima due to specialized 3D shape. A one-time manualskeletal alignment and weight-painting process is requiredfor generating arbitrary human shapes and poses. We em-ployed 200 particles in the PF tracker, with 10 simulatedannealing iterations at each time step. As particle filteringinvolves independent computation of the likelihood func-tion of theN particles, we can parallelize the processing.We use efficient Graphics Processing Unit (GPU) based im-plementation for computing the likelihood function whichis the bottleneck operation for the entire optimization pro-cess.Sustaining Multimodality in PF: To characterize andcompare the degree of modality propagated by different par-ticle filtering algorithms, we use the weighting scheme ineqn. 1 to weigh a correspondence between input and out-put cluster. For the input cluster corresponding to the inputdata, we compute which of the associated output clustershave been observed (or covered) for the current set of par-ticles. For each association between the input and outputcluster, we add the corresponding weight and compute theratio with the maximum weight. Fig.8 shows the degreeof multimodality preserved by different PF algorithms forthe jogging sequence. Note that baseline APF has mini-mum modality preserved compared to the other three PFenhancements.

Online Active Learning of Predictive Models: For adapt-

Figure 9. Online active learning, pose prediction accuracyof on-line active learning of the predictive models(Best viewed in color)

ing the hBME model to the test domain, we apply Bayesianrelevance based updating to the two levels of gate distri-butions and the expert regression functions. From a set ofparticle hypotheses, we selectN = 1 − 5% with highestlikelihood weights to update the bases sets of the gatesgv,ge and the Experts. ForN > 5% approximate poses causeddegradation in the accuracy of the predictors. Gate clusterfor gv andge are identified based on viewpoint and prox-imity to expert predictors. After the parameter and basesupdation, EM-iterations are run to refine the gates and ex-pert cluster distributions. The entire updation mechanismis fast and performed in online fashion. Fig.9 shows theaverage joint angle prediction accuracy in degrees for thehBME model for a jogging sequence on subjectS3. Weused JPF for tracking and updated hBME at every frameusing 1, 4 and 10 particles with highest weights. Noticealso that the average error for each of the active learningscheme decreases using more number of particles to updatethe hBME model.Pose Estimation Results: Table7 shows quantitative eval-uation results of our framework on the HumanEva datasetfor the sensorsC2 andC3. Since we trained only on datacorresponding to sensor C2, for evaluating pose estimationresults in sensorC3, we estimated 3D pose using calibra-tion parameters of sensorC2. To compute errors, we trans-form the estimated root joint orientation by first remov-ing the camera rotation due toC2 and applying the rota-tion due toC3. That is for the camera projection matrices:PC2

= [RC2TC2

] andPC3= [RC3

TC3], the pose is trans-

formed using the rotation matrixRC3RT

C2before comput-

ing the error scores. The results in Table7 demonstrates theimproved accuracies both in terms of joint location and jointangles. Some significant discrepancies between the errorrates of joint locations and joint angles is because joint an-gle error does not take into account the error in the orienta-tion of the pose but only the body joint angles. Even a smallerror in the root joint can cause sizable difference in thejoint location but none in the angles of the joints. Based onthe results, JPF clearly outperforms both the baseline algo-rithm based on APF and in most cases PF based on JLM and

Algorithm Seq. APF OPF JPF JLM

Jog(S1 in C2) (Joint Loc.) 38.48 78.39 34.55 41.24Jog(S1 in C2) (Joint Angle) 9.32 12.54 8.19 12.11Jog (S2 in C2) (Joint Loc.) 43.04 35.05 31.01 58.58

Jog (S2 in C2) (Joint Angle) 11.07 9.50 7.25 9.06Jog (S3 in C2)(Joint Loc.) 78.18 75.41 38.74 62.57

Jog (S3 in C2)(Joint Angle) 11.03 11.26 9.06 11.55

Box (S2 in C2)(Joint Loc.) 67.4 34.73 43.58 27.65Box (S2 in C2)(Joint Angle) 18.18 12.55 14.08 8.39Box (S1 in C2)(Joint Loc.) 43.56 33.27 25.19 23.12

Box (S1 in C2)(Joint Angle) 13.22 11.70 10.05 7.05Box (S3 in C2)(Joint Loc.) 49.61 68.6 37.37 55.15

Box (S3 in C2)(Joint Angle) 15.41 23.75 12.11 13.02

Walk (S1 in C2)(Joint Loc.) 26.43 30.42 25.01 26.64Walk (S1 in C2)(Joint Angle) 7.61 7.23 5.04 4.10Walk (S2 in C2)(Joint Loc.) 60.40 37.04 34.61 35.06

Walk (S2 in C2)(Joint Angle) 9.71 9.06 6.01 6.74Walk (S3 in C2)(Joint Loc.) 54.09 63.09 27.61 64.25

Walk (S3 in C2)(Joint Angle) 9.52 9.32 4.60 7.49Jog (S2 in C3) (Joint Loc.) 51.43 40.12 38.91 54.33

Jog (S2 in C3) (Joint Angle) 11.49 12.57 10.5 13.28

Table 1. 3D pose estimation accuracies in average joint locationerror and joint angle error, for various PF algorithms. Highlightedvalues denote the best of the 4 algorithms that include: APF -An-nealed Particle Filtering, learned Optimal Proposal Density basedPF, JPF - Joint Particle Filtering and JLM - Joint LikelihoodMod-eling. JPF clearly outperforms the baseline APF and the other twoimprovements proposed in the work

Optimal Proposal density. Higher accuracy of JPF is due tovery detailed bottom-up predictive models (total5 LatentDim. ×8 Views×2 Experts regression functions in hBMEand8 Views ×3 Experts regression functions for orienta-tions). The bottom-up proposal density with 16 Gaussiancomponents can efficiently represent any depth and viewbased ambiguity to provide diverse set of samples havinghigher weights and closer to the true posterior. APF basedon learned optimal proposal density performs well on cer-tain sequences, however for other sequences it may outputstates far from the training data, causing it to recursivelyoutput mean predictions (as the combined feature and stateprediction from previous step significantly differs from thetraining exemplars). The errors are usually difficult to re-cover from. JLM based APF in most cases outperformsbaseline APF and Optimal Proposal Density based APF.Fig.4 compares the pose estimation results from the fourtrackers. Notice that JPF is able to overcome the errors dueto view-based and left-right leg forward ambiguities. Thegeneric bottom-up model can be be applied to estimate posein any orientation.Shape Estimation Results: In order to evaluate the accu-racy of our 3D shape estimation framework, we use laserscan data of a subject as the groundtruth shape and apply ourshape estimation algorithm to reconstruct its 3D shape for

Figure 10. Shape estimation results for a subject performing walk-ing motion. We compare the error in the estimated circumferenceof various body parts (Chest, Waist, Hips, Thigh and Calf) with re-spect to groundtruth circumferences computed from the laser scandata of the subject

Measurement Chest Waist Hips Thigh Calf(cm) (cm) (cm) (cm) (cm)

Groundtruth 106.4 83.46 103.18 59.6 38.65Estimated 108.85 89.59 112.44 67.08 45.76Error(%) 2.3% 7.35% 8.98% 12.55% 18.41%

Table 2. Part circumference estimation accuracy

a walking motion image sequence (see fig.7). We use the3D body part measurements of the laser scan data to com-pute the error in 3D body part circumference estimation.Fig.10 shows the plot of circumferences estimated fromthe fitted 3D body shape for some frames of the image se-quence, and corresponding groundtruth measurements. Ta-ble 2 shows the comparison of the groundtruth body partgirths and the circumferences computed from the estimated3D body shape and averaged over 20 frames.

Attributes Estimation: Attribute estimation accuracy wasevaluated on 4 targets. 3D shapes fitted to 250 frames ofthe video sequence were used to infer attributes using thelearned regression functions. Fig.11 shows the plots of theresults on the first two subjects where subject 1 is male andsubject 2 is female. Fig.12shows the same for the subjects3 and 4, where subject 3 is a male and subject 4 is female.For gender prediction the classifier gave the score of beinga male, that gave best accuracy when the threshold is set0.3. Table3 shows the average prediction error for heightand weight, and prediction accuracy for gender for the 250frames. We also extracted 20 frames from each of the sub-ject sequence that had best shape fitting likelihood. Theaverage prediction accuracy significantly improved whenonly best fitted shapes were used for attributes estimationas shown in the results in parentheses in table3.

Error/ Height Weight GenderSubject Error(cm) Error (Kgs) Accuracy(%)

Subject 1 3.0(2.1) 6.55 (4.32) 67.5%(72.1%)Subject 2 5.33(3.74) 10.3 (9.9) 68.75%(77.5%)Subject 3 5.10(3.22) 2.46 (2.9) 65.0% (70.5%)Subject 4 3.70(2.23) 19.17(15.3) 60.3%(67.9%)

Table 3. Attributes prediction accuracy

Figure 11. Attributes estimation results for subjects 1 and2

Figure 12. Attributes estimation results for subjects 3 and4

8. Conclusion

We have developed a fully automated system for 3D poseand shape estimation and analysis from monocular imagesequences. We develop three principled approaches to en-hance particle filtering by integrating bottom-up informa-

tion either as proposal density for obtaining more diverseparticles or as complementary cues to improve likelihoodcomputation during the correction step. Through extensiveexperimental evaluation we have demonstrated that our al-gorithms enhance the ability of the particle filtering to prop-agate multimodality for effective reconstruction of 3D posesfrom 2D images. In this work, we also demonstrated that afeedback mechanism from top-down modeling can furtheradapt and improve the bottom-up predictors to enhance theoverall tracking performance.Acknowledgements: This work was supported by AirForce Research Lab, contract number FA8650-10-C-6125.

References

[1] S. Arulampalam, S. Maskell, N. Gordon, and C. T.A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking.Signal Processing, IEEETransactions on, 2002.3

[2] Dataset. Caesar: Civilian american and euro-pean surface anthropometry resource project. Inhttp://store.sae.org/caesar/, volume 1, 2002.7

[3] J. Deutscher, A. Blake, and I. D. Reid. Articulatedbody motion capture by annealed particle filtering. InCVPR, 2000.3, 5

[4] A. Doucet, S. Godsill, and C. Andrieu. On sequentialmonte carlo sampling methods for bayesian filtering.STATISTICS AND COMPUTING, 2000.3, 5

[5] A. Doucet and A. M. Johansen. A tutorial on particlefiltering and smoothing: fifteen years later, 2011.3

[6] M. Isard and A. Blake. Icondensation: Unifying low-level and high-level tracking in a stochastic frame-work. In ECCV, 1998.3

[7] A. Kanaujia, C. Sminchisescu, and D. Metaxas. Spec-tral latent variable models for perceptual inference.ICCV, 2007.3, 4, 10

[8] M. W. Lee and I. Cohen. Proposal maps driven mcmcfor estimating human body pose in static images. InCVPR (2), pages 334–341, 2004.3

[9] M. W. Lee and R. Nevatia. Human pose tracking inmonocular sequence using multilevel structured mod-els. Trans. Pattern Anal. Mach. Intell., 2009.3

[10] R. Rosales and S. Sclaroff. Learning body pose viaspecialized maps. InNIPS, pages 1263–1270, 2001.3

[11] M. Salzmann and R. Urtasun. Combining discrimina-tive and generative methods for 3d deformable surfaceand articulated pose reconstruction. InCVPR 2010.IEEE, 2010.3

[12] L. Sigal, A. O. Balan, and M. J. Black. Combined dis-criminative and generative articulated pose and non-rigid shape estimation. InNIPS, 2007.3

[13] L. Sigal, A. O. Balan, and M. J. Black. Humaneva:Synchronized video and motion capture dataset andbaseline algorithm for evaluation of articulated humanmotion. Technical Report, 2010.4

[14] C. Sminchisescu, A. Kanaujia, Z. Li, and D. N.Metaxas. Discriminative density propagation for 3dhuman motion estimation. InCVPR (1), 2005.3, 4

[15] C. Sminchisescu, A. Kanaujia, and D. N. Metaxas.Learning joint top-down and bottom-up processes for3d visual inference. InCVPR (2), 2006.3

[16] M. E. Tipping. Sparse bayesian learning and the rel-evance vector machine.Journal of Machine LearningResearch, 1, 2001.5, 7, 10

[17] J. Vermaak, A. Doucet, and P. Perez. Maintaining mul-timodality through mixture tracking. InICCV, 2003.4

Documents

[email protected] arXiv:1410.0117v2 [cs.CV] …arXiv:1410.0117v2 [cs.CV] 6 Oct 2014 Coupling Top-down and Bottom-up Methods for 3D Human Pose and Shape Estimation from Monocular