25
International Journal of Computer Vision 53(3), 199–223, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Human Body Model Acquisition and Tracking Using Voxel Data IVANA MIKI ´ C Q3DM, Inc., 10110 Sorrento Valley Rd., Suite B, San Diego, CA 92121, USA [email protected] MOHAN TRIVEDI Department of Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Drive 0434, La Jolla, CA 92093-0434, USA [email protected] EDWARD HUNTER Q3DM, Inc., 10110 Sorrento Valley Rd., Suite B, San Diego, CA 92121, USA [email protected] PAMELA COSMAN Department of Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Drive 0407, La Jolla, CA 92093-0407, USA [email protected] Received February 14, 2002; Revised November 5, 2002; Accepted January 15, 2003 Abstract. We present an integrated system for automatic acquisition of the human body model and motion tracking using input from multiple synchronized video streams. The video frames are segmented and the 3D voxel reconstructions of the human body shape in each frame are computed from the foreground silhouettes. These reconstructions are then used as input to the model acquisition and tracking algorithms. The human body model consists of ellipsoids and cylinders and is described using the twists framework resulting in a non-redundant set of model parameters. Model acquisition starts with a simple body part localization procedure based on template fitting and growing, which uses prior knowledge of average body part shapes and dimensions. The initial model is then refined using a Bayesian network that imposes human body proportions onto the body part size estimates. The tracker is an extended Kalman filter that estimates model parameters based on the measurements made on the labeled voxel data. A voxel labeling procedure that handles large frame-to-frame displacements was designed resulting in very robust tracking performance. Extensive evaluation shows that the system performs very reliably on sequences that include different types of motion such as walking, sitting, dancing, running and jumping and people of very different body sizes, from a nine year old girl to a tall adult male. Keywords: human body model acquisition, motion capture, pose estimation

Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

International Journal of Computer Vision 53(3), 199–223, 2003c© 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

Human Body Model Acquisition and TrackingUsing Voxel Data

IVANA MIKICQ3DM, Inc., 10110 Sorrento Valley Rd., Suite B, San Diego, CA 92121, USA

[email protected]

MOHAN TRIVEDIDepartment of Electrical and Computer Engineering, University of California, San Diego,

9500 Gilman Drive 0434, La Jolla, CA 92093-0434, [email protected]

EDWARD HUNTERQ3DM, Inc., 10110 Sorrento Valley Rd., Suite B, San Diego, CA 92121, USA

[email protected]

PAMELA COSMANDepartment of Electrical and Computer Engineering, University of California, San Diego,

9500 Gilman Drive 0407, La Jolla, CA 92093-0407, [email protected]

Received February 14, 2002; Revised November 5, 2002; Accepted January 15, 2003

Abstract. We present an integrated system for automatic acquisition of the human body model and motiontracking using input from multiple synchronized video streams. The video frames are segmented and the 3D voxelreconstructions of the human body shape in each frame are computed from the foreground silhouettes. Thesereconstructions are then used as input to the model acquisition and tracking algorithms.

The human body model consists of ellipsoids and cylinders and is described using the twists framework resultingin a non-redundant set of model parameters. Model acquisition starts with a simple body part localization procedurebased on template fitting and growing, which uses prior knowledge of average body part shapes and dimensions.The initial model is then refined using a Bayesian network that imposes human body proportions onto the body partsize estimates. The tracker is an extended Kalman filter that estimates model parameters based on the measurementsmade on the labeled voxel data. A voxel labeling procedure that handles large frame-to-frame displacements wasdesigned resulting in very robust tracking performance.

Extensive evaluation shows that the system performs very reliably on sequences that include different types ofmotion such as walking, sitting, dancing, running and jumping and people of very different body sizes, from a nineyear old girl to a tall adult male.

Keywords: human body model acquisition, motion capture, pose estimation

Page 2: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

200 Mikic et al.

1. Introduction

Tracking of the human body, also called motion cap-ture or posture estimation, is a problem of estimatingthe parameters of the human body model (such as jointangles) from the video data as the position and config-uration of the tracked body change over time.

A reliable motion capture system would be valu-able in many applications (Gavrila, 1999; Moeslundand Granum, 2001). One class of applications are thosewhere the extracted body model parameters are used di-rectly, for example to interact with a virtual world, drivean animated avatar in a video game or for computergraphics character animation. Another class of applica-tions use extracted parameters to classify and recognizepeople, gestures or motions, such as surveillance sys-tems, intelligent environments, or advanced user inter-faces (sign language translation, gesture driven control,gait, or pose recognition). Finally, the motion param-eters can be used for motion analysis in applicationssuch as personalized sports training, choreography, orclinical studies of orthopedic patients.

Human body tracking algorithms usually assumethat the body model of the tracked person is knownand placed close to the true position in the beginningof the tracking process. These algorithms then estimatethe model parameters in time to reflect the motion of theperson. For a fully automated motion capture system,in addition to tracking, the model acquisition problemneeds to be solved. The goal of the model acquisitionis to estimate the parameters of the human body modelthat correspond to the specific shape and size of thetracked person and to place and configure the model toaccurately reflect the position and configuration of thebody in the beginning of the motion capture process.

In this paper we present a fully automated system formotion capture that includes both the model acquisitionand the motion tracking. While most researchers havetaken the approach of working directly with the imagedata, we use 3D voxel reconstructions of the humanbody shape at each frame as input to the model acqui-sition and tracking (Mikic et al., 2001). This approachleads to simple and robust algorithms that take advan-tage of the unique qualities of voxel data. The price isan additional preprocessing step where the 3D voxelreconstructions are computed from the image data—aprocess that can be performed in real-time using dedi-cated hardware.

Section 2 describes the related work and Section 3gives an overview of the system. The algorithm for

computing 3D voxel reconstructions is presented inSection 4, the human body model is described in Sec-tion 5 and the tracking in Section 6. Section 7 describesthe model acquisition and Section 8 shows the resultsof the system evaluation. The conclusion follows inSection 9. Additional details can be found in Mikic(2002).

2. Related Work

Currently available commercial systems for motioncapture require the subject to wear special markers,body suits or gloves. In the past few years, the problemof markerless, unconstrained posture estimation usingonly cameras has received much attention from com-puter vision researchers.

Many existing posture estimation systems requiremanual initialization of the model and then performtracking. Very few approaches exist where the modelis acquired automatically and in those cases, the per-son is usually required to perform a set of calibrationmovements that identify the body parts to the system(Kakadiaris and Metaxas, 1998; Cheung et al., 2000).Once the model is available in the first frame, a verycommon approach to tracking is to perform iterationsof four steps until good agreement between the modeland the data is achieved: prediction of the model posi-tion in the next frame, projection of the model to theimage plane(s), comparison of the projection with thedata in the new frame and adjustment of the modelposition based on this comparison.

Algorithms have been developed that take inputfrom one (Rehg and Kanade, 1995; Hunter, 1999;Bregler, 1997; DiFranco et al., 2001; Howe et al.,1999; Wachter and Nagel, 1999; Sminchiescu andTriggs, 2001; Ioffe and Forsyth, 2001) or multiplecameras (Kakadiaris and Metaxas, 1996; Bregler andMalik, 1998; Gavrila and Davis, 1996; Delamarre andFaugeras, 2001; Yamamoto et al., 1998; Hilton, 1999;Jung and Wohn, 1997). Rehg and Kanade (1995) havedeveloped a system for tracking a 3D articulated modelof a hand based on a layered template representationof self-occlusions. Hunter (1999) and Hunter et al.(1997) developed an algorithm based on the Expec-tation Maximization (EM) procedure that assigns fore-ground pixels to body parts and then updates bodypart positions to explain the data. An extra process-ing step, based on virtual work static equilibrium con-ditions, integrates object kinematic structure into the

Page 3: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 201

EM procedure, guaranteeing that the resulting postureis kinematically valid. An algorithm using products ofexponential maps to relate the parameters of the hu-man body model to the optical flow measurements wasdescribed by Bregler and Malik (1998). Wren (2000)designed the DYNA system, driven by 2D blob featuresfrom multiple cameras that are probabilistically inte-grated into a 3D human body model. Also, the systemincludes a feedback from the 3D body model to the 2Dfeature tracking by setting the appropriate prior prob-abilities using the extended Kalman filter. This frame-work accounts for ‘behaviors’—the aspects of motionthat cannot be explained by passive physics but rep-resent purposeful human motion. Gavrila and Davis(1996) used a human body model composed of taperedsuper-quadrics to track (multiple) people in 3D. Theyuse a constant acceleration kinematic model to pre-dict positions of body parts in the next frame. Theirlocations are then adjusted using the undirected nor-malized chamfer distance between image contours andcontours of the projected model (in multiple images).The search is decomposed in stages: they first adjustpositions of the head and the torso, then arms and legs.Kakadiaris and Metaxas have developed a system for3D human body model acquisition (1998) and tracking(1996) using three cameras placed in a mutually or-thogonal configuration. The person under observationis requested to perform a set of movements accordingto a protocol that incrementally reveals the structure ofthe human body. Once the model has been acquired, thetracking is performed using the physics-based frame-work (Metaxas and Terzopoulos, 1993). Based on theexpected body position, the difference between the pre-dicted and actual images is used to calculate forces thatare applied to the model. The dynamics are modeledusing the extended Kalman filter. The tracking result, anew body pose, is a result of the applied forces actingon the physics-based model. The problem of occlu-sions is solved by choosing from the available camerasthose that provide visibility of the part and observabil-ity of its motion, for every body part at every frame.Delamarre and Faugeras (2001) describe an algorithmthat computes human body contours based on opticalflow and intensity. Then, forces are applied that attemptto align the outline of the model to the contours ex-tracted from the data. This procedure is repeated untila good agreement is achieved. Deutscher et al. (2000,2001) developed a system based on the CONDEN-SATION algorithm (particle filter) (Isard and Blake,1996). Deutscher introduced a modified particle filter to

handle high dimensional configuration space of humanmotion capture. It uses a continuation principle, basedon annealing, to gradually introduce the influence ofnarrow peaks in the fitness function. Two image fea-tures are used in combination: edges and foreground sil-houettes. Good tracking results are achieved using thisapproach.

Promising results have been reported using the depthdata obtained from stereo (Covell et al., 2000; Jojic etal., 1999; Plankers and Fua, 1999, 2001) for pose esti-mation. The first attempt at using voxel data obtainedfrom multiple cameras to estimate body pose has beenreported in Cheung et al. (2000). A simple six-partbody model is fitted to the 3D voxel reconstruction.The tracking is performed by assigning the voxels inthe new frame to the closest body part from the pre-vious frame and by recomputing the new position ofthe body part based on the voxels assigned to it. Thissimple approach does not guarantee that two adjacentbody parts would not drift apart and also can lose trackeasily for moderately fast motions.

In the algorithms that work with the data in the imageplanes, the 3D body model is repeatedly projected ontothe image planes to be compared against the extractedimage features. A simplified camera model is oftenused to enable efficient model projection (Hunter, 1999;Bregler, 1997). Another problem in working with theimage plane data is that different body parts appear indifferent sizes and may be occluded depending on therelative position of the body to the camera and on thebody pose.

When using 3D voxel reconstructions as input to themotion capture system, very detailed camera modelscan be used, since the computations that use the cameramodel can be done off-line and stored in lookup tables.The voxel data is in the same 3D space as the bodymodel, therefore, there is no need for repeated projec-tions of the model to the image space. Also, since thevoxel reconstruction is of the same dimensions as thereal person’s body, the sizes of different body parts arestable and do not depend on the person’s position andpose. This allows the design of simple algorithms thattake advantage of our knowledge of average shapes andsizes of body parts.

3. System Overview

The system flowchart is shown in Fig. 1. The maincomponents are the 3D voxel reconstruction, model

Page 4: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

202 Mikic et al.

Figure 1. The system flowchart.

acquisition and motion tracking. The 3D voxel re-construction takes multiple synchronized, segmentedvideo streams and computes the reconstruction of theshape represented by the foreground pixels. The systemcomputes the volume of interest from the foregroundbounding boxes in each of the camera views. Then,for each candidate voxel in the volume of interest, it ischecked whether its projections onto each of the im-age planes coincide with a foreground pixel. If yes,that voxel is assigned to the reconstruction, otherwisethe voxel is set to zero. The cameras are calibrated,and calibration parameters are used by the system tocompute the projections of the voxels onto the imageplanes.

Model acquisition is performed in two stages. Inthe first frame of the sequence, a simple templatefitting and growing procedure is used to locate dif-ferent body parts. This procedure takes advantage ofthe prior knowledge of average sizes and shapes ofdifferent body parts. This initial model is then re-fined using a Bayesian network that incorporates theknowledge of human body proportions into the esti-mates of body part sizes. This procedure convergesrapidly.

Finally, the tracking procedure executes the predict-update cycle at each frame. Using the prediction ofthe model position and configuration from the previ-

ous frame, the voxels in the new frame are assignedto one of the body parts. Measurements of locationsof specific points on the body are extracted from thelabeled voxels and a Kalman filter is used to adjustthe model position and configuration to best fit the ex-tracted measurements. The voxel labeling procedurecombines template fitting and distance minimizingapproaches. This algorithm can handle large frame-to-frame displacements and results in robust trackingperformance.

The human body model used in this project is de-scribed using the twists framework developed in therobotics community (Murray et al., 1993). In thisformulation, the constraints that ensure kinematicallyvalid postures allowed by the degrees of freedom inthe joints and the connectedness between specific bodyparts are incorporated into the model. This results inthe non-redundant set of model parameters and in thesimple and stable tracking algorithm, since there is noneed to impose these constraints during the trackingprocess.

4. 3D Voxel Reconstruction

To compute the voxel reconstruction, the camera im-ages are first segmented using the algorithm described

Page 5: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 203

Figure 2. 3D voxel reconstruction: (a) original input images, (b) extracted 2D silhouettes, (c) two views of the resulting 3D reconstruction atvoxel size of 50 mm and (d) two views of the 3D reconstruction at voxel size of 25 mm.

in Horprasert et al. (1999) which eliminates shadowsand highlights and produces good quality silhouettes.Based on the centroids and bounding boxes of the 2Dsilhouettes, a bounding volume of the person is com-puted. Cameras are calibrated using Tsai’s algorithm(Tsai, 1987).

Reconstructing a 3D shape using silhouettes frommultiple images is called voxel carving or shape fromsilhouettes. Octree (Szeliski, 1993) is one of the bestknown approaches to voxel carving. The volume ofinterest is first represented by one cube, which is pro-gressively subdivided into eight subcubes. Once it isdetermined that a subcube is entirely inside or entirelyoutside the 3D reconstruction, its subdivision isstopped. Cubes are organized in a tree, and once allthe leaves stop dividing, the tree gives an efficient rep-resentation of the 3D shape. The more straightforwardapproach is to check for each voxel if it is consistentwith all silhouettes. With several clever speed-up meth-ods, this approach is described in Cheung et al. (2000).In this system, the person is known to always be in-side a predetermined volume of interest. A projectionof each voxel in that volume onto each of the imageplanes is precomputed and stored in a lookup table.Then, at runtime, the process of checking whether the

voxel is consistent with a 2D silhouette is very fastsince the use of the lookup table eliminates most of thenecessary computations.

Our goal is to allow a person unconstrained move-ment in a large space by using multiple pan-tilt cameras.Therefore, designing a lookup table that maps voxelsto pixels in each camera image is not practical. Instead,we pre-compute a lookup table that maps points fromundistorted sensor-plane coordinates (divided by fo-cal length and quantized) in Tsai’s model to the imagepixels. Then, the only computation that is performed atruntime is mapping from world coordinates to undis-torted sensor-plane coordinates. A voxel is uniformlysampled and included in the 3D reconstruction if themajority of the sample points agree with all image sil-houettes. Voxel size is chosen by the user. An exampleframe is shown in Fig. 2 with reconstruction at tworesolutions.

The equations are given below, where xw is thevoxel’s world coordinate, xc is its coordinate in thecamera coordinate system, Xu and Yu are undistortedsensor-plane coordinates, Xd and Yd are distorted (true)sensor-plane coordinates and X f and Yf are pixelcoordinates. The lookup table is fixed for a cameraregardless of its orientation (would work for a pan/tilt

Page 6: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

204 Mikic et al.

camera also—only rotation matrix R and a translationvector T change in this case).

Lookup table computations: Run-time computations:(Xuf , Yu

f

)→ (X f, Yf)

− − − − − − − − − − −−Xu = Xd

(1 + κ1

(X2

d + Y 2d

))Yu = Yd

(1 + κ1

(X2

d + Y 2d

))X f = d−1

x Xdsx + Cx

Yf = d−1y Yd + Cy

(xw, yw, zw) →(

Xuf , Yu

f

)− − − − − − − − − − −

xc = Rxw + TXuf = xc

zc, Yu

f = yc

zc

5. Human Body Model

In designing a robust motion capture system that pro-duces kinematically valid posture estimates, the choiceof the human body modeling framework is critical.Valid configurations of the model are defined by anumber of constraints that can be incorporated into themodel itself or imposed during tracking. It is desir-able to incorporate as many constraints as possible intothe model, since that results in a more robust and stabletracking performance. However, it is also desirable thatthe relationship between the model parameters and lo-cations of specific points on the body be simple, since itrepresents the measurement equation of the Kalman fil-ter. The twists framework (Murray et al., 1993) for de-scribing kinematic chains satisfies both requirements.The model is captured with a non-redundant set of pa-rameters (i.e. with most of the constraints incorporatedinto the model), and the relationship between the modelparameters and the locations of specific points on thebody is simple.

5.1. Twists and the Product of ExponentialsFormula for Kinematic Chains

Let us consider a rotation of a rigid object about a fixedaxis. Let the unit vector along the axis of rotation beω ∈ �3 and q ∈ �3 be a point on the axis. Assumingthat the object rotates with unit velocity, the velocity ofa point p(t) on the object is:

p(t) = ω × (p(t) − q) (1)

This can be rewritten in homogeneous coordinates as:

[p

0

]=

[ω −ω × q

0 0

][p

1

]= ξ

[p

1

](2)

˙p = ξ p

where p = [p 1]T is a homogeneous coordinate of thepoint p, and ω × x = ωx, ∀x ∈ �3, i.e.,

ω =

0 −ω3 ω2

ω3 0 −ω1

−ω2 ω1 0

(3)

and

ξ =[ω −ω × q

0 0

]=

[ω v

0 0

](4)

is defined as a twist associated with the rotation aboutthe axis defined by ω and q. The solution to the differ-ential Eq. (1) is:

p(t) = eξ t p(0) (5)

eξ t is the mapping (the exponential map associated withthe twist ξ ) from the initial location of a point p toits new location after rotating t radians about the axisdefined by ω and q. It can be shown that

M = eξ θ =[

eωθ (I − eωθ )(ω × v) + ωωT vθ

0 1

]

(6)

where

eωθ = I + ω

‖ω‖ sin(‖ω‖θ ) + ω2

‖ω‖2(1 − cos(‖ω‖θ ))

(7)

is a rotation matrix associated with the rotation of θ

radians about an axis ω. If we have an open kinematicchain with n axes of rotation, it can be shown that:

gP (θ) = eξ1θeξ2θ . . . eξnθ gP (0) (8)

where gP (θ) is the rigid body transformation betweenthe base of the chain and a point on the last link of thechain, in the configuration described by the n anglesθ = [θ1 θ2 · · · θn]T ; gP (0) represents the rigid bodytransformation between the same points for a referenceconfiguration of the chain. Equation (8) is called theproduct of exponentials formula for an open kinematicchain and it can be shown that it is independent of the

Page 7: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 205

Figure 3. Articulated body model. Sixteen axes of rotation (marked by circled numbers) in body joints are modeled using twists relative tothe torso-centered coordinate system. To describe an axis of rotation, a unit vector along the axis and a coordinate of a point on the axis in the“initial” position of the body are needed. As initial position, we chose the one where legs and arms are straight and arms are pointing away fromthe body as shown in the figure. Dimensions of body parts are determined in the initialization procedure and are held fixed thereafter. Body partdimensions are denoted by λ; subscript refers to the body part number and superscript to dimension order: 0 is for the smallest and 2 for thelargest of the three. For all body parts except the torso, the two smaller dimensions are set to be equal.

order in which the rotations are performed. The anglesare numbered going from the chain base toward the lastlink.

5.2. Twist-Based Human Body Model

The articulated body model we use is shown in Fig. 3.It consists of five open kinematic chains: torso-head,torso-left arm, torso-right arm, torso-left leg and torso-right leg. Sizes of body parts are denoted as 2λ

( j)i ,

where i is the body part index, and j is the dimen-sion order—smallest dimension is 0 and largest is 2.For all parts except torso, the two smaller dimensionsare set to be equal to the average of the two dimen-sions estimated during initialization. The positions ofjoints are fixed relative to the body part dimensions inthe torso coordinate system (for example, the hip is at[0 dHλ

(1)0 −λ

(2)0 ]T ).

Sixteen axes of rotation are modeled in differentjoints. Two in the neck, three in each shoulder, twoin each hip and one in each elbow and knee. We takethe torso-centered coordinate system as the reference.

The range of allowed values is set for each angle. Forexample, the rotation in the knee can go from 0 to 180degrees—the knee cannot bend forward. The rotationsabout these axes (relative to the torso) are modeledusing exponential maps, as described in the previoussection.

Even though the axes of rotation change as the bodymoves, in the twists formulation the descriptions ofthe axes stay fixed and are determined in the initialbody configuration. We chose the configuration withextended arms and legs, and with arms pointing to theside of the body (shown in Fig. 3) as the initial bodyconfiguration. In this configuration, all angles θi areequal to zero. The figure also shows values for the vec-tors ωi and qi for each axis. The location of a point onthe body with respect to the torso is determined by itslocation in the initial configuration and the product ofexponential maps for the axes that affect the positionof that point.

Knowing the dimensions of body parts and usingthe body model shown in Fig. 3, the configuration ofthe body is completely captured with angles of rotationabout each of the axes (θ1 − θ16) and the centroid (t0)

Page 8: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

206 Mikic et al.

and orientation (rotation matrix R0) of the torso. Ori-entation of the torso is parameterized with a unit vectorω0 = [ω0 ω1 ω2] and the angle θ0 (Eq. (9)). The po-sition and orientation of the torso are captured usingseven parameters—three coordinates for centroid lo-cation and four for the orientation. Therefore, the con-figuration of the described model is fully captured by23 parameters: θ1 − θ16, t0, ω0 and θ0.

R0 = eω0θ0 = I + ω0

‖ω0‖ sin(‖ω0‖θ )

+ ω20

‖ω0‖2(1 − cos(‖ω0‖θ0)) (9)

The exponential maps associated with each of thesixteen axes of rotation are easily computed usingEq. (6) and the vectors ωi and qi , for each i = 1, . . . ,

16 given in Fig. 3.During the tracking process, locations of specific

points on the body, such as the upper arm centroid orneck, are used to adjust the model configuration to thedata. To formulate the tracker, it is necessary to derivethe equations that relate locations of these points in thereference coordinate system to the parameters of thebody model.

For a point p, we define the significant rotations asthose affecting the position of the point—if p is thewrist, there would be four: three in the shoulder and onein the elbow. The set of angles θp contains the anglesassociated with the significant rotations. The positionof a point pt(θp) with respect to the torso is given bythe product of exponential maps corresponding to theset of significant rotations and of the position of thepoint in the initial configuration pt(0) (in homogeneouscoordinates):

pt(θp) = Mi1Mi2 . . . Minpt(0), where

θp = {θi1, θi2, . . . θin} (10)

We denote with M0 the mapping that corresponds tothe torso position and orientation:

M0 =[

R0 t0

0 1

]=

[eω0θ0 t0

0 1

](11)

where R0 = eω0θ0 and t0 is the torso centroid. Thehomogeneous coordinates of a point with respect tothe world coordinate system can now be expressed as:

p0(θp, ω0, t0) = M0pt(θp) (12)

It follows that the Cartesian coordinate of this point is:

p0(θp) = R0(Ri1(Ri2(. . . (Rinpt(0) + tin) + · · ·)+ ti2) + ti1) + t0 (13)

6. Tracking

The algorithm for model acquisition, which estimatesbody part sizes and their locations in the beginning ofthe sequence, will be presented in the next section. Fornow, we will assume that the dimensions of all bodyparts and their approximate locations in the beginningof the sequence are known. For every new frame, thetracker updates the model position and configuration toreflect the motion of the tracked person. The algorithmflowchart is shown in Fig. 4.

The tracker is an extended Kalman filter (EKF) thatestimates the parameters of the model given the mea-surements extracted from the data. For each new frame,the prediction of the model position and configura-tion produced from the previous frame is used to labelthe voxel data and compute the locations of the cho-sen measurement points. Those measurements are thenused by the EKF to update the model parameters andto produce the prediction for the next frame. In thissection, the extended Kalman filter is formulated andthe voxel labeling algorithm is described.

6.1. The Extended Kalman Filter

The Kalman filter tracker for our problem is definedby:

x[k + 1] = F[k]x[k] + v[k](14)

z[k] = h[k, x[k]] + w[k]

where v[k] and w[k] are sequences of zero-mean,white, Gaussian noise with covariance matrices Q[k]and R[k], respectively. The initial state x(0) is assumedto be Gaussian with mean x[0/0] and covarianceP[0/0]. The 23 parameters of the human body modeldescribed in Section 5 constitute the Kalman filter state,x (Table 1):

x = [tT0 ωT

0 θ0 θ1 · · · θ16]T

(15)

The state transition matrix is set to the identity matrix.For the measurements of the Kalman filter (containedin the vector zk) we chose 23 points on the human

Page 9: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 207

Table 1. State variables.

t0 Torso centroid θ3, θ5, θ7 Shoulder angles (L) θ10 Elbow (R) θ15 Knee (L)

ω0, θ0 Torso orientation θ4, θ6, θ8 Shoulder angles (R) θ11, θ13 Hip angles (L) θ16 Knee (R)

θ1, θ2 Neck angles θ9 Elbow (L) θ12, θ14 Hip angles (R)

Figure 4. Flow chart of the tracking algorithm. For each new frame, the prediction of the model position is used to label the voxels and computethe locations of measurement points. The tracker then updates the model parameters to fit the new measurements.

Figure 5. Measurement points.

body which include centroids and endpoints of differ-ent body parts (Fig. 5 and Table 2):

z = [pT

0 · · · pT22

]T(16)

The measurement equation is defined by Eq. (13).The measurement Jacobian hx[k + 1] = [∇xhT

[k + 1, x]]Tx=x[k+1/k] is easily computed. For example,

the last row of the measurement Jacobian (for the rightfoot) is determined from:

∂p22

∂ω0= ∂R0

∂ω0(R12(R14(R16p22(0) + t16) + t14) + t12)

(17)

∂p22

∂θ0= ∂R0

∂θ0(R12(R14(R16p22(0) + t16) + t14) + t12)

(18)∂p22

∂θ12= R0

(∂R12

∂θ12(R14(R16p22(0) + t16) + t14)

+ ∂t12

∂θ12

)(19)

∂p22

∂θ14= R0R12

(∂R14

∂θ14(R16p22(0) + t16) + ∂t14

∂θ14

)

(20)∂p22

∂θ16= R0R12R14

(∂R16

∂θ16p22(0) + ∂t16

∂θ16

)(21)

Usually several iterations of the extended Kalmanfilter algorithm are performed in each frame with themeasurement Jacobian updated at every iteration.

6.2. Voxel Labeling

Initially, we labeled the voxels based on the Maha-lanobis distance from the predicted positions of bodyparts (Mikic et al., 2001). However, in many cases, thisled to loss of track. This was due to the fact that labeling

Page 10: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

208 Mikic et al.

Table 2. Variables in the measurement vector.

p0 Torso centroid p8 Calf cent. (L) p16 Elbow (R)

p1 Head centroid p9 Calf cent. (R) p17 Fingertips (L)

p2 Upper arm cent. (L) p10 Neck p18 Fingertips (R)

p3 Upper arm cent. (R) p11 Shoulder (L) p19 Knee (L)

p4 Thigh cent. (L) p12 Shoulder (R) p20 Knee (R)

p5 Thigh cent. (R) p13 Hip (L) p21 Foot (L)

p6 Lower arm cent. (L) p14 Hip (R) p22 Foot (R)

p7 Lower arm cent. (R) p15 Elbow (L)

Figure 6. Head location procedure illustrated in a 2D cross-section. (a) Search for the location of the center of a spherical crust template thatcontains the maximum number of surface voxels. Small dashed circle is the prediction of the head pose from the previous frame. It determinesthe search area (large dashed circle) for the new head location; (b) the best location is found; (c) voxels that are inside the sphere of a largerdiameter are labeled as belonging to the head; (d) head voxels, the head center and the neck.

based purely on distance cannot produce a good resultwhen the model prediction is not very close to true po-sitions of body parts. We have, therefore, designed analgorithm that takes advantage of the qualities of voxeldata to perform reliable labeling even for very largeframe-to-frame displacements. The head and torso arelocated first without relying on the distance from theprediction, but based on their unique shapes and sizes.Next, the predictions of limb locations are modified topreserve joint angles with respect to the new positionsof the head and the torso. This is the key step that en-ables tracking for large displacements. The limb vox-els are then labeled based on distance from the mod-ified predictions. In the reminder of this section, thedetailed description of the voxel labeling algorithm isgiven.

Due to its unique shape and size, the head is easi-est to find and is located first (see Fig. 6). We createa spherical crust template whose inner and outer di-ameters correspond to the smallest and largest headdimensions. The template center location that max-imizes the number of surface voxels that are insidethe crust is chosen as the head center. Then, the vox-els that are inside the sphere of the larger diameter,centered at the chosen head center are labeled as be-longing to the head, and the true center and orienta-tion of the head are recomputed from those voxels.The approximate location of the neck is found as anaverage over the head voxels with at least one neigh-bor a non-head body voxel. Since the prediction ofthe head location is available, the search for the headcenter can be limited to some neighborhood, which

Page 11: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 209

Figure 7. Fitting the torso. The torso template is placed so that itsbase is at the neck and its main axis passes through the centroidof non-head voxels. Voxels that are inside the template are used tocalculate the new centroid and the template is rotated to align themain axis with the new centroid. The process is repeated until thetemplate stops moving which happens when it is entirely inside thetorso, or is well centered over it.

speeds up the search and reduces the likelihood oferror.

The torso is located next. The template of the sizeof the person’s torso (with circular cross-section of theradius equal to the larger of the two torso radii) is placedwith its base at the neck and with its axis going throughthe centroid of non-head body voxels. The voxels insidethe template are then used to recompute a new centroid,and the template is rotated so that its axis passes throughit (Fig. 7). The template is anchored to the neck at thecenter of its base at all times. This procedure is repeateduntil the template stops moving, which is accomplishedwhen it is entirely inside the torso or is well centeredover it.

Next, the predictions for the four limbs are modifiedto maintain the predicted hip and shoulder angles withthe new torso position, which usually moves them muchcloser to the true positions of the limbs. The remainingvoxels are then assigned to the four limbs based onMahalanobis distance from these modified positions.To locate upper arms and thighs, the same fitting pro-cedure used for the torso is repeated, including onlythe appropriate limb voxels, with templates anchoredat the shoulders/hips. When the voxels belonging toupper arms and thighs are labeled, the remaining vox-els in each of the limbs are labeled as lower arms orcalves.

Once all the voxels are labeled, the 23 measurementpoints are easily computed as centroids or endpoints ofappropriate blobs. The extended Kalman filter trackeris then used to adjust the model to the measurementsin the new frame and to produce the prediction for thenext frame. Figure 8 illustrates the voxel labeling andtracking.

7. Model Acquisition

The human body model is chosen a priori and is thesame for all humans. However, the actual sizes of bodyparts vary from person to person. Obviously, for eachcaptured sequence, the initial locations of differentbody parts will vary also. Model acquisition, therefore,involves both locating the body parts and estimatingtheir true sizes from the data in the beginning of a se-quence. It is performed in two stages (Fig. 9). First,rough estimates of body part locations and sizes in thefirst frame are generated using a simple template fit-ting and growing algorithm. In the second stage, thisestimate is refined over several subsequent frames us-ing a Bayesian network that takes into account boththe measured body dimensions and the known propor-tions of the human body. During this refinement pro-cess, the Bayesian network is inserted into the trackingloop, using the body part size measurements producedby the voxel labeling to modify the model, which isthen adjusted to best fit the data using the extendedKalman filter. When the body part sizes stop changing,the Bayesian network is “turned off” and the regulartracking continues.

7.1. Initial Estimation of Body Part Locationsand Sizes

This procedure is similar to the voxel labeling describedin Section 6.2. However, the prediction from the previ-ous frame does not exist (this is the first frame) and thesizes of body parts are not known. Therefore, severalmodifications and additional steps are needed.

The algorithm illustrated in Fig. 6 is still used tolocate the head, however, the inner and outer diame-ters of the spherical crust template are now set to thesmallest and largest head diameters we expect to see.Also, the whole volume has to be searched. Errors aremore likely than during voxel labeling for tracking, butare still quite rare: in our experiments on 600 frames,this version located the head correctly in 95% of theframes.

To locate the torso, the same fitting procedure de-scribed for voxel labeling is used (Fig. 7), but withthe template of an average sized torso. Then, the torsotemplate is shrunk to a small predetermined size in itsnew location and grown in all dimensions until fur-ther growth starts including empty voxels. At everystep of the growing, the torso is reoriented as shownin Fig. 7 to ensure that it is well centered during

Page 12: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

210 Mikic et al.

Figure 8. Voxel labeling and tracking. (a) Tracking result in the previous frame; (b) model prediction in the new frame; (c) head and torsolocated; (d) limbs moved to preserve the predicted hip and joint angles for the new torso position and orientation; (e) four limbs are labeledby minimizing the Mahalanobis distance from the limb positions shown in (d); (f) upper arms and thighs are labeled by fitting them inside thelimbs, anchored at the shoulder/hip joints. The remaining limb voxels are labeled as lower arms and calves; (g) the measurement points areeasily computed from the labeled voxels; (h) tracker adjusts the body model to fit the data in the new frame.

Figure 9. Flow chart of the model acquisition process.

Page 13: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 211

Figure 10. Torso locating procedure illustrated in a 2D cross-section. (a) Initial torso template is fitted to the data; (b) It is then replaced by asmall template of predetermined size which is anchored at the same neck point and oriented the same way; (c) the template is then grown andreoriented at every step of growing to ensure the growth does not go in the wrong direction; (d) the growing is stopped when it starts includingempty voxels; (e) voxels inside the final template are labeled as belonging to the torso.

growth. In the direction of the legs, the growing willstop at the place where legs part. The voxels insidethis new template are labeled as belonging to the torso(Fig. 10).

Next, the four regions belonging to the limbs arefound as the four largest connected regions of remain-ing voxels. The hip and shoulder joints are located asthe centroids for voxels at the border of the torso andeach of the limbs. Then, the same fitting and growingprocedure described for the torso is repeated for thighsand upper arms. The lower arms and calves are foundby locating connected components closest to theidentified upper arms and thighs. Figure 11 shows thedescribed initial body part localization on real voxeldata.

7.2. Model Refinement

The estimates of body part sizes and locations in thefirst frame are produced using the algorithm describedin the previous section. It performs robustly, but thesizes of the torso and the limbs are often very inaccu-rate and depend on the body pose in the first frame. Forexample, if the person is standing with legs straightand close together, the initial torso will be very longand include much of the legs. The estimates of thethigh and calf sizes will be very small. Obviously, anadditional mechanism for estimating true body partsizes is needed.

In addition to the initial estimate of the body partsizes and of the person’s height, a general knowledge

of human body proportions is available. To take that im-portant knowledge into account when reasoning aboutbody part sizes, we are using Bayesian networks (BNs).A BN is inserted into the tracking loop (Fig. 12), mod-ifying the estimates of body part lengths at each newframe. The EKF tracker adjusts the new model positionand configuration to the data, the voxel labeling proce-dure provides the measurements in the following frame,which are then used by the BN to update the estimatesof body part lengths. This procedure is repeated untilthe body part lengths stop changing, which is usuallyachieved in three to four frames.

The domain knowledge that is useful for designingthe Bayesian network is: the human body is symmetric,i.e., the corresponding body parts on the left and theright sides are of the same dimensions; the lengths ofthe head, the torso, the thigh and the calf add up to theperson’s height; the proportions of the human body areknown.

The measurements that can be made from the dataare the sizes of all body parts and the person’s height.The height of the person, the dimensions of the headand the two width dimensions for all other body partsare measured quite accurately. The lengths of differ-ent body parts are the ones that are inaccurately mea-sured. This is due to the fact that the measured lengthsdepend on the borders between body parts, whichare hard to locate accurately. For example, if the legis extended, it is very hard to determine where thethigh ends and the calf begins, but the two width di-mensions can be very accurately determined from thedata.

Page 14: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

212 Mikic et al.

Figure 11. Initial body part localization. (a) 3D voxel reconstruction; (b) head located; (c) initial torso template anchored at the neck centeredover the non-head voxels; (d) start of the torso growing; (e) final result of torso growing with torso voxels labeled; (f) four limbs labeled as fourlargest remaining connected components; (g) upper arms and thighs are grown anchored at the shoulders/hips with the same procedure used fortorso; (h) lower arms and calves are fitted to the remaining voxels; (i) all voxels are labeled; (j) current model adjusted to the data using the EKFto ensure a kinematically valid posture estimate.

Figure 12. Body part size estimation.

Page 15: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 213

Figure 13. The Bayesian network for estimating body part lengths. Each node represents a length. The leaf nodes are measurements (Thmrepresents the new thigh measurement, Thm0 reflects the past measurements etc.). Nodes Torso, Thigh, Calf, UpperArm and LowerArm arerandom variables that represent true body part lengths.

Taking into account what is known about the humanbody and what can be measured from the data, we canconclude that there is no need to refine our estimatesof the head dimensions or the width dimensions ofother body parts since they can be accurately estimatedfrom the data, and our knowledge of body proportionswould not be of much help in these cases anyway. How-ever, for body part lengths, the refinement is neces-sary and the available prior knowledge is very useful.Therefore, we have built a Bayesian network shownin Fig. 13 that estimates the lengths of body parts andthat takes into account what is known and what can bemeasured.

Each node represents a continuous random variable.Leaf nodes Thm, Cm, UAm and LAm are the measure-ments of the lengths of the thigh, calf and upper andlower arm in the current frame. Leaf node Height isthe measurement of the person’s height (minus headlength) computed in the first frame. If the person’sheight is significantly smaller than the sum of mea-sured lengths of appropriate body parts, we take thatsum as the true height—in case the person is not stand-ing up. Leaf nodes Thm0, Cm0, UAm0 and LAm0 areused to increase the influence of past measurements andspeed up the convergence. Each of these nodes is up-dated with the mean of the marginal distribution of itsparent from the previous frame. Other nodes (Torso,Thigh, Calf, UpperArm and LowerArm) are randomvariables that represent true body part lengths. Due tothe body symmetry, we include only one node for eachof the lengths of the limb body parts and update thecorresponding measurement node with the average ofthe measurements from the left and right sides. Themeasurement of the torso length is not used because

the voxel labeling procedure just fits the known torsoto the data, therefore the torso length measurement isessentially the same as the torso length in the modelfrom the previous frame.

All variables are Gaussian and the distribution of anode Y with continuous parents Z is of the form:

p(Y/Z = z) = N (α + βT z, σ 2) (22)

Therefore, for each node with n parents, a set of nweights β = [β1 . . . βn]T , a standard deviation σ andpossibly a constant α are the parameters that need to bechosen. These parameters have clear physical interpre-tation and are quite easy to select. The selected param-eters for the network in Fig. 13 are shown in Table 3.Nodes Thigh, Calf, UpperArm and LowerArm haveeach only one parent (Torso) and the weight parametersrepresent known body proportions. Node Height has aGaussian distribution with the mean equal to the sumof thigh, calf and torso lengths (hence all three weightsare equal to 1). Each of the nodes Thm0, Cm0, UAm0and LAm0 is updated with the mean of the marginaldistribution of its parent in the previous frame—hencethe weight of 1.

7.3. Determining Body Orientation

The initial body part localization procedure does notinclude determining which side of the body is left andwhich is right. This is important to do because it af-fects the angle range for each joint. For example, if they-axis of the torso points to the person’s left, the angle

Page 16: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

214 Mikic et al.

Table 3. The parameters used in the Bayesian network shown in Fig. 13. Torso is the only parent node, andtherefore the only node that needs an a priori distribution. Nodes Thigh, Calf, UpperArm and LowerArm haveeach only one parent. Torso and their probabilities conditioned on the torso length are Gaussian with the meanthat linearly depends on the torso length. The weight parameters represent known body proportions. Node Heighthas a Gaussian distribution with the mean equal to the sum of thigh, calf and torso lengths. Each of the nodesThm0, Cm0, UAm0 and LAm0 is updated with the mean of the marginal distribution of its parent in the previousframe—hence the weight of 1.

Parameter values (mm) Parameter values (mm) Parameter values (mm)Node β, σ Node β, σ Node β, σ

Thigh [0.9], 100 Thm [1], 200 Thm0 [1], 100

Calf [1], 100 Cm [1], 200 Cm0 [1], 100

UpperArm [0.55], 100 UAm [1], 200 UAm0 [1], 100

LowerArm [0.75], 100 LAm [1], 200 LAm0 [1], 100

Torso (µ, σ ) = (500, 300) Height [1, 1, 1]T, 100

Figure 14. Determining the body orientation. Left: top-down view, Right: side view. (a) In the first frame, using the erroneous orientation,EKF cannot properly fit one leg because it is trying to fit the left leg to the measurements from the right leg; (b) using the other orientation, thefit is much closer and this orientation is accepted as the correct one.

range for either knee is [0, 180] degrees. If, however,it points to the right, the knee range is [−180, 0]. Todetermine this overall body orientation, we first try toadjust the model to the body part positions that are ini-tially estimated using the EKF described in the previoussection using an arbitrary body orientation. Then, weswitch the body orientation (i.e., the angle limits) andrepeat the adjustment. If a significant difference in thequality of fit exists, the orientation that produces asmaller error is chosen. The quality of the fit is quan-tified by the sum of Euclidean distances between themeasurement points and the corresponding points inthe model. Otherwise, the decision is deferred to thenext frame. The quality of the fit is quantified bythe sum of Euclidean distances between the mea-surement points and the corresponding points in themodel.

This is illustrated in Fig. 14. Figure 14(a) showsthe model adjustment with erroneous orientation.Figure 14(b) shows the correct orientation with kneesappropriately bent.

8. Results

The system presented in this paper was evaluated in aset of experiments with sequences containing people ofdifferent heights and body types, from a nine-year-oldgirl to a tall adult male, and different types of motionssuch as walking, dancing, jumping, sitting, stair climb-ing, etc. (Fig. 15). For each sequence, six synchro-nized full frame (640×480 pixels) video streams werecaptured at approximately 10 Hz. The voxel size wasset to 25 × 25 × 25 mm.

Page 17: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 215

Figure 15. The performance of the system was evaluated on sequences with people of different body sizes and with different motions.

In this section, we present some of the results ofthese experiments. However, the quality of the re-sults this system produces can be fully appreciatedonly by viewing the movies that show the model over-laid over the 3D voxel reconstructions, which can befound at http://cvrr.ucsd.edu/∼ivana/projects.htm. Thedetails of the system evaluation can be found in Mikic(2002).

8.1. Model Acquisition

The iterations of the model refinement are stoppedwhen the sum of body part length changes falls below3 mm, which usually happens in three to four frames.Figure 16 shows the model acquisition process in asequence where the person is climbing a step that isbehind her. In the first frame, the initial model has avery long torso and short thighs and calves. The model

refinement converges in three frames producing a verygood estimate of body part lengths.

Figure 17 shows the body part size estimates in sixsequences recorded with the same person and the truebody part sizes measured on the person. Some variabil-ity is present, but it is quite small—about 3–4 voxellengths, i.e. 75–100 mm.

Figure 18 shows the original camera views of fivepeople and Fig. 19 shows the corresponding acquiredmodels. The models successfully capture the main fea-tures of these very different human bodies. All fivemodels were acquired using the same algorithm andthe same Bayesian network with fixed parameters.

8.2. Tracking

The ground truth, to which the tracking results shouldbe compared, is very difficult to obtain. We have taken

Page 18: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

216 Mikic et al.

Figure 16. Model refinement. The person is climbing the step that is behind her. For this sequence the procedure converged in three frames.The person is tracked while the model is acquired. (a) Initial estimate of body part sizes and locations in the first frame; (b)–(d) model refinement.

Figure 17. Body part size estimates for six sequences showing the same person, and the true body part sizes. Body part numbers are:0—torso, 1—head, 2—upper arm (L), 3—upper arm (R), 4—thigh (L), 5—thigh (R), 6—lower arm (L), 7—lower arm (R), 8—calf (L),9—calf (R).

the approach, as in Hunter (1999), of verifying thetracking accuracy by subjective evaluation—careful vi-sual inspection of result movies. Once we are con-vinced that the results are accurate, i.e. unbiased, a

smooth curve fitted to the data is taken as the sub-stitute for the ground truth and the precision of thetracking output is evaluated using quantitative mea-sures. Savitzky-Golay filter (Press et al., 1993) is used

Page 19: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 217

Figure 18. Original views of the five people shown in Fig. 19. (a) Aditi (height: 1295.4 mm); (b) Natalie (height: 1619.25 mm); (c) Ivana(height: 1651 mm); (d) Andrew (height: 1816.1 mm); (e) Brett (height: 1879 mm).

Figure 19. Estimated models for the five people that participated in the experiments. The models are viewed from similar viewpoints—thesize differences are due to the true differences in body sizes. (a) Aditi (height: 1295 mm); (b) Natalie (height: 1619 mm); (c) Ivana (height:1651 mm); (d) Andrew (height: 1816 mm); (e) Brett (height: 1879 mm).

to compute the smooth curve fit. Its output at each pointis equal to the value of a polynomial of order M , fit-ted (using least squares) to points in a (2n + 1) win-dow centered at the point that is currently evaluated.It achieves smoothing without much attenuation of theimportant data features. The values of M and n are cho-sen by the user who subjectively optimizes between thelevel of smoothing and the preservation of importantdata features. In our experiments, we used M = 3and n = 7.

The tracking results look very good by visual inspec-tion. We show sample frames from three sequences inthis section, and invite the reader to view the resultmovies at http://cvrr.ucsd.edu/∼ivana/projects.htm.The precision analysis showed an average absolute er-ror for joint angles of three to four degrees. Shoulderangles θ7 and θ8 that capture the rotation about the mainaxis of the upper arm contribute the most to the averageerror. When the arm is not bent at the elbow, that angleis not defined and is difficult to estimate. However, in

Page 20: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

218 Mikic et al.

Figure 20. Tracking results for the walking sequence and one of the six original camera views.

Figure 21. Walking: trajectory of the torso centroid.

Figure 22. Hip and knee angles [rad] as functions of frame number for the walking sequence.

these cases, the deviation from the “ground truth” isnot really an error. The average error drops below threedegrees when these two angles are not considered. Thedetailed precision figures can be found in Mikic (2002).

Figure 20 shows five sample frames of a walkingsequence. Figure 21 shows the trajectory of the torsocentroid as the person walked around the room. Figure22 shows plots of the hip and knee angles as functionsof time. This sequence contained 19 steps, 10 by the leftand 9 by the right leg, which can easily be correlatedwith the shown plots.

Figure 23 shows six sample frames for a dance se-quence. The results for a running and jumping sequenceare shown in Fig. 24.

Our experiments also reveal some limitations of thecurrent version of the model. Figure 25 shows a casewhere the rotation in the waist that is not captured in our

Page 21: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 219

Figure 23. Dance sequence: one of the original camera views, the voxel reconstructions and the tracking results for six sample frames.

Page 22: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

220 Mikic et al.

Figure 24. Running and jumping sequence: one of the original camera views, the voxel reconstructions and the tracking results for six sampleframes.

Page 23: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 221

Figure 25. Tracking error due to the rotation in the waist that is not modeled in our body model. Left: original view. Right: the tracking resultshown from two different angles.

model causes a tracking error. To improve the resultsin such cases, we can split the torso in two parts andadd one rotation (about the z axis) between them. Suchrefinements can be easily incorporated in the proposedframework.

9. Conclusion

We have presented a fully automated system for hu-man body model acquisition and tracking using mul-tiple cameras. The system does not perform the track-ing directly on the image data, but on the 3D voxelreconstructions computed from the 2D foreground sil-houettes. This approach removes all the computationsrelated to the transition between the image planes andthe 3D space from the tracking and model acquisitionalgorithms, making them simple and robust.

The advantages of this approach are twofold. First,the transition from the image space to the 3D spacethat the real person and the model inhabit is performedonce, during the preprocessing stage when the 3D voxelreconstructions are computed. The approaches wherethe image data is directly used for tracking require re-peated projections of the model onto the image planes,often performed several times per frame. Second, anal-ysis of the voxel data is in many ways simpler than theanalysis of the image data. Voxel data is in the same 3Dspace as the model and the real person. Therefore, themeasurements made on the voxel data are very easilyrelated to the parameters of the human body model,which makes the tracking algorithm simple and sta-ble. Also, the dimensions and shapes of different bodyparts are the same in the data as in the real world, which

leads to simple algorithms for locating body parts thatrely on knowledge of their average shapes and sizes.For example, finding the head in the voxel data is aneasy task, since the head has a unique and stable shapeand size. However, this is a much harder problem whenimage data is directly analyzed. The head may be oc-cluded in some views and its size will depend on therelative distance to each of the cameras.

We are using the twists framework to describe thehuman body model. The constraints that ensure phys-ically valid body configurations are inherent to themodel, resulting in a non-redundant set of parame-ters. Only the constraints for joint angle limits haveto be imposed during the tracking. The body is de-scribed in a reference configuration, where axes of ro-tation in different joints coincide with basis vectors ofthe world coordinate system. The coordinates of anypoint on the body are easily computed from the modelparameters using this framework, which results in asimple tracker formulation that uses locations of dif-ferent points as measurements to which the model isadjusted.

We have developed an automated model acquisitionprocedure that does not require any special movementsby the tracked person. Model acquisition starts by ini-tial estimation of body part sizes and locations using atemplate fitting and growing procedure that takes ad-vantage of our knowledge of average shapes and sizesof body parts. Those estimates are then refined using asystem whose main component is a Bayesian network,which incorporates the knowledge of human body pro-portions. The Bayesian network is inserted into thetracking loop, modifying the model as the tracking isperformed.

Page 24: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

222 Mikic et al.

The tracker relies on a hybrid voxel labeling pro-cedure to obtain quality measurements. It combinesthe minimization of distance from the model predic-tion and template fitting to produce reliable labelingresults, even for large frame to frame displacements.

We have conducted an extensive set of experiments,involving multiple people of heights ranging from1.3 to 1.9 m, and complex motions ranging from sit-ting and walking to dancing and jumping. The sys-tem performs very reliably in capturing these differenttypes of motion and in acquiring models for differentpeople.

References

Bregler, C. 1997. Learning and recognizing human dynamics in videosequences, IEEE International Conference on Computer Visionand Pattern Recognition, San Juan, Puerto Rico.

Bregler, C. and Malik, J. 1998. Tracking people with twistsand exponential maps, IEEE International Conference onComputer Vision and Pattern Recognition, Santa Barbara,CA.

Cheung, G., Kanade, T., Bouguet, J., and Holler, M. 2000. A realtime system for robust 3D voxel reconstruction of human mo-tions. In Proceedings IEEE Conference on Computer Vision andPattern Recognition, Hilton Head Island, SC, USA, vol. 2, pp. 714–720.

Covell, M., Rahimi, A., Harville, M., and Darrell, T. 2000.Articulated-pose estimation using brightness- and depth-constancy constraints. In IEEE Int. Conference on Computer Vi-sion and Pattern Recognition, Hilton Head Island, SC, pp. 438–445.

Delamarre, Q. and Faugeras, O. 2001. 3D articulated models andmulti-view tracking with physical forces, The special issue of theCVIU journal on modeling people, 81(3):328–357.

Deutscher, J., Blake, A., and Reid, I. 2000. Articulated body motioncapture by annealed particle filtering, IEEE Int. Conference onComputer Vision and Pattern Recognition, Hilton Head Island,SC.

Deutscher, J., Davison, A., and Reid, I. 2001. Automatic partitioningof high dimensional search spaces associated with articulated bodymotion capture, IEEE Int. Conference on Computer Vision andPattern Recognition, Kauai, Hawaii.

DiFranco, D., Cham, T., and Rehg, J. 2001. Reconstruction of3D figure motion from 2D correspondences. In IEEE Int. Con-ference on Computer Vision and Pattern Recognition, Kauai,Hawaii.

Gavrila, D. 1999. Visual analysis of human movement: A sur-vey. Computer Vision and Image Understanding, 73(1):82–98.

Gavrila, D. and Davis, L. 1996. 3D model-based tracking of humansin action: A multi-view approach. In Proceedings of IEEE Confer-ence on Computer Vision and Pattern Recognition, San Francisco,CA, USA, pp. 73–80.

Hilton, A. 1999. Towards model-based capture of persons shape,appearance and motion. In International Workshop on ModelingPeople at ICCV’99, Corfu, Greece.

Horprasert, T., Harwood, D., and Davis, L.S. 1999. A statistical ap-proach for real-time robust background subtraction and shadowdetection. In Proc. IEEE ICCV’99 FRAME-RATE Workshop,Kerkyra, Greece.

Howe, N., Leventon, M., and Freeman, W. 1999. Bayesian recon-struction of 3D human motion from single-camera video. In Neu-ral Information Processing Systems, Denver, Colorado.

Hunter, E. 1999. Visual estimation of articulated motion using theexpectation-constrained maximization algorithm, Ph.D. Disserta-tion, University of California, San Diego.

Hunter, E., Kelly, P., and Jain, R. 1997. Estimation of articulated mo-tion using kinematically constrained mixture densities. In IEEENonrigid and Articulated Motion Workshop, San Juan, PuertoRico.

Ioffe, S. and Forsyth, D. 2001. Human tracking with mixtures oftrees. In IEEE International Conference on Computer Vision,Vancouver, Canada.

Isard, M. and Blake, A. 1996. Visual tracking by stochastic propa-gation of conditional density. In Proc. 4th European Conferenceon Computer Vision, Cambridge, England.

Jojic, N., Turk, M., and Huang, T. 1999. Tracking self-occluding ar-ticulated objects in dense disparity maps. In IEEE Int. Conferenceon Computer Vision. Corfu, Greece.

Jung, S. and Wohn, K. 1997. Tracking and motion estimation of thearticulated object: A hierarchical Kalman filter approach, Real-Time Imaging, 3:415–432.

Kakadiaris, I. and Metaxas, D. 1996. Model-based estimation of 3Dhuman motion with occlusion based on active multi-viewpointselection. In Proc. IEEE International Conference on ComputerVision and Pattern Recognition, San Francisco, CA.

Kakadiaris, I. and Metaxas, D. 1998. Three-dimensional human bodymodel acquisition from multiple views, International Journal ofComputer Vision, 30(3):191–218.

Metaxas, D. and Terzopoulos, D. 1993. Shape and nonrigid motionestimation through physics-based synthesis, IEEE Trans. PatternAnalysis and Machine Intelligence, 15(6):580–591.

Mikic, I. 2002. Human body model acquisition and tracking us-ing multi-camera voxel data, Ph.D. Dissertation, University ofCalifornia, San Diego.

Mikic, I., Trivedi, M., Hunter, E., and Cosman, P. 2001. Articulatedbody posture estimation from multi-camera voxel data. In IEEEConference on Computer Vision and Pattern Recognition, Kauai,Hawaii.

Moeslund, T. and Granum, E. 2001. A survey of computer vision-based human motion capture, Computer Vision and Image Under-standing, 81:231–268.

Murray, R., Li, Z., and Sastry, S. 1993. A mathematical introductionto robotic manipulation, CRC Press.

Plankers, R. and Fua, P. 1999. Articulated soft objects for video-basedbody modeling. In International Workshop on Modeling People atICCV’99, Corfu, Greece.

Plankers, R. and Fua, P. 2001. Tracking and modeling people in videosequences, Computer Vision and Image Understanding, 81:285–302.

Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. 1993. Nu-merical Recipes in C: The Art of Scientific Computing, CambridgeUniversity Press.

Rehg, J. and Kanade, T. 1995. Model-based tracking of self-occluding articulated objects. In IEEE International Conferenceon Computer Vision, Cambridge.

Page 25: Human Body Model Acquisition and Tracking Using Voxel Datacode.ucsd.edu/pcosman/IJCV.pdf · 2003. 7. 14. · Human Body Model Acquisition and Tracking Using Voxel Data 201 EM procedure,

Human Body Model Acquisition and Tracking Using Voxel Data 223

Sminchiescu, C. and Triggs, B. 2001. Covariance scaled sampling formonocular 3D body tracking. In IEEE International Conferenceon Computer Vision and Pattern Recognition, Kauai, Hawaii.

Szeliski, R. 1993. Rapid octree construction from image sequences,CVGIP: Image Understanding, 58(1):23–32.

Tsai, R. 1987. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TVcameras and lenses, IEEE Journal of Robotics and Automation,RA-3(4):323–344.

Wachter, S. and Nagel, H. 1999. Tracking persons in monocularimage sequences, Computer Vision and Image Understanding,74(3):174–192.

Wren, C. 2000. Understanding expressive action, Ph.D. Dissertation,Massachusetts Institute of Technology.

Yamamoto, M., Sato, A., Kawada, S., Kondo, T., and Osaki, Y.1998. Incremental tracking of human actions from multiple views,IEEE International Conference on Computer Vision and PatternRecognition.