538 IEEE JOURNAL OF SELECTED TOPICS IN …cvrr.ucsd.edu/publications/2012/Holte_JSTSP2012.pdfincluding outdoor human activity analysis, e.g., [21], ... on different abstraction levels,

538 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012

Human Pose Estimation and Activity RecognitionFrom Multi-View Videos: Comparative Explorations

of Recent DevelopmentsMichael B. Holte, Student Member, IEEE, Cuong Tran, Student Member, IEEE, Mohan M. Trivedi, Fellow, IEEE,

and Thomas B. Moeslund, Member, IEEE

Abstract—This paper presents a review and comparative studyof recent multi-view approaches for human 3D pose estimation andactivity recognition. We discuss the application domain of humanpose estimation and activity recognition and the associated require-ments, covering: advanced human–computer interaction (HCI),assisted living, gesture-based interactive games, intelligent driverassistance systems, movies, 3D TV and animation, physical therapy,autonomous mental development, smart environments, sport mo-tion analysis, video surveillance, and video annotation. Next, wereview and categorize recent approaches which have been proposedto comply with these requirements. We report a comparison of themost promising methods for multi-view human action recognitionusing two publicly available datasets: the INRIA Xmas Motion Ac-quisition Sequences (IXMAS) Multi-View Human Action Dataset,andthe i3DPostMulti-ViewHumanActionandInteractionDataset.To compare the proposed methods, we give a qualitative assessmentof methods which cannot be compared quantitatively, and analyzesome prominent 3D pose estimation techniques for application,where not only the performed action needs to be identified but amore detailed description of the body pose and joint configuration.Finally, we discuss some of the shortcomings of multi-view camerasetups and outline our thoughts on future directions of 3D bodypose estimation and human action recognition.

Index Terms—3-D, comparative study, human action recogni-tion, human pose estimation, i3DPost, INRIA Xmas Motion Acqui-sition Sequences (IXMAS), maker-less, multi-view, survey, view-invariance, vision-based, volumetric reconstruction.

I. INTRODUCTION

“L OOKING at People” is a promising field within com-puter vision with many applications. Most rely on

pose estimation (which again relies on human body modeling)

Manuscript received August 02, 2011; revised December 06, 2011; acceptedMarch 25, 2012. Date of publication May 01, 2012; date of current version Au-gust 10, 2012. This work was supported in part by the Danish National ResearchCouncils (FTP) under the research project: “Big Brother is watching you!”, inpart by the European Cooperation in Science and Technology under COST 2101Biometrics for Identity Documents and Smart Cards, and in part by the Euro-pean Community’s Seventh Framework Program (FP7/2007-2013) under Grant211471 (i3DPost). This work was also supported in part by the UC DiscoveryDigital Media Innovation (DiMI) program, National Science Foundation, andVolkswagen. The associate editor coordinating the review of this manuscriptand approving it for publication was Prof. Aydin Alatan.

M. B. Holte and T. B. Moeslund are with the Visual Analysis of PeopleLaboratory, Department of Architecture, Design and Media Technology, Aal-borg University (AAU), 9220 Aalborg, Denmark (e-mail: [email protected];[email protected]).

C. Tran and M. M. Trivedi are with the Computer Vision and Robotics Re-search Laboratory (CVRR), University of California, San Diego (UCSD), LaJolla, CA 92093-0434 (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSTSP.2012.2196975

and recognition. It is therefore interesting to get an overviewof recent progress in these fields including how the differentmethods compare. In recent years a wide range of applicationsusing 3D human pose estimation and activity recognition hasbeen introduced. Among those, several key applications are il-lustrated in Fig. 1 including:

• Advanced human-computer interaction (HCI): Beyondtraditional medium like computer mouse and keyboard,it is desirable to develop better, more natural interfacesbetween intelligent systems and human in which under-standing visual human gesture is an important channel.A few examples are using hand movement to control thepresentation slides [1] or recognizing manufacturing stepsto help workers to learn and improve their skills [2].

• Assisted living: Pose estimation and activity recognitioncan also be applied in assisting handicapped people, elderlypeople, as well as normal people. For example a systemto detect when a person falls [3] or a robot controlled byblinking [4].

• Gesture-based interactive games: In which the player usesnon-intrusive body movement to interact with the games.For example an Interactive balloon game [5] or the well-known Microsoft Kinect Xbox [6].

• Intelligent driver assistance systems: Looking at driver isa key part required in a holistic approach for intelligentdriver assistance systems [7]. Examples of driver assis-tance systems using posture and behavior analysis are:Monitoring driver awareness based on head pose tracking[8], combining driver head pose and hands tracking fordistraction alert [9], modeling driver foot behavior tomitigate pedal misapplications [10], developing smartairbag system based on sitting posture analysis [11], orpredicting driver turn intent [12].

• Movies, 3D TV and animation: Human motion capture isalso applied extensively in movies, 3D TV and animation.For example in the Avatar movie, in a digital dance lesson[13] or for recording and representation of data for 3D TV[14].

• Physical therapy: Modern biomechanics and physicaltherapy applications require the accurate capture ofnormal and pathological human movement without the ar-tifacts of intrusive marker-based motion capture systems.Therefore marker-less posture estimation and gestureanalysis approaches were also developed to be applied inthis area [15], [16].

1932-4553/$31.00 © 2012 IEEE

HOLTE et al.: HUMAN POSE ESTIMATION AND ACTIVITY RECOGNITION FROM MULTI-VIEW VIDEOS 539

Fig. 1. Application domain of human pose estimation and activity recognition.

• Autonomous mental development: Study the developmentof human mental capabilities by observing its real-timeinteractions with the environment using its own sensorsand effectors, e.g., study the cognitive development andlearning process of young children [17]. Instead of manu-ally observing the data for analysis, such studies can utilizethe recent advances in pose estimation and activity anal-ysis to automate the process and enable analysis in a largerscale.

• Smart environments: In which humans and environmentcollaborate. Smart environments need to extract and main-tain an awareness of a wide range of events and human ac-tivities occurring in these spaces [18]. For example, moni-toring the focus of attention and interaction of participantsin a meeting room [19], [20].

• Sport motion analysis: Several sports like golf, ballet, orskating require accurate body posture and movement there-fore posture estimation and gesture analysis could be ap-plied to this area for analyzing performance and training.

• Video surveillance: Video surveillance is used in manyplaces such as critical infrastructure, public transporta-tion, office buildings, parking lots, and homes. Howevermanually monitoring these cameras is becoming a hazard.Therefore, approaches for automatic video surveillance

including outdoor human activity analysis, e.g., [21], [22]will be needed.

• Video annotation: With the development of hardware tech-nology, a very large amount of video data can be easilysaved. Among those, there are lots of human-related videossuch as surveillance videos, sport videos, or movies. In-stead of manually scanning through those large video data-base to get the needed information, human motion analysiscan be used to annotate those video, e.g., approaches to an-notate video of a soccer game [23] or in more general foroutdoor sports broadcasts [24].

Many approaches have been proposed to comply with the re-quirements of these applications, and based on different kindsof sensor systems for data acquisition: marker-based systems,laser-range scanners [25], structured light [26], time-of-flight(ToF) sensors [27], [28], the Microsoft Kinect sensor [6], andmulti-camera systems [29]. Table I gives an overview of the ap-plication domain of human body modeling and motion analysisand the associated requirements. As can be seen, the require-ments vary significantly depending on the desired application.This results in the need of approaches, which, e.g., can operateon different abstraction levels, in uncontrolled environments,with high precision, in critical real-time and for large databasesearch.


TABLE IDIFFERENT APPLICATIONS AND THEIR REQUIREMENTS TO 3D HUMAN POSE ESTIMATION AND ACTIVITY RECOGNITION APPROACHES

A number of surveys has been published during the lastdecade reviewing approaches for human motion capture, bodymodeling, pose estimation and activity recognition in moregeneral [30]–[35], [25]. This paper differs from these, in thesense that it focuses exclusively on recent work on multi-viewhuman pose estimation and action recognition, both basedon 2D multi-view data and reconstructed 3D data, acquiredwith standard cameras. Multi-view camera systems have theadvantage that they enable full 3D reconstruction of the humanbody, and to some extent handles self-occlusion. In contrastsingle 3D imaging devices, like ToF sensors and Kinect, willonly acquire 3D surface structure visible from that singleviewpoint. We give a more detailed description and comparisonof some prominent and diverse 3D pose estimation techniques,which properly represent the contributions to this field. Ad-ditionally, we present a quantitative comparison of severalpromising multi-view human action recognition approachesusing two publicly available datasets: the INRIA Xmas MotionAcquisition Sequences (IXMAS) Multi-View Human ActionDataset [36] and the i3DPost Multi-View Human Action andInteraction Dataset [29].

A. Human Pose Estimation

Vision-based pose estimation and tracking of articulatedhuman body is the problem of estimating kinematic param-eters of the body model (such as joints position and jointsangle) from a static image or a video sequence. Typically, theshape and dimension of body parts are assumed fixed and theinterdependence between body parts are only the kinematicconstraints at body joints. Related research studies in this areainclude body pose estimation, hand pose estimation, and headpose estimation. Among those, the most extensive sub-fieldis body pose estimation, which refers to the articulated bodymodel normally with torso, head, and four limbs but withoutdetails of hand, foot, or facial variation. Several important ap-plications explicitly required detailed 3D posture informationincluding movies and 3D animation, sport motion analysis,physical therapy, as well as some application in advanced HCIor smart environments (e.g., robot controls or applications

using pointing gesture). Moreover, the output 3D pose informa-tion is also a rich and view-invariant representation for actionrecognition [34]. Developing an efficient and robust body poseestimation system however is a challenging task. One majorreason is the very high dimensionality of the pose configurationspace, e.g., in [37], 19 degree of freedom (DoF) are used forthe body model and 27 DoF are used for the hand model.As concluded in [38], although human tracking is consideredmostly solved in constrained situations, i.e., has large numberof calibrated camera , people wear tight clothes, and theenvironment is static, there are still remaining key challengesincluding tracking with fewer cameras , dealing withcomplex environments, variations in object appearance (e.g.,general clothes, hair, etc.), automatically adapting to differentbody shapes, and automatically recovering from failure.

Some surveys of several techniques for human body posemodeling and tracking can be found in [31]–[33], [25], eachwith different focus and taxonomy. Werghi [25] provided a gen-eral overview of both 3D human body scanner technologiesand approaches dealing with such scanned data, which focuson one or more of the following topics: body landmark detec-tion, segmentation of body scanned data, body modeling andbody tracking. Poppe [33] survey on pose estimation techniques,in which they mentioned the division into 2D approaches and3D approaches, depends on the goal to achieve 2D pose or 3Dpose representation; The division into model-based approachesand model-free approaches, depends on whether a priori kine-matic body model is employed. Moeslund et al. [31] split thepose estimation process into initialization, tracking, pose es-timation, and recognition. In [32], they also provided an up-dated review of advances in human motion capture for the pe-riod from 2000 to 2006. We see that it is not easy to have aunified taxonomy for the broad area of human body modelingand tracking. Quite similar to [31], we categorize related re-search studies based on the common components in a genericbody pose estimation system. As shown in Fig. 2, we first needa component to extract useful features from the input visiondata, and then a procedure to infer body pose from extractedfeatures. In this paper, we focus on representative model-based


Fig. 2. Block diagram of a generic human body pose estimation system. Thedashed line means that the underlying kinematic model can be used or not. Grayboxes show the focus of this paper, which are model based methods using voxeldata and aim to extract full 3D posture.

approaches using multi-view video input and aim to extract real3D posture. In comparison to monocular view, multi-view datacan help to reduce the self occlusion issue and provide more in-formation to make the pose estimation task easier as well as toimprove the accuracy. The underlying kinematic body model inmodel-based approaches can help to improve the accuracy androbustness although it also raises the issue of model initializa-tion and re-initialization.

B. Human Action Recognition

While 2D human action recognition has received high interestduring the last decade, 3D human action recognition is still avery unexplored field. Relatively few authors have so far re-ported work on 3D human action recognition [30], [32], [34],[35]. Human actions are performed in real 3D environments;however, traditional cameras only capture the 2D projection ofthe scene. Vision-based analysis of 2D activities carried out inthe image plane will therefore only be a projection of the ac-tual actions. As a result, the projection of the actions will de-pend on the viewpoint, and not contain full information aboutthe performed activities. To overcome this shortcoming, the useof 3D representations of reconstructed 3D data has been intro-duced through the use of two or more cameras [39], [29], [40],[41], [36]. In this way, the surface structure or a 3D volume ofthe person can be reconstructed, e.g., by shape-from-silhouette(SfS) techniques [42], and thereby a more descriptive represen-tation for action recognition can be established.

The use of 3D data allows for efficient analysis of 3D humanactivities. However, we are still faced with the problem that theorientation of the subject in the 3D space should be known.Therefore, approaches have been proposed without this assump-tion by introducing view-invariant or view-independent repre-sentations. Another strategy which has been explored is the ap-plication of multiple views of a scene to improve recognition byextracting features from different 2D image views or to achieveview-invariance.

The ultimate goal is to be able to perform reliable actionrecognition applicable for, e.g., video annotation, advancedhuman–computer interaction, video surveillance, driver assis-tance, automatic activity analysis, and behavior understanding.We contribute to this field by providing a review and com-parative study of recent research on 2D and 3D human actionrecognition for multi-view camera systems (see Table III),to give people interested in the field an easy overview of the

Fig. 3. Common steps in model-based methods for articulated human bodypose estimation using multi-view input. Dashed boxes mean that some methodsmayor may not have all of these steps.

proposed approaches, and an idea of the performance anddirection of the research.

Methods for 3D human action recognition can either bemodel-free or model-based. Mostly a model-free strategy isapplied, which have the advantage that it can use a wide rangeof image, static shape/pose, motion or statistical features, anddoes not depend on a predefined human body model. However,the approaches usually do not capture any information aboutthe 3D human body pose, joint positions, etc. This limits it’s us-ability to a specific set of applications, where the exact pose andjoint configuration of the body parts are not explicitly required(see Fig. 1 and Table I). Whereas, the model-based methods,which requires a human body model and is usually applied inconjunction with human body modeling and pose estimation,allows for description of the exact pose of the respective bodyparts. This opens up for another set of applications.

The remainder of the paper is organized as follows. Section IIis a review of selected recent model-based methods for humanbody pose estimation using multi-view data. Section III gives areview of 2D and 3D approaches for human action recognition,followed up by a description of multi-view dataset and a quanti-tative comparison of promising methods. Finally in Section VI,we present a discussion and directions of future work.

II. 3D HUMAN POSE ESTIMATION

As mentioned earlier in Section I.A, in this paper we focuson model-based approaches using multi-view video input andaim to extract real 3D posture. Fig. 3 shows common stepsin model-based approaches for human pose estimation usingmulti-view input including: camera calibration/data capture,voxel reconstruction, initialization/segmentation (segmentvoxel data into different body parts), modeling/estimation (es-timating pose using the current frame only), and tracking (usetemporal information from the previous frames in estimatingbody pose in the current frame). In each step, different methodsmay have different choices of approaches: There are methodsusing 3D features (e.g., voxel data) reconstructed from multipleviews while others may still use 2D features (e.g., silhouette,contour) extracted from each view. They may have manualor automatic initialization step. Some methods may not havetracking step. Some methods are for a generic purpose whileothers are application specific for efficiency. Table II is asummary of selected representative model based methods forhuman body pose estimation using multi-view data (see Fig. 4).In the following section, we will discuss in more details thefactors mentioned above.


Fig. 4. Prominent 3D human body model and human motion representations [37], [43]–[45], [36].

A. Using 2D Versus 3D Features From Multi-View

Among multi-view approaches, some methods use 3D fea-tures reconstructed from multiple views [49], [55], [37], [56],[47], [46], [44], [45], e.g., volumetric (voxel) data, while othersstill use 2D features [57]–[59], e.g., color, edges, and silhou-ette. Since the real body pose is in 3D, using voxel data canhelp avoiding the repeated projection of 3D body model onto theimage planes to compare against the extracted 2D features. Fur-thermore, reconstructed voxel data help to avoid the image scaleissue. These advantages of using voxel data allow the design ofsimple algorithms and we can make use of our knowledge aboutshapes and sizes of body parts. For example, Mikic et al. [44]used specific information about shape and size of head and torsoto have a hierarchical growing procedure (detecting head first,then torso, then limbs) for body model acquisition that can beused effectively even when there is a large displacements be-tween frames. Several methods are based on voxel data, whichonly indicates that voxel data is a strong cure for body pose esti-mation. Of course, there is an additional computational cost forvoxel reconstruction but efficient techniques for this task havealso been developed [49], [56], [47], [60].

B. Tracking-Based Versus Single Frame-Based Approaches

The modeling and tracking steps can be considered as a map-ping from input space of voxel data and information in thepredefined model (e.g., kinematic constraints) to the bodymodel configuration space :

(1)

The body model configuration contains both static parame-ters (i.e., shape and size of each body component) and dynamic

parameters (i.e., mean and orientation of each body compo-nent), in which the static parameters are estimated in the ini-tialization step. Methods are different in the way they use andimplement the mapping procedure M. Methods that have mod-eling step but no tracking step are also called single frame-basedmethods, e.g., [45] while methods with tracking step are calledtracking-based methods, e.g., [49], [37], [52], [51], [53], [44],[5]. Because the tracker in tracking based methods would be lostover long sequences, multiple hypotheses at each frame can beused to improve the robustness of tracking. The single-frame-based approach is a more difficult issue because it does notmake any assumptions on time coherence. However, we see thattracking-based methods encounter the issue of initialization orre-initialization of the tracked model.

C. Manual Versus Automatic Initialization

Some methods have automatic initialization step like [47],[54], [44], [45], and [5] while others require a priori known ormanually initialized static parameters, e.g., [37], [46], [51], [53],and [48]. In [44], the specific shape and size of the head wasused to design a hierarchical growing procedure for initializa-tion. In [52], a database of human body shapes was used for ini-tial pose-shape registration. In [50] and [5], the user was askedto start at a specific pose (e.g., stretch pose) to aid the automaticinitialization. In [45], Sundaresan et al. discovered an inter-esting property of Laplacian Eigenspace (LE) transformation:By mapping into high dimensional (e.g., 6D) LE, voxel dataof body chains like limbs, which have their length greater thantheir thickness, will form an 1-D smooth curve which can thenbe used to segment voxel data into different body chains. Theythen use a spline-fitting process to segment the curves which re-sults in the segmentation of their respective body chains. This is


TABLE IISUMMARY OF SELECTED MODEL-BASED METHODS FOR MULTI-VIEW BODY POSE ESTIMATION AND TRACKING

however a single frame based approach: The segmented voxelclusters are then registers to their actual body chain using a prob-abilistic registration procedure at each frame. Their results seemto be sensitive to noise in the voxel data (loss of track in the testwith the public HumanEvaII dataset).

On the other hand, the kinematically constrained Gaussianmixture model (KC-GMM) method proposed by Cheng andTrivedi [37] is a tracking based method and showed good resultson the HumanEvaII dataset (won the first prize in the Work-shop on Evaluation of Articulated Human Motion and Pose Es-

timation—CVPR EHuM2 2007 competition). However, it re-quires a careful manual initialization. An framework combiningKC-GMM method and LE-based voxel segmentation was pro-posed in [61] for a more powerful human body modeling andtracking system. The LE-based voxel segmentation was usedto fill in the gap of an automatic initialization of KC-GMMmethod. Regarding the LE-based method, combining with atracking-based method like KC-GMM instead of doing voxelsegmentation at every frame helps to overcome the sensitiza-tion to voxel noise to some extent.


D. Generic Purpose Versus Application Specific Approachesfor Efficiency

Depending on applications, human pose tracking may focuson different body parts including full body pose, upper bodypose [32], [33], hand pose [62], and head pose [63]. Due to thecomplexity of human body pose estimation task, there are trade-offs between developing a generic approach versus an approachintegrated to some specific cases for efficiency. For example, theKC-GMM method [37] is for generic purpose and was appliedsuccessfully for both HumanEvaII body data and synthesizedhand data. However, this method is not real-time because ofa required manual initialization step and related computationalcost. For efficiency, some methods are designed for applicationspecific. For example [5] and [50] focus on situations in whichmost of the influential information of body motion carried by theupper body and arms while the user typically in a fixed position.These situations arise in several realistic applications such asdriver activity analysis and user activity analysis in a smart tele-conference or meeting room. In [5], the problem of upper bodypose tracking is broken into two subproblems: first track the ex-tremities including head and hands blobs; then the 3D move-ments of head and hands are used to infer the correspondingupper body movements as an inverse kinematics problem. Sincethe head and hand regions are typically well defined and un-dergo less occlusion, tracking is more reliable. Moreover, bybreaking the high-dimensional search problem of upper bodypose tracking into two subproblems, the complexity is reducedconsiderably to achieve real-time performance. However, theyneed to deal with possible ambiguity due the kinematic redun-dancy of body model.

Another type of approaches for efficiency is to use a priormotion model from training sequences. Some representative ap-proaches using prior motion models are [49] learning prior mo-tion model with variable length Markov model (VLMM), whichcan explain high-level behaviors over a long history or [53]using coordinated mixture of factor analyzer to learn the priormodel. Compared to approaches for generic body motions [37],[52], [51], [54], [44], these approaches use the prior motionmodels to reduce the search space for a more efficient and robustpose tracking. However, the downside is that these methods arelimited to the type of motions in training data (i.e., have diffi-culties if there are “unseen” movements).

III. MULTI-VIEW HUMAN ACTION RECOGNITION

In this section, we review and compare multi-view ap-proaches for human action recognition (see Table III). First,we will give an outline of approaches which solely apply 2Dmulti-view image data, and then full 3D-based techniques.

A. 2D Approaches

One line of work concentrates solely on the 2D image data ac-quired by multiple cameras. Action recognition can range frompointing gesture to complex multi-signal actions, e.g., includingboth coarse level of body movement and fine level of hand ges-ture.

1) Shape and Silhouette Features: In the work of Souveniret al. [75], the acquired data from five calibrated and synchro-nized cameras, is further projected to 64 evenly spaced virtual

cameras used for training. Actions are described in a view-in-variant manner by computing transform surfaces of silhou-ettes and manifold learning. Gkalelis et al. [80] exploits the cir-cular shift invariance property of the discrete Fourier transform(DFT) magnitudes, and use fuzzy vector quantization (FVQ)and linear discriminant analysis (LDA) to represent and clas-sify actions from multi-view silhouettes. Another approach wasproposed by Iosifidis et al. [83], where Binary body masks fromframes of a multi-camera setup used to produce the i3DPostMulti-View Human Action Dataset [29], are concatenated tomulti-view binary masks. These masks are rescaled and vec-torized to create feature vectors in the input space. FVQ is per-formed to associate input feature vectors with movement repre-sentations and LDA is used to map movements in a low-dimen-sionality discriminant feature space.

2) Motion Features: Some authors perform action recog-nition using motion features or a combination of static shapeand motion features from image sequences in different viewingangles. Ahmad et al. [65] apply principal component analysis(PCA) of optical flow velocity and human body shape infor-mation, and then represent each action using a set of multi-di-mensional discrete hidden Markov models (HMMs) for each ac-tion and viewpoint. Matikainen et al. [90] proposed a methodfor multi-user, prop-free pointing detection using two cameraviews. The observed motion are analyzed and used to refer thecandidates of pointing rotation centers and then estimate the2D pointer configurations in each image. Based on the extrinsiccamera parameters, these 2D pointer configurations are mergedacross views to obtain 3D pointing vectors. Cherla et al. [70]show how view-invariant recognition can be performed by usingdata fusion of two orthogonal views. An action basis is builtusing eigenanalysis of walking sequences of different people,and projections of the width profile of the actor and spatio-temporal features are applied. Finally, dynamic time warping(DTW) is used for recognition.

3) Synthetic Training Data: Others use synthetic data ren-dered from a wide range of viewpoints to train their model andthen classify actions in a single view, e.g., Lv et al. [68], whereshape context is applied to represent key poses from silhouettesand Viterbi path searching for classification. A similar approachwas proposed by Fihl. et al. [91] for gait analysis.

4) Cross-View Recognition: Another topic which has beenexplored by several authors the last couple of years is cross-view action recognition. This is a difficult task of recognizingactions by training on one view and testing on another com-pletely different view (e.g., the side view versus the top view ofa person in IXMAS). A number of techniques have been pro-posed, stretching from applying multiple features [73], infor-mation maximization [74], dynamic scene geometry [85], selfsimilarities [72], [86], and transfer learning [71], [87].

5) Other Techniques: A number of other techniques havebeen employed, like metric learning [76] or representing actionby feature-trees [81] or ballistic dynamics [78]. In [84], Wein-land et al. propose an approach which is robust to occlusions andviewpoint changes using local partitioning and hierarchical clas-sification of 3D histogram of oriented gradients (3DHOG) vol-umes. For additional related work on view-invariant approachesplease refer to the recent survey by Ji et al. [30].


TABLE IIIPublications on Multi-View Human Action Recognition

B. 3D Approaches

Another line of work utilize the full reconstructed 3D datafor feature extraction and description. Fig. 4 shows some exam-ples of the more prominent model and non-model-based rep-resentations of the human body and its motion. These will be

reviewed in the following along with a number of other recent3D approaches.

1) 3D Shape and Pose Features: Johnson and Hebert pro-posed the spin image [92], and Osada et al. the shape distribu-tion [93]. Ankerst et al. introduced the shape histogram [94],which is a similar to the 3D extended shape context [95] pre-


sented by Körtgen et al. [96], and Kazhdan et al. applied spher-ical harmonics to represent the shape histogram in a view-in-variant manner [97]. Later Huang et al. extended the shape his-togram with color information [98]. Recently, Huang et al. madea comparison of these shape descriptors combined with self sim-ilarities, with the shape histogram (3D shape context) as the topperforming descriptor [82].

2) Temporal Information and Alignment: A common char-acteristic of all these approaches is that they are solely based onstatic features, like shape and pose description, while the mostpopular and best performing 2D image descriptors apply mo-tion information or a combination of the two [32], [35]. Someauthors add temporal information by capturing the evolvementof static descriptors over time, i.e., shape and pose changes [66],[99], [64], [24], [67], [36], [69], [79]. The common trends areto accumulate static descriptors over time, track human shape orpose information, or apply sliding windows to capture the tem-poral contents [32], [67], [36], [35].

Recently, Huang et al. proposed 3D shape matching in tem-poral sequences by time filtering and shape flows [82]. Kilneret al. [24] applied the shape histogram and evaluated similaritymeasures for action matching and key-pose detection in sportsevents, using 3D data available in the multi-camera broadcastenvironment. Cohen et al. [99] use 3D human body shapes andsupport vector machines (SVMs) for view-invariant identifica-tion of human body postures. They apply a cylindrical histogramand compute an invariant measure of the distribution of recon-structed voxels, which later was used by Pierobon et al. [67] forhuman action recognition. Another example is seen in the workof Huang and Trivedi [64], where a 3D cylindrical shape con-text is presented to capture the human body configuration forgesture analysis of volumetric data. The temporal informationof an action is modeled using HMM. However, this study doesnot address the view-independence aspect. Instead, the subjectsare asked to rotate while training the system.

Pehlivan et al. [88] presented a view-independent representa-tion based on human poses. The volume of the human body isfirst divided into a sequence of horizontal layers, and then theintersections of the body segments with each layer are codedwith enclosing circles. The circular features in all layers: 1) thenumber of circles; 2) the area of the outer circle; and 3) thearea of the inner circle are then used to generate a pose de-scriptor. The pose descriptors of all frames in an action sequenceare further combined to generate corresponding motion descrip-tors. Action recognition is then performed with a simple nearestneighbor classifier.

3) Model-Based Human Pose Tracking: More detailed 3Dpose information (i.e., from tracking the kinematics model of thehuman body) is a rich and view-invariant representation for ac-tion recognition but challenging to derive [34]. Human body posetracking is itself an important area with many related researchstudies. Among these, research started with monocular view and2D features, and more recently (about 10 years ago) multi-viewand 3D features like volumetric data have been applied for bodypose estimation and tracking [100]. One of the earliest methodsfor multi-view 3D human pose tracking using volume datawas proposed by Mikic et al. [44], in which they use a hierar-chical procedure starting by locating the head using its specific

shape and size, and then growing to other body parts. Thoughthis method showed good visual results for several complexmotion sequences, it is also quite computationally expensive.

Cheng and Trivedi [37] proposed a method that incorpo-rates the kinematics constraints of a human body model intoa Gaussian mixture model framework, which was applied totrack both body and hand models from volume data. Althoughthis method was highly rated with good body tracking accuracyon HumanEva dataset [41], it requires a manual initializationand could not run in real-time. We see that there are alwaystradeoffs between achieving detailed information of humanbody pose and the computational cost as well as the robustness.In [89], Song et al. focus on gestures with more limited bodymovements. Therefore they only use the depth informationfrom two camera views to track 3D upper body poses using aBayesian inference framework with a particle filter, as well asclassifying several hand poses based on their appearance. Thetemporal information of both upper body and hand pose arethen inputted into a hidden conditional random field (HCRF)framework for aircraft handling gesture recognition. To dealwith the long range temporal dependencies in some gestures,they also incorporate a Gaussian temporal smoothing kernelinto the HCRF inference framework.

4) 3D Motion Features: The motion history volume (MHV)was proposed by Weinland et al. [36], as a 3D extension of mo-tion history images (MHIs) (see Fig. 4). MHVs are created byaccumulating static human postures over time in a cylindricalrepresentation, which is made view-invariant with respect to thevertical axis by applying the Fourier transform in cylindrical co-ordinates. The same representation was used by Turaga et al.[77] in combination with a more sophisticated action learningand classification based on Stiefel and Grassmann manifolds.Later, Weinland et al. [69] proposed a framework, where ac-tions are modeled using 3D occupancy grids, built from multipleviewpoints, in an exemplar-based HMMs. Learned 3D exem-plars are used to produce 2D image information which is com-pared to the observations; hence, 3D reconstruction is not re-quired during the recognition phase.

Canton-Ferrer et al. [66] propose another view-invariant rep-resentation based on 3D MHIs and 3D invariant statistical mo-ments. A different strategy is presented by Yan et al. [79]. Theypropose a 4D action feature model (4D-AFM) for recognizingactions from arbitrary views based on spatio-temporal featuresof spatio-temporal volumes (STVs). The extracted features aremapped from the STVs to a sequence of reconstructed 3D vi-sual hulls over time, resulting in the 4D-AFM model, which isused for matching actions.

A 3D descriptors which are directly based on rich detailedmotion information are the 3D motion context (3D-MC) [43]and the harmonic motion context (HMC) [43] proposed byHolte et al. The 3D-MC descriptor is a motion oriented 3Dversion of the shape context [95], [96], which incorporatesmotion information implicitly by representing estimated 3Doptical flow (see Fig. 4) by embedded histograms of 3D opticalflow (3D-HOF) in a spherical histogram. The HMC descriptoris an extended version of the 3D-MC descriptor that makes itview-invariant by decomposing the representation into a set ofspherical harmonic basis functions.


Fig. 5. Image and 3D voxel-based volume examples for the 13 actions from the IXMAS Multi-View Human Action Dataset. The figure is organized such that thecolumns correspond to the 13 different actions performed by the 12 actors. The first five rows depict images captured from the five camera views, while the sixthrow shows the corresponding 3D volumes.

IV. MULTI-VIEW DATASETS

This section presents a description of popular multi-viewdatasets. A number of multi-view human action datasets arepublicly available, where a frequently used dataset is theIXMAS Multi-View Human Action Dataset1 [36]. It consistsof 12 nonprofessional actors performing 13 daily-life actions3 times: check watch, cross arms, scratch head, sit down, getup, turn around, walk, wave, punch, kick, point, pick up, andthrow. The dataset has been recorded by five calibrated andsynchronized cameras, where the actors chose freely positionand orientation, and consists of image sequences (390 291)and reconstructed 3D volumes (64 64 64 voxels), resultingin a total of 2340 action instances for all five cameras. Fig. 5shows multi-view actor/action images and voxel-based volumeexamples from the IXMAS datasets.

Recently, a new high-quality dataset has been produced, thei3DPost Multi-View Human Action and Interaction Da-taset2

[29]. This dataset, which has been generated within the Intel-ligent 3D Content Extraction and Manipulation for Film andGames EU-funded research project, consists of eight actors per-forming ten different actions, where six are single actions: walk,run, jump, bend, hand-wave and jump-in-place, and four arecombined actions: sit-stand-up, run-fall, walk-sit and run-jump-walk. Additionally, the dataset also contains two interactions:

1The IXMAS dataset is available at http://4drepository.inrialpes.fr/public/viewgroup/6

2The i3DPost dataset is available at http://kahlan.eps.surrey.ac.uk/i3dpost_action.

handshake and pull, and six basic facial expressions. The sub-jects have different body sizes, clothing and are of different sexand nationalities. The multi-view videos have been recordedby an eight calibrated and synchronized camera setup in high-definition resolution (1920 1080), resulting in a total of 640videos (excluding videos of interactions and facial expressions).For each video frame a 3D mesh model of relatively high detaillevel (20 000–40 000 vertices and 40 000–80 000 triangles) ofthe actor and the associated camera calibration parameters areavailable. The mesh models were reconstructed using a globaloptimization method proposed by Starck and Hilton [42]. Fig. 6shows multi-view actor/action images and 3D mesh model ex-amples from the i3DPost dataset.

Another interesting multi-view dataset is the SynchronizedVideo and Motion Capture Dataset for Evaluation of Articu-lated Human Motion (HumanEva) [41], containing six simpleactions performed by four actors, captured by seven calibratedvideo cameras (four grayscale and three color), which havebeen synchronized with 3D body poses obtained from a motioncapture system. Among other less frequently used multi-viewdatasets are the CMU Motion of Body (MoBo) Database [101],the Multi-camera Human Action Video Data-set (MuHAVi)[39] and the KU Gesture Dataset [40].

V. COMPARISON

In this section, we report a qualitative and quantitativecomparison of several reviewed methods for human actionrecognition based on evaluations on two publicly availabledatasets: IXMAS Multi-View Human Action Dataset [36]


TABLE IVRECOGNITION ACCURACIES (%) FOR THE IXMAS DATASET. THE COLUMN NAMED “DIM” STATES IF THE METHODS APPLY 2D IMAGE DATA OR 3D DATA, THE

OTHER COLUMNS STATES HOW MANY ACTIONS ARE USED FOR EVALUATION, AND IF THE RESULTS ARE BASED ON ALL VIEWS OR CROSS-VIEW RECOGNITION

Fig. 6. Image and 3D mesh model examples for the ten actions from thei3DPost Multi-View Human Action Dataset. The figure is organized such thatthe columns correspond to the ten different actions performed by the eightactors, where the first six columns show the single actions and the last fourcolumns show the combined actions. The first four rows depict images capturedfrom the eight camera views, while the ninth row shows the corresponding 3Dmesh models.

and the i3DPost Multi-View Human Action and InteractionDataset [29].

In Table IV, the recognition accuracies of several 2D and3D approaches evaluated on IXMAS are listed. It is interestingto note that all the 3D approaches except one are the top per-forming methods. Especially, the methods proposed by Turagaet al. [77] and Weinland et al. [36], which both are based onmotion history volumes (MHVs), produce superior recognitionaccuracies. The approach in [77] is based on the prior work of

Weinland et al., but applies a more sophisticated action learningand classification based on Stiefel and Grassmann manifolds,which leads to a significant improvement.

Another interesting approach with high performance is thework by Pehlivan et al. [88], which uses 3D pose features repre-sented by horizontal circular pose features over time. This showsthat methods based on full 3D shape and pose information arealso promising 3D action recognition strategies. In contrast, thework on 3D action recognition of Yan et al. [79], where a 4Daction feature model (4D-AFM) based on spatio-temporal fea-tures of spatio-temporal volumes (STVs) is developed, resultsin a lower recognition rate than some 2D methods. The authorsdevelop a 4D action feature model (4D-AFM) based on spatio-temporal features extracted from a sequence of spatio-temporalvolumes (STVs). A reason might be that the low-quality multi-view video data of IXMAS produces noisy sequences of recon-structed 3D visual hulls over time, and therefore distorts theextraction of more fine detailed features from the sequence ofSTVs.

The best performing 2D methods are the work proposed byVitaladevuni et al. [78], Weinland et al. [84] and Liu et al. [74].In [78], Vitaladevuni et al. uses motion history images featuresand ballistic dynamics, where actions are represented and clas-sified by a Bayesian model. While both [84] and [74] use localfeatures in form of spatio-temporal interest points to extract abag of visual-words of Cuboid features and local partitioningand hierarchical classification of 3D histogram of oriented gra-dients volumes, respectively. In both cases, support vector ma-chines are applied for classification. However, Weinland et al.directly show the method’s robustness to occlusions and view-point changes, and report near real-time performance.

This evaluation indicates that the use of the full reconstructed3D information is superior to applying 2D image data from mul-tiple views, when it comes to recognition accuracy. However,the computational cost of working in 3D is usually also moreexpensive. Hence, with respect to the application and demandfor real-time performance, 2D approaches might still be bestchoice. It should be noted that some results are reported using


TABLE VRECOGNITION ACCURACIES (%) FOR THE i3DPOST DATASET. *GKALELIS ET

AL. [80] TEST ON FIVE SINGLE ACTIONS

cross-view evaluation, which is more challenging than applyingdata from multiple and identical viewpoints; however, still someof these methods perform very well. Especially, the approachesproposed by Haq et al. [85] using dynamic scene geometry andLiu et al. [87] adopting a transfer learning model, give superiorcross-view recognition. When both types of results are availablein the original work, we have reported the results for all views,since these are more comparable to the 3D Results, where allviews are used to reconstruct 3D data.

Table V shows the recognition accuracies of a few other ap-proaches evaluated on the i3DPost dataset. The evaluation hasbeen carried out for eight actions by combining the six single ac-tions in the dataset with two additional single actions: sit downand fall by splitting two of the four combined actions. Againthe approach based on full 3D motion information in form of3D optical flow and HMC by Holte et al. [43] outperforms the2D methods by Gkalelis et al. [80] and Iosifidis et al. [83],which both use shape features from multi-view body masks of2D human silhouettes, fuzzy vector quantization and linear dis-criminant analysis. This strengthen the similar outcome of themore extensive comparison using IXMAS.

Generally, the top performing approaches for the two datasetsare the 3D methods based on 3D motion features by Turaga etal. [77], Weinland et al. [36] and Holte et al. [43]. However, itshould be noticed that all these methods for human action recog-nition are basically model-free, which means that they do notapply a specific human body model to model and estimate theexact position and configuration of the body parts and joints.Hence, these methods are only applicable for a set of the ap-plications in Table I. This results in a need for model-basedapproaches for 3D pose estimation and exact modeling of thehuman body.

VI. DISCUSSION AND FUTURE DIRECTIONS

In this paper, we provide a review and comparative studyof recent developments for human pose estimation and activityrecognition using multi-view data. We give a overview of thedifferent application areas and their associated requirements forsuccessful operation.

First, we review the subarea of model-based methods for realhuman body pose estimation using volumetric data. After a briefoverview to put the topic into context, we focus on analyzingand comparing several selected methods, especially some re-cent methods proposed in the past two years, to highlight theirimportant results. This includes: increasing generality, real-timeperformance, and a new general LE-based method for voxel seg-mentation. There are some related open-ended research areasthat should be mentioned. First is the issue of human body poseestimation at multi-level (e.g., body level, head level, hand level)which was mentioned in [102]. We can see the benefits of having

such a multi-level human body pose estimation system, suchas: combined information from different level of details is moreuseful (e.g., in intelligent environment, the combination of bodypose, hand pose, head pose would give better interpretation ofhuman status/intention); Information from different levels cancompliment each other and help to improve the estimation per-formance. However, typical approaches in this area deal witheach of these tasks (body pose estimation, hand pose estimation,head pose estimation, etc.), separately. Therefore, it is useful toconduct further studies to analyze the reasons to why typicalapproaches only deal with one task at a time, and find a wayto achieve the goal of a full body model (e.g., including body,head, and hand). Another related open-ended research area thatis important put more emphasis on, is the issue of pose estima-tion and tracking of multiple objects simultaneously.

Next, the subarea of multi-view action recognition is re-viewed, covering both 2D and 3D multi-view approaches, andpublicly available multi-view datasets. A qualitative compar-ison of several promising approaches based on the IXMAS andi3DPost datasets, reveals that methods using 3D representationsof the data turn out to outperform the 2D methods. The mainstrength of multi-view setups is the high quality full-volume3D data, which can be provided from 3D reconstruction byshape-from-silhouettes and refinements techniques. It alsohelps to uncover occluded action regions from different viewsin the global 3D data, and allows for extraction of informativefeatures in a more rich 3D space, than the one captured froma single view. However, although the reviewed approachesshow promising results for multi-view human pose estimationand action recognition, 3D reconstructed data from multi-viewcamera systems has some shortcomings. First of all, thequality of the silhouettes is crucial for the outcome of applyingshape-from-silhouettes. Hence, shadows, holes and other errorsdue to inaccurate foreground segmentation will affect the finalquality of the reconstructed 3D data. Second, the number ofviews and the image resolution will influent the level of detailswhich can be achieved, and self-occlusion is a known problemwhen reconstructing 3D data from multi-view image data,resulting in merging body parts. Finally, 3D data can only bereconstructed in a limited space where multiple camera viewsoverlap.

In recent years other prominent vision-based sensors for ac-quiring 3D data have been developed. ToF range cameras, whichare single sensors capable of measuring depth information, havebecome popular in the computer vision community. Especially,with the introduction of the Microsoft Kinect sensor [6], thesesingle and direct 3D imaging devices have become widespreadand commercial available at low cost. Their applicability arebroader due to the convenience of using a single sensor, avoidingthe difficulties inherent to classical stereo and multi-view ap-proaches (the correspondence problem, careful camera place-ment and calibration). However, in contract to the rich full-volume 3D data which can be provided by 3D reconstructionfrom multi-view data, these sensors only captures 3D data ofthe frontal surfaces of humans and other objects. Additionally,these sensors are usually limited to a range up to about 6–7 me-ters, and the estimated data can become distorted by scatteredlight from reflective surfaces.


ACKNOWLEDGMENT

The authors would like to thank their colleagues atCVRR-LISA Laboratory, especially Dr. S. Cheng, Dr. K.Huang, and Dr. I. Mikic for their valuable inputs, and I. Pitasand N. Nikolaidis, Informatics, and Telematics Institute, Centerfor Research and Technology Hellas, Greece, and Departmentof Informatics, Aristotle University of Thessaloniki, Greece,for their support on the i3DPost dataset.

REFERENCES

[1] H. Lee and J. H. Kim, “An HMM-based threshold model approach forgesture recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21,no. 10, pp. 961–973, 1999.

[2] A. Postawa, M. Kleinsorge, J. Krueger, and G. Seliger, “Automatedimage based recognition of manual work steps in the remanufacturingof alternators,” Adv. Sustain. Manuf., vol. 5, pp. 209–214, 2011.

[3] C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau, “Fall detectionfrom human shape and motion history using video surveillance,” inProc. 21st Int. Conf. Adv. Inf. Netw. Applicat. Workshops, 2007.

[4] A. Alonso, R. Rosa, L. Val, M. Jimenez, and S. Franco, “A robot con-trolled by blinking for ambient assisted living,” in Proc. 10th Int. Work-Conf. Artif. Neural Netw.: II: Distrib. Comput., Artif. Intell., Bioinf. ,Soft Comput., Ambient Assist. Living, 2009.

[5] C. Tran and M. Trivedi, “Introducing XMOB: Extremity movementobservation framework for upper body pose tracking in 3D,” in Proc.IEEE Int. Symp. Multimedia, 2009, pp. 446–447.

[6] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake, “Real-time human pose recognition in partsfrom single depth images,” in Proc. Computer Comput. Vis. PatternRecognit., 2011.

[7] M. Trivedi and S. Cheng, “Holistic sensing and active displays for in-telligent driver support systems,” IEEE Comput. Mag., vol. 40, no. 5,pp. 60–68, May 2007.

[8] E. Murphy-Chutorian and M. Trivedi, “Head pose estimation and aug-mented reality tracking: An integrated system and evaluation for mon-itoring driver awareness,” IEEE Trans. Intell. Transport. Syst., vol. 11,no. 2, pp. 300–311, Jun. 2010.

[9] C. Tran and M. Trivedi, “Driver assistance for ’keeping hands on thewheel and eyes on the road,” in Proc. IEEE Int. Conf. Vehicular Elec-tronics and Safety, 2009, pp. 97–101.

[10] C. Tran, A. Doshi, and M. Trivedi, “Pedal errors prediction by driverfoot gesture analysis: A vision-based inquiry,” in Proc. IEEE Intell.Veh. Symp., 2011, pp. 577–582.

[11] M. Trivedi, S. Cheng, E. Childers, and S. Krotosky, “Occupant postureanalysis with stereo and thermal infrared video: Algorithms and exper-imental evaluation,” IEEE Trans. Veh. Technol., Special Iss. In-Veh. Vis.Syst., vol. 53, no. 6, pp. 1698–1712, Nov. 2004.

[12] S. Cheng and M. Trivedi, “Turn-intent analysis using body pose forintelligent driver assistance,” IEEE Pervas. Comput., vol. 5, no. 4, pp.28–37, Oct.-Dec. 2006.

[13] J. Geigel and M. Schweppe, “Motion capture for realtime control ofvirtual actors in live, distributed, theatrical performances,” in Proc.FG’11, 2011, pp. 774–779.

[14] A. Alatan, Y. Yemez, U. Güdükbay, X. Zabulis, K. Müller, C. Erdem,C. Weigel, and A. Smolic, “Scene representation technologies for3dtv—A survey,” EEE Trans. Circuits Syst. Video Technol., vol. 17,no. 11, pp. 1587–1605, Nov. 2007.

[15] J. Radmer and J. Krueger, “Depth data-based capture of human move-ment for biomechanical application in clinical rehabilitation use,” inProc. 5th Int. Symp. Health Inform. Bioinformat., 2010, pp. 144–148.

[16] L. Muendermann, S. Corazza, A. Chaudhari, T. Andriacchi, A. Sun-daresan, and R. Chellappa, “Measuring human movement for biome-chanical applications using markerless motion capture,” in Proc. SPIEThree-Dimensional Image Capture Applicat., 2006.

[17] Y. Chen, L. Smith, S. Hongwei, A. Pereira, and T. Smith, “Active in-formation selection: Visual attention through the hands,” IEEE Trans.Auton. Mental Develop., vol. 1, no. 2, pp. 141–151, Aug. 2009.

[18] M. Trivedi, K. Huang, and I. Mikic, “Dynamic context capture anddistributed video arrays for intelligent spaces,” IEEE Trans. Syst., Man,Cybern., A, vol. 35, no. 1, pp. 145–163, Jan. 2005.

[19] E. Murphy-Chutorian and M. Trivedi, “3D tracking and dynamicanalysis of human head movements and attentional targets,” in Proc.IEEE/ACM Int. Conf. Distrib. Smart Cameras, 2008, pp. 1–8.

[20] A. Waibel, R. Stiefelhagen, R. Carlson, J. Casas, J. Kleindienst, L.Lamel, O. Lanz, D. Mostefa, M. Omologo, F. Pianesi, L. Polymenakos,G. Potamianos, J. Soldatos, G. Sutschet, and J. Terken, “Computers inthe human interaction loop,” in Handbook of Ambient Intelligence andSmart Environments. New York: Springer, 2010.

[21] S. Park and M. Trivedi, “Understanding human interactions with trackand body synergies (TBS) captured from multiple views,” Comput. Vis.Image Understand., vol. 111, no. 1, pp. 2–20, 2008.

[22] A. Utasi and C. Benedek, “A 3-D marked point process model formulti-view people detection,” in Proc. Computer Comput. Vis. PatternRecognit., 2011, pp. 3385–3392.

[23] J. Assfalg, M. Bertini, C. Colombo, A. Bimbo, and W. Nunziati, “Se-mantic annotation of soccer videos: Automatic highlights identifica-tion,” Comput. Vis. Image Understand., vol. 92, no. 2–3, pp. 285–305,2003.

[24] J. Kilner, J.-Y. Guillemaut, and A. Hilton, “3D action matching withkey-pose detection,” in Proc. Int. Conf. Comput. Vis. Workshops, 2009,pp. 1–8.

[25] N. Werghi, “Segmentation and modeling of full human body shapefrom 3-D scan data: A survey,” IEEE Trans. Syst., Man, Cyber. C, vol.37, no. 6, pp. 1122–1136, Nov. 2007.

[26] D. Fofi, T. Sliwa, and Y. Voisin, “A comparative survey on invisiblestructured light,” Proc. SPIE, vol. 5303, pp. 90–98, 2004.

[27] A. Kolb, E. Barth, R. Koch, and R. Larsen, “Time-of-flight sensors incomputer graphics,” in Proc. Eurographics - State of the Art Reports,2009.

[28] E. Stoykova, A. Alatan, P. Benzie, N. Grammalidis, S. Malasitis, J. Os-termann, S. Piekh, V. Sainov, C. Theobalt, T. Thevar, and X. Zabulis,“3-d time-varying scene capture technologies: A survey,” EEE Trans.Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1568–1586, Nov.2007.

[29] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, “Thei3Dpost multi-view and 3d human action/interaction database,” inProc. CVMP, 2009, pp. 159–168.

[30] X. Ji and H. Liu, “Advances in view-invariant human motion analysis:A review,” Trans. Syst., Man, Cybern. C, vol. 40, no. 1, pp. 13–24, Jan.2010.

[31] T. Moeslund and E. Granum, “A survey of computer vision-basedhuman motion capture,” Comput. Vis. Image Understand. , vol. 81, no.3, pp. 231–268, 2001.

[32] T. Moeslund, A. Hilton, and V. Krüger, “A survey of advances in vi-sionbased human motion capture and analysis,” Comput. Vis. ImageUnderstand., vol. 104, no. 2–3, pp. 90–126, 2006.

[33] R. Poppe, “Vision-based human motion analysis: An overview,”Comput. Vis. Image Understand., vol. 108, no. 1–2, pp. 4–18, 2007.

[34] R. Poppe, “A survey on vision-based human action recognition,” ImageVis. Comput., vol. 28, no. 6, pp. 976–990, 2010.

[35] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-basedmethods for action representation, segmentation and recognition,”INRIA Rep. vol. RR-7212, pp. 54–111, 2010 [Online]. Available:http://hal.inria.fr/inria-00459653/PDF/RR-7212.pdf

[36] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint actionrecognition using motion history volumes,” Comput. Vis. ImageUnderstand., vol. 104, no. 2, pp. 249–257, 2006.

[37] S. Y. Cheng and M. M. Trivedi, “Articulated human body pose infer-ence from voxel data using a kinematically constrained Gaussian mix-ture model,” in Proc. Computer Comput. Vis. Pattern Recognit. Work-shops, 2007.

[38] L. Sigal and M. Black, “Guest editorial: State of the art in image andvideo based human pose and motion estimation,” Int. J. Comput. Vis.,vol. 87, pp. 1–3, 2010.

[39] MuHAVi Dataset Instructions, [Online]. Available: http://dipersec.king. ac.uk/MuHAVi- MAS/

[40] B.-W. Hwang, S. Kim, and S.-W. Lee, “A fullbody gesture database forautomatic gesture recognition,” in Proc. FG’06, 2006 [Online]. Avail-able: http://gesturedb.korea.ac.kr/

[41] L. Sigal and M. Black, “Humaneva: Synchronized video and motioncapture dataset for evaluation of articulated human motion,” 2006,Tech. Rep..

[42] J. Starck and A. Hilton, “Surface capture for performance based ani-mation,” IEEE Comput. Graphics Applicat., vol. 27, no. 3, pp. 21–31,May-Jun. 2007.

[43] M. Holte, T. Moeslund, N. Nikolaidis, and I. Pitas, “3D human actionrecognition for multi-view camera systems,” in Proc. 3DIMPVT, 2011.

[44] I. Mikic, M. M. Trivedi, E. Hunter, and P. Cosman, “Human bodymodel acquisition and tracking using voxel data,” Int. J. Comput. Vis.,vol. 53, no. 3, pp. 199–223, 2003.


[45] A. Sundaresan and R. Chellappa, “Model driven segmentation of artic-ulating humans in Laplacian eigenspace,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 30, no. 10, pp. 1771–1785, Oct. 2008.

[46] Q. Delamarre and O. Faugeras, “3D articulated models and multiviewtracking with physical forces,” Comput. Vis. Image Understand., vol.81, no. 3, pp. 328–357, 2001.

[47] G. Cheung, S. Baker, and T. Kanade, “Shape-from-silhouette of articu-lated objects and its use for human body kinematic estimation and mo-tion capture,” in Proc. Comput. Comput. Vis. Pattern Recognit., 2003.

[48] J. Ziegler, K. Nickel, and R. Stiefelhagen, “Tracking of the articulatedupper body on multi-view stereo image sequences,” in Proc. Comput.Vis. Pattern Recognit., 2006.

[49] F. Caillette, A. Galata, and T. Howard, “Real-time 3-D human bodytracking using learnt models of behaviour,” Comput. Vis. Image Un-derstand., vol. 109, pp. 112–125, 2008.

[50] O. Bernier, P. Cheung-Mon-Chana, and A. Bougueta, “Fast non-parametric belief propagation for real-time stereo articulated bodytracking,” Comput. Vis. Image Understand., vol. 113, no. 1, pp. 29–47,2009.

[51] J. Gall, B. Rosenhahn, T. Brox, and H. Seidel, “Optimization and fil-tering for human motion capture: A multi-layer framework,” Int. J.Comput. Vis., vol. 87, no. 1–2, pp. 75–92, 2010.

[52] S. Corazza, L. Mundermann, E. Gambaretto, G. Ferrigno, and T. An-driacchi, “Markerless motion capture through visual hull, articulatedICP and subject specific model generation,” Int. J. Comput. Vis., vol.87, no. 1–2, pp. 156–169, 2010.

[53] R. Li, T. Tian, S. Sclaroff, and M. Yang, “3d human motion trackingwith a coordinated mixture of factor analyzers,” Int. J. Comput. Vis.,vol. 87, no. 1–2, 2010.

[54] M. Hofmann and D. Gavrila, “Multi-view 3D human pose estimationin complex environment,” Int. J. Comput. Vis., 2011.

[55] S. Y. Cheng and M. M. Trivedi, “Multimodal voxelization and kine-matically constrained Gaussian mixture model for full hand pose esti-mation: An integrated systems approach,” in Proc. Int. Conf. Comput.Vis. Syst., 2006.

[56] G. Cheung and T. Kanade, “A real-time system for robust 3d voxelreconstruction of human motions,” in Proc. Comput. Vis. PatternRecognit., 2000.

[57] Z. Husz and A. Wallace, “Evaluation of a hierarchical partitionedparticle filter with action primitives,” in Proc. Comput. Vis. PatternRecognit. Workshop, 2007.

[58] D. Knossow, R. Ronfard, and R. Horaud, “Human motion tracking witha kinematic parameterization of extremal contours,” Int. J. Comput.Vis., vol. 79, no. 3, pp. 247–269, 2008.

[59] R. Poppe, “Evaluating example-based pose estimation: Experimentson the humaneva sets,” in Proc. Comput. Vis. Pattern Recognit. Work-shops, 2007.

[60] G. Slabaugh, B. Culbertson, and T. Malzbender, “A survey of methodsfor volumetric scene reconstruction for photographs,” in Proc. Int.Conf. Vol. Graphics, 2001.

[61] C. Tran and M. Trivedi, “Hand modeling and tracking from voxel data:An integrated framework with automatic initialization,” in Proc. IEEEInt. Conf. Pattern Recognit., 2008.

[62] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “Vi-sion-based hand pose estimation: A review,” Comput. Vis. Image Un-derstand., vol. 108, no. 1–2, 2007.

[63] E. Murphy-Chutorian and M. Trivedi, “Head pose estimation in com-puter vision: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol.31, no. 4, pp. 607–626, Apr. 2009.

[64] K. Huang and M. Trivedi, “3D shape context based gesture analysisintegrated with tracking using omni video array,” in Proc. Comput.Comput. Vis. Pattern Recognit. Workshops, 2005.

[65] M. Ahmad and S.-W. Lee, “HMM-based human action recogni-tion using multiview image sequences,” in Proc. Int. Conf. PatternRecognit., 2006.

[66] C. Canton-Ferrer, J. Casas, and M. Pardás, “Human model and motionbased 3d action recognition in multiple view scenarios,” in Proc. EU-SIPCO, 2006.

[67] M. Pierobon, M. Marcon, A. Sarti, and S. Tubaro, “3-D body pos-ture tracking for human action template matching,” in Proc. Int. Conf.Acoust., Speech, Signal Process., 2006, pp. 501–504.

[68] F. Lv and R. Nevatia, “Single view human action recognition usingkey pose matching and viterbi path searching,” in Proc. Comput. Vis.Pattern Recognit., 2007.

[69] D. Weinland, R. Ronfard, and E. Boyer, “Action recognition fromarbitrary views using 3d exemplars,” in Proc. Int. Conf. Comput. Vis.,2007.

[70] S. Cherla, K. Kulkarni, A. Kale, and V. Ramasubramanian, “Towardsfast, view-invariant human action recognition,” in Computer Comput.Vis. Pattern Recognit. Workshops, 2008.

[71] A. Farhadi and M. Tabrizi, “Learning to recognize activities from thewrong view point,” in Proc. Eur. Conf. Comput. Vis., 2008.

[72] I. Junejo, E. Dexter, I. Laptev, and P. Pérez, “Cross-view action recog-nition from temporal self-similarities,” in Proc. Eur. Conf. Comput.Vis., 2008.

[73] J. Liu, S. Ali, and M. Shah, “Recognizing human actions using multiplefeatures,” in Computer Comput. Vis. Pattern Recognit., 2008.

[74] J. Liu and M. Shah, “Learning human actions via information maxi-mization,” in Computer Comput. Vis. Pattern Recognit., 2008.

[75] R. Souvenir and J. Babbs, “Learning the viewpoint manifold for actionrecognition,” in Computer Comput. Vis. Pattern Recognit., 2008.

[76] D. Tran and A. Sorokin, “Human activity recognition with metriclearning,” in Proc. Eur. Conf. Comput. Vis., 2008.

[77] P. Turaga, A. Veeraraghavan, and R. Chellappa, “Statistical analysison Stiefel and Grassmann manifolds with applications in computer vi-sion,” in Computer Comput. Vis. Pattern Recognit., 2008.

[78] S. Vitaladevuni, V. Kellokumpu, and L. Davis, “Action recognitionusing ballistic dynamics,” in Computer Comput. Vis. Pattern Recognit.,2008.

[79] P. Yan, S. Khan, and M. Shah, “Learning 4d action feature models forarbitrary view action recognition,” in Computer Comput. Vis. PatternRecognit., 2008.

[80] N. Gkalelis, N. Nikolaidis, and I. Pitas, “View indepedent human move-ment recognition from multi-view video exploiting a circular invariantposture representation,” in ICME, 2009.

[81] K. Reddy, J. Liu, and M. Shah, “Incremental action recognition usingfeature-tree,” in Proc. Int. Conf. Comput. Vis., 2009.

[82] P. Huang, A. Hilton, and J. Starck, “Shape similarity for 3D video se-quences of people,” Int. J. Comput. Vis., vol. 89, pp. 362–381, 2010.

[83] A. Iosifidis, N. Nikolaidis, and I. Pitas, “Movement recognition ex-ploiting multi-view information,” in Proc. Multimedia Signal Process.,2010.

[84] D. Weinland, M. Özuysal, and P. Fua, “Making action recognition ro-bust to occlusions and viewpoint changes,” in Proc. Eur. Conf. Comput.Vis., 2010.

[85] A. Haq, I. Gondal, and M. Murshed, “On dynamic scene geometryfor view-invariant action matching,” in Proc. Comput. Vis. PatternRecognit., 2011.

[86] I. Junejo, E. Dexter, I. Laptev, and P. Pérez, “View-independent actionrecognition from temporal self-similarities,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 33, no. 1, pp. 172–185, Jan. 2011.

[87] J. Liu, M. Shah, B. Kuipers, and S. Savarese, “Cross-view action recog-nition via view knowledge transfer,” in Proc. Comput. Comput. Vis.Pattern Recognit., 2011.

[88] S. Pehlivan and P. Duygulu, “A new pose-based representation for rec-ognizing actions from multiple cameras,” Comput. Vis. Image Under-stand., vol. 115, pp. 140–151, 2011.

[89] Y. Song, D. Demirdjian, and R. Davis, “Multi-signal gesture recogni-tion using temporal smoothing hidden conditional random fields,” inProc. FG’11, 2011.

[90] P. Matikainen, P. Pillai, L. Mummert, R. Sukthankar, and M. Hebert,“Prop-free pointing detection in dynamic cluttered environments,” inProc. FG’11, 2011.

[91] P. Fihl and T. B. Moeslund, “Invariant gait continuum based on theduty-factor,” Signal Image Video Process., vol. 3, no. 4, pp. 391–402,2008.

[92] A. Johnson and M. Hebert, “Using spin images for efficient objectrecognition in cluttered 3d scenes,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 21, no. 5, pp. 433–449, May 1999.

[93] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, “Shape distri-butions,” ACM Trans. Graph., vol. 21, pp. 807–832, 2002.

[94] M. Ankerst, G. Kastenmüller, H.-P. Kriegel, and T. Seidl, “3D shapehistograms for similarity search and classification in spatial databases,”in Proc. Int Symp. Spatial Datbases, 1999.

[95] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.

[96] M. Körtgen, M. Novotni, and R. Klein, “3D shape matching with 3Dshape contexts,” in Proc. Central Eur. Seminar Comput. Graph., 2003.

[97] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz, “Rotation invariantspherical harmonic representation of 3D shape descriptors,” in Proc.SIGGRAPH, 2003.

[98] P. Huang and A. Hilton, “Shape-colour histograms for matching 3Dvideo sequences,” in Proc. 3DIM, 2009.


[99] I. Cohen and H. Li, “Inference of human postures by classification of3D human body shape,” in Proc. Int. Workshop Anal. Modeling of Facesand Gestures , 2003.

[100] C. Tran and M. Trivedi, “Human body modeling and tracking usingvolumetric representation: Selected recent studies and possibilities forextensions,” in Proc. ACM Workshops, 2008.

[101] R. Gross and J. Shi, “The CMU Motion of Body (MoBo) Database,”2001, Tech. Rep..

[102] M. Trivedi, “Human movement capture and analysis in intelligent en-vironments,” Mach. Vis. Applicat., vol. 14, no. 4, pp. 215–217, 2003.

Michael B. Holte (S’11) received the M.Sc.E.E. de-gree in informatics-computer vision and graphics andthe Ph.D. degree from Aalborg University (AAU),Aalborg, Denmark, in 2005 and 2012, respectively.

His primary research interests are human motionanalysis, gesture and action recognition, behavioranalysis, machine vision, pattern recognition, inter-active systems, and computer graphics.

Cuong Tran (S’11) received the B.S. degree in com-puter science from Hanoi University of Technology,Hanoi, Vietnam, and the M.S. and Ph.D. degrees incomputer science from the University of California(UC) at San Diego, La Jolla, in 2004, 2008, and 2012,respectively.

He is currently a Researcher in the ComputerVision and Robotics Research Laboratory, UC SanDiego. His research interests include vision-basedhuman pose estimation and activity analysis forinteractive applications, intelligent driver assistance,

human–machine interfaces, and behavior prediction.Dr. Tran is a member of a Vietnam Education Foundation (VEF) fellow.

Mohan M. Trivedi (F’08) received the B.E. degree(with honors) in electronics from the Birla Instituteof Technology and Science, Pilani, India, in 1974 andthe M.E. and Ph.D. degrees from Utah State Univer-sity in 1976 and 1979, respectively.

He is a Professor of electrical and computer engi-neering and the Founding Director of the ComputerVision and Robotics Research Laboratory and Lab-oratory for Intelligent and Safe Automobiles (LISA)at the University of California at San Diego, La Jolla.He and his team are currently pursuing research in

machine and human perception, machine learning, human-centered multimodalinterfaces, intelligent transportation, driver assistance, and active safety sys-tems. His team has played key role in several major research collaborative initia-tives. These include an autonomous robotic team for Shinkansen track mainte-nance, autonomous vision-based robots for nuclear environments, human-cen-tered vehicle collision avoidance systems, vision-based passenger protectionsystem for “smart airbags” and several vision systems for transportation andhomeland security applications. His team designed and deployed the “EagleEyes” system on the U.S.-Mexico border in 2006.

Prof. Trivedi is a coauthor of a number of papers winning “Best Paper”awards. Two of his students were awarded Best Dissertation Awards by theIEEE ITS Society (Dr. S. Cheng 2008 and Dr. B. Morris 2010) and his adviseeDr. A. Doshi’s dissertation was selected as the UCSD entry and judged amongthe five finalists in the 2011 dissertation competition of the Western (USA andCanada) Association of Graduate Schools. He has received the DistinguishedAlumnus Award from the Utah State University, Pioneer Award (Technical Ac-tivities) and Meritorious Service Award from the IEEE Computer Society. Hehas given over 65 Keynote/Plenary talks at major conferences. He presented theMel Webber Memorial Lecture of the UCTC conference in 2009. Trivedi servesas a consultant to industry and government agencies in the U.S. and abroad,including the National Academies, major auto manufactures and researchinitiatives in Asia and Europe. He is a Fellow of the IEEE “for contributions tointelligent transportation systems field,” Fellow of the IAPR “for contributionsto vision systems for situational awareness and human-centered vehicle safety,”and Fellow of the SPIE “for distinguished contributions to the field of opticalengineering.”

Thomas B. Moeslund (M’12) received theM.Sc.E.E. and Ph.D. degrees from Aalborg Uni-versity, Aalborg, Denmark, in 1996 and 2003,respectively.

He is Head of a computer vision group at AalborgUniversity. His research is focused on all aspects ofcomputer vision and image analysis. He has beeninvolved in 12 national and international researchprojects, both as coordinator, WP leader, and re-searcher. He is reviewer for all major journals withinthe field. He serves as associate editor and editorial

board member for four international journals.Dr. Moeslund has coedited one special journal issue, cochaired four inter-

national workshops/tutorials and acted as PC member/reviewer for a numberof conferences. In 2012, he is invited speaker at AMDO12, Advisory BoardMember of ICIEV12, organizer of a CVPR12 tutorial and coeditor of a journalspecial issue. He has published 90 peer-reviewed journal and conference papers.Awards include a most cited paper award in 2009, a best IEEE paper award in2010, and a teacher-of-the-year award in 2010.