Q. Cai, Human Motion Analysis a Review

Embed Size (px)

Citation preview

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    1/13

    Computer Vision and Image Understanding

    Vol. 73, No. 3, March, pp. 428440, 1999

    Article ID cviu.1998.0744, available online at http://www.idealibrary.com on

    Human Motion Analysis: A Review

    J. K. Aggarwal and Q. Cai

    Computer and Vision Research Center, Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712

    Received October 22, 1997; accepted September 25, 1998

    Human motion analysis is receiving increasing attention from

    omputer vision researchers. This interest is motivated by a wide

    pectrum of applications, such as athletic performance analysis,

    urveillance, manmachine interfaces, content-based image storage

    and retrieval, and video conferencing. This paper gives an overview

    of the various tasks involved in motion analysis of the human body.

    We focus on three major areas related to interpreting human mo-ion: (1) motion analysis involving human body parts, (2) tracking a

    moving human from a single view or multiple camera perspectives,

    and (3) recognizing human activities from image sequences. Motion

    analysis of human body parts involves the low-level segmentation

    of the human body into segments connected by joints and recov-

    rs the 3D structure of the human body using its 2D projections

    over a sequence of images. Tracking human motion from a single

    view or multiple perspectives focuses on higher-level processing, in

    which moving humans are tracked without identifying their body

    parts. After successfully matching the moving human image from

    one frame to another in an image sequence, understanding the hu-

    man movements or activities comes naturally, which leads to our

    discussion of recognizing human activities. c 1999 Academic Press

    1. INTRODUCTION

    Human motion analysis is receiving increasing attention from

    omputer vision researchers. This interest is motivated by ap-

    plications over a wide spectrum of topics. For example, seg-

    menting the parts of the human body in an image, tracking the

    movement of joints over an image sequence, and recovering the

    underlying 3D body structure are particularly useful for analy-

    is of athletic performance, as well as medical diagnostics. The

    apability to automatically monitor human activities using com-

    puters in security-sensitive areas such as airports, borders, and

    building lobbies is of great interest to the police and military.

    With the development of digital libraries, the ability to automat-

    cally interpret video sequences will save tremendous human

    effort in sorting and retrieving images or video sequences us-

    ng content-based queries. Other applications include building

    manmachine user interfaces, video conferencing, etc. This pa-

    per gives an overview of recent development in human motion

    analysis from image sequences using a hierarchical approach.

    In contrast to our previous review of motion estimation of a

    igid body [1], this survey concentrates on motion analysis of the

    human body, which is a nonrigid form. Our discussion c

    three areas: (1) motion analysis of the human body stru

    (2) tracking human motion without using the body parts

    a single view or multiple perspectives, and (3) recognizi

    man activities from image sequences. The relationship a

    these three areas is depicted in Fig. 1. Motion analysis of t

    man body usually involves low-level processing, such as

    part segmentation, joint detection and identification, and tcovery of 3D structure from the 2D projections in an

    sequence. Tracking moving individuals from a single vi

    multiple perspectives involves applying visual features to

    the presence of humans directly, i.e., without considering t

    ometric structure of the body parts. Motion information s

    position and velocity, along with intensity values, is emp

    to establish matching between consecutive frames. Once f

    correspondence between successive frames is solved, th

    step is to understand the behavior of these features throu

    the image sequence. Therefore, our discussion turns to rec

    tion of human movements and activities.Two typical approaches to motion analysis of human

    parts are reviewed, depending on whether a priori shape m

    are used. Figure 2 lists a number of publications in thi

    over the past years. In both model-based and nonmodel-

    approaches, the representation of the human body evolve

    stick figures to 2D contours to 3D volumes as the comp

    of the model increases. The stick figure representation is

    on the observation that human motion is essentially the m

    ment of the supporting bones. The use of 2D contours is d

    associated with the projection of the human figure in im

    Volumetric models, such as generalized cones, elliptical

    ders,and spheres, attemptto describe the details of a human

    in 3D and, therefore, require more parameters for compu

    Various levels of representations could be used for grap

    animation with different resolutions [2].

    With regard to the tracking of human motion without in

    ing body parts, we differentiate the work based on wheth

    subject is imaged at one time instant by a single camera o

    multiple perspectives using different cameras. In both co

    rations, the features to be tracked vary from points to 2D

    and 3D volumes. There is always a trade-off between f

    complexity and tracking efficiency. Lower-level features

    as points, are easier to extract but relatively more diffic

    428

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    2/13

    HUMAN MOTION ANALYSIS

    FIG. 1. Relationship among the three areas of human motion analysis addressed in the paper.

    rack than higher-level features such as blobs and 3D volumes.

    Most of the work in this area is listed in Fig. 3.

    To recognize human activities from an image sequence, two

    ypes of approaches were addressed: approaches based on a

    state-space model and those using the template matching tech-

    nique. In the first case, the features used for recognition could be

    points, lines, and 2D blobs. Methods using template matchingusually apply meshes of a subject image to identify a particular

    movement. Figure 4 gives an overview of the research in this

    area. In some of the publications, recognition is conducted us-

    ng only parts of the human figure. Since these methods can be

    FIG. 2. Past research on motion analysis of human body parts.

    naturally extended to recognition of a whole body move

    we also include them in our discussion.

    This paper is an extension of the review in [3]. The or

    zation of the paper is as follows: Section 2 reviews wo

    motion analysis of the human body structure. Section 3 c

    research on the higher-level tasks of tracking a human bod

    whole. Section 4 extends the discussion to recognition of hactivity in image sequences based upon successfully tra

    the features between consecutive frames. Finally, section 5

    cludes the paper and delineates possible directions for futu

    search.

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    3/13

    430 AGGARWAL AND CAI

    FIG. 3. Past research on tracking of human motion without using body parts.

    2. MOTION ANALYSIS OF HUMAN BODY PARTS

    This section focuses on motion analysis of human body parts,

    .e., approaches which involve 2D or 3D analysis of the hu-

    man body structure throughout image sequences. Convention-

    ally, human bodies are represented as stick figures, 2D contours,

    or volumetric models [4]; therefore, body segments can be ap-

    proximated as lines, 2D ribbons, and 3D volumes, accordingly.

    Figures5,6,andand7showexamplesofthestickfigure,2Dcon-

    our, and elliptical cylinder representations of the human body,

    espectively. Human body motion is typically addressed by the

    movement of the limbs and hands [58], such as the velocitiesof the hand or limb segments, or the angular velocity of various

    body parts.

    Two general strategies are used, depending upon whether in-

    ormation about the object shape is employed in the motion

    analysis, namely, model-based approaches and methods which

    do not rely on a priori shape models. Both methodologies fol-

    ow the general framework of: (1) feature extraction, (2) feature

    orrespondence, and (3) high-level processing. The major differ-

    FIG. 4. Past work on human activity recognition.

    ence between the two methodologies is in establishing f

    correspondence between consecutive frames. Methods

    assume a priori shape models match the real images to a p

    fined model. Feature correspondence is automatically ach

    once matching between the real images and the model is

    lished. When no a priori shape models are available, how

    correspondence between successive frames is based upo

    diction or estimation of features related to position, vel

    shape, texture, and color. These two methodologies can a

    combined at various levels to verify the matching betwee

    secutive frames and, finally, to accomplish more complex

    Since we have surveyed these methods previously in [4], wrestrict out discussion to very recent work.

    2.1. Motion Analysis without a priori Shape Models

    Most approaches to 2D or 3D interpretation of human

    structure focus on motion estimation of the joints of bod

    ments. When no a priori shape models are assumed, he

    assumptions are usually used to establish the correspon

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    4/13

    HUMAN MOTION ANALYSIS

    FIG. 5. A stick-figure human model (based on Chen and Lees work [11]).

    of joints between successive frames. These assumptions im-

    pose constraints on feature correspondence, decrease the search

    space, and eventually, result in a unique match.

    The simplest representation of a human body is the stick fig-

    ure, which consists of line segments linked by joints. The motion

    of joints provides the key to motion estimation and recognitionof the whole figure. This concept was initially considered by

    Johansson [9], who marked joints as moving light displays

    MLD). Along this vein, Rashid [10] attempted to recover a con-

    nected human structure with projected MLD by assuming that

    points belonging to the same object have higher correlations in

    FIG.6. A 2D contour human model (similar to Leung and Yangs model [26]).

    FIG. 7. A volumetric human model (derived from Hoggs work [1

    projected positions and velocities. Later, Webb and Agg[11, 12]extendedto 3D structure recovery of Johansson-typ

    ures in motion. They imposed the fixed axis assumption, w

    assumes that the motion of each rigid object (or part of an

    ulated object) is constrained so that its axis of rotation rem

    fixed in direction. Therefore, the depth of the joints can be

    mated from their 2D projections. Detailed review of [912

    be found in [4]. All of these approaches inevitably demand a

    degree of accuracy in extracting body segments and joints

    segmentation problem is avoided by directly using MLD

    implies their restrictions to human images with natural clo

    Anotherwaytodescribethehumanbodyisusing2DcontIn such descriptions, the human body segments are anal

    to 2D ribbons or blobs. For example, Shio and Sklansky

    focused their work on 2D translational motion of human b

    The blobs were grouped based on the magnitude and direct

    the pixel velocity which were obtained using techniques si

    to the optical flow method [16]. The velocity of each par

    considered to converge to a global average value over se

    frames. This average velocity corresponds to the motion o

    whole human body and leads to identification of the whole

    via region grouping of blobs with a similar smoothed vel

    Kurakakeand Nevatia [17] attemptedto locatethe joint loca

    in images of walking humans by establishing correspondbetween extracted ribbons. They assumed small motion bet

    two consecutive frames, and feature correspondence was

    ducted using various geometric constraints. Joints were fi

    identified as the center of the area, where two ribbons ov

    Recent work by Kakadiaris et al. [18, 19] focused on bod

    decomposition and joint location from image sequences

    moving subject using a physics-based framework. Unlike

    where joints are located from shapes, here the joint is rev

    only when the body segments connected to it involve mo

    In the beginning, the subject image is assumed to be on

    formable model. As the subject moves and new postures o

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    5/13

    432 AGGARWAL AND CAI

    multiple new models are produced to replace the old ones, with

    each of them representing an emerging subpart. Joints are de-

    ermined based on the relative motion and shape of two moving

    ubparts. These methods usually require small image motion

    between successive frames which will not be a major concern as

    he video sampling rate increases. Our main concern is still seg-

    mentation under normal circumstances. Among the addressed

    methods, Shio and Sklansky [15] relied on motion as the cue

    or segmentation, while the latter two approaches are based onntensity or texture. To our best knowledge, robust segmentation

    echnologies are yet to be developed. Since this problem is the

    major obstacle to most of the work involving low-level process-

    ng, we will not mention it repeatedly in later discussions.

    Finally, we want to address the recent work by Rowley and

    Rehg [20] which focuses on the segmentation of optical flow

    fields of articulated objects. It is an extension to motion anal-

    ysis of rigid objects using the expectation-maximization (EM)

    algorithm [21]. Compared to [21], the major contribution of [20]

    s to add kinematic motion constraints to each pixel data. The

    trength of this work lies in its combination of motion segmen-ation and estimation in EM computation, i.e., segmentation is

    accomplished in the E-step, and motion analysis in the M-step.

    These two steps are computed iteratively in a forwardbackward

    manner to minimize the overall energy function of the whole im-

    age. The motion addressed in the paper is restricted to 2D affine

    ransforms. We would expect to see its extension to 3D cases

    under perspective projections.

    2.2. Model-Based Approaches

    In the above subsection, we examined several approaches to

    motion analysis that do not require a priori shape models. In this

    ype of approach, which is necessary when no a priori shape

    models are available, it is typically more difficult to establish

    eature correspondence between consecutive frames. Therefore,

    most methods for the motion analysis of human body parts apply

    predefined models for feature correspondence and body struc-

    ure recovery. Our discussion will be based on the representa-

    ions of various models.

    Chen and Lee [22] recovered the 3D configuration of a moving

    ubject according to its projected 2D image. Their model used 17

    ine segments and 14 joints to represent the features of the head,

    orso, hip, arms, and legs (shown in Fig. 5). Various constraints

    were imposed for the basic analysis of the gait. The methodwas computationally expensive, as it searched through all pos-

    ible combinations of 3D configurations, given the known 2D

    projection, and required accurate extraction of 2D stick figures.

    Bharatkumar et al. [7] also used stick figures to model the lower

    imbs of the moving human body. They aimed at constructing

    a general kinematic model for gait analysis in human walking.

    Medial-axis transformations were applied to extract 2D stick

    figures of the lower limbs. The body segment angle and joint

    displacement were measured and smoothed from real image se-

    quences, and then a common kinematic pattern was detected

    or each walking cycle. A high correlation (>0.95) was found

    between the real images and the model. The drawback t

    model is that it is view-based and sensitive to changes of th

    spective angle at which the images are captured. Hubers h

    model [23] is a refined version of the stick figure repre

    tion. Joints are connected by line segments with a certain d

    of relaxation as virtual springs. Thus, this articulated

    matic model behaves analogously to a mass-spring-dampe

    tem. Motion and stereo measurements of joints are confine

    three-dimensional spacecalledproximityspace (PS). The hhead serves as the starting point for tracking all PS loca

    In the end, particular gestures were recognized based on t

    states of the joints associated with the head, torso, and

    The key to solving this problem requires the 3D positions

    joints in an image sequence. A recent publication by Iw

    [24]focused on real-time extraction of stick figures from m

    ular thermal images. The height of the human image an

    distance between the subject and the camera were precalib

    Then the orientation of the upper half of the body was calc

    as the principle axis of inertia of the human silhouette. S

    cant points, such as the top of the head and the tips of theand feet, were heuristically located as the extreme points fa

    from the center of the silhouette. Finally, major joints such

    elbows and knees were estimated, based on the positions

    detected points through genetic learning. The drawback

    method is, again, that it is view-oriented and gesture-rest

    For example, if the human arms are placed in front of the

    there is no way to extract the finger tips using this metho

    therefore, it fails to locate the elbow joints.

    Niyogi and Adelson [8] pursued another route to est

    the joint motion of human body segments. They first e

    ined the spatial-temporal (XYT) braided pattern produc

    the lower limb trajectories of a walking human and cond

    gait analysis for coarse human recognition. Then the proj

    of head movements in the spatial-temporal domain was lo

    followed by the identification of other joint trajectories.

    joint trajectories were then utilized to outline the contou

    walking human, based on the observation that the human

    is spatially contiguous. Finally, a more accurate gait an

    was performed using the outlined 2D contour, which le

    fine-level recognition of specific humans. The major co

    we have with this work is how to obtain these XYT trajec

    in real image sequences without attaching specific sensors

    head and feet during image acquisition.In Akitas work [25], both stick figures and cone appro

    tions were integrated and processed in a coarse-to-fine m

    A key frame sequence of stick figures indicates the approx

    order of the motion and spatial relationships between the

    parts. A cone model is included to provide knowledge

    rough shape of the body parts, whose 2D segments corre

    to the counterparts of the stick figure model. The prelim

    condition to this approach is to obtain these key frames

    to body part segmentation and motion estimation. Perale

    Torres also made use of both stick figure and volumetri

    resentations in their work [26]. They introduced a pred

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    6/13

    HUMAN MOTION ANALYSIS

    ibrary with two levels of biomechanical graphical models. Level

    one is a stick figure tree with nodes for body segments and arcs

    for joints. Level two is composed of descriptions of surface and

    body segments using 3D graphical primitives. Both levels of the

    model are applied in various matching stages. The experiments

    were conducted in a relative restricted environments with one

    subject and multiple calibrated cameras. The distance between

    he camera and the subject was also known. We would like to

    see extension to their methods when the imaging conditions aremuch more relaxed.

    Leung and Yang [27] applied a 2D ribbon model to recognize

    poses of a human performing gymnastic movements. They esti-

    mated motion solely from the outline of the moving human sub-

    ect. This indicates that the proposed framework is not restricted

    by the small motion assumption. The system consists of two ma-

    or processes: extraction of human outlines and interpretation of

    human motion. In the first process, a sophisticated 2D ribbon

    model (shown in Fig. 6) was applied to explore the structural

    and shape relationships between the body parts and their asso-

    ciated motion constraints. A spatial-temporal relaxation processwas proposed to determine if an extracted 2D ribbon belongs

    o a part of the body or that of the background. In the end, a

    description of the body parts and the appropriate body joints is

    obtained, based on the structure and motion. Their work is one

    of the most complete systems for human motion analysis, rang-

    ng from low-level segmentation to the high level task of body

    part labeling. It would be very interesting to see how the system

    reacts to more general human motion sequences.

    As we know, the major drawback of 2D models is their re-

    striction to viewing camera angles. Applying a 3D volumetric

    model alleviates this problem, but it requires more parameters

    and more expensive computation during the matching process.In a pioneering work in model-based 3D human motion analysis

    28], ORourke and Badler applied a very elaborate volumetric

    model consisting of 24 rigid segments and 25 joints. The surface

    of each segment is defined by a collection of overlapping sphere

    primitives. A coordinate system is embedded in the segments

    along with various motion constraints of human body parts. The

    system consists of four main processes: prediction, simulation,

    mage analysis, and parsing (shown in Fig. 8). First, the image

    analysis phase accurately locates the body parts based on the

    previous prediction results. After the range of the predicted 3D

    ocation of the body parts becomes smaller, the parserfits theseocation-time relationships into certain linear functions. Then,

    heprediction phase estimates the position of the parts in the next

    frame using the determined linear functions. Finally, a simulator

    embedded with extensive knowledge of the human body trans-

    ates the prediction data into corresponding 3D regions, which

    will be verified by the image analysis phase in the next loop.

    Our question is whether the computation converges in the four

    process loop when the initial prediction is random and whether

    he use of an elaborate representation of a human figure is nec-

    essary when we merely want to estimate the motion of body

    parts.

    FIG. 8. The overview of ORourke and Badlers system.

    Elliptical cylinders are another commonly used repres

    tions for modeling human forms in 3D volumes. Compar

    ORourke and Badlers elaborate model, an elliptical cyl

    could be well represented by fewer parameters: the length

    axis and the major and minor axes of the ellipse cross sec

    Hogg [29] and Rohr [30] used the cylinder model originat

    Marr and Nishihara [31], in which thehuman body is repres

    by 14 elliptical cylinders. The origin of the coordinate syst

    located at the center of the torso. Both Hogg and Rohr inte

    to generate 3D descriptions of a human walking by modHogg [29] presented a computer program (WALKER) w

    attempted to recover the 3D structure of a walking person.

    applied eigenvector line fitting to outline the human imag

    then fitted the 2D projections into the 3D human model us

    similar distance measure. In the same spirit, Wachter and N

    [32] also attempted to establish the correspondence betw

    3D human model connected by elliptical cones and a real i

    sequence. Both edge and region information were incorpo

    in determining body joints, their degrees of freedom (DOF

    orientations to the camera by an iterated extended Kalman

    A fair amount of work in human motion analysis has focon themotion of certain body parts.For example,Rehg etal

    were interested in tracking the motion of self-occluded fi

    by matching an image sequence of hand motion to a 3D fi

    model. The two occluded fingers were rendered, again,

    several cylinders. The major drawback to this work is tha

    assumed an invariant visibility order of 2D templates of

    two occluded fingers to the viewing camera. This assum

    simplified the motion prediction and matching process bet

    the 3D shape model and its projection in the image plane. H

    ever, it also limits its application to image sequences cap

    under normal conditions. Recent work by Goncalves et al

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    7/13

    434 AGGARWAL AND CAI

    addressed the problem of motion estimation of a human arm in

    3D using a calibrated camera. Both the upper and lower arm

    were modeled as truncated circular cones, and the shoulder and

    elbow joints were assumed to be spherical joints. They used per-

    pective projection of a 3D arm model to fit the blurred image

    of a real arm. Matching was conducted by recursively minimiz-

    ng the error between the model projection and the real image

    hrough dynamically adapting the size and orientation of the

    model. Iterative searching for the global minimum might be aoncern for this approach.

    Kakadiaris and Metaxas [35], on the other hand, consider the

    problem of 3D model-based tracking of human body parts from

    hree calibrated cameras placed in a mutually orthogonal con-

    figuration. The 3D human model was extracted, based on its 2D

    projections from the view of each camera using a physics-based

    ramework [19] discussed in the previous section. Cameras are

    witched for tracking, based on the visibility of the body parts

    and the observability of their predicted motion. Correspondence

    between the points on the occluding contours to points on the

    3D shape model is established using concepts from projectivegeometry. The Kalman filter is used for prediction of the body

    part motion. Since the authors in [19] claimed that their method

    was general to multiple calibrated cameras, Gavrila and Davis

    36, 37] extended it to four calibrated cameras for 3D human

    body modeling. The class oftapered super-quadrics, including

    ylinders, spheres, ellipsoids, and hyper-rectangles, were used

    or the 3D graphical model with 22 DOFs. The general frame-

    work for human pose recovery and tracking of body parts was

    imilar to that proposed in ORourke and Badlers [28]. Match-

    ng between the model view and the actual scene was based on

    he similarity of arbitrary edge contours using a variant ofcham-

    fer matching [38]. Finally, we address the recent work by Hunter

    t al. [39], who applied mixture density modeling for 3D human

    pose recovery from a real image sequence. The 3D human model

    onsists of five ellipsoidal component shapes representing the

    arms and the trunk with 14 DOFs. A modified EM algorithm was

    employed to solve a constrained mixture estimation of the low-

    evel segmented images. Bregler [40] also applied the mixture

    of Gaussian estimation to 2D blobs in their low-level tracking

    process. However, their ultimate goal was to recognize human

    dynamics, which will be discussed in later sections.

    All of these approaches encounter the problem of matching a

    human image to its abstract representation by models of differ-ent complexity. The problem is itself nontrivial. The complexity

    of the matching process is governed by the number of model pa-

    ameters and the efficiency of human body segmentation. When

    ewer model parameters are used, it is easier to match the feature

    o the model, but it is more difficult to extract the feature. For

    example, the stick figure is the simplest way to represent a hu-

    man body, and thus it is relatively easier to fit the extracted lines

    nto the corresponding body segments. However, extracting a

    tick figure from real images needs more care than searching for

    2D blobs or 3D volumes. In some of the approaches mentioned

    above, the Kalman filter and the EM algorithm were used to ob-

    tain good estimation and prediction of body part positions.

    ever, these algorithms are computationally expensive and

    guarantee convergence if the initial conditions are ill defi

    3. TR ACKING HUMAN MOTION WITHOUT

    USI NG BODY PARTS

    In the previous section, we discussed human motion

    ysis involving identification of joint-connected body parwe know, the tasks of labeling body segments and then lo

    their connected joints are certainly nontrivial. On the other

    applications such as surveillance requireonly tracking the

    ment of the subject of interest. It will be computationally

    efficient to track moving humans by directly focusing o

    motion of the whole body rather than its parts. Knowin

    the motion of individual body parts usually converges to t

    the whole body in the long run, we could employ the m

    of the whole body as a guide for predicting and estimati

    movement of body parts; i.e., we could use the the human

    as the coordinates for local motion estimation of its bodyIn this section, we will review methods for tracking a m

    human without involving its body parts.

    As is well known to computer vision researchers, track

    equivalent to establishing correspondence of the image str

    between consecutive frames based on features related to

    tion, velocity, shape, texture, and color. Typically, the tra

    process involves matching between images using pixels, p

    lines, and blobs, based on their motion, shape, and other

    information [41]. There are two general classes of corre

    dence models, namely iconic models which use corre

    templates and structural models which use image fe

    [41]. Iconic models are generally suitable for any objec

    only when the motion between two consecutive frames is

    enough so that the object images in these frames are highl

    related. Since our interest is in tracking moving humans,

    retain a certain degree of nonrigidity, our literature rev

    limited to the use of structural models.

    Feature-based tracking typically starts with feature extra

    followed by feature matching over a sequence of image

    criteria for selecting a good feature are its brightness, co

    size, and robustness to noise. To establish feature corre

    dence between successive frames, well-defined constrain

    usually imposed to find a unique match. There is again a trabetween feature complexity and tracking efficiency. Lowe

    features such as points are easier to extract but relatively

    difficult to track than higher-level features such as lines,

    and polygons.

    We will organize our discussion based on two scena

    from a single camera view or from multiple perspectiv

    both cases, various levels of features could be used to est

    matching between successive frames. The major differe

    that the features used for matching from multiple perspe

    must be projected to the same spatial reference, while tra

    from a single view does not have this requirement.

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    8/13

    HUMAN MOTION ANALYSIS

    3.1. Tracking from a Single View

    Most methods focus on tracking humans in image sequences

    from a single camera view. Features used for tracking have

    evolved from points to motion blobs.

    Polana and Nelsons work [42] is a good example of point fea-

    ure tracking. They considered that the movements of arms and

    egs converge to that of the torso. In [42], each walking subject

    mage was bounded by a rectangular box, and the centroid of thebounding box was used as the feature to track. Tracking was suc-

    cessful even when the two subjects were occluded to each other

    n the middle of the image sequence, as long as the velocity of the

    centroids couldbe differentiated. The advantage of this approach

    s its simplicity and the use of body motion to solve occlusion.

    However, it only considers 2D translational motion and its track-

    ng robustness could be further improved by incorporating other

    features such as texture, color, and shape. Cai et al. [43] also

    focused on tracking the movements of the whole human body

    using a viewing camera with 2D translational movement. They

    focused on dynamic recovery of still or changing backgroundmages. The image motion of the viewing camera was estimated

    by matching the line segments of the background image. Then,

    hree consecutive frames were adjusted into the same spatial ref-

    erence with motion compensation. At the final stage, subjects

    were tracked using the center of the bounding boxes and esti-

    mated motion information. Compared to [42], their work is able

    o handle a moving camera. However, it also shares the strengths

    and weaknesses of the previous approach. Segen and Pingalis

    44] people tracking system utilized the corner points of moving

    contours as thefeatures for correspondence.These feature points

    were matched in a forwardbackward manner between two suc-

    cessive frames using a distance measure related to position andcurvature values of the points. The matching process implies

    hat a certain degree of rigidity of the moving human body and

    small motion between consecutive frames was assumed. Finally,

    short-lived or partially overlapped trajectories were merged into

    ong-lived paths. Compared to the one-trajectory approaches in

    42, 43], their work relies on multiple trajectories and thus is

    more robust if all matches are correctly resolved. However, the

    robustness of corner points for matching is questionable, given

    he condition of natural clothing and nonrigid human shapes.

    Another commonly used feature is 2D blobs or meshes. In

    Okawa and Hanatanis work [45], background pixels were votedas the most frequent value during the image sequence [15]. The

    meshes belonging to a moving foot were then detected by in-

    corporating low-pass filtering and masking. Finally, the distance

    between the human foot and the viewing camera was computed

    by comparing the relative foot position to the precalibrated CAD

    floor model. Rossi and Bozzoli [46] also used moving blobs to

    rack and count people crossing the field of view of a vertically

    mounted camera. Occlusion of multiple subjects was avoided

    due to the viewing angle of the camera, and tracking was per-

    formed using position estimation duringthe periodwhen thesub-

    ect enters the top and the bottom of the image. Intille and Bobick

    [47, 48] solved their tracking problem by taking advanta

    the knowledge of a so-called closed-world. A closed-w

    is a space-time domain in which all possible objects pres

    the image sequences are known. They illustrated their t

    ing algorithm using the example of a football game, wher

    background and the rules for play are known a priori. Ca

    motion was removed by establishing homographic transf

    between the football field model and its model using landm

    in the field. The players were detected as moving blobstracking was performed by template matching the neighb

    gion of the player image between consecutive frames. T

    approaches are well justified, given the restricted environm

    of their applications. We would not recommend using the

    image sequences acquired under general conditions.

    Pentland et al. [49, 50] explored the blob feature thorou

    In their work, blobs were not restricted to regions due to m

    and could be any homogeneous areas, such as color, tex

    brightness, motion, shading, or a combination of these. S

    tics such as mean and covariance were used to model the

    features in both 2D and 3D. In [49], the feature vector of ais formulated as (x, y, Y,U, V), consisting of a spatial

    and color (Y,U, V) information. A human body is constr

    by blobs representing various body parts such as head, t

    hands, and feet. Meanwhile, the surrounding scene is mode

    a texture surface. Gaussian distributions were assumed for

    models of the human body and the background scene. In th

    pixels belong to the human body were assigned to different

    part blobs using the log-likelihood measure. Later, Azarbay

    and Pentland [50] recovered the 3D geometry of the blobs

    a pair of 2D blob features via nonlinear modeling and r

    sive estimation. The 3D geometry included the shape an

    orientation of the 3D blobs along with the relative rotatiotranslation to the binocular camera. Tracking of the 2D

    was inherent in the recovery process which iteratively sea

    the equilibria of the nonlinear state space model across i

    sequences. The strength of these approaches is their compr

    sive feature selection, which could result in more robust pe

    mance when the cost of computation is not a major conce

    Another trend is to use color meshes [51] or clusters

    for human motion tracking. In the work by Gunsel et al.

    video objects such as moving humans were identified as g

    of meshes with similar motion and color (YUV) informa

    Nodes of the meshes were tracked in image sequences block matching. In particular, they focused on the applic

    of content-based indexing of video streams. Heisele et al

    initially adopted a clustering process to categorize obje

    various prototypes based on color in RGB space and thei

    responding image locations. These prototypes then serve

    seeds for the parallel k-means clustering during which cen

    of the clusters are implicitly tracked in subsequent images.

    is also applied for tracking human faces [53, 54] and identi

    nude pictures on the internet [55]. Since our interest is mo

    motion analysis, we will not discuss these approaches in d

    Overall, we believe color is a vital cue for segmentatio

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    9/13

    436 AGGARWAL AND CAI

    racking of moving humans, as long as extra computation and

    torage is not a major concern.

    3.2. Tracking from Multiple Perspectives

    The disadvantage of tracking human motion from a single

    view is that the monitored area is relatively narrow due to the

    imited field of view of a single camera. To increase the mon-

    tored area, one strategy is to mount multiple cameras at vari-

    ous locations of the area of interest. As long as the subject iswithin the area of interest, it will be imaged from at least one

    of the perspectives. Tracking from multiple perspectives also

    helps to solve ambiguities in matching when subject images are

    occluded to each other from a certain viewing angle. In recent

    years, more work on tracking of human motion from multiple

    perspectives has emerged [5659]. Compared to tracking mov-

    ng humans from a single view, establishing feature correspon-

    dence between images captured from multiple perspectives is

    more challenging because the features are recorded in different

    patial coordinates. These features must be adjusted to the same

    patial reference before matching is performed.Recent work by Cai and Aggarwal [59, 60] uses multiple

    points belonging to the medial axis of the human upper body as

    he feature to track. These points are sparsely sampled and as-

    umed to be independent of each other, which preserves a certain

    degree of nonrigidity of the human body. Location and average

    ntensity of the feature points are used to find the most likely

    match between two consecutive frames imaged from different

    viewing angles. One feature point is assumed to match to its

    orresponding epipolar line via motion estimation and precal-

    bration of the cameras. Camera switching was automatically

    mplemented, based on the position and velocity of the subject

    elative to the viewing cameras. Experimental results of tracking

    moving humans in indoor environments using a prototype sys-

    em equipped with three cameras indicated robust performance

    and a potential for real-time implementation. The strength of this

    approach is its comprehensive framework, and relative lower

    omputational cost given the degree of the problem complexity.

    However, the approach still relies heavily on the accuracy of

    he segmentation results, and more powerful and sophisticated

    egmentation methods would improve its performance.

    Sato et al. [56], on the other hand, treated a moving human as

    a combination of various blobs of its body parts. All distributed

    ameras are calibrated in a world coordinate system that corre-ponds to a CAD model of the surrounding indoor environment.

    The blobs of body parts were matched over image sequences us-

    ng their area, average brightness, and rough 3D position in the

    world coordinates. The 3D position of a 2D blob was estimated,

    based on height information, by measuring the distance between

    he center of gravity of the blob and the floor. These small blobs

    were then merged into an entire region of the human image us-

    ng the spatial and motion parameters from several frames. Kelly

    t al. [58, 57] adopted a similar strategy [56] to construct a 3D

    environmental model. They introduce a feature called voxels,

    which are sets of cubic volume elements containing information

    such as which object belongs to this pixel and the hist

    this object. The depth information of the voxel is also ob

    using height estimation. Moving humans are tracked as a

    of these voxels from the best angle of the viewing sy

    A significant amount of effort is made in [57, 58] to con

    a friendly graphical interface for browsing multiple ima

    quences of a same event. Both methods require a CAD mo

    the surrounding environment and tracking relies on findi

    3D positions of blobs in the world coordinate system.

    4. HUMAN ACTIVITY RECOGNITION

    In this section, we move on to recognizing human move

    or activities from image sequences. A large portion of lite

    in this area is devoted to human facial motion recognitio

    emotion detection, which fall into thecategory of elastic no

    motion. In this paper, we are interested only in motion inv

    a human body, i.e., human body motion as a form of artic

    motion. The difference between articulated motion and e

    motion is well addressed in [4]. There also exists a body oon human image recognition by applying geometric fea

    such as 2D clusters [15, 60], profile projections [61], t

    and color information [55, 53]. Since motion informatio

    not incorporated in the recognition process, we will not in

    them for further discussion.

    For human activity or behavior recognition, most effort

    used state-space approaches [62] to understand the huma

    tion sequence [5, 63, 13, 14, 64, 65, 40]. An alternativ

    use the template matching technique [42, 6668] to conv

    image sequence into a static shape pattern and then com

    it to the prestored prototypes during the recognition pr

    The advantage of using the template matching techniqulower computational cost; however, it is usually more sen

    to the variance of the movement duration. Approaches ba

    state-space models, on the other hand, define each static p

    as a state. These states are connected by certain probabi

    Any motion sequence is considered as a tour going throug

    ious states of these static poses. Joint probabilities are com

    through these tours, and the maximum value is selected

    criterion for classifying activities. In such a scenario, dura

    motion is no longer an issue because each state can repea

    visit itself. However, approaches using these methods u

    apply intrinsic nonlinear models and do not have a closedsolution. As we know, nonlinear modeling typically re

    searching for a global optimum in the training process,

    requires expensive computing iterations. Meanwhile, sel

    the proper number of states and dimension of the feature

    to avoid underfitting or overfitting remains an issue.

    following subsections, both template matching and state-

    approaches [62] will be discussed.

    4.1. Approaches Using Template Matching

    The work [68] by Boyd and Little is perhaps the onl

    which uses MLDs as the feature for recognition, bas

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    10/13

    HUMAN MOTION ANALYSIS

    emplate matching. They assumed that a single pedestrian is

    walking parallel to the camera and that the movement of the

    subject is periodic. First, they computed n optical flow images

    from n+ 1 frames and then chose m characteristics such as the

    centroid of MLDs and moments from each optical flow images.

    Thus the kth characteristic (1 km) in n frames forms a time-

    varying sequence with a phasek. The differences between these

    phases Fi =i m (1 km 1) are grouped into the fea-

    ure vector for recognition. As can be seen from the procedure,his approach is restricted by the relative positions between the

    walking subject and the viewing camera angle.

    Perhaps the most commonly used feature for recognition of

    human motion is 2D meshes. For example, Polana and Nelson

    42] computed the optical flow fields [16] between consecu-

    ive frames and divided each flow frame into a spatial grid in

    both X and Y directions. The motion magnitude in each cell

    s summed, forming a high-dimensional feature vector used for

    recognition. To normalize the duration of the movement, they

    assumed that human motion is periodic and divided the entire

    sequence into a number of cycles of the activity. Motion ina single cycle was averaged throughout the number of cycles

    and differentiated into a fixed number of temporal divisions.

    Finally, activities were recognized using the nearest neighbor

    algorithm. Recent work by Bobick and Davis [66] follows the

    same vein, but extracts the motion feature differently. They in-

    erpret human motion in an image sequence by using motion-

    energy images (MEI) and motion-history images (MHI). The

    motion images in a sequence are calculated via differencing be-

    ween successive frames and then thresholded into binary val-

    ues. These motion images are accumulated in time and form

    MEI, which are binary images containing motion blobs. The

    MEI are later enhanced into MHI, where each pixel value isproportional to the duration of motion at that position. Each

    action consists of MEIs and MHIs in images sequences with

    various viewing angles. Moment-based features are extracted

    from MEIs and MHIs and employed for recognition using tem-

    plate matching. Cui and Weng [67] intended to segment hand

    mages with different gestures from cluttered backgrounds us-

    ng a learning-based approach. They first used the motion cue

    o fix an attention window containing the moving object. Then

    hey searched the training set to find the closest match, which

    served as the basis for predicting the pattern in the next frame.

    This predicted pattern is later applied to the input image for ver-fication. If the pattern is verified, then a refinement process is

    followed to obtain an accurate representation of the hand im-

    age with any cluttered backgrounds excluded. As can be seen

    from these approaches, the key to the recognition problem is

    finding the vital and robust feature sets. Due to the nature of 2D

    meshes, all the features addressed above are view-based. There

    are several alternatives to the solution. One is to use multiple

    views and incorporate features from different perspectives for

    recognition. Another is to directly apply 3D features, such as

    he trajectories of 3D joints. Once the features are extracted, any

    good method for machine learning and classification, such as

    FIG. 9. The basic structure of a hidden Markov model.

    nearest neighbor, decision tree, and regression models, cou

    employed.

    4.2. State-Space Approaches

    State space models have been widelyused to predict, estiand detect time series over a long period of time. One repr

    tative model is the hidden Markov model (HMM), whic

    probabilistic technique for the study of discrete times s

    HMM has been used extensively in speech recognition, bu

    recently has it been adopted for recognition of human m

    sequences in computer vision [5]. Its model structure cou

    summarized as a hidden Markov chain and a finite set of o

    probability distributions [69]. The basic structure of an H

    is shown in Fig. 9, where Si represents each state connect

    probabilities to other states or its own, and y(t) is the observ

    derived from each state. The main tool in HMM is the B

    Welch (forwardbackward) algorithm for maximum likelestimation of the model parameters. Features for recognit

    each state vary from points and lines to 2D blobs. We wi

    ganize our discussions according to the complexity of fe

    extraction.

    Bobick [13] and Campbell [14] applied 2D or 3D Cart

    tracking data sensed by MLDs of body joints for activity re

    nition. To state more specifically, the trajectories of mu

    part joints form a high-dimensional phase space, and this p

    space or its subspace is employed as the feature to recog

    Both researchers intend to convert the continuous motion

    a set of discrete symbolic representations. The feature veceach frame in a motion sequence is portrayed as a point i

    phase space belonging to a certain state. A typical gesture

    fined as an ordered sequence of these states restricted by m

    constraints. In Campbell [14], the learning/training proc

    accomplished by fitting the unique curve of a gesture in

    subset of the phase space with low-order polynomials. Ge

    recognition is based on finding the maximum correlatio

    tween the predictor and the current motion. Bobick [13], o

    other hand, applied the k-means algorithm to find the cent

    each cluster (or state) from the sample points of the traj

    ries. Matching between trajectories of 3D MLDs to a ge

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    11/13

    438 AGGARWAL AND CAI

    was accomplished through dynamic programming. The major

    oncern to these approaches is how to reliably extract these 3D

    rajectories from general video streams.

    Goddards human movement recognition [63] focused on the

    ow limb segments of the human stick figure. 2D projections of

    he joints were used directly, without interpretation, as inputs,

    and features for recognition were encoded by coarse orientation

    and coarse angular speed of the line segmentsin the image plane.

    Although Goddard [63] did not directly apply HMM in his workor discrimination of human gaits, he also considered a move-

    ment as a composition ofevents linked by time intervals. Each

    vent is independent of the movement. The links between these

    vents are restrained by feature and temporal constraints. This

    ype of structure, called a scenario, is easily extended by assem-

    bling scenarios into composite scenarios. Matching real scene

    kinematics to modeled movements involves visual motion fea-

    ures in the scene, a priori events matched to the scenario, and

    he temporal constraints on the sequence and time intervals.

    Another commonly used feature for identifying human move-

    mentsor activities is 2D blobs, which couldbe obtained from anyhomogeneous regions based on motion, color, texture, etc. The

    work by Yamato et al. [5] is perhaps the first one on recognition

    of human action in thiscategory. Mesh features of binary moving

    human blobs are used as the low-level feature for learning and

    ecognition. Learning was accomplished by training the HMMs

    o generate symbolic patterns for each class. Optimization of the

    model parameters is achieved using the BaumWelch algorithm.

    Finally, recognition is based on the output of the given image

    equence using forward calculation. Recent work by Starner and

    Pentland [64] applied a similar method to recognition of Amer-

    can sign language. Instead of directly using the low-level mesh

    eature, they introduced the position of a hand blob, its angle of

    axis of least inertia, and eccentricity of its bounding ellipse in

    he feature vector. Brand et al. [65] applied the coupled HMMs

    o recognize the actions by two associated blobs (for example,

    he right- and left-hand blobs in motion). Instead of directly

    omposing the two interacting process into a single HMM pro-

    ess whose structural complexity is the product of that of each

    ndividual process, they coupled the two HMMs in such a way

    hat the structural complexity is merely twice that of each in-

    dividual HMM. Breglers paper [40] presents a comprehensive

    ramework for recognizing human movements using statistical

    decomposition of human dynamics at various levels of abstrac-ions. The recognition process starts with low-level processing,

    during which blobs are estimated as mixture of Gaussian mod-

    els based on motion and color similarity, spatial proximity, and

    prediction from the previous frame using EM algorithm. During

    his process, blobs of various body parts are implicitly tracked

    over image sequences. In the mid-level processing, blobs with

    oherent motion are fitted into simple movements of dynamic

    ystems. For example, a walking cycle is considered as compo-

    ition of two simple movements, the one when the leg has the

    group support and the one when the leg swings in the air. At the

    highest level, HMMs are used as a mixture of these mid-level

    dynamic systems to represent the complex movement. R

    nition is completed by maximizing the posterior probabi

    the HMM model from trained data given the input ima

    quence. Most of these methods include low-level process

    feature extraction. But again, all recognition methods rely

    2D features are view-based.

    5. CONCLUSION

    We have presented an overview of past developments

    man motion analysis. Our discussion has focused on three

    tasks: (1) motion analysis of human body parts, (2) high

    tracking of human motion from a single view or multipl

    spectives, and (3) recognition of human movements or act

    based on successfully tracking features over an image sequ

    Motion analysis of the human body parts is essentially the

    3D interpretation of the human body structure using the m

    of the body parts over image sequences, and involves low

    tasks such as body part segmentation, joint location, and

    tion. Tracking human motion is performed at a higher le

    which the parts of the human body are not explicitly iden

    e.g., the human body is considered as a whole when establ

    matches between consecutive frames. Our discussion is c

    rized by whether the subject is imaged at one time instant

    single view or from multiple perspectives. Tracking a sub

    image sequences from multiple perspectives requires th

    tures be projected into a common spatial reference. Th

    of recognizing human activity in image sequences assume

    feature tracking for recognition has been accomplished

    types of techniques are addressed: state-space approach

    template matching approaches. Template matching is sim

    implement, but sensitive to noise and the time interval movements. State-space approaches, on the other hand,

    come these drawbacks but usually involve complex ite

    computation.

    The key to successful execution of high-level tasks is to

    lish feature correspondence between consecutive frames,

    still remains a bottleneck in most of the approaches. Typ

    constraints on human motion or human models are impo

    order to decrease the ambiguity during the matching pr

    However, these assumptions may not conform to general

    ing conditions or may introduce other tractable issues, s

    the difficulty in estimating model parameters from the reage data. Recognition of human motion is just in its in

    and there exists a trade-off between computational cost an

    tion duration accuracy for methodologies based on state

    models and template matching. We are still looking forw

    new techniques to improve the performance and, meanw

    decrease the computational cost.

    ACKNOWLEDGMENTS

    The work was supported in part by the United States Army R

    Office under Contracts DAAH04-95-I-0494 and DAAG55-98-1-0230

    Texas Higher Education Coordinating Board Advanced Technology

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    12/13

    HUMAN MOTION ANALYSIS

    95-ATP-442 and Advanced Research Project 97-ARP-275. Special thanks also

    go to Ms. Debi Paxton and Mr. Shishir Shah for their generous help in editing

    and commenting on the paper.

    REFERENC ES

    1. J. K. Aggarwal and N. Nandhakumar, On the computation of motion of

    sequences of imagesA review, Proc. IEEE76(8), 1988, 917934.

    2. S. Bandi and D. Thalmann, A configuration space approach for efficient

    anomation of human figures, in Proc. of IEEE Computer Society Workshopon Motionof Non-Rigidand Articulated Objects, PuertoRico,1997,pp.38

    45.

    3. J. K. Aggarwal and Q. Cai, Human motion analysis: A review, in Proc. of

    IEEE Computer Society Workshop on Motion of Non-Rigid and Articulated

    Objects, Puerto Rico, 1997, pp. 90102.

    4. J. K. Aggarwal, Q. Cai, W. Liao, and B. Sabata, Non-rigid motion analysis:

    Articulated & elastic motion, CVGIP: Image Understanding, 1997.

    5. J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in time-

    sequential imagesusingHiddenMarkov Model, in Proc. IEEE Conf. CVPR,

    Champaign, IL, June 1992, pp. 379385.

    6. W. Kinzel, Pedestrian recognition by modelling their shapes and move-

    ments, in progress in image analysis and processing III, in Proc. of the

    7th Intl. Conf. on Image Analysis and Processing 1993, Singapore, 1994 ,

    pp. 547554.

    7. A. G. Bharatkumar, K. E. Daigle, M. G. Pandy, Q. Cai, and J. K. Aggarwal,

    Lower limb kinematics of human walking with the medial axis transforma-

    tion, in Proc. of IEEE Computer Society Workshop on Motion of Non-Rigid

    and Articulated Objects, Austin, TX, 1994, pp. 7076.

    8. S. A. Niyogiand E. H. Adelson, Analyzingand recognizing walking figures

    in XYT, in Proc. IEEE CS Conf. on CVPR, Seattle, WA, 1994, pp. 469474.

    9. G. Johansson, Visual motion perception, Sci. Am. 232(6), 1975, 7688.

    10. R. F. Rashid, Towards a system for the interpretation of moving light dis-

    play, IEEE Trans. PAMI, 2(6), 574581, November 1980.

    11. J. A. Webb and J. K. Aggarwal, Visually interpreting the motion of objects

    in space, IEEE Comput. August 1981, 4046.12. J. A. Webb and J. K. Aggarwal, Structure from motion of rigid and jointed

    objects, in Artif. Intell. 19, 1982, 107130.

    13. A. F. Bobick and A. D. Wilson, A state-based technique for the summa-

    rization and recognition of gesture, in Proc. of 5th Intl. Conf. on Computer

    Vision, 1995, pp. 382388.

    14. L. W. Campbell andA. F. Bobick, Recognition of human body motionusing

    phasespace constraints,in Proc.of 5thIntl.Conf. onComputerVision, 1995,

    pp. 624630.

    15. A. Shio and J. Sklansky, Segmentation of people in motion, in Proc. of

    IEEE Workshop on Visual Motion, IEEE Computer Society , October 1991,

    pp. 325332.

    16. B. K. P. Horn and B. G. Schunk, Determining optical flow, Artificial Intel-

    ligence, 17, 185204, 1981.17. S. Kurakake and R. Nevatia,Description and tracking of movingarticulated

    objects, in 11th Intl. Conf. on Pattern Recognition, Hague, Netherlands,

    1992, Vol. 1, pp. 491495.

    18. I. A. Kakadiaris, D. Metaxas, and R. Bajcsy, Active part-decomposition,

    shape and motion estimation of articulated objects: A physics-based ap-

    proach, in Proc. IEEE CS Conf. on CVPR, Seattle, WA, 1994, pp. 980984.

    19. I. A. Kakadiaris and D. Metaxas, 3d human body model acquisition from

    multiple views, in Proc. of 5th Intl. Conf. on Computer Vision, 1995,

    pp. 618623.

    20. H. A. Rowley and J. M. Rehg, Analyzing articulated motion using

    expectation-maximization, in Proc. of Intl. Conf. on Pattern Recognition,

    Puerto Rico, 1997, pp. 935941.

    21. A. Jepson and M. Black, Mixture models for optical flow computa

    Proc. of Intl. Conf. on Pattern Recognition, New York City, 1993 , pp

    761.

    22. Z. Chen and H. J. Lee, Knowledge-guided visual perception of 3D

    gait from a single image sequence, IEEE Trans. Systems, Man, Cy

    22(2), 1992, 336342.

    23. E. Huber, 3D real-time gesture recognition using proximity sp

    Proc. of Intl. Conf. on Pattern Recognition, Vienna, Austria, Augus

    pp. 136141.

    24. S. Iwasawa, K. Ebihara, J. Ohya, and S. Morishima, Real-time estiof human body posture from momocular thermal images, in Proc. IE

    Conf. on CVPR, Puerto Rico, 1997, pp. 1520.

    25. K. Akita, Image sequence analysis of real world human motion, P

    Recognit. 17(1), 7383, 1984.

    26. F. J. Perales and J. Torres, A system for human motion matching b

    synthetic and real image based on a biomechanic graphical model, i

    of IEEE Computer Society Workshop on Motion of Non-Rigid and A

    lated Objects, Austin, TX, 1994, pp. 8388.

    27. M. K. Leung and Y. H. Yang, First sight: A human body outline l

    system, IEEE Trans. PAMI17(4), 1995, 359377.

    28. J. ORourke andN. I. Badler, Model-based image analysis of human

    using constraint propagation, IEEE Trans. PAMI, 2:522536, 1980.

    29. D. Hogg, Model-based vision: a program to see a walking person

    and Vision Computing, 1(1), 520, 1983.

    30. K. Rohr, Towards model-based recognition of human movements in

    sequences, CVGIP: Image Understanding 59(1), 94115, 1994.

    31. D. Marr and H. K. Nishihara, Representation and recognition of the

    organization of three-dimensional shapes, in Proc. R. Soc. London

    Vol. B, pp. 269294.

    32. S. Wachter and H. H. Nagel, Tracking of persons in monocular

    sequences, in Proc. of IEEE Computer Society Workshop on Mo

    Non-Rigid and Articulated Objects, Puerto Rico, 1997, pp. 29.

    33. J. M. Rehg and T. Kanade, Model-based tracking of self-occluding

    lated objects, in Proc. of 5th Intl. Conf. on Computer Vision, Camb

    MA, 1995, pp. 612617.34. L. Goncalves, E. D. Bernardo,E. Ursella, andP.Perona, Monoculartr

    of the human arm in 3D, in Proc. of 5th Intl. Conf. on Computer

    Cambridge, MA, 1995, pp. 764770.

    35. I. A. Kakadiaris and D. Metaxas, Model based estimation of 3D

    motion with occlusion based on active multi-viewpoint selection, in

    IEEE CS Conf. on CVPR, San Francisco, CA, 1996, pp. 8187.

    36. D. M. Gavrila and L. S. Davis, Towards 3-D model-based tracki

    recognition of human movement: a multi-view approach, in Pro

    Workshop on Automatic Face and Gesture Recognition, Zurich, S

    land, June 1995.

    37. D. M. Gavrila and L. S. Davis, 3-D model-based tracking of hum

    action: A multi-view approach, in Proc. IEEE CS Conf. on CVP

    Francisco, CA, 1996, pp. 7380.38. H. G. Barrow, Parametric correspondence and chamfer matching: Tw

    trchniques for image matching, in Proc. IJCAI, 1977, pp. 659663.

    39. E. A. Hunter, P. H. Kelly, and R. C. Jain, Estimation of articulated

    using kinematically constrained mixture densities, in Proc. of IEEE

    puter Society Workshop on Motion of Non-Rigid and Articulated O

    Puerto Rico, 1997, pp. 1017.

    40. C. Bregler, Learning and recognizing human dynamics in video sequ

    in Proc. IEEE CS Conf. on CVPR, Puerto Rico, 1997, pp. 568574

    41. J. K. Aggarwal, L. S. Davis, and W. N. Martin, Correspondence pro

    dynamic scene analysis, Proc. IEEE69(5), 1981, 562572.

    42. R. Polana and R. Nelson, Low level recognition of human motion (

    toget yourman without findinghis bodyparts), in Proc.of IEEE Co

  • 7/29/2019 Q. Cai, Human Motion Analysis a Review

    13/13

    440 AGGARWAL AND CAI

    Society Workshop on Motion of Non-Rigid and Articulated Objects, Austin,

    TX, 1994, pp. 7782.

    3. Q. Cai, A.Mitiche,and J.K. Aggarwal,Tracking human motionin anindoor

    environment, in Proc. of 2nd Intl. Conf. on Image Processing, Washington,

    D.C., October 1995, Vol. 1, pp. 215218.

    4. J. Segen and S. Pingali, A camera-based system for tracking people in real

    time, in Proc. of Intl. Conf. on Pattern Recognition, Vienna, Austria, August

    1996, pp. 6367.

    5. Y. Okawa and S. Hanatani, Recognition of human body motions by robots,

    in Proc. IEEE/RSJ Intl. Conf. Intelligent Robots and Systems, Raleigh, NC,1992, pp. 21392146.

    6. M. Rossi and A. Bozzoli, Tracking and counting people, in 1st Intl. Conf.

    on Image Processing, Austin, Texas, November 1994, pp. 212216.

    7. S. S. Intille and A. F. Bobick, Closed-world tracking, in Proc. Intl. Conf.

    Comp. Vis., 1965, pp. 672678.

    8. S. S. Intille, J. W. Davis, andA. F. Bobick, Real-time closed-worldtracking,

    in Proc. IEEE CS Conf. on CVPR, Puerto Rico, 1997, pp. 697703.

    9. C. Wren, A. Azarbayejani, T. Darrel, and A. Pentland, Pfinder: Real-time

    tracking of the human body, in Proc. SPIE, Bellingham, WA, 1995.

    0. A. Azarbayejani and A. Pentland, Real-time self-calibrating stereo person

    tracking using 3D shape estimation from blob features, in Proc. of Intl.

    Conf. on Pattern Recognition, Vienna, Austria, August 1996, pp. 627632.

    1. B.Gunsel, A. M. Tekalp,and P. J. L. Beek, Object-based video indexing for

    virtual studio production, in Proc. IEEE CS Conf. on CVPR, Puerto Rico,

    1997, pp. 769774.

    2. B. Heisele, U. Kreel, and W. Ritter, Tracking non-rigid moving objects

    based on color cluster flow, in Proc. IEEE CS Conf. on CVPR, Puerto Rico,

    1997, pp. 257260.

    3. J. Yang andA. Waibel, A real-timeface tracker, in Proc. of IEEE Computer

    Society Workshop Applications on Computer Vision, Sarasota, FL, 1996,

    pp. 142147.

    4. P. Fieguth and D. Terzopoulos, Color-based tracking of heads and other

    mobile objects at video frame rate, in Proc. IEEE CS Conf. on CVPR,

    Puerto Rico, 1997, pp. 2127.

    5. D. A. Forsyth and M. M. Fleck, Identifying nude pictures, in Proc. of IEEEComputer Society Workshop Applications on Computer Vision, Sarasota,

    FL, 1996, pp. 103108.

    6. K. Sato, T. Maeda, H. Kato, and S. Inokuchi, CAD-based object tracking

    with distributed monocular camera for security monitoring, in Proc. 2nd

    CAD-Based Vision Workshop, Champion, PA, Feburary 1994, p

    297.

    57. R. Jain and K. Wakimoto, Multiple perspective interactive video,

    of Intl. Conf. on Multimedia Computing and Systems, 1995, pp. 20

    58. P. H. Kelly, A. Katkere, D. Y. Kuramura, S. Moezzi, S. Chatterj

    R. Jain, An architecture for multiple perspective interactive video,

    of ACM Conf. on Multimedia, 1995, pp. 201212.

    59. Q. Caiand J. K. Aggarwal, Trackinghumanmotion using multiplec

    in Proc.of Intl. Conf. on Pattern Recognition, Vienna,Austria, Augu

    pp. 6872.

    60. Q. Cai and J. K. Aggarwal, Automatic tracking of human motio

    door scenes across multiple synchronized video streams, in Intl. C

    Computer Vision, Bombay, India, January 1998.

    61. Y. Kuno and T. Watanabe, Automatic detection of human for visual

    lance system,in Proc. of Intl. Conf.on PatternRecognition,Vienna, A

    August 1996, pp. 865869.

    62. J. Farmer,M. Casdagli,S. Eubank, and J. Gibson, State-space recons

    in the presence of noise, Physics D 51, 1991, 5298.

    63. N. H. Goddard, Incremental model-based discriminiation of art

    movement from motion features, in Proc. of IEEE Computer Societ

    shop on Motion of Non-Rigid and Articulated Objects, Austin, TX

    pp. 8995.

    64. T. Starner and A. Pentland, Visual recognition of American sign la

    using hidden Markov models, in Proc. Intl. Workshop on Automat

    and Gesture Recognition, Zurich, Switzerland, June 1995.

    65. M. Brand, N. Oliver, and A. Pentland, Coupled hidden Markov mo

    complex action recognition, in Proc. IEEE CS Conf. on CVPR,Puer

    1997, pp. 994999.

    66. A. F. Bobick and J. Davis, Real-time recognition of activity using te

    templates, in Proc. of IEEE Computer Society Workshop Applicat

    Computer Vision, Sarasota, FL, 1996, pp. 3942.

    67. Y. Cui and J. J. Weng, Hand segmentation using learning-based pr

    and verification for hand signrecognition,in Proc.IEEE CS Conf. on

    Puerto Rico, 1997, pp. 8893.

    68. J. E. Boyd andJ. J. Little, Globalversus structured interpretation ofMoving lightdisplays,in Proc.of IEEEComputerSociety Workshop

    tion of Non-Rigid and Articulated Objects, Puerto Rico, 1997, pp.

    69. A. B. Poritz, Hidden Markov Models: A guided tour., in Proc. IE

    Conf. on Acoust. Speech and Signal Proc., 1988, pp. 713.