10.1.1.115.6795

Embed Size (px)

Citation preview

  • 7/27/2019 10.1.1.115.6795

    1/6

    Model-based Human Posture Estimation for Gesture Analysis

    in an Opportunistic Fusion Smart Camera Network

    Chen Wu and Hamid Aghajan

    Department of Electrical Engineering

    Stanford University, USA

    Abstract

    In multi-camera networks rich visual data is provided both

    spatially and temporally. In this paper a method of human

    posture estimation is described incorporating the concept of

    an opportunistic fusion framework aiming to employ mani-

    fold sources of visual information across space, time, and

    feature levels. One motivation for the proposed method

    is to reduce raw visual data in a single camera to ellipti-

    cal parameterized segments for efficient communication be-

    tween cameras. A 3D human body model is employed as

    the convergence point of spatiotemporal and feature fusion.

    It maintains both geometric parameters of the human pos-

    ture and the adaptively learned appearance attributes, all

    of which are updated from the three dimensions of space,

    time and features of the opportunistic fusion. In sufficient

    confidence levels parameters of the 3D human body model

    are again used as feedback to aid subsequent in-node vision

    analysis. Color distribution registered in the model is used

    to initialize segmentation. Perceptually Organized Expec-tation Maximization (POEM) is then applied to refine color

    segments with observations from a single camera. Geomet-

    ric configuration of the 3D skeleton is estimated by Particle

    Swarm Optimization (PSO).

    1. Introduction

    In a multi-camera network, access to multiple sources of

    visual data often allows for making more comprehensive

    interpretations of events and gestures. It also creates a per-

    vasive sensing environment for applications where it is im-

    practical for the users to wear sensors. Example applica-

    tions include surveillance, smart home care, gaming, etc.

    In this paper we propose a method of human posture esti-

    mation using an opportunistic fusion framework to employ

    manifold sources of information obtained from the camera

    network in a principled way. The framework spans three

    dimensions of space (different camera views), time (each

    camera collecting data over time), and feature levels (se-

    lecting and fusing different feature subsets).

    Our work aims for intelligent and efficient vision inter-

    pretations in a camera network. One underlying constraint

    of the network is the relatively low bandwidth. Therefore,

    for efficient collaboration between cameras, we expect con-

    cise descriptions instead of raw image data as outputs from

    local processing in a single camera. This process inevitably

    removes certain details in images of a single camera, which

    requires the camera to have some intelligence on its ob-

    servations (smart cameras) , i.e., some knowledge of thesubject. This derives one of the motivations for opportunis-

    tic data fusion between cameras, which compensates for

    partial observations in individual cameras. So the output

    from opportunistic data fusion (a model of the subject) is

    fed to local processing. On the other hand, outputs of local

    processing in single cameras enable opportunistic data fu-

    sion by contributing local descriptions from multiple views.

    It is the interactive loop that brings in the potential for

    achieving both efficient and adequate vision-based analy-

    sis in the camera network. An example of the communica-

    tion model between five cameras to reconstruct the persons

    model is shown in Fig. 1. The circled numbers represent the

    sequence of events.

    In our approach a 3D human body model embodies up-

    to-date information from both current and historical obser-

    vations of all cameras in a concise way. It has the follow-

    ing components: 1. Geometric configuration: body part

    lengths, angles. 2. Color or texture of body parts. 3. Mo-

    tion of body parts. The three components are all updated

    from the three dimensions of space, time, and features of

    the opportunistic fusion. The 3D human model takes up

    two roles. One is as an intermediate step for high-level

    application-pertinent gesture interpretation, the other is to

    create a feedback path from spatiotemporal and feature fu-

    sion operations to low-level vision processing in each cam-

    era. It is true that for a number of gestures a human body

    model may not be needed to interpret the gesture. There is

    existing work for hand gesture recognition [1] where only

    part of the body is analyzed. Some gestures can also be

    detected through spatiotemporal motion patterns of some

    body parts [2, 3]. However, as the set of gestures to differ-

    entiate expands, it becomes increasingly difficult to devise

    methods for gesture recognition based on only a few cues.

    A 3D human body model provides a unified interface for

    1

  • 7/27/2019 10.1.1.115.6795

    2/6

    Camera 5 wants to update itsknowledge of the subject

    1Broadcast the request for

    collaborationCAM 1

    CAM 2

    CAM 3

    CAM 4

    CAM 5

    2 The other cameras send

    requested descriptions

    Vector of descriptions from local processing

    3

    Fusion to update local

    knowledge of the subject

    4Updated knowledge

    of the subject is fedback

    Vector of model parameters

    5

    5

    5

    5

    Up-to-date knowledge of the subject

    is used in local processing 1 2 3 4 5

    Figure 1: Communication for collaboration in the camera

    network.

    a variety of gesture interpretations. On the other hand, in-

    stead of being a passive output to represent decisions from

    spatiotemporal and feature fusion, the 3D model implic-

    itly enables more interactions between the three dimensions

    by being actively involved in vision analysis. For exam-

    ple, although predefined appearance attributes are generally

    not reliable, adaptively learned appearance attributes can be

    used to identify the person or body parts. Those attributes

    are usually more distinguishable than generic features such

    as edges.

    Fitting human models to images or videos has been aninteresting topic for which a variety of methods have been

    developed. Some reconstruct 3D representations of human

    models from a single cameras view [4, 5]. Due to the self-

    occlusive nature of human body, causing ambiguity from

    a single view, most of these methods rely on a restricted

    dynamic model of behaviors. But tracking can easily fail in

    case of sudden motions or other movements that differ much

    from the dynamic model. In 3D model reconstruction from

    multi-view cameras [6, 7], most methods start from silhou-

    ettes in different cameras, from which points occupied by

    the subject are estimated, and finally a 3D model with prin-

    ciple body parts is fit in the 3D space [8]. This approach

    heavily relies on the silhouettes obtained from each image.

    It is also sensitive to the accuracy of camera calibration.

    However, in many situations background subtraction for sil-

    houettes suffers for quality or is almost impossible due to

    clustered background or camouflaged foreground. Another

    aspect of the human model fitting problem is the choice of

    image features. All human model fitting methods are based

    on some image features as targets to fit the model. Most

    of them are based on generic features such as silhouettes

    or edges [9, 7]. Some use skin color but such methods are

    prone to failure in some situations since lighting usually has

    big influence in colors and skin color varies from person to

    person.

    In this paper, we first introduce the opportunistic fusion

    framework as well as an implementation of its concepts

    through human gesture analysis in Section 2. In Section

    3, image segmentation in a single camera is described indetail. Color distribution maintained in the model is used

    to initialize segmentation. Perceptually Organized Expec-

    tation Maximization (POEM) is then applied to refine color

    segments with observations from a single camera, followed

    by a watershed algorithm to assign segment labels to all pix-

    els based on spatial relationships. Finally, ellipse fitting is

    used to parameterize segments in order to create concise

    segment descriptions for communication. In Section 4, Par-

    ticle Swarm Optimization (PSO) is used for 3D model fit-

    ting. Examples are shown to demonstrate capability of the

    elliptical segments for posture estimation.

    2. Opportunistic Fusion for Human

    Gesture Analysis

    We propose a framework of opportunistic fusion in multi-

    camera networks in order to both employ the rich visual

    information provided by cameras and incorporate learned

    knowledge of the subject into active vision analysis. The

    opportunistic fusion framework is composed of three di-

    mensions, space, time, and feature levels. In the rest of the

    paper, the problem of human gesture analysis is elaborated

    on to show how those concepts can be implemented.

    2.1. The Fusion Framework Overview

    The opportunistic fusion framework for gesture analysis is

    shown in Fig. 2. On the top of Fig. 2 are spatial fusion

    modules. In parallel is the progression of the 3D human

    body model. Suppose at time t0 we have the model with

    the collection of parameters as M0. At the next instance t1,

    the current model M0 is input to the spatial fusion module

    for t1, and the output decisions are used to update M0 from

    which we get the new 3D model M1.

    Now we look into a specific spatial fusion module (the

    lower part of Fig. 2) for the detailed process. In the low-

    est level of the layered gesture analysis, image features are

    extracted by local processing. No explicit collaboration be-

    tween cameras is done in this stage since communication

    is not expected until images/videos are reduced to short de-

    scriptions. Distinct features (e.g. colors) specific for the

    subject are registered in the current model M0 and are used

    for analysis (arrow 1 in Fig. 2). The intuition here is, weadaptively learn what are the attributes distinguishing the

    subject, save them as marks in the 3D model, and then use

    those marks to look for the subject. After local process-

    ing, data is shared between cameras to derive a new estimate

    2

  • 7/27/2019 10.1.1.115.6795

    3/6

    Description Layer 1:

    Images

    Description Layer 2:

    Features

    Description Layer 3:

    Gesture Elements

    Description Layer 4:

    Gestures

    Decision Layer 1:

    within a single camera

    Decision Layer 2:

    collaboration between

    cameras

    Decision Layer 3:

    collaboration between

    cameras

    R1 R2 R3

    F1 F2 F3

    f12f11 f21

    f22f31

    f32

    E1 E2 E3

    G

    Description Layers Decision Layers

    time

    3D human model

    updating through

    model history andnew observations

    local processing and spatial

    collaboration in the camera network

    old model

    updated

    model

    Active vision

    (temporal fusion)

    Decision feedback to

    update the model

    (spatial fusion)

    space

    output of spatiotemporal fusion

    1

    2

    Model ->

    gesture interpretations

    3

    Figure 2: Spatiotemporal fusion framework for human ges-

    ture analysis.

    of the model. Parameters in M0 specify a smaller space of

    possible M1s. Then decisions from spatial fusion of cam-

    eras are used to update M0 to obtain the new model M1(arrow 2 in Fig. 2). Therefore, for every update of themodel M, it combines space (spatial collaboration between

    cameras), time (the previous model M0), and feature levels

    (choice of image features in local processing from both new

    observations and subject-specific attributes in M0). Finally,

    the new model M1 is used for high-level gesture deductions

    in a certain scenario (arrow 2 in Fig. 2).

    2.2. 3D Body Model Reconstruction OverviewAn implementation of the 3D human body posture estima-

    tion is presented in this paper. Elements in the opportunis-

    tic fusion framework described above are incorporated in

    this algorithm as illustrated in Fig. 3. Local processing in

    a single camera includes segmentation and ellipse fitting

    for a concise parameterization of segments. We assume

    the 3D model is initialized with a distinct color distribu-

    tion for the subject. For each camera, the color distribution

    is first refined using the EM (Expectation Maximization)

    algorithm and then used for segmentation. Undetermined

    pixels from EM are assigned labels through watershed seg-

    mentation. For spatial collaboration, ellipses from all cam-eras are merged to find the geometric configuration of the

    3D skeleton model. Candidate configurations are examined

    using PSO (Particle Swarm Optimization). Details and ex-

    periment results of the algorithm are presented in Section 3

    and Section 4.

    3. In-Node Feature Extraction

    The goal of local processing in a single camera is to reduce

    raw images/videos to simple descriptions so that they can

    be efficiently transmitted between the cameras. The output

    of the algorithm will be ellipses fitted from segments and

    the mean color of the segments. As shown in the upper part

    of Fig. 3, local processing includes image segmentation for

    the subject and ellipse fitting to the extracted segments.

    We assume the subject is characterized by a distinct color

    distribution. Foreground area is obtained through back-ground subtraction. Pixels with high or low illumination are

    also removed since for those pixels chrominance may not

    be reliable. Then a rough segmentation for the foreground

    is done either based on K-means on the chrominance of the

    foreground pixels, or color distributions from the model. In

    the initialization stage when the model has not been well

    established, or when we do not have a high confidence in

    the model, we need to start from the image itself and use

    a method such as K-means to find color distribution of the

    subject. However, when a model with a reliable color distri-

    bution is available, we can directly assign pixels to different

    segments based on the existing color distribution. In prac-tice, the color distribution maintained by the model may not

    be uniformly accurate for all cameras due to effects such

    as color map changes or illumination differences. Also the

    subjects appearance may change in a single camera due to

    the movement or lighting conditions. Therefore, the color

    distribution of the model is only used for a rough segmen-

    tation in initialization of segmentation. Then an EM algo-

    rithm is used to refine the color distribution for the current

    image. The initial estimated color distribution plays an im-

    portant role because it can prevent EM from being trapped

    in local minima.

    Suppose the color distribution is a mixture of N

    Gaussian modes, with parameters = {1, 2, . . . , 3},where l = {l, l} are the mean and covariance ma-trix of the modes. Mixing weights of different modes are

    A = {1, 2, . . . , 3}. We need to find the probability ofeach pixel xi belonging to a certain mode l: P r(yi = l|xi).From standard EM for Gaussian Mixture Models (GMM)

    we have the E step as:

    P r(k+1)(yi = l|xi) (k)l P(k)

    l

    (xi), l = 1, . . . , N

    Nl=1

    P r(k+1)(yi = l|xi) = 1

    P r(k+1)(yi = l|xi) (1)

    and the M step as:

    (k+1)l =

    Mi=1 xiP r(yi = l|xi,

    (k))Mi=1 P r(yi = l|xi,

    (k))(2)

    (k+1)l =

    Mi=1(xi

    (k)l )(xi

    (k)l )

    TP r(yi = l|xi, (k))M

    i=1 P r(yi = l|xi, (k))

    (3)

    3

  • 7/27/2019 10.1.1.115.6795

    4/6

    Background

    subtraction

    Rough

    segmentation

    EM: refine

    color models

    Watershed

    segmentation

    Color segmentation and ellipse fitting in local processing

    Ellipse fitting

    Generate test

    configurations

    Score test

    configurations

    Update each test

    configuration usingPSO

    Combine 3 views to get 3D skeleton geometric configuration

    Update 3D model

    (color/texture,motion)

    3D human body model

    Maintain current

    model

    Previous color

    distributionPrevious geometric

    configuration and motion

    Local

    processingfrom other

    cameras

    Check

    stop criteria

    N

    Y

    Figure 3: Algorithm flowchart for 3D human skeleton model reconstruction.

    and

    (k+1)

    l =

    1

    Mxi

    P r(k+1)

    (yi = l|xi) (4)

    where k is the number of iterations, and the M step is ob-

    tained by maximizing the log-likelihood:

    L(x; ) =Mi=1

    Nl=1

    P r(yi = l|xi)logP r(xi|l). (5)

    However, this basic EM algorithm takes each pixel indepen-

    dently, without considering the fact that pixels belonging to

    the same mode are usually spatially close to each other. In

    [10] Perceptually Organized EM (POEM) is introduced. In

    POEM, influence of neighbors is incorporated by a weight-ing measure:

    w(xi, xj) = exixj

    21s(xi)s(xj)

    22 (6)

    where s(xi) is the spatial coordinate of xi. Then, votesfor xi from the neighborhood are given by

    Vl(xi) =xj

    l(xj)w(xi, xj), where l(xj)=Pr(yj=l|xj)

    (7)

    Based on this voting scheme, the following modifica-

    tions are made to the EM steps. In the E step, (k)l is

    changed to (k)l (xi), which means that for every pixel xi,mixing weights for different modes are different. This is

    partially due to the influence of neighbors. In the M step,

    mixing weights are updated by

    (k)l (xi) =

    eV(xi)

    l

    Nk=1 e

    V(xi)

    k

    (8)

    in which controls the softness of neighbors votes. If

    is as small as 0, then mixing weights are always uniform. If

    (a) (b) (c) (d)

    Figure 4: Ellipse fitting. (a) original image; (b) segments;

    (c) simple ellipse fitting to connected regions; (d) improved

    ellipse fitting.

    approaches infinity, the mixing weight for the mode with

    the largest vote will be 1.

    After refinement of the color distribution with POEM,

    we set pixels with high probability (e.g., larger than 99.9%)that belong to a certain mode as markers for that mode.

    Then a watershed segmentation algorithm is implemented

    to assign labels for undecided pixels. Finally, in order to

    obtain a concise parameterization for each segment, an el-

    lipse is fitted to it. Note that a segment refers to a spa-

    tially connected region of the same mode. Therefore, a sin-

    gle mode can have several segments. When the segment is

    generally convex and has a shape similar to an ellipse, the

    fitted ellipse well represents the segment. However, when

    the segments shape differs considerably from an ellipse, adirect fitting step may not be sufficient. To address such

    cases, we first test the similarity between the segment and

    an ellipse by fitting an ellipse to the segment and comparing

    their overlap. If similarity is low, the segment is split into

    two segments and this process is carried out recursively on

    every segment until they all meet the similarity criterion. In

    Fig. 4, if we use a direct ellipse fitting to every segment, we

    obtain Fig. 4(c). If we adopt the test-and-split procedure,

    correct ellipses are obtained as shown in Fig. 4(d). Experi-

    mental results are shown in Fig. 5.

    4

  • 7/27/2019 10.1.1.115.6795

    5/6

    (a)

    (b)

    (c)

    Figure 5: Experiment results of local processing. (a) origi-

    nal images; (b) segments; (c) fitted ellipses.

    4. Collaborative Posture EstimationHuman posture estimation is essentially treated as an op-

    timization problem, in which we aim to minimize the dis-

    tance between the posture and ellipses from the multiple

    cameras. There can be several different ways to find the

    3D skeleton model based on observations from multi-view

    images. One method is to directly solve for the unknown

    parameters through geometric calculation. In this method

    one needs to first establish correspondences between points

    / segments in different cameras, which is itself a hard prob-

    lem. Common observations for points are rare for human

    problems, and body parts may take on very different ap-

    pearances from different views. Therefore, it is difficult toresolve ambiguity in the 3D space based on 2D observa-

    tions. A second method would be to cast this as an opti-

    mization problem, in which we find optimal is and is

    to minimize an objective function (e.g., difference between

    projections due to a certain 3D model and the actual seg-

    ments) based on properties of the objective function. How-

    ever, if the problem is highly nonlinear or non-convex, it

    may be very difficult or time consuming to solve. There-

    fore, searching strategies which do not explicitly depend on

    the objective function formulation are desired.

    Motivated by [11, 12], Particle Swarm Optimization

    (PSO) is used for our optimization problem. The lower part

    of Fig. 3 shows the estimation process. Ellipses from local

    processing of single cameras are merged together to recon-

    struct the skeleton (Fig. 6). Here we consider a simplified

    problem in which only arms change in position while other

    body parts are kept in the default location. Elevation angles

    (i) and azimuth angles (i) of the left/right upper/lower

    parts of the arms are specified as parameters (Fig. 6(b)).

    The assumption is that projection matrices from 3D skele-

    ton to 2D image planes are known. This can be achieved

    either from locations of cameras and the subject, or it can

    1

    2

    3

    41

    2

    3

    4

    x

    y

    z

    O

    ellipses ellipses ellipses

    CAM1 CAM2 CAM3

    x

    yz

    The

    person

    CAM 1

    CAM 2

    CAM 3

    (a) (b)

    Figure 6: 3D skeleton model fitting. (a) Top view of the

    experiment setting. (b) The 3D skeleton reconstructed from

    ellipses from multi-view cameras.

    be calculated from some known projective correspondences

    between the 3D subject and points in the images, without

    knowing exact locations of cameras or the subject.

    PSO is suitable for posture estimation as an evolutionary

    optimization mechanism. It starts from a group of initial

    particles. During the evolution the particles are directed to

    the good position while keeping some randomness to ex-

    plore the search space. Suppose there are N particles (test

    configurations) xi, each being a vector ofis and is. The

    velocity ofxi is denoted by vi. Assume the best position of

    xi up to now is xi, and the global best position of all xis

    up to now is g. The objective function is f() for which wewish to find the optimal position x to minimize f(x). ThePSO algorithm is as follows:

    1. Initialize xi and vi. The value ofvi is usually set to 0,

    and xi = xi. Evaluate f(xi) and set g = argminf(xi).

    2. While the stop criterion is not satisfied, do for every

    xi:

    vi vi + c1r1(xi xi) + c2r2(g xi)

    xi xi + vi

    If f(xi) < f(xi), xi = xi; If f(xi) < f(g),g = xi

    The stop criterion: After updating all N xisonce, the increase in f(g) falls below a thresh-old, then the algorithm exits.

    Here is the inertia coefficient, while c1 and c2 are the

    social coefficients. r1 and r2 are random vectors with

    each element uniformly distributed on [0,1]. Choice of ,

    c1 and c2 controls the convergence process of the evolution.

    If is big, the particles have more inertia and tend to keep

    their own directions to explore the search space. This allows

    for more chance of finding the true global optimal if the

    group of particles is currently around a local optimal. While

    ifc1 and c2 are big, the particles are more social with the

    other particles and go quickly to the best positions known

    5

  • 7/27/2019 10.1.1.115.6795

    6/6

    Figure 7: Experiment results for 3D skeleton reconstruc-

    tion. Original images from 3 camera views and the skele-

    tons are shown.

    by the group. In our experiment, N = 16, = 0.3 andc1 = c2 = 1.

    Similar to other search techniques, PSO will be likely to

    converge to local optimum without carefully choosing the

    initial particles. In the experiment we assume that the 3D

    skeleton will not go through a big change in a time inter-val. Therefore, at time t1 the search space formed by the

    particles is centered around the optimal solution of the geo-

    metric configuration at time t0. That is, time consistency in

    postures is used to initialize particles for searching. Some

    examples showing images from 3 views and the posture es-

    timates are shown in Fig. 7.

    5. Conclusion and Future Work

    There are two main motivations for our work for gesture

    analysis in a multi-camera network. One is to reduce image

    data to short descriptions by local processing in each cam-

    era for efficient communication among cameras, the other

    one is to explore consistency and distinctiveness of the

    subject by opportunistic fusion of information across space,

    time, and feature levels. We studied the use of a 3D human

    model to keep both geometric and appearance parameters

    of the subject. In some of our experiments the problem

    of PSO converging to a local minimum still exists, espe-

    cially when there is a sudden move which causes the initial

    search space to be relatively far away from the new pos-

    ture. Future work includes defining a more versatile model,

    and using more information from local features of cameras

    to better initialize the search space. In the current method

    calibrated cameras are assumed. However, we found that

    fitting is sensitive to accuracy of calibration. Since accurate

    camera calibration is not always practical in applications

    of posture recognition, we are exploring solutions based on

    uncalibrated cameras.

    References

    [1] Andrew D. Wilson and Aaron F. Bobick, Parametric hidden

    markov models for gesture recognition, IEEE Transactions

    on Pattern Analysis and Machine Intelligence, vol. 21, no. 9,

    pp. 884900, 1999.

    [2] Yanxi Liu, Robert Collins, and Yanghai Tsin, Gait sequence

    analysis using frieze patterns, in Proceedings of the 7th

    European Conference on Computer Vision (ECCV02), May

    2002.

    [3] Y. Rui and P. Anandan, Segmenting visual actions based onspatio-temporal motion patterns, in CVPR00.

    [4] Hedvig Sidenbladh, Michael J. Black, and Leonid Sigal,

    Implicit probabilistic models of human motion for synthesis

    and tracking, in ECCV 02: Proceedings of the 7th Euro-

    pean Conference on Computer Vision-Part I, London, UK,

    2002, pp. 784800, Springer-Verlag.

    [5] J. Deutscher, A. Blake, and I.D. Reid, Articulated body

    motion capture by annealed particle filtering, in CVPR00,

    2000, pp. II: 126133.

    [6] Kong Man Cheung, Simon Baker, and Takeo Kanade,

    Shape-from-silhouette across time: Part ii: Applications to

    human modeling and markerless motion tracking, Interna-

    tional Journal of Computer Vision, vol. 63, no. 3, pp. 225 245, August 2005.

    [7] Clement Menier, Edmond Boyer, and Bruno Raffin, 3d

    skeleton-based body pose recovery, in Proceedings of the

    3rd International Symposium on 3D Data Processing, Visu-

    alization and Transmission, Chapel Hill (USA), june 2006.

    [8] Ivana Mikic, Mohan Trivedi, Edward Hunter, and Pamela

    Cosman, Human body model acquisition and tracking using

    voxel data, Int. J. Comput. Vision, vol. 53, no. 3, pp. 199

    223, 2003.

    [9] H. Sidenbladh and M.J. Black, Learning the statistics of

    people in images and video, IJCV, vol. 54, no. 1-3, pp.

    183209, August 2003.

    [10] Y. Weiss and E. Adelson, Perceptually organized em: A

    framework for motion segmentaiton that combines informa-

    tion about form and motion, Tech. Rep. 315, M.I.T Media

    Lab, 1995.

    [11] S. Ivecovic and E. Trucco, Human body pose estimation

    with pso, in IEEE Congress on Evolutionary Computation,

    2006, pp. 12561263.

    [12] C. Robertson and E. Trucco, Human body posture via hi-

    erarchical evolutionary optimization, in BMVC06, 2006, p.

    III:999.

    6