10.1.1.115.6795

7/27/2019 10.1.1.115.6795

1/6

Model-based Human Posture Estimation for Gesture Analysis

in an Opportunistic Fusion Smart Camera Network

Chen Wu and Hamid Aghajan

Department of Electrical Engineering

Stanford University, USA

Abstract

In multi-camera networks rich visual data is provided both

spatially and temporally. In this paper a method of human

posture estimation is described incorporating the concept of

an opportunistic fusion framework aiming to employ mani-

fold sources of visual information across space, time, and

feature levels. One motivation for the proposed method

is to reduce raw visual data in a single camera to ellipti-

cal parameterized segments for efficient communication be-

tween cameras. A 3D human body model is employed as

the convergence point of spatiotemporal and feature fusion.

It maintains both geometric parameters of the human pos-

ture and the adaptively learned appearance attributes, all

of which are updated from the three dimensions of space,

time and features of the opportunistic fusion. In sufficient

confidence levels parameters of the 3D human body model

are again used as feedback to aid subsequent in-node vision

analysis. Color distribution registered in the model is used

to initialize segmentation. Perceptually Organized Expec-tation Maximization (POEM) is then applied to refine color

segments with observations from a single camera. Geomet-

ric configuration of the 3D skeleton is estimated by Particle

Swarm Optimization (PSO).

1. Introduction

In a multi-camera network, access to multiple sources of

visual data often allows for making more comprehensive

interpretations of events and gestures. It also creates a per-

vasive sensing environment for applications where it is im-

practical for the users to wear sensors. Example applica-

tions include surveillance, smart home care, gaming, etc.

In this paper we propose a method of human posture esti-

mation using an opportunistic fusion framework to employ

manifold sources of information obtained from the camera

network in a principled way. The framework spans three

dimensions of space (different camera views), time (each

camera collecting data over time), and feature levels (se-

lecting and fusing different feature subsets).

Our work aims for intelligent and efficient vision inter-

pretations in a camera network. One underlying constraint

of the network is the relatively low bandwidth. Therefore,

for efficient collaboration between cameras, we expect con-

cise descriptions instead of raw image data as outputs from

local processing in a single camera. This process inevitably

removes certain details in images of a single camera, which

requires the camera to have some intelligence on its ob-

servations (smart cameras) , i.e., some knowledge of thesubject. This derives one of the motivations for opportunis-

tic data fusion between cameras, which compensates for

partial observations in individual cameras. So the output

from opportunistic data fusion (a model of the subject) is

fed to local processing. On the other hand, outputs of local

processing in single cameras enable opportunistic data fu-

sion by contributing local descriptions from multiple views.

It is the interactive loop that brings in the potential for

achieving both efficient and adequate vision-based analy-

sis in the camera network. An example of the communica-

tion model between five cameras to reconstruct the persons

model is shown in Fig. 1. The circled numbers represent the

sequence of events.

In our approach a 3D human body model embodies up-

to-date information from both current and historical obser-

vations of all cameras in a concise way. It has the follow-

ing components: 1. Geometric configuration: body part

lengths, angles. 2. Color or texture of body parts. 3. Mo-

tion of body parts. The three components are all updated

from the three dimensions of space, time, and features of

the opportunistic fusion. The 3D human model takes up

two roles. One is as an intermediate step for high-level

application-pertinent gesture interpretation, the other is to

create a feedback path from spatiotemporal and feature fu-

sion operations to low-level vision processing in each cam-

era. It is true that for a number of gestures a human body

model may not be needed to interpret the gesture. There is

existing work for hand gesture recognition [1] where only

part of the body is analyzed. Some gestures can also be

detected through spatiotemporal motion patterns of some

body parts [2, 3]. However, as the set of gestures to differ-

entiate expands, it becomes increasingly difficult to devise

methods for gesture recognition based on only a few cues.

A 3D human body model provides a unified interface for

1

7/27/2019 10.1.1.115.6795

2/6

Camera 5 wants to update itsknowledge of the subject

1Broadcast the request for

collaborationCAM 1

CAM 2

CAM 3

CAM 4

CAM 5

2 The other cameras send

requested descriptions

Vector of descriptions from local processing

3

Fusion to update local

knowledge of the subject

4Updated knowledge

of the subject is fedback

Vector of model parameters

5

5

5

5

Up-to-date knowledge of the subject

is used in local processing 1 2 3 4 5

Figure 1: Communication for collaboration in the camera

network.

a variety of gesture interpretations. On the other hand, in-

stead of being a passive output to represent decisions from

spatiotemporal and feature fusion, the 3D model implic-

itly enables more interactions between the three dimensions

by being actively involved in vision analysis. For exam-

ple, although predefined appearance attributes are generally

not reliable, adaptively learned appearance attributes can be

used to identify the person or body parts. Those attributes

are usually more distinguishable than generic features such

as edges.

Fitting human models to images or videos has been aninteresting topic for which a variety of methods have been

developed. Some reconstruct 3D representations of human

models from a single cameras view [4, 5]. Due to the self-

occlusive nature of human body, causing ambiguity from

a single view, most of these methods rely on a restricted

dynamic model of behaviors. But tracking can easily fail in

case of sudden motions or other movements that differ much

from the dynamic model. In 3D model reconstruction from

multi-view cameras [6, 7], most methods start from silhou-

ettes in different cameras, from which points occupied by

the subject are estimated, and finally a 3D model with prin-

ciple body parts is fit in the 3D space [8]. This approach

heavily relies on the silhouettes obtained from each image.

It is also sensitive to the accuracy of camera calibration.

However, in many situations background subtraction for sil-

houettes suffers for quality or is almost impossible due to

clustered background or camouflaged foreground. Another

aspect of the human model fitting problem is the choice of

image features. All human model fitting methods are based

on some image features as targets to fit the model. Most

of them are based on generic features such as silhouettes

or edges [9, 7]. Some use skin color but such methods are

prone to failure in some situations since lighting usually has

big influence in colors and skin color varies from person to

person.

In this paper, we first introduce the opportunistic fusion

framework as well as an implementation of its concepts

through human gesture analysis in Section 2. In Section

3, image segmentation in a single camera is described indetail. Color distribution maintained in the model is used

to initialize segmentation. Perceptually Organized Expec-

tation Maximization (POEM) is then applied to refine color

segments with observations from a single camera, followed

by a watershed algorithm to assign segment labels to all pix-

els based on spatial relationships. Finally, ellipse fitting is

used to parameterize segments in order to create concise

segment descriptions for communication. In Section 4, Par-

ticle Swarm Optimization (PSO) is used for 3D model fit-

ting. Examples are shown to demonstrate capability of the

elliptical segments for posture estimation.

2. Opportunistic Fusion for Human

Gesture Analysis

We propose a framework of opportunistic fusion in multi-

camera networks in order to both employ the rich visual

information provided by cameras and incorporate learned

knowledge of the subject into active vision analysis. The

opportunistic fusion framework is composed of three di-

mensions, space, time, and feature levels. In the rest of the

paper, the problem of human gesture analysis is elaborated

on to show how those concepts can be implemented.

2.1. The Fusion Framework Overview

The opportunistic fusion framework for gesture analysis is

shown in Fig. 2. On the top of Fig. 2 are spatial fusion

modules. In parallel is the progression of the 3D human

body model. Suppose at time t0 we have the model with

the collection of parameters as M0. At the next instance t1,

the current model M0 is input to the spatial fusion module

for t1, and the output decisions are used to update M0 from

which we get the new 3D model M1.

Now we look into a specific spatial fusion module (the

lower part of Fig. 2) for the detailed process. In the low-

est level of the layered gesture analysis, image features are

extracted by local processing. No explicit collaboration be-

tween cameras is done in this stage since communication

is not expected until images/videos are reduced to short de-

scriptions. Distinct features (e.g. colors) specific for the

subject are registered in the current model M0 and are used

for analysis (arrow 1 in Fig. 2). The intuition here is, weadaptively learn what are the attributes distinguishing the

subject, save them as marks in the 3D model, and then use

those marks to look for the subject. After local process-

ing, data is shared between cameras to derive a new estimate

2

7/27/2019 10.1.1.115.6795

3/6

Description Layer 1:

Images


Features


Gesture Elements


Gestures

Decision Layer 1:

within a single camera

Decision Layer 2:

collaboration between

cameras

Decision Layer 3:

collaboration between

cameras

R1 R2 R3

F1 F2 F3

f12f11 f21

f22f31

f32

E1 E2 E3

G

Description Layers Decision Layers

time

3D human model

updating through

model history andnew observations

local processing and spatial

collaboration in the camera network

old model

updated

model

Active vision

(temporal fusion)

Decision feedback to

update the model

(spatial fusion)

space

output of spatiotemporal fusion

1

2

Model ->

gesture interpretations

3

Figure 2: Spatiotemporal fusion framework for human ges-

ture analysis.

of the model. Parameters in M0 specify a smaller space of

possible M1s. Then decisions from spatial fusion of cam-

eras are used to update M0 to obtain the new model M1(arrow 2 in Fig. 2). Therefore, for every update of themodel M, it combines space (spatial collaboration between

cameras), time (the previous model M0), and feature levels

(choice of image features in local processing from both new

observations and subject-specific attributes in M0). Finally,

the new model M1 is used for high-level gesture deductions

in a certain scenario (arrow 2 in Fig. 2).

2.2. 3D Body Model Reconstruction OverviewAn implementation of the 3D human body posture estima-

tion is presented in this paper. Elements in the opportunis-

tic fusion framework described above are incorporated in

this algorithm as illustrated in Fig. 3. Local processing in

a single camera includes segmentation and ellipse fitting

for a concise parameterization of segments. We assume

the 3D model is initialized with a distinct color distribu-

tion for the subject. For each camera, the color distribution

is first refined using the EM (Expectation Maximization)

algorithm and then used for segmentation. Undetermined

pixels from EM are assigned labels through watershed seg-

mentation. For spatial collaboration, ellipses from all cam-eras are merged to find the geometric configuration of the

3D skeleton model. Candidate configurations are examined

using PSO (Particle Swarm Optimization). Details and ex-

periment results of the algorithm are presented in Section 3

and Section 4.

3. In-Node Feature Extraction

The goal of local processing in a single camera is to reduce

raw images/videos to simple descriptions so that they can

be efficiently transmitted between the cameras. The output

of the algorithm will be ellipses fitted from segments and

the mean color of the segments. As shown in the upper part

of Fig. 3, local processing includes image segmentation for

the subject and ellipse fitting to the extracted segments.

We assume the subject is characterized by a distinct color

distribution. Foreground area is obtained through back-ground subtraction. Pixels with high or low illumination are

also removed since for those pixels chrominance may not

be reliable. Then a rough segmentation for the foreground

is done either based on K-means on the chrominance of the

foreground pixels, or color distributions from the model. In

the initialization stage when the model has not been well

established, or when we do not have a high confidence in

the model, we need to start from the image itself and use

a method such as K-means to find color distribution of the

subject. However, when a model with a reliable color distri-

bution is available, we can directly assign pixels to different

segments based on the existing color distribution. In prac-tice, the color distribution maintained by the model may not

be uniformly accurate for all cameras due to effects such

as color map changes or illumination differences. Also the

subjects appearance may change in a single camera due to

the movement or lighting conditions. Therefore, the color

distribution of the model is only used for a rough segmen-

tation in initialization of segmentation. Then an EM algo-

rithm is used to refine the color distribution for the current

image. The initial estimated color distribution plays an im-

portant role because it can prevent EM from being trapped

in local minima.

Suppose the color distribution is a mixture of N

Gaussian modes, with parameters = {1, 2, . . . , 3},where l = {l, l} are the mean and covariance ma-trix of the modes. Mixing weights of different modes are

A = {1, 2, . . . , 3}. We need to find the probability ofeach pixel xi belonging to a certain mode l: P r(yi = l|xi).From standard EM for Gaussian Mixture Models (GMM)

we have the E step as:

P r(k+1)(yi = l|xi) (k)l P(k)

l

(xi), l = 1, . . . , N

Nl=1

P r(k+1)(yi = l|xi) = 1

P r(k+1)(yi = l|xi) (1)

and the M step as:

(k+1)l =

Mi=1 xiP r(yi = l|xi,

(k))Mi=1 P r(yi = l|xi,

(k))(2)

(k+1)l =

Mi=1(xi

(k)l )(xi

(k)l )

TP r(yi = l|xi, (k))M

i=1 P r(yi = l|xi, (k))

(3)

3

7/27/2019 10.1.1.115.6795

4/6

Background

subtraction

Rough

segmentation

EM: refine

color models

Watershed

segmentation

Color segmentation and ellipse fitting in local processing

Ellipse fitting

Generate test

configurations

Score test

configurations

Update each test

configuration usingPSO

Combine 3 views to get 3D skeleton geometric configuration

Update 3D model

(color/texture,motion)

3D human body model

Maintain current

model

Previous color

distributionPrevious geometric

configuration and motion

Local

processingfrom other

cameras

Check

stop criteria

N

Y

Figure 3: Algorithm flowchart for 3D human skeleton model reconstruction.

and

(k+1)

l =

1

Mxi

P r(k+1)

(yi = l|xi) (4)

where k is the number of iterations, and the M step is ob-

tained by maximizing the log-likelihood:

L(x; ) =Mi=1

Nl=1

P r(yi = l|xi)logP r(xi|l). (5)

However, this basic EM algorithm takes each pixel indepen-

dently, without considering the fact that pixels belonging to

the same mode are usually spatially close to each other. In

[10] Perceptually Organized EM (POEM) is introduced. In

POEM, influence of neighbors is incorporated by a weight-ing measure:

w(xi, xj) = exixj

21s(xi)s(xj)

22 (6)

where s(xi) is the spatial coordinate of xi. Then, votesfor xi from the neighborhood are given by

Vl(xi) =xj

l(xj)w(xi, xj), where l(xj)=Pr(yj=l|xj)

(7)

Based on this voting scheme, the following modifica-

tions are made to the EM steps. In the E step, (k)l is

changed to (k)l (xi), which means that for every pixel xi,mixing weights for different modes are different. This is

partially due to the influence of neighbors. In the M step,

mixing weights are updated by

(k)l (xi) =

eV(xi)

l

Nk=1 e

V(xi)

k

(8)

in which controls the softness of neighbors votes. If

is as small as 0, then mixing weights are always uniform. If

(a) (b) (c) (d)

Figure 4: Ellipse fitting. (a) original image; (b) segments;

(c) simple ellipse fitting to connected regions; (d) improved

ellipse fitting.

approaches infinity, the mixing weight for the mode with

the largest vote will be 1.

After refinement of the color distribution with POEM,

we set pixels with high probability (e.g., larger than 99.9%)that belong to a certain mode as markers for that mode.

Then a watershed segmentation algorithm is implemented

to assign labels for undecided pixels. Finally, in order to

obtain a concise parameterization for each segment, an el-

lipse is fitted to it. Note that a segment refers to a spa-

tially connected region of the same mode. Therefore, a sin-

gle mode can have several segments. When the segment is

generally convex and has a shape similar to an ellipse, the

fitted ellipse well represents the segment. However, when

the segments shape differs considerably from an ellipse, adirect fitting step may not be sufficient. To address such

cases, we first test the similarity between the segment and

an ellipse by fitting an ellipse to the segment and comparing

their overlap. If similarity is low, the segment is split into

two segments and this process is carried out recursively on

every segment until they all meet the similarity criterion. In

Fig. 4, if we use a direct ellipse fitting to every segment, we

obtain Fig. 4(c). If we adopt the test-and-split procedure,

correct ellipses are obtained as shown in Fig. 4(d). Experi-

mental results are shown in Fig. 5.

4

7/27/2019 10.1.1.115.6795

5/6

(a)

(b)

(c)

Figure 5: Experiment results of local processing. (a) origi-

nal images; (b) segments; (c) fitted ellipses.

4. Collaborative Posture EstimationHuman posture estimation is essentially treated as an op-

timization problem, in which we aim to minimize the dis-

tance between the posture and ellipses from the multiple

cameras. There can be several different ways to find the

3D skeleton model based on observations from multi-view

images. One method is to directly solve for the unknown

parameters through geometric calculation. In this method

one needs to first establish correspondences between points

/ segments in different cameras, which is itself a hard prob-

lem. Common observations for points are rare for human

problems, and body parts may take on very different ap-

pearances from different views. Therefore, it is difficult toresolve ambiguity in the 3D space based on 2D observa-

tions. A second method would be to cast this as an opti-

mization problem, in which we find optimal is and is

to minimize an objective function (e.g., difference between

projections due to a certain 3D model and the actual seg-

ments) based on properties of the objective function. How-

ever, if the problem is highly nonlinear or non-convex, it

may be very difficult or time consuming to solve. There-

fore, searching strategies which do not explicitly depend on

the objective function formulation are desired.

Motivated by [11, 12], Particle Swarm Optimization

(PSO) is used for our optimization problem. The lower part

of Fig. 3 shows the estimation process. Ellipses from local

processing of single cameras are merged together to recon-

struct the skeleton (Fig. 6). Here we consider a simplified

problem in which only arms change in position while other

body parts are kept in the default location. Elevation angles

(i) and azimuth angles (i) of the left/right upper/lower

parts of the arms are specified as parameters (Fig. 6(b)).

The assumption is that projection matrices from 3D skele-

ton to 2D image planes are known. This can be achieved

either from locations of cameras and the subject, or it can

1

2

3

41

2

3

4

x

y

z

O

ellipses ellipses ellipses

CAM1 CAM2 CAM3

x

yz

The

person

CAM 1

CAM 2

CAM 3

(a) (b)

Figure 6: 3D skeleton model fitting. (a) Top view of the

experiment setting. (b) The 3D skeleton reconstructed from

ellipses from multi-view cameras.

be calculated from some known projective correspondences

between the 3D subject and points in the images, without

knowing exact locations of cameras or the subject.

PSO is suitable for posture estimation as an evolutionary

optimization mechanism. It starts from a group of initial

particles. During the evolution the particles are directed to

the good position while keeping some randomness to ex-

plore the search space. Suppose there are N particles (test

configurations) xi, each being a vector ofis and is. The

velocity ofxi is denoted by vi. Assume the best position of

xi up to now is xi, and the global best position of all xis

up to now is g. The objective function is f() for which wewish to find the optimal position x to minimize f(x). ThePSO algorithm is as follows:

1. Initialize xi and vi. The value ofvi is usually set to 0,

and xi = xi. Evaluate f(xi) and set g = argminf(xi).

2. While the stop criterion is not satisfied, do for every

xi:

vi vi + c1r1(xi xi) + c2r2(g xi)

xi xi + vi

If f(xi) < f(xi), xi = xi; If f(xi) < f(g),g = xi

The stop criterion: After updating all N xisonce, the increase in f(g) falls below a thresh-old, then the algorithm exits.

Here is the inertia coefficient, while c1 and c2 are the

social coefficients. r1 and r2 are random vectors with

each element uniformly distributed on [0,1]. Choice of ,

c1 and c2 controls the convergence process of the evolution.

If is big, the particles have more inertia and tend to keep

their own directions to explore the search space. This allows

for more chance of finding the true global optimal if the

group of particles is currently around a local optimal. While

ifc1 and c2 are big, the particles are more social with the

other particles and go quickly to the best positions known

5

7/27/2019 10.1.1.115.6795

6/6

Figure 7: Experiment results for 3D skeleton reconstruc-

tion. Original images from 3 camera views and the skele-

tons are shown.

by the group. In our experiment, N = 16, = 0.3 andc1 = c2 = 1.

Similar to other search techniques, PSO will be likely to

converge to local optimum without carefully choosing the

initial particles. In the experiment we assume that the 3D

skeleton will not go through a big change in a time inter-val. Therefore, at time t1 the search space formed by the

particles is centered around the optimal solution of the geo-

metric configuration at time t0. That is, time consistency in

postures is used to initialize particles for searching. Some

examples showing images from 3 views and the posture es-

timates are shown in Fig. 7.

5. Conclusion and Future Work

There are two main motivations for our work for gesture

analysis in a multi-camera network. One is to reduce image

data to short descriptions by local processing in each cam-

era for efficient communication among cameras, the other

one is to explore consistency and distinctiveness of the

subject by opportunistic fusion of information across space,

time, and feature levels. We studied the use of a 3D human

model to keep both geometric and appearance parameters

of the subject. In some of our experiments the problem

of PSO converging to a local minimum still exists, espe-

cially when there is a sudden move which causes the initial

search space to be relatively far away from the new pos-

ture. Future work includes defining a more versatile model,

and using more information from local features of cameras

to better initialize the search space. In the current method

calibrated cameras are assumed. However, we found that

fitting is sensitive to accuracy of calibration. Since accurate

camera calibration is not always practical in applications

of posture recognition, we are exploring solutions based on

uncalibrated cameras.

References

[1] Andrew D. Wilson and Aaron F. Bobick, Parametric hidden

markov models for gesture recognition, IEEE Transactions

on Pattern Analysis and Machine Intelligence, vol. 21, no. 9,

pp. 884900, 1999.

[2] Yanxi Liu, Robert Collins, and Yanghai Tsin, Gait sequence

analysis using frieze patterns, in Proceedings of the 7th

European Conference on Computer Vision (ECCV02), May

2002.

[3] Y. Rui and P. Anandan, Segmenting visual actions based onspatio-temporal motion patterns, in CVPR00.

[4] Hedvig Sidenbladh, Michael J. Black, and Leonid Sigal,

Implicit probabilistic models of human motion for synthesis

and tracking, in ECCV 02: Proceedings of the 7th Euro-

pean Conference on Computer Vision-Part I, London, UK,

2002, pp. 784800, Springer-Verlag.

[5] J. Deutscher, A. Blake, and I.D. Reid, Articulated body

motion capture by annealed particle filtering, in CVPR00,

2000, pp. II: 126133.

[6] Kong Man Cheung, Simon Baker, and Takeo Kanade,

Shape-from-silhouette across time: Part ii: Applications to

human modeling and markerless motion tracking, Interna-

tional Journal of Computer Vision, vol. 63, no. 3, pp. 225 245, August 2005.

[7] Clement Menier, Edmond Boyer, and Bruno Raffin, 3d

skeleton-based body pose recovery, in Proceedings of the

3rd International Symposium on 3D Data Processing, Visu-

alization and Transmission, Chapel Hill (USA), june 2006.

[8] Ivana Mikic, Mohan Trivedi, Edward Hunter, and Pamela

Cosman, Human body model acquisition and tracking using

voxel data, Int. J. Comput. Vision, vol. 53, no. 3, pp. 199

223, 2003.

[9] H. Sidenbladh and M.J. Black, Learning the statistics of

people in images and video, IJCV, vol. 54, no. 1-3, pp.

183209, August 2003.

[10] Y. Weiss and E. Adelson, Perceptually organized em: A

framework for motion segmentaiton that combines informa-

tion about form and motion, Tech. Rep. 315, M.I.T Media

Lab, 1995.

[11] S. Ivecovic and E. Trucco, Human body pose estimation

with pso, in IEEE Congress on Evolutionary Computation,

2006, pp. 12561263.

[12] C. Robertson and E. Trucco, Human body posture via hi-

erarchical evolutionary optimization, in BMVC06, 2006, p.

III:999.

6

Documents

10.1.1.115.6795