Upload
faiz-hadi
View
217
Download
0
Embed Size (px)
Citation preview
7/27/2019 10.1.1.115.6795
1/6
Model-based Human Posture Estimation for Gesture Analysis
in an Opportunistic Fusion Smart Camera Network
Chen Wu and Hamid Aghajan
Department of Electrical Engineering
Stanford University, USA
Abstract
In multi-camera networks rich visual data is provided both
spatially and temporally. In this paper a method of human
posture estimation is described incorporating the concept of
an opportunistic fusion framework aiming to employ mani-
fold sources of visual information across space, time, and
feature levels. One motivation for the proposed method
is to reduce raw visual data in a single camera to ellipti-
cal parameterized segments for efficient communication be-
tween cameras. A 3D human body model is employed as
the convergence point of spatiotemporal and feature fusion.
It maintains both geometric parameters of the human pos-
ture and the adaptively learned appearance attributes, all
of which are updated from the three dimensions of space,
time and features of the opportunistic fusion. In sufficient
confidence levels parameters of the 3D human body model
are again used as feedback to aid subsequent in-node vision
analysis. Color distribution registered in the model is used
to initialize segmentation. Perceptually Organized Expec-tation Maximization (POEM) is then applied to refine color
segments with observations from a single camera. Geomet-
ric configuration of the 3D skeleton is estimated by Particle
Swarm Optimization (PSO).
1. Introduction
In a multi-camera network, access to multiple sources of
visual data often allows for making more comprehensive
interpretations of events and gestures. It also creates a per-
vasive sensing environment for applications where it is im-
practical for the users to wear sensors. Example applica-
tions include surveillance, smart home care, gaming, etc.
In this paper we propose a method of human posture esti-
mation using an opportunistic fusion framework to employ
manifold sources of information obtained from the camera
network in a principled way. The framework spans three
dimensions of space (different camera views), time (each
camera collecting data over time), and feature levels (se-
lecting and fusing different feature subsets).
Our work aims for intelligent and efficient vision inter-
pretations in a camera network. One underlying constraint
of the network is the relatively low bandwidth. Therefore,
for efficient collaboration between cameras, we expect con-
cise descriptions instead of raw image data as outputs from
local processing in a single camera. This process inevitably
removes certain details in images of a single camera, which
requires the camera to have some intelligence on its ob-
servations (smart cameras) , i.e., some knowledge of thesubject. This derives one of the motivations for opportunis-
tic data fusion between cameras, which compensates for
partial observations in individual cameras. So the output
from opportunistic data fusion (a model of the subject) is
fed to local processing. On the other hand, outputs of local
processing in single cameras enable opportunistic data fu-
sion by contributing local descriptions from multiple views.
It is the interactive loop that brings in the potential for
achieving both efficient and adequate vision-based analy-
sis in the camera network. An example of the communica-
tion model between five cameras to reconstruct the persons
model is shown in Fig. 1. The circled numbers represent the
sequence of events.
In our approach a 3D human body model embodies up-
to-date information from both current and historical obser-
vations of all cameras in a concise way. It has the follow-
ing components: 1. Geometric configuration: body part
lengths, angles. 2. Color or texture of body parts. 3. Mo-
tion of body parts. The three components are all updated
from the three dimensions of space, time, and features of
the opportunistic fusion. The 3D human model takes up
two roles. One is as an intermediate step for high-level
application-pertinent gesture interpretation, the other is to
create a feedback path from spatiotemporal and feature fu-
sion operations to low-level vision processing in each cam-
era. It is true that for a number of gestures a human body
model may not be needed to interpret the gesture. There is
existing work for hand gesture recognition [1] where only
part of the body is analyzed. Some gestures can also be
detected through spatiotemporal motion patterns of some
body parts [2, 3]. However, as the set of gestures to differ-
entiate expands, it becomes increasingly difficult to devise
methods for gesture recognition based on only a few cues.
A 3D human body model provides a unified interface for
1
7/27/2019 10.1.1.115.6795
2/6
Camera 5 wants to update itsknowledge of the subject
1Broadcast the request for
collaborationCAM 1
CAM 2
CAM 3
CAM 4
CAM 5
2 The other cameras send
requested descriptions
Vector of descriptions from local processing
3
Fusion to update local
knowledge of the subject
4Updated knowledge
of the subject is fedback
Vector of model parameters
5
5
5
5
Up-to-date knowledge of the subject
is used in local processing 1 2 3 4 5
Figure 1: Communication for collaboration in the camera
network.
a variety of gesture interpretations. On the other hand, in-
stead of being a passive output to represent decisions from
spatiotemporal and feature fusion, the 3D model implic-
itly enables more interactions between the three dimensions
by being actively involved in vision analysis. For exam-
ple, although predefined appearance attributes are generally
not reliable, adaptively learned appearance attributes can be
used to identify the person or body parts. Those attributes
are usually more distinguishable than generic features such
as edges.
Fitting human models to images or videos has been aninteresting topic for which a variety of methods have been
developed. Some reconstruct 3D representations of human
models from a single cameras view [4, 5]. Due to the self-
occlusive nature of human body, causing ambiguity from
a single view, most of these methods rely on a restricted
dynamic model of behaviors. But tracking can easily fail in
case of sudden motions or other movements that differ much
from the dynamic model. In 3D model reconstruction from
multi-view cameras [6, 7], most methods start from silhou-
ettes in different cameras, from which points occupied by
the subject are estimated, and finally a 3D model with prin-
ciple body parts is fit in the 3D space [8]. This approach
heavily relies on the silhouettes obtained from each image.
It is also sensitive to the accuracy of camera calibration.
However, in many situations background subtraction for sil-
houettes suffers for quality or is almost impossible due to
clustered background or camouflaged foreground. Another
aspect of the human model fitting problem is the choice of
image features. All human model fitting methods are based
on some image features as targets to fit the model. Most
of them are based on generic features such as silhouettes
or edges [9, 7]. Some use skin color but such methods are
prone to failure in some situations since lighting usually has
big influence in colors and skin color varies from person to
person.
In this paper, we first introduce the opportunistic fusion
framework as well as an implementation of its concepts
through human gesture analysis in Section 2. In Section
3, image segmentation in a single camera is described indetail. Color distribution maintained in the model is used
to initialize segmentation. Perceptually Organized Expec-
tation Maximization (POEM) is then applied to refine color
segments with observations from a single camera, followed
by a watershed algorithm to assign segment labels to all pix-
els based on spatial relationships. Finally, ellipse fitting is
used to parameterize segments in order to create concise
segment descriptions for communication. In Section 4, Par-
ticle Swarm Optimization (PSO) is used for 3D model fit-
ting. Examples are shown to demonstrate capability of the
elliptical segments for posture estimation.
2. Opportunistic Fusion for Human
Gesture Analysis
We propose a framework of opportunistic fusion in multi-
camera networks in order to both employ the rich visual
information provided by cameras and incorporate learned
knowledge of the subject into active vision analysis. The
opportunistic fusion framework is composed of three di-
mensions, space, time, and feature levels. In the rest of the
paper, the problem of human gesture analysis is elaborated
on to show how those concepts can be implemented.
2.1. The Fusion Framework Overview
The opportunistic fusion framework for gesture analysis is
shown in Fig. 2. On the top of Fig. 2 are spatial fusion
modules. In parallel is the progression of the 3D human
body model. Suppose at time t0 we have the model with
the collection of parameters as M0. At the next instance t1,
the current model M0 is input to the spatial fusion module
for t1, and the output decisions are used to update M0 from
which we get the new 3D model M1.
Now we look into a specific spatial fusion module (the
lower part of Fig. 2) for the detailed process. In the low-
est level of the layered gesture analysis, image features are
extracted by local processing. No explicit collaboration be-
tween cameras is done in this stage since communication
is not expected until images/videos are reduced to short de-
scriptions. Distinct features (e.g. colors) specific for the
subject are registered in the current model M0 and are used
for analysis (arrow 1 in Fig. 2). The intuition here is, weadaptively learn what are the attributes distinguishing the
subject, save them as marks in the 3D model, and then use
those marks to look for the subject. After local process-
ing, data is shared between cameras to derive a new estimate
2
7/27/2019 10.1.1.115.6795
3/6
Description Layer 1:
Images
Description Layer 2:
Features
Description Layer 3:
Gesture Elements
Description Layer 4:
Gestures
Decision Layer 1:
within a single camera
Decision Layer 2:
collaboration between
cameras
Decision Layer 3:
collaboration between
cameras
R1 R2 R3
F1 F2 F3
f12f11 f21
f22f31
f32
E1 E2 E3
G
Description Layers Decision Layers
time
3D human model
updating through
model history andnew observations
local processing and spatial
collaboration in the camera network
old model
updated
model
Active vision
(temporal fusion)
Decision feedback to
update the model
(spatial fusion)
space
output of spatiotemporal fusion
1
2
Model ->
gesture interpretations
3
Figure 2: Spatiotemporal fusion framework for human ges-
ture analysis.
of the model. Parameters in M0 specify a smaller space of
possible M1s. Then decisions from spatial fusion of cam-
eras are used to update M0 to obtain the new model M1(arrow 2 in Fig. 2). Therefore, for every update of themodel M, it combines space (spatial collaboration between
cameras), time (the previous model M0), and feature levels
(choice of image features in local processing from both new
observations and subject-specific attributes in M0). Finally,
the new model M1 is used for high-level gesture deductions
in a certain scenario (arrow 2 in Fig. 2).
2.2. 3D Body Model Reconstruction OverviewAn implementation of the 3D human body posture estima-
tion is presented in this paper. Elements in the opportunis-
tic fusion framework described above are incorporated in
this algorithm as illustrated in Fig. 3. Local processing in
a single camera includes segmentation and ellipse fitting
for a concise parameterization of segments. We assume
the 3D model is initialized with a distinct color distribu-
tion for the subject. For each camera, the color distribution
is first refined using the EM (Expectation Maximization)
algorithm and then used for segmentation. Undetermined
pixels from EM are assigned labels through watershed seg-
mentation. For spatial collaboration, ellipses from all cam-eras are merged to find the geometric configuration of the
3D skeleton model. Candidate configurations are examined
using PSO (Particle Swarm Optimization). Details and ex-
periment results of the algorithm are presented in Section 3
and Section 4.
3. In-Node Feature Extraction
The goal of local processing in a single camera is to reduce
raw images/videos to simple descriptions so that they can
be efficiently transmitted between the cameras. The output
of the algorithm will be ellipses fitted from segments and
the mean color of the segments. As shown in the upper part
of Fig. 3, local processing includes image segmentation for
the subject and ellipse fitting to the extracted segments.
We assume the subject is characterized by a distinct color
distribution. Foreground area is obtained through back-ground subtraction. Pixels with high or low illumination are
also removed since for those pixels chrominance may not
be reliable. Then a rough segmentation for the foreground
is done either based on K-means on the chrominance of the
foreground pixels, or color distributions from the model. In
the initialization stage when the model has not been well
established, or when we do not have a high confidence in
the model, we need to start from the image itself and use
a method such as K-means to find color distribution of the
subject. However, when a model with a reliable color distri-
bution is available, we can directly assign pixels to different
segments based on the existing color distribution. In prac-tice, the color distribution maintained by the model may not
be uniformly accurate for all cameras due to effects such
as color map changes or illumination differences. Also the
subjects appearance may change in a single camera due to
the movement or lighting conditions. Therefore, the color
distribution of the model is only used for a rough segmen-
tation in initialization of segmentation. Then an EM algo-
rithm is used to refine the color distribution for the current
image. The initial estimated color distribution plays an im-
portant role because it can prevent EM from being trapped
in local minima.
Suppose the color distribution is a mixture of N
Gaussian modes, with parameters = {1, 2, . . . , 3},where l = {l, l} are the mean and covariance ma-trix of the modes. Mixing weights of different modes are
A = {1, 2, . . . , 3}. We need to find the probability ofeach pixel xi belonging to a certain mode l: P r(yi = l|xi).From standard EM for Gaussian Mixture Models (GMM)
we have the E step as:
P r(k+1)(yi = l|xi) (k)l P(k)
l
(xi), l = 1, . . . , N
Nl=1
P r(k+1)(yi = l|xi) = 1
P r(k+1)(yi = l|xi) (1)
and the M step as:
(k+1)l =
Mi=1 xiP r(yi = l|xi,
(k))Mi=1 P r(yi = l|xi,
(k))(2)
(k+1)l =
Mi=1(xi
(k)l )(xi
(k)l )
TP r(yi = l|xi, (k))M
i=1 P r(yi = l|xi, (k))
(3)
3
7/27/2019 10.1.1.115.6795
4/6
Background
subtraction
Rough
segmentation
EM: refine
color models
Watershed
segmentation
Color segmentation and ellipse fitting in local processing
Ellipse fitting
Generate test
configurations
Score test
configurations
Update each test
configuration usingPSO
Combine 3 views to get 3D skeleton geometric configuration
Update 3D model
(color/texture,motion)
3D human body model
Maintain current
model
Previous color
distributionPrevious geometric
configuration and motion
Local
processingfrom other
cameras
Check
stop criteria
N
Y
Figure 3: Algorithm flowchart for 3D human skeleton model reconstruction.
and
(k+1)
l =
1
Mxi
P r(k+1)
(yi = l|xi) (4)
where k is the number of iterations, and the M step is ob-
tained by maximizing the log-likelihood:
L(x; ) =Mi=1
Nl=1
P r(yi = l|xi)logP r(xi|l). (5)
However, this basic EM algorithm takes each pixel indepen-
dently, without considering the fact that pixels belonging to
the same mode are usually spatially close to each other. In
[10] Perceptually Organized EM (POEM) is introduced. In
POEM, influence of neighbors is incorporated by a weight-ing measure:
w(xi, xj) = exixj
21s(xi)s(xj)
22 (6)
where s(xi) is the spatial coordinate of xi. Then, votesfor xi from the neighborhood are given by
Vl(xi) =xj
l(xj)w(xi, xj), where l(xj)=Pr(yj=l|xj)
(7)
Based on this voting scheme, the following modifica-
tions are made to the EM steps. In the E step, (k)l is
changed to (k)l (xi), which means that for every pixel xi,mixing weights for different modes are different. This is
partially due to the influence of neighbors. In the M step,
mixing weights are updated by
(k)l (xi) =
eV(xi)
l
Nk=1 e
V(xi)
k
(8)
in which controls the softness of neighbors votes. If
is as small as 0, then mixing weights are always uniform. If
(a) (b) (c) (d)
Figure 4: Ellipse fitting. (a) original image; (b) segments;
(c) simple ellipse fitting to connected regions; (d) improved
ellipse fitting.
approaches infinity, the mixing weight for the mode with
the largest vote will be 1.
After refinement of the color distribution with POEM,
we set pixels with high probability (e.g., larger than 99.9%)that belong to a certain mode as markers for that mode.
Then a watershed segmentation algorithm is implemented
to assign labels for undecided pixels. Finally, in order to
obtain a concise parameterization for each segment, an el-
lipse is fitted to it. Note that a segment refers to a spa-
tially connected region of the same mode. Therefore, a sin-
gle mode can have several segments. When the segment is
generally convex and has a shape similar to an ellipse, the
fitted ellipse well represents the segment. However, when
the segments shape differs considerably from an ellipse, adirect fitting step may not be sufficient. To address such
cases, we first test the similarity between the segment and
an ellipse by fitting an ellipse to the segment and comparing
their overlap. If similarity is low, the segment is split into
two segments and this process is carried out recursively on
every segment until they all meet the similarity criterion. In
Fig. 4, if we use a direct ellipse fitting to every segment, we
obtain Fig. 4(c). If we adopt the test-and-split procedure,
correct ellipses are obtained as shown in Fig. 4(d). Experi-
mental results are shown in Fig. 5.
4
7/27/2019 10.1.1.115.6795
5/6
(a)
(b)
(c)
Figure 5: Experiment results of local processing. (a) origi-
nal images; (b) segments; (c) fitted ellipses.
4. Collaborative Posture EstimationHuman posture estimation is essentially treated as an op-
timization problem, in which we aim to minimize the dis-
tance between the posture and ellipses from the multiple
cameras. There can be several different ways to find the
3D skeleton model based on observations from multi-view
images. One method is to directly solve for the unknown
parameters through geometric calculation. In this method
one needs to first establish correspondences between points
/ segments in different cameras, which is itself a hard prob-
lem. Common observations for points are rare for human
problems, and body parts may take on very different ap-
pearances from different views. Therefore, it is difficult toresolve ambiguity in the 3D space based on 2D observa-
tions. A second method would be to cast this as an opti-
mization problem, in which we find optimal is and is
to minimize an objective function (e.g., difference between
projections due to a certain 3D model and the actual seg-
ments) based on properties of the objective function. How-
ever, if the problem is highly nonlinear or non-convex, it
may be very difficult or time consuming to solve. There-
fore, searching strategies which do not explicitly depend on
the objective function formulation are desired.
Motivated by [11, 12], Particle Swarm Optimization
(PSO) is used for our optimization problem. The lower part
of Fig. 3 shows the estimation process. Ellipses from local
processing of single cameras are merged together to recon-
struct the skeleton (Fig. 6). Here we consider a simplified
problem in which only arms change in position while other
body parts are kept in the default location. Elevation angles
(i) and azimuth angles (i) of the left/right upper/lower
parts of the arms are specified as parameters (Fig. 6(b)).
The assumption is that projection matrices from 3D skele-
ton to 2D image planes are known. This can be achieved
either from locations of cameras and the subject, or it can
1
2
3
41
2
3
4
x
y
z
O
ellipses ellipses ellipses
CAM1 CAM2 CAM3
x
yz
The
person
CAM 1
CAM 2
CAM 3
(a) (b)
Figure 6: 3D skeleton model fitting. (a) Top view of the
experiment setting. (b) The 3D skeleton reconstructed from
ellipses from multi-view cameras.
be calculated from some known projective correspondences
between the 3D subject and points in the images, without
knowing exact locations of cameras or the subject.
PSO is suitable for posture estimation as an evolutionary
optimization mechanism. It starts from a group of initial
particles. During the evolution the particles are directed to
the good position while keeping some randomness to ex-
plore the search space. Suppose there are N particles (test
configurations) xi, each being a vector ofis and is. The
velocity ofxi is denoted by vi. Assume the best position of
xi up to now is xi, and the global best position of all xis
up to now is g. The objective function is f() for which wewish to find the optimal position x to minimize f(x). ThePSO algorithm is as follows:
1. Initialize xi and vi. The value ofvi is usually set to 0,
and xi = xi. Evaluate f(xi) and set g = argminf(xi).
2. While the stop criterion is not satisfied, do for every
xi:
vi vi + c1r1(xi xi) + c2r2(g xi)
xi xi + vi
If f(xi) < f(xi), xi = xi; If f(xi) < f(g),g = xi
The stop criterion: After updating all N xisonce, the increase in f(g) falls below a thresh-old, then the algorithm exits.
Here is the inertia coefficient, while c1 and c2 are the
social coefficients. r1 and r2 are random vectors with
each element uniformly distributed on [0,1]. Choice of ,
c1 and c2 controls the convergence process of the evolution.
If is big, the particles have more inertia and tend to keep
their own directions to explore the search space. This allows
for more chance of finding the true global optimal if the
group of particles is currently around a local optimal. While
ifc1 and c2 are big, the particles are more social with the
other particles and go quickly to the best positions known
5
7/27/2019 10.1.1.115.6795
6/6
Figure 7: Experiment results for 3D skeleton reconstruc-
tion. Original images from 3 camera views and the skele-
tons are shown.
by the group. In our experiment, N = 16, = 0.3 andc1 = c2 = 1.
Similar to other search techniques, PSO will be likely to
converge to local optimum without carefully choosing the
initial particles. In the experiment we assume that the 3D
skeleton will not go through a big change in a time inter-val. Therefore, at time t1 the search space formed by the
particles is centered around the optimal solution of the geo-
metric configuration at time t0. That is, time consistency in
postures is used to initialize particles for searching. Some
examples showing images from 3 views and the posture es-
timates are shown in Fig. 7.
5. Conclusion and Future Work
There are two main motivations for our work for gesture
analysis in a multi-camera network. One is to reduce image
data to short descriptions by local processing in each cam-
era for efficient communication among cameras, the other
one is to explore consistency and distinctiveness of the
subject by opportunistic fusion of information across space,
time, and feature levels. We studied the use of a 3D human
model to keep both geometric and appearance parameters
of the subject. In some of our experiments the problem
of PSO converging to a local minimum still exists, espe-
cially when there is a sudden move which causes the initial
search space to be relatively far away from the new pos-
ture. Future work includes defining a more versatile model,
and using more information from local features of cameras
to better initialize the search space. In the current method
calibrated cameras are assumed. However, we found that
fitting is sensitive to accuracy of calibration. Since accurate
camera calibration is not always practical in applications
of posture recognition, we are exploring solutions based on
uncalibrated cameras.
References
[1] Andrew D. Wilson and Aaron F. Bobick, Parametric hidden
markov models for gesture recognition, IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 21, no. 9,
pp. 884900, 1999.
[2] Yanxi Liu, Robert Collins, and Yanghai Tsin, Gait sequence
analysis using frieze patterns, in Proceedings of the 7th
European Conference on Computer Vision (ECCV02), May
2002.
[3] Y. Rui and P. Anandan, Segmenting visual actions based onspatio-temporal motion patterns, in CVPR00.
[4] Hedvig Sidenbladh, Michael J. Black, and Leonid Sigal,
Implicit probabilistic models of human motion for synthesis
and tracking, in ECCV 02: Proceedings of the 7th Euro-
pean Conference on Computer Vision-Part I, London, UK,
2002, pp. 784800, Springer-Verlag.
[5] J. Deutscher, A. Blake, and I.D. Reid, Articulated body
motion capture by annealed particle filtering, in CVPR00,
2000, pp. II: 126133.
[6] Kong Man Cheung, Simon Baker, and Takeo Kanade,
Shape-from-silhouette across time: Part ii: Applications to
human modeling and markerless motion tracking, Interna-
tional Journal of Computer Vision, vol. 63, no. 3, pp. 225 245, August 2005.
[7] Clement Menier, Edmond Boyer, and Bruno Raffin, 3d
skeleton-based body pose recovery, in Proceedings of the
3rd International Symposium on 3D Data Processing, Visu-
alization and Transmission, Chapel Hill (USA), june 2006.
[8] Ivana Mikic, Mohan Trivedi, Edward Hunter, and Pamela
Cosman, Human body model acquisition and tracking using
voxel data, Int. J. Comput. Vision, vol. 53, no. 3, pp. 199
223, 2003.
[9] H. Sidenbladh and M.J. Black, Learning the statistics of
people in images and video, IJCV, vol. 54, no. 1-3, pp.
183209, August 2003.
[10] Y. Weiss and E. Adelson, Perceptually organized em: A
framework for motion segmentaiton that combines informa-
tion about form and motion, Tech. Rep. 315, M.I.T Media
Lab, 1995.
[11] S. Ivecovic and E. Trucco, Human body pose estimation
with pso, in IEEE Congress on Evolutionary Computation,
2006, pp. 12561263.
[12] C. Robertson and E. Trucco, Human body posture via hi-
erarchical evolutionary optimization, in BMVC06, 2006, p.
III:999.
6