Upload
sumit-chakravarty
View
217
Download
0
Embed Size (px)
Citation preview
8/9/2019 [2011][Acpr]Zhang Chenguang
http://slidepdf.com/reader/full/2011acprzhang-chenguang 1/5
Video Object Segmentation by Hierarchical
Localized Classification of Regions
Chenguang Zhang, Haizhou AiDept. of Computer Science and Technology
Tsinghua University, Beijing, P.R. China
[email protected], [email protected]
Abstract—Video Object Segmentation (VOS) is to cut out aselected object from video sequences, where the main difficultiesare shape deformation, appearance variations and backgroundclutter. To cope with these difficulties, we propose a novelmethod, named as Hierarchical Localized Classification of Re-gions (HLCR). We suggest that appearance models as well asthe spatial and temporal coherence between frames are thekeys to break through bottleneck. Locally, in order to identifyforeground regions, we propose to use Hierarchial LocalizedClassifiers, which organize regional features as decision trees.
In global, we adopt Gaussian Mixture Color Models (GMMs).After integrating the local and global results into a probabilitymask, we can achieve the final segmentation result by graph cut.Experiments on various challenging video sequences demonstratethe efficiency and adaptability of the proposed method.
Index Terms—video object segmentation, classification, track-ing, graph cut
I. INTRODUCTION
In computer vision, Video Object Segmentation (VOS) is
an attractive task which has many applications, such as video
edit, video composition, object recognition, etc. Generally, a
VOS system mainly faces two basic problems in computer
vision, object tracking and segmentation. There are numerousalgorithms to solve object tracking [1], such as mean shift [2],
particle filter [3], online boosting [4], random forest [5], etc.
There are also a great deal of works on object segmentation,
such as level set methods [6], graph cut [7] and grab cut [8].
It is well known that, for a VOS system, dealing with
general video sequences is an extremely challenging objective,
due to the factors from appearance variations, irregular motion
and background clutter. On the basis of object tracking and
segmentation, various approaches have been proposed for VOS
in recent years. Li et al. [9] directly extend the traditional
graph cut [7] algorithm from 2D image to 3D image sequence,
and optimize the global energy function to yield segmentation
result. Apart from the limitation of heavily relying on Gaussianmixture color models, this 3D graph cut method is quite time-
consuming and does not allow user interaction. Afterward,
localized color and shape models are introduced by Xue [10] in
Video SnapCut system, which shows increased discriminative
ability and proves to be more efficient. However, due to
unexpected errors of optical flow when the object is occluded
by itself or others, it is not reliable to perform classification
on object boundary and shift local window. An alternative
method by Brendel et al. [11], focusing on tracking region
across frames, is attractive for its computational benefit and
spatial-temporal coherence. However, suffering from failure
of matching the contour of regions, this method lacks of the
ability to deal with complex deformation of non-rigid object.
Meanwhile, Niebles et al. [12] demonstrate how to combine
model-based information (e.g. part-based detection result for
human) and appearance approaches to extract human body
regions. Nevertheless, for general objects, high performance
detectors are usually not available, which limits the general-ization of that method.
Inspired by previous works of localized windows [10] and
tracking regions [11], we propose a novel method, named
as Hierarchical Localized Classification of Regions (HLCR),
for video object segmentation. The main contribution of our
approach is to overcome the limitations of directly shifting
local windows and unreliable region tracking, by taking the
spatial-temporal relationship between corresponding regions
in neighboring frames as inference strategy.
The rest of this paper is organized as follows. In Section II,
we first give a formulation, and then show a brief overview of
our system. Section III introduces the whole pipeline of our
approach. Experimental results on different video sequencesare presented in Section IV. Finally, in Section V, we offer a
conclusion of our method, followed by a discussion about the
future work.
I I . PROBLEM F ORMULATION AND S YSTEM OVERVIEW
Given an input video sequence I = {I 0, I 1, . . . , I N −1}, the
VOS system is initialized by a selected key frame I k with
known foreground mask F (I k). The output of a typical VOS
system is to label out the foreground mask M (I t) for each
frame I t.
Taking the foreground mask in a particular frame as input,as illustrated in Fig. 1, our system is designed to generate
the foreground mask in the next frame. With the help of
Regional Back-Track Method for motion estimation, we can
assign regions to a series of Hierarchical Localized Classifiers,
to predict potential foreground and background regions locally.
Combining the classification result with Gaussian Mixture
Color Models (GMMs), we can produce a probability mask,
followed by an optimization based on the mask to yield final
segmentation results with graph cut [7] algorithm.
8/9/2019 [2011][Acpr]Zhang Chenguang
http://slidepdf.com/reader/full/2011acprzhang-chenguang 2/5
Fig. 1. Outline of our approach
III. OUR A PPROACH
First of all, the initial foreground mask F (I k) is provided
by user. Since video frames are spatial-temporally cohesive,
we can propagate the foreground mask between neighboring
frames. From the reference frame (Fig. 2(a)) to the target
frame (Fig. 2(b)), bidirectional propagations are both feasible.
Without loss of generality, the following analysis only explains
the forward direction, which is from frame I t to frame I t+1.
Naturally, using the selected key frame I k as the first referenceframe and repeatedly applying this procedure of propagation,
we can get foreground masks in all frames.
For computational benefit as well as distinctiveness and ro-
bustness, each frame is over-segmented into SLIC superpixels
[13], which convert the original pixel-connected graph GP
(Fig. 2(b)) to a regional-connected graph GR (Fig. 2(c) ).
(a) Frame I t (b) Frame I t+1 (c) SLIC Regions (d) Optical Flow
(e) Classification (f) GMMs Prob. (g) Graph Prob. (h) Seg Result
Fig. 2. An example of the procedure of processing a single frame
A. Regional Back-Track Method
For a region in frame t + 1, Regional Back-Track Method
is introduced to find out the best matching region in frame t,
and determine whether they are essentially corresponded.
There is no doubt that pixel-level optical flow (Fig. 2(d) )
is not reliable when heavy occlusion happens. Although it is
claimed in [10] that flow averaging approach in local windowcould generate more robust result, it still produces meaningless
motion vector when there are no really “matched” regions.
Based on this observation, we suggest that a reliable region
track method should not only be insensitive to minor optical
flow errors, but also judge whether the matched regions are
essentially corresponded. For arbitrary region Ra in frame t+1, Regional Back-Track Method is defined as
BackTrack(Ra) = mincRa−vRa−cRb≤δ
Diff (Ra, Rb) (1)
where Rb is in frame t, cRa denotes the center of region Ra,
vRa denotes the averaged motion vector for all pixels in region
Ra and Diff (Ra, Rb) denotes the difference between region
Ra and Rb. Obviously, a larger δ would be more robust to
optical flow errors while more risky to introduce mistaken
regions. On the other hand, δ is highly related to the radius
rRa, since the center of large regions drift easier than small
ones. Consequently, in our experiments, δ is set as rRa
and
Diff (Ra, Rb) is set as the Euclid Distance between the mean
color of two regions.
A key issue of Regional Back-Track Method is how
to convert Diff (Ra, Rb) to a binary decision. Traditional
methods, such as selecting a global threshold or using Chi-
square test, are very tricky and unstable. Here, inspired by
the Statistical Region Merging Method [14], we choose the
independent bounded difference inequality as the decision
function. (Treating each pixel in Ra as a bounded independent
random variable.) As a result, the predicate logic is shown
below.
B(Ra, Rb) =
1 if |Ra − Rb| ≤
b2(Ra) + b2(Rb)0 otherwise
.
(2)
To summarize, for an arbitrary region Ra in frame t + 1,
Regional Back-Track Method provide the best match region
Rb in frame t if they are essentially corresponded. Otherwise,
this method would mark Ra as a “mismatched” region.
B. Hierarchical Localized Classifiers
In this section, we introduce Hierarchical Localized Classi-
fiers to evaluate the probability of that a region in frame t + 1belongs to foreground.
Localized classifiers for VOS system are introduced in
Video SnapCut System [10], in which a series of overlapping
local windows are created along foreground boundary with
fixed size and then propagate through frames. However, due to
a large boundary variation and local window drift, that method
is limited when facing topology changes. In addition, since
the size of local window is fixed, we definitely sacrifice the
ability to benefit from multi-scale space. To overcome these
limitations, we propose a new solution called Hierarchical
Localized Classifiers.
Given a foreground mask M (I t) and the corresponding
foreground bounding box B(I t) in reference frame t, we
define a potential searching box S (I t) by extending B(I t)for a fixed ratio β (β = 0.3 in our experiments), using the
following equations.
center(S (I t)) = center(B(I t))
height(S (I t)) = (1 + β )height(B(I t))
width(S (I t)) = (1 + β )width(B(I t))
(3)
Next, we build a hierarchical quad-tree structure by splitting
the searching box S (I t), in which each tree node corresponds
to a local window. The partition rules are shown in Fig.
3. Then, we generate a localized classifier L(W i) for each
8/9/2019 [2011][Acpr]Zhang Chenguang
http://slidepdf.com/reader/full/2011acprzhang-chenguang 3/5
window W i, trained by all inner regions which have already
been labeled as foreground or background according to the
foreground mask M (I t). Here, we build a multi-dimensional
feature vector f (R) = (r,g,b,y,u,v,cx,cy) for region R,
where (r,g,b ,y,u,v) denotes the average value of all pixels
in region R in RGB and YUV color space and (cx, cy) denotes
the center of region R. If W i contains both foreground and
background regions, we use a decision tree for classification.
Otherwise, the localized classifier L(W i) is degenerated into
a constant function (Return 1 if it contains only foreground,
and return 0 if not.).
Fig. 3. Hierarchical Localized Classifiers based on quad-tree partition. If alocal window is larger than a fixed size λ and contains both foreground and
background regions, e.g. W i, we split it into four sub-windows. Otherwise,the partition terminates here and this window turns out to be leaf node, e.g.W j . For each window W i, a localized classifier L(W i) is trained by all theinside regions.
As for prediction, instead of shifting local windows, we
prefer to assign each region Ra in frame t + 1 to a series
of windows {W i0 ,W i1 , . . . ,W in−1} in frame t. Recall the
Regional Back-Track method introduced in section III-A,
assuming we have found the best match region Rb in frame t
(if not, we will discuss how to handle the mismatched Ra
later in section III-C), Rb should be covered by a unique
leaf node of the quad-tree partition. Tracing back to all
the ancient nodes in the quad-tree, we can get a series
of windows {W i0 ,W i1 , . . . ,W in−1}. For each window W ik ,
we use the pre-trained localized classifier L(W ik) to predict
whether Ra is belong to foreground or not. (Note here we use
(r,g,b,y,u,v,cx − vxRa, cy − vyRa
) as the feature vector,
where (vxRa, vyRa
) is the averaged motion vector of Ra.)
To produce the final classification result q Ra, we need inte-
grating the localized classifiers together, using this equation:
q Ra =
n−1k=0 ωkq kn−1
k=0 ωk
(4)
where q k denotes the binary prediction of L(W ik) and ωk
denotes the weight of classifier L(W ik). Obviously, the clas-
sifiers with high confidence should be weighted more thanthose with low confidence. Therefore, in our experiments, the
classification ratio on training set is used as ωk.
In summary, for an arbitrary region Ra in frame t+1 which
finds corresponding region Rb in frame t, the Hierarchical
Localized Classifiers make an integrated prediction of the
probability that Ra will be included in the foreground mask.
C. Combined Probability Mask and Iterative Refinement
Combined Probability Mask is introduced to integrate lo-
calized classification result with global GMMs. As a result,
we can use graph cut algorithm to optimize the segmentation
result.
For graph cut method, we need to optimize the following
energy function
E = λ
i
E d(Ri) +i=j
E c(Ri, Rj ) (5)
where E d(Ri) is data energy and E c(Ri, Rj ) is regional
connection energy. In our framework, E c(Ri, Rj ) is the color
difference between region Ri and Rj , which is the same as
traditional graph cut method [7], and E d(Ri) is the com-
bined probability of Global Gaussian Mixture Color Models
(GMMs) and Hierarchical Localized Classifiers predictions,
which is shown as follows.
GMMs are widely used in segmentation and tracking tasks
and turn out to be quite effective. In our system, both fore-
ground and background GMMs are acquired by clustering
regions in the reference frame t according to the given mask.
Note that directly updating foreground GMMs is very risky.
Considering the initial foreground mask provided by user
input in the key frame is extremely important, we suggestthat a combination of foreground in the initial key frame
and reference frame is quite necessary. In general, though
the discrimination ability of Hierarchical Localized Classifiers
is better than GMMs, it may suffer from the risk of over-
fitting and is incapable of handling mismatched regions in
section III-A. Consequently, we combine these two responses
to generate a more reliable foreground probability p(Ra),
using the formula shown below.
1) If Ra has a corresponding region Rb in frame t, then
p(Ra) = q fg (Ra) · q Ra
q fg (Ra) · q Ra + q bg(Ra) · (1 − q Ra
). (6)
2) Otherwise, Ra is mismatched. Since q Ra is not available,we have
p(Ra) = q fg (Ra)
q fg (Ra) + q bg(Ra) (7)
where q f g(Ra) is probability that Ra is in foreground GMMs,
q bg(Ra) is probability that Ra is in background GMMs and
q Ra is the classification response in section III-B.
Given the combined probability p(Ra) as data energy
E d(Ri), we can solve this two-label graph cut problem
through max-flow method. However, since complex videos
often contain unexpected noise, the combined probability
p(Ra) may drift in a few regions. Therefore, we apply a
iterative refinement to the graph cut result, which is shown
as following.1) Perform Graph Cut based on the combined probability
p(Ra) to get foreground regions.
2) Perform the max-connected component detection for
foreground regions to filter false alarmed regions.
3) Update the foreground and background GMMs and the
combined probability p(Ra). Repeat Step 1) and 2) until
converge.
In our experiments, repeating for only 2 or 3 times, the
iterative refinement will produce a convincing result.
8/9/2019 [2011][Acpr]Zhang Chenguang
http://slidepdf.com/reader/full/2011acprzhang-chenguang 4/5
IV. EXPERIMENTS
Currently, since there is no standard datasets for video
segmentation, in our experiments, the testing datasets are
collected from [15] and [12]. The first video clip is waterskiing
from [15], 97 frames, 544 × 280. The second one is diving
from [15], 179 frames, 880×488. The third one is skating from
[15], 573 frames, 552 × 310. The fourth one is dancing from
[12], 138 frames, 320 × 240. Note that these videos are verychallenging in terms of dynamic camera, background clutter,
blurred motion, object shadows, etc.
We quantitatively analysis our approach on these test
datasets. We randomly select 10 frames from each video clip
for evaluation and label out the true foreground manually. The
metric is standard F -Measure, which is defined as below.
F -Measure = 2 · P recision · Recall
Precision + Recall (8)
where P recision is the probability that an auto-segmented
foreground pixel is a true foreground pixel and Recall is the
probability that a true foreground pixel is detected.
Since there is no available source code or executable binary
for current VOS method, such as [10] and [11], we chooseto use Grab Cut [8] algorithm for comparison, where we
draw foreground bounding boxes for several times and select
the best one for each frame. Table. I sums up the achieved
comparisons, from which we can see that our approach is
much better than Grab cut. Note that our method works very
well when handling visually similar foreground and back-
ground (such as dark legs and black background in Fig. 4(d)),
which improves F -Measure by as much as twenty percentage
points. Some examples are shown in Fig. 4, which demonstrate
that our method significantly improves the subjective quality
of segmentation.
TABLE IEXPERIMENTAL R ESULTS
Vide Cli p Method P r ecision Recal l F -Measure
Water-skiing
Grab Cut 0.753 0.911 0.836
Our Method 0.938 0.849 0.891+/− 0.185 -0.062 0.067
Diving
Grab Cut 0.823 0.849 0.836
Our Method 0.914 0.950 0.931+/− 0.091 0.101 0.096
Skating
Grab Cut 0.956 0.905 0.930
Our Method 0.973 0.919 0.945+/− 0.017 0.014 0.015
DancingGrab Cut 0.873 0.620 0.725
Our Method 0.946 0.947 0.947
+/− 0.073 0.327 0.221
In terms of complexity, our method only takes about 300milliseconds for each frame on an Intel core quad 2.40 GHz
CPU with 3GB memory. With the help of the initial labeled
foreground mask and a reliable frame-by-frame inference
strategy, our method can deal with very complex videos. Nev-
ertheless, our method fails when unexpected sudden change
of foreground appearance occurs.
V. CONCLUSION
In this paper, we propose a novel method to regard VOS
as a problem of tracking and classifying regions in local
windows. Regional Back-Track Method, which is based on
optical flow, is applied to track regions across frames. The
Hierarchical Localized Classifiers are introduced for the pre-
diction of potential foreground regions. Combined probability
mask based on classification results and GMMs is used for
graph cut algorithm with iterative refinement, which produces
reliable segmentation results. Experiments on various videos
demonstrate its great performance.
In current version, we only use single frame propagation
in this paper, which may lead to unexpected drifts in certain
extreme scenario. Although the foreground GMMs in the
initial key frame are used as global constraints, which enhance
the stability of our method, we believe that multi-frames
propagation will benefit more from spatial temporal space.
Another potential work is extending this work to multi-object
cutout, which has more extensive application prospect. We
expect to investigate these issues in our future work.
ACKNOWLEDGMENT
This work is supported by National Science Foundation of
China under grant No.61075026.
REFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, p. 13, 2006.
[2] D. Comaniciu and P. Meer, “Mean shift analysis and applications,” inThe Proceedings of IEEE International Conference on Computer Vision,vol. 2, 1999, pp. 1197 –1203.
[3] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, “An adaptive color-based particle filter,” Image Vision Comput., vol. 21, no. 1, pp. 99–110,2003.
[4] H. Grabner and H. Bischof, “On-line boosting and vision,” in IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, vol. 1, 2006, pp. 260 – 267.
[5] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, “On-linerandom forests,” in IEEE International Conference on Computer VisionWorkshops, 2009, pp. 1393 –1400.
[6] A. reza Mansouri and J. Konrad, “Motion segmentation with level sets,”in IEEE International Conference on Image Processing, 1999, pp. 126–130.
[7] Y. Boykov and M. pierre Jolly, “Interactive graph cuts for optimalboundary and region segmentation of objects in n-d images,” in IEEE
International Conference on Computer Vision, 2001, pp. 105–112.[8] C. Rother, V. Kolmogorov, and A. Blake, “Grab cut: interactive fore-
ground extraction using iterated graph cuts,” ACM Transactions onGraphics, vol. 23, pp. 309–314, 2004.
[9] Y. Li, J. Sun, and H. yeung Shum, “Video object cut and paste,” ACM
Transactions on Graphics, vol. 24, pp. 595–600, 2005.[10] X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut: robust video
object cutout using localized classifiers,” vol. 28, 2009.[11] W. Brendel and S. Todorovic, “Video object segmentation by tracking
regions,” in IEEE International Conference on Computer Vision, 2009,pp. 833 –840.
[12] J. C. Niebles, B. Han, A. Ferencz, and F. fei Li, “Extracting moving
people from internet videos,” in European Conference on Computer Vision, 2008, pp. 527–540.
[13] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels,” EPFL, Tech. Rep., jun 2010.
[14] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 26, pp. 1452–1458,2004.
[15] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchicalgraph based video segmentation,” in IEEE International Conference onComputer Vision and Pattern Recognition, 2010, pp. 2141–2148.
8/9/2019 [2011][Acpr]Zhang Chenguang
http://slidepdf.com/reader/full/2011acprzhang-chenguang 5/5
(a) Water-skiing Sequence on Frame 27, 48, 57, 67
(b) Diving Sequence on Frame 35, 64, 83, 122
(c) Skating Sequence on Frame 12, 18, 63, 111
(d) Dancing Sequence on Frame 5, 20, 101, 130
Fig. 4. Experimental Results. From left to right, 1st row: Original Key Frame Image, Segmentation Results of Our Approach; 2nd row: Initial LabeledForeground Mask, Segmentation Results of Grab Cut [8]. Please zoom in to check for more segmentation details.