06234121

978-1-4673-0024-7/10/$26.00 ©2012 IEEE 1934

2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012)

Segmentation of Depth Image using Graph cut

Jiangming Yu Jieyu Zhao Research Institute of Computer Science and Technology, Ningbo University, Ningbo, China, 315211

Abstract — a large number of tasks in computer vision involve finding a target from a background image. It is also known as the foreground/background discrimination problem. Various methods have been developed to solve this problem. [1, 2, 10, 11, 12] Newly developed techniques for general purpose of object abstraction use both color and edge information for segmentation purpose. In this paper, we use graph cut methods on images with depth information from A Kinect camera. Also, we apply the approach with pyramid representation. This greatly reduces the time used with the iterative graph cut methods. We estimate the statistic model on the bottom of the pyramid, while use graph cut on the top of the pyramid. This speeds up the whole segmentation process while keeps a good segmentation quality at the same time. When come across the situation like the object’s color resembles that of the background but with different depth, our method can still achieve a good result.

Keywords - graph cut; depth image; pyramid representation; gaussin mixture model;

I. INTRODUCTION The task of interactive image segmentation is becoming

more and more popular to alleviate the problems inherent to fully automatic segmentation which seems to never be perfect. The ultimate goal is to extract an object from the background in an input image with as few user interactions as possible. Foreground/background discrimination is aimed to separate an image to two distinctive parts. In order to achieve good quality, the information that can be used in an image is color, texture, depth, etc. of the pixel of the segment. It is common that some prior on segmentations is needed for achieving a perfect segment. The prior usually presents as appearance models [1, 2], to distinguish better the foreground from the background segmentation. Intuitively, the foreground and the background priors provide constrains about what the user intends to segment. A good segmentation should be smooth in regions while preserving sharp discontinuities that exist at object boundaries.

A good way of using both the color (texture) information and the contrast (edge) information is by graph cut method. [1, 2, 3]

The theory of graph cut was used in computer vision in the paper by Greig, Porteous and Seheult [4] of Durham University. In the Bayesian statistical context of smoothing noisy (or corrupted) images, they showed how the maximum a posteriori estimate of a binary image can be achieved exactly by maximizing the flow through an associated simulated annealing (as proposed by the Geman Brohters [5]), or iterated conditional modes (a type of greedy algorithm as suggested by

Julian Besag [6]) were used to solve such image smoothing problems.

To use graph cut, the base step is to build a graph. The nodes in the graph is of two kinds. One is called neighbor node. It is the nodes which correspond to the pixels in the image. Another kind of nodes is called terminal node. Only two terminal nodes exist in the graph. One is object terminal, the other is background terminal. And the links link the nodes can be classified into two kinds. One is neighbor link which links the neighbor nodes. It corresponds to the prior term in the energy function. The other is called the terminal link which links the neighbor node and the terminal node. It corresponds to the likelihood term in the energy function. To estimate the neighbor link, the neighbor nodes usually see as some probability field, usually the MRF. The neighbor link preserves the sharp boundary. The other link which is the terminal link is estimated as the likelihood of the priors that the user gives. This term is under the hard constrains of the user’s interaction. The two kinds of links are the penalties of segmenting some regions as object and others to be the background. When the graph is built, a fast implementation of segmentation can be achieved by a new max-flow algorithm [7]. In brief, the process is starting with some user interaction to provide hard constrains for segmentation. Then the graph is built with constrains on MRF. Graph cut is used to find the global optimal segmentation of the image. After the minimum cuts are found, the object/background regions are naturally defined by the cuts. The obtained results give the best balance of boundary and region properties among all segmentations satisfying constrains.

Boykov and Jolly [1] derive a general purpose interactive segmentation technique that divides an image into two segments. They imposes two kinds of constrains which they called the hard constrain and the soft constrain. The hard constrain is provides clues on what the user intends to segment. The rest of the image is segmented automatically by computing a global optimum among all segmentations fitting the hard constrains. One main advantage of their method is that their method is fit with N-dimensional segmentation and their cost function is clearly defined. They say that many previous techniques don’t have a clear cost function at all [8]. And some even compute only an approximate solution. On the contrary, their imperfections of a globally optimal solution are directly related to the definition of the cost function. Their cost function is derived from the one in [3] in a context of MAP-MRF estimation. Their technique is based on powerful graph cut algorithms from combination optimization [9, 10]. They apply their method only on gray images.

Rother, Kolmogorov, and Blake [2] derives the method of Boykov and Jolly [1]. Instead they use GMMs to construct the

1935

statistic model in RGB color space. They follow the practice that is already used for soft segmentation [13, 14]. They use two GMMs, one for the background and one for the background. They develop the iterative version of the optimization. Rother, Kolmogorov, and Blake [2] further the graph cut approach into three aspects. They use graph cut method iteratively, the reason of this is that they reduce the user’s interaction to drag a rectangle round the desired object. In their opinion, they call this “incomplete labeling”, the reason is that the pixels in the rectangle are not all belonging to foreground. Their method can fit itself during the iterative process. When the segmentation is done, they also use a matting strategy to adjust the contour they get. The problem is that their method is time-consuming. When meet with the situation that object and foreground resemble in color space their method fail.

More recently, Vicente, Kolmogorov and Rother [10] use graph cut in MRF in high order. They imply graph cut based image segmentation with connectivity priors. They formulate several versions of the connectivity constraint to the two terms energy function. However, to minimize the their energy function is NP-hard. There are also some other method considering high order MRF, see [11, 12]. They can’t achieve global optimal of the segmentation.

In this paper, we drive the Rother, Kolmogorov, and Blake [2] iterative graph cut method. We need the user to drag a rectangle to cover the object in a given image. we also use the depth information which is obtained from A Kinect camera. In fact, we use the depth information in both two terms in an energy function of graph cut method: the data term and the spatial coherency term. There are two main contributions in our paper.

First, we revise the two terms in the energy function to fit the additional depth information got from the A Kinect camera. We add an additional single Gaussian model to fore ground GMMs, a uniform distribution to the background GMMs. Because the depth information is independent from the color information, possibility model of the depth information can be added to the color model by multiplying. So now the foreground possibility model is a GMMs and a single Gaussian model of depth channel. And the background possibility model is a GMMs and a uniform distribution of depth channel. The reason about we simply estimate the background depth information just by a uniform distribution is that we think the pixels of the background are almost very complex and this can result the uniformity of the depth data. We further more add the depth information to the second term of the energy function that is the prior term. This means that the boundary of our segmentation must preserve both the sharp color inconsistence and the depth inconsistence. The can overcome the drawback that the colors of the background and the object are resemble. In our energy function, even we can’t distinguish the object from the background merely using the image color, the added depth term also will preserve the depth inconsistent and abstract the object. This can be a fascinating outcome.

Second, we notice that although we use a new max-flow algorithm [7], the time spent on this process is still occupy most of the time spend on the whole executing of the graph cut

even our image is small about 300*300 pixels. Considering that time spent on graph cut is mostly on the max-flow algorithm. We use a trick which can influence the result by little while achieve good quality. We present the image in different scales. The method we use is resemble the pyramid method but in a simpler and lower way. We only apply two layers of the pyramid. We estimate the possibility model in the bottom layer, then executing the graph cut on the top layer. The boundary we get from the top layer casts to the bottom layer and a new circle begins. Through experiment, we use less time with the relatively same quality.

This paper is organized as follows: Section II presents the mathematics formulation of the probabilistic spectral matching problem. With previous preparations we derive a new probabilistic matching scheme in Section III. The implementation details and experimental results are presented in Section IV.

II. PROBLEM FORMULATION

The problem of segmenting an image can be seen as a labeling problem. We set a label to every pixel in the image, which is ( 1 )il i T= , T presents the number of pixels in

the image, and il specifies assignments to pixel i in T . Then

we can use 1( , , , ), ( {0,1})i T iL l l l l= ∈ to present the

whole image labeling. Each il can be either 0 or 1, which 0 defines the object and 1 defines the background. Vector L defines the segmentation. Furthermore, we use io presents the observation in the image. The observation information in a color image is RGB, in our problem we simultaneously get the depth information from the A Kinect camera sensor. So the observation data is RGBD. We use 1( , , , , )i TO o o o= presents the whole image data. Now we can solve our segmentation problem in a probabilistic framework, that is,

arg max ( )l

l p l o= , we maximum the posterior ( )p l o to

get the optimal contours. This posterior can be written as ( ) ( ) ( )p l o p o l p l≈ , the first term in the equation is the

observation likelihood(can be calculated from the hard constrains from the user’s interaction), the second term is the prior(can be calculated in the Markov Random Field). For the reason that this formulation can’t be computed directly, we must rewrite it as follows:

( ) expp l o E= − (1)

where

( , ) ( )E A l o B lλ= + (2)

( , ) ( )i i ii

A l o f o l=∑ (3)

1936

( , )

( ) ( , )i i ji j N

B l g l l∈

= ∑

(4)

Now, the maximum problem can be computed as a minimum problem of the “energy” E . The first term ( , )A l o in the equation (2) is known as the regional term, it assumes the individual penalties for assigning pixel i as object and background. This term is calculated by the hard constraints of user’s interaction. If the RGBD data of a pixel in the image is close to the probability model constructed from the user’s interaction, then the penalties of seeing this pixel as object or background are small, otherwise, the penalties are large. The term ( )B l comprises the boundary properties of segmentation L , it is interpreted as a penalty for a discontinuity between i and j . This penalty is large if the observation data RGBD between in the neighbor system of MRF is vastly different from each other, and the opposite will be small if the difference between two neighborhood pixel data is not very obvious. The term λ is a coefficient which specifies a relative importance of the region properties term ( , )A l o versus the boundary properties term ( )B l .

The likelihood term ( , )A l o is the similarity between the probability of the foreground and the background model and the observation data in the image. Through the form ( )i i if o l

we can see that if we give the label il a value 0 or 1, its

meaning is the how much can the observation io fits the background or foreground. The main problem in this term is how to construct the fore/background probability model. We use the method that the user drags a rectangle covering the object in the given image. This method reduces user’s interaction drastically. It is used in [2]. But they have no depth information added to their model. They add two GMMs to estimate the fore/background probability model in RGB color field. Their formulation is listed here:

1

( | ) ( ; , )K

i f j i j jj

f o n cθ π μ=

= Σ∑

(5)

1

( | ) ( ; , )K

i b j i j jj

f o n cθ π μ=

= Σ∑

(6)

The two terms in the equation fθ and bθ represent the foreground model and the background model respectively.

( , , )i i in c μ ∑ is a standard normal distribution also called a single Gaussian model and K represents the number of the single Gaussian model. jπ is a coefficient that represents the proportion of the specific single Gaussian model in the GMMs(Gaussian Mixture Model). In our paper, we add the depth information to the equation (5) and (6). We add a single

Gaussian model to the foreground model and a uniform distribution to the background model. It is the reason that object is always close together in the depth channel and the background is kind of complex to fit the uniform distribution. The form is as follows:

1

( | ) ( ; , ) ( ; , )K

i f j i j j i d dj

p o n c n dθ π μ μ=

= Σ Σ∑

(7)

1

1

( | ) ( ; , )K

i b j i j jj

p o T n cθ π μ−

=

= Σ∑

(8)

The meaning is obvious. ic is the color data RGB observed in

the image and id is the depth information sensed from the A Kinect camera. T is the number of the pixels in the image which is mentioned before.

More specifically, we use GMMs in RGBD field to set the region penalties ( , )A l o and ( )B l as negative log-likelihoods. The ultimate formation is listed here:

( { 0}) (ln | )i i fi iif o l p ol θ= = − (9)

( { 1}) (ln | )i i bi iif o l p ol θ= = − (10)

As for the term ( )B l , we use the MRF neighborhood system. It is the four neighborhood system. That is to set the pixel’s label either 0 or 1 depends only on the four neighbor pixels near the pixel. It has value only when the two neighbor pixels are in different label. And the penalty is calculated as follows:

2

2( , ) exp2i j

i j

o op o o

σ−

= −

(11)

io and jo are in the MRF neighborhood system. It penalizes a lot when the neighboring two pixels have similar intensities when i jo o σ− < . On the other hand, if pixels are very

different, i jo o σ− > , then the penalty is small. Intuitively, this function corresponds to the distribution of noise among neighboring pixels of an image. We can further perfect the equation into the form:

( , ) ( , )i i j i j i jg l l l l p o o= − (12)

This form means that only the link at contours will be penalized. It defines the soft constrains in order to compute the global optimum of the boundaries.

1937

III. ALGORITHM In this section, we introduce our algorithm of two-layer

iterative graph cut in RGBD image. The core of our algorithm is use the depth information to distinguish the situation when the probability models of foreground and background resemble. With the additional depth channel sensed from the A Kinect camera we can easy abstract the object from the background. We also find the graph cut algorithm is somewhat time-consuming when computes the minimum energy, so we use the image pyramid to imply the new fast max-flow algorithm at the up layer of the image pyramid, and we do achieve a relatively good quality with less time. We also use more layers of image pyramid, but unfortunately the result is not as good as the two-layer one.

To start our algorithm, we first need a user to drag a rectangle to cover the object in the image. Then we use k-means to estimate the GMMs of the fore/back ground model with an additional single Gaussian model and a uniform distribution for the depth channel. After that, we calculate the two terms in the energy function. This process is different from the term we give before because we use a strategy of image pyramid. We extract one pixel in every four neighbor pixels in the image. Then the pixels in our max-flow algorithm will decrease three quarters. The tow energy terms are computed in this scale image. We construct the graph for max-flow algorithm to use on this scale image too. The optimal boundary can get from the min-cut got from the max-flow algorithm. This boundary projects to the bottom layer of the image pyramid. Again a new repeat begins. Now we list all the key step of our algorithm. A rough process of our algorithm procedure is listed as follows:

1. Given an image observation

, , ( {1 }, {1 }, )i jo i I j J I J T∈ ∈ × = , the

number of iterations IterNum and the fore/background observation stacks and the parameter σ .

2. Read the color and the depth data from the A Kinect camera and set the 0IterNum = and the GMMs number 5K = . Extract the top layer image

, , ( [ / 2], [ / 2])p qo p i q j= = .

3. If 0IterNum = , use the user’s rectangle to estimate the stacks, pixels in the rectangle put into the foreground stack, the others put into the background stack. If 0IterNum ≠ , pixels in bottom layer image belong to the foreground put into the foreground stack, the others put into the background stack.

1IterNum IterNum= + .

4. Use k-means algorithm [15] to estimate the GMMs of fore/background models on the bottom image layer. Computer the value of the links with the formula:

, ,1

( ; , ) ( ,n ; )lK

j p q j j p q d dj

n c n dπ μ μ=

Σ− Σ∑ for the

link of the object terminal,

1,

1

( ;ln , )K

j p q j jj

T n cπ μ−

=

Σ− ∑ for the link of the

background terminal, 1 1

2

, ,2exp

2p q p qo o

σ−

− for the

neighbor link and 1 1( , )p q is the coordinate neighboring ( , )p q .

5. Use the new fast max-flow algorithm [7] to find the optimum boundary in the top image, reject it to the bottom layer.

6. If the result satisfies the user then end, else go to step 2.

IV. EXPERIMENT RESULT

In this section, we imply our method in several images. Also we compare our method with some other state-of-the-art methods. We perform our algorithm in two steps. First, we use two-layer iterative graph cut on the standard image of the starfish.

(a)first iteration (b)second iteration (c)third iteration Figure 1.Our two-layer iterative graph cut method on the standard image of star fish.

(a)first iteration (b)second iteration (c)third iteration Figure 2.Ordinary iterative graph cut method on the standard image of star fish. The upper part of Figure 1 is the segmentation result we get from the top layer of the image. The blue line in the nether part presents the contour we get at the bottom layer of the image. Through the result, we can see that by our two-layer iterative graph cut the results are almost same. Now we list the time we use in each iterate in Table 1: Table 1: Time used to get the results in Figure 1 and Figure 2

two-layer iterative graph cut ordinary iterative graph cut

iterative 1 2 3 iterative 1 2 3 time(ms) 3235 1813 1563 time(ms) 4422 2203 1859

1938

From the Table 1 we see that one part of our method can reduce the time at the same time preserve the quality of the segmentation.

Next, we add the depth information to the graph cut methods, both the two energy terms must be changed to adapt the depth channel. We compare the results that use only the color information to our both use color and depth information method. In particularly, our method outweighs the ordinary iterative graph cut method when object in the image resemble the background in RGB color spaces. Our results are listed as follows:

(a)RGB image (b)depth image Figure 3.The test RGB image containing a red book and a red can of coke, and the corresponding depth image. The depth information is presented with the blue (higher 8 bits) and the green (lower 8 bits) colors.

(a)first iteration (b)second iteration (c)third iteration

Figure 4.The segmentation results of the ordinary iterative graph cut method without the use of the depth information, it fails to separate the red can from the book behind.

(a)first iteration (b)second iteration (c)third iteration

Figure 5.The segmentation results of our method with the additional depth information sensed from the Kinect camera. With the depth information added to the two terms in the energy function we can easily abstract the object from the background even with the similarity between the foreground and the background in color space.

V. CONCLUTION In this paper, we derive the iterative graph cut method in

two ways, we first use a strategy that construct the statistical

model in top layer and compute the min-cut in the bottom layer. Through this method, we can greatly reduce the time while achieving the required quality. The other way is that we use the depth information to the graph cut method in case of the situation that object and the background have similarity in color distribution. The energy terms are revised to consider the additional depth information of the image. Experimental results show the efficiency of our method.

ACKNOWLEDGMENT This work is supported by the Twelfth Five Years HiTech

project of the Ministry of Science and Technology, discipline project of Ningbo University(xkl09154), the Natural Science Foundation of Zhejiang (D1080807), and the Scientific Research Foundation of Ningbo University ( G11JA017).

REFERENCES [1] Y.Boykov and V.Kolmogorov. Interactive graph cut for optimal

boundary an region segmetation of objects in N-D images. In ECCV, 2004.

[2] C.Rother, V.Kolmogorov, and A.Blake. Grabcut-interactive foregournd extraction using interated graph cut. SIGGRAPH, August 2004.

[3] D.Greig, B.Porteous, and A.Seheult. Exact maximum a posteriori estimation for binary images. J.of the Royal Statistical Society Series B, 51(2):271-279, 1989.

[4] D.M. Greig, B.T. Porteous and A.H. Seheult. Exact maximum a posteriori estimation for binar images, Journal of the Royal Statistical Society Series B, 51, 271-741. 1989.

[5] D.Geman and S.Geman. Stochastio relaxation, Gibbs distribution and the Bayesian restoration of images, IEEE Trans.Pattern Anal. Mach. Intell., 6, 721-741. 1984.

[6] J.E. Besag, On the statistical analysis of dirty pictures (with discussion), Journal of the Royal Statistical Society Series B, 48, 259-302. 1986.

[7] Y.Boykov and V.Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. In 3rd. Intnl.Workshop on Energy Minimization Methods in Computer Vision and Pattern Recongnition(EMMCVPR). Springer-Verlag, September 2001, to appear.

[8] R.M.Haralick and L.G.Shapiro. Computer and Robot Vision. Addison-Wesley Publishing Company, 1992.

[9] A.Goldberg and R.Tarjan. A new approach to the maximum flow problem. Journal of the Association for Computer Machinery, 35(4):921-940, October 1988.

[10] S.Vicente, V.Kolmogorov and C.Rother. Graph cut based image segmentation with conectivity priors. In CVPR, 2008.

[11] S.Vicente, V.Kolmogorov and C.Rother. Joint optimization of segmentation and appearance models. In ICCV, 2009.

[12] O.J.Woodford, C.Rother and V.Kolmogorov. Aglobal perspective on map inference on map inference for low-level vision. In Microsoft Research Technical Report, 2009.

[13] M.Ruzon, and C.Tomasi. Alpha estimation in natural images. In Proc.IEEE Conf.Comp.Vision and Pattern Recog. 2000.

[14] Y.-Y.Chuang, B.Curless, D.Salesin, and R.Szeliski. A Bayesian approach to ditital matting. In Proc.IEEE Conf.Computer Vision and Pattern Recon. 2001.

[15] M.Inaba, N.Katoh and H. Imai. Applications of weighted Voronio diagrams and randomization to variance-based k-clustering. Proceedings of 10th ACM Symposium on Computational Geometry. pp. 332-339.1994.

Documents

06234121