Moving Object Detection by Connected Component Labeling … · Moving Object Detection by Connected Component Labeling of Point Cloud Registration Outliers on the GPU Michael Korn,

Moving Object Detection by Connected Component Labeling ofPoint Cloud Registration Outliers on the GPU

Michael Korn, Daniel Sanders and Josef PauliIntelligent Systems Group, University of Duisburg-Essen, 47057 Duisburg, Germany

{michael.korn, josef.pauli}@uni-due.de, [email protected]

Keywords: 3D Object Detection, Iterative Closest Point (ICP), Point Cloud Registration, Connected Component Labeling(CCL), Depth Images, GPU, CUDA.

Abstract: Using a depth camera, the KinectFusion with Moving Objects Tracking (KinFu MOT) algorithm permitstracking the camera poses and building a dense 3D reconstruction of the environment which can also containmoving objects. The GPU processing pipeline allows this simultaneously and in real-time. During the recon-struction, yet untraced moving objects are detected and new models are initialized. The original approach todetect unknown moving objects is not very precise and may include wrong vertices. This paper describes animprovement of the detection based on connected component labeling (CCL) on the GPU. To achieve this,three CCL algorithms are compared. Afterwards, the migration into KinFu MOT is described. It incorpo-rates the 3D structure of the scene and three plausibility criteria refine the detection. In addition, potentialbenefits on the CCL runtime of CUDA Dynamic Parallelism and of skipping termination condition checks areinvestigated. Finally, the enhancement of the detection performance and the reduction of response time andcomputational effort is shown.

1 INTRODUCTION

One key skill to understand complex and dynamic en-vironments is the ability to separate the different ob-jects in the scene. During a 3D reconstruction themovement of the camera and the movement of one orseveral observed objects not only aggravate the reg-istration of the data over several frames, but on theother hand further information for the decompositionof the elements of the scene can be obtained, too. Thispaper focuses on the detection of moving objects indepth images on the GPU in real-time. In contextof this paper two phases of the reconstruction pro-cess are distinguished: Firstly, the major process oftracking and construction of models of all known ob-jects and the static background (due to camera mo-tion) of the scene. Typically, depth images are pro-cessed as point clouds and tracking can be accom-plished by point cloud registration by the IterativeClosest Point (ICP) algorithm, which is a well-studiedalgorithm for 3D shape alignment (Rusinkiewicz andLevoy, 2001). Afterwards, the models can be cre-ated and continually enhanced by fusing all depth im-age data into voxel grids (Curless and Levoy, 1996).Secondly and focused in this paper, after the registra-tion of each frame the data must be filtered for new

moving objects for which no models have been pre-viously considered. Obviously, all pixels which canbe matched sufficiently with existing models do notindicate a new object. But pixels that remained asoutliers of the registration process may be caused bya yet untraced moving object. Likewise, outliers arespread over the whole depth data due to sensor noise,reflection, point cloud misalignment, et cetera. Ac-cordingly, just large clusters (in the three-dimensionalspace) of outliers can be candidates for the initializa-tion of a new object model. These clusters can befound by Connected Component Labeling (CCL) onthe GPU. Subsequently, the shape of the candidateshas to be verified with some plausibility criteria.

This paper is organized as follows. In section 2the required background knowledge concerning the3D reconstruction process is summarized. In addi-tion, in subsection 2.3 the detection of new objects isdiscussed especially. At the beginning of section 3,three GPU based CCL algorithms are introduced andcompared. Afterwards the migration into an existing3D reconstruction algorithm is described and two fur-ther runtime optimizations are discussed. Section 4details our experiments and results. In the final sec-tion we discuss our conclusions and future directions.

Korn, M., Sanders, D. and Pauli, J.Moving Object Detection by Connected Component Labeling of Point Cloud Registration Outliers on the GPU.In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017) - Volume 6: VISAPP, pages499-508ISBN: 978-989-758-227-1Copyright c© 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

499

2 BACKGROUND

Our implementation builds on the KinectFusion(KinFu) algorithm which was published in (Izadiet al., 2011; Newcombe et al., 2011) and has becomea state-of-the-art method for 3D reconstruction in theterms of robustness, real-time and map density. Thenext subsection outlines the original KinFu publica-tion. Afterwards, an enhanced version is introduced.Finally, the detection of new moving objects is de-scribed in detail.

2.1 KinectFusion

The KinectFusion algorithm generates a 3D recon-struction of the environment on a GPU in real-timeby integrating all available depth images from a depthsensor into a discretized Truncated Signed DistanceFunctions (TSDF) representation (Curless and Levoy,1996). The measurements are collected in a voxel gridin which each voxel stores a truncated distance to theclosest surface including a related weight that is pro-portional to the certainty of the stored value. To inte-grate the depth data into the voxel grid every incom-ing depth image is transformed into a vertex map andnormal map pyramid. Another deduced vertex mapand normal map pyramid is obtained by ray castingthe voxel grid based on the last known camera pose.According to the camera view, the grid is sampled byrays searching in steps for zero crossings of the TSDFvalues. Both pyramids are registered by a ICP proce-dure and the resulting transformation determines thecurrent camera pose. Due to runtime optimization thematching step of the ICP is accomplished by projec-tive data association and the alignment is rated by apoint-to-plane error metric (see (Chen and Medioni,1991)). Subsequently, the voxel grid data is updatedby iterating over all voxels and the projection of eachvoxel into the image plane of the camera. The newTSDF values are calculated using a weighted runningaverage. The weights are also truncated – allowingthe reconstruction of scenes with some degree of dy-namic objects. Finally, the maps, created by the ray-caster, are used to generate a rendered image of theimplicit environment model.

2.2 KinectFusion with Moving ObjectsTracking

Based on the open-source publication of KinFu, re-leased in the Point-Cloud-Library (PCL; (Rusu andCousins, 2011)), Korn and Pauli extended in (Kornand Pauli, 2015) the scope of KinFu to the ability toreconstruct the static background and several moving

rigid objects simultaneously and in real-time. Inde-pendent models are constructed for the backgroundand each object. Each model is stored in its own voxelgrid. During the registration process for each pixelof the depth image the best matching model amongall models is determined. This yields a second out-put of the ICP beside the alignment of all existingmodels: the correspondence map. This map containsthe assignment of each pixel of the depth image to amodel or information why the matching failed. Firstof all, the correspondence map is needed to constructthe equation systems which minimize the alignmenterrors. Afterwards, it is used to detect new movingobjects. It is highly unlikely that during the initial de-tection of an object the entire shape and extent of theobject can be observed. During the processing of fur-ther frames the initial model will be updated and ex-tended. Because of this the initial allocated voxel gridmay turn out as too small and therefore, the voxel gridwill grow dynamically in this case.

2.3 Detection of New Moving Objects

The original detection approach from (Korn andPauli, 2015) is illustrated in Fig. 1 by the com-puted correspondence maps. The figure shows cho-sen frames from a dataset recorded with a handheldMicrosoft Kinect for Xbox 360. In the center of thescene, a middle size robot with a differential drive andcaster is moving. On an aluminum frame a laptop, abattery and an open box on the top is transported. Atfirst most pixels can be matched with the static back-ground model. Then more and more registration out-liers occur (dark and light red). In addition, potentialoutliers (yellow) were introduced in (Korn and Pauli,2015). These are vertices with a point-to-plane dis-tance that is small enough (< 3.5 cm) to be treated asinlier in the alignment process. On the other hand, thedistance is noticeable large (> 2 cm) and such match-ings cannot be considered as totally certain. Becauseof this, potential outliers are processed like outliersduring the detection phase.

The basic idea of the detection is that small accu-mulations of outliers can occur due to manifold rea-sons. But large clusters are mainly caused by move-ment in the scene. The detection in (Korn and Pauli,2015) is performed window-based for each pixel. Theneighborhood with a size of 51× 51 pixels of eachmarked outlier is investigated. If 90% of the neigh-bor pixels are marked as outliers or potential outliers,then the pixel in the center of the window is marked asnew moving object. In the next step, each (potential)outlier in a much smaller 19× 19 neighborhood of apixel marked as a moving object is also marked as a

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

500

Figure 1: The first image provides an overview of a substantial part of the evaluation scene. The remaining images show theVisualization of the correspondence maps of the frames 160 to 173 (row-wise from left to right) of a dataset which containsa moving robot in the center of the scene. Each correspondence map shows the matched points with the background (lightgreen), the matched points with the object (dark green), potential outlier (yellow), ICP outlier (dark red), potential new object(light red), missing points in depth image (black) and missing correspondences due to volume boundary (blue).

moving object. Subsequently, the found cluster is ini-tialized as new object model (dark green in frame 168in Fig. 1)), including the allocation of another voxelgrid, if three conditions are met. Particularly, a mov-ing object needs to be detected in at least 5 of the con-secutive previous frames (frames 163-167 in Fig. 1).Two further requirements concern environments withseveral moving objects. However, these are behindthe scope of this paper due to the focus on the de-tection phase. This detection strategy has three coreobjectives:

• The initial model size must be extensive enough toprovide sufficient 3D structure for the followingregistration process. In the original approach ofKorn and Pauli the minimal size is approximately19×19 pixels.

• The initial model may not include extraneousvertices. This is the major weakness of the imple-mentation in (Korn and Pauli, 2015) since the de-tection does not depend on the 3D structure (e.g.the original vertex map). Vertices from surfacesfar away from the moving object may get part ofthe cluster of outliers even though they are notconnected in 3D. This happens likely at the bor-ders of the object, wherefore a margin of not usedoutliers exists (light red is enclosed by dark red inFig. 1). It should be noted that the algorithm isalmost always working even if a few wrong out-liers are assigned to the object, but the initial voxelgrid is much too large and the performance (video

memory and runtime) decreases.

• The appearance of a new object should be sta-ble. A detection based on just one frame wouldbe risky. If large clusters of outliers are found in5 sequential frames, a detection is much more re-liable. The time delay of 5 · 33 ms = 165 ms isbearable, but the reconstruction process also mis-assigns the information of 5 frames. This is par-tially compensated because the detected clustersare likely to be larger in subsequent frames andthey should at least contain the vertices of the pre-vious frames (compare frame 163 with 167). Thelost frames could also be buffered and reprocessedbut this would not be real-time capable with cur-rent GPUs.

In the remaining parts of this paper the improve-ment of the detection process is investigated. By de-ciding on each particular vertex, whether it belongsto the largest cluster of outliers or not, the first twopoints of the list above can be enhanced. By incor-porating the 3D structure, the inclusion of extrane-ous vertices is prevented and the full size of the clus-ters can be determined. Seed-based approches (e.g.region growing) are unfavorable on a GPU. Further-more, it would be necessary to find good seed pointswhich would increase the computational effort. How-ever, several CCL algorithms were presented whichare optimized for the GPU architecture and able topreserve the real-time capability of KinFu.

Moving Object Detection by Connected Component Labeling of Point Cloud Registration Outliers on the GPU

501

3 CONNECTED COMPONENTLABELING ALGORITHMS

In order to develop a CCL algorithm, that works oncorrespondence maps and incorporates the 3D struc-ture, three CCL on GPU algorithms are briefly intro-duced in the next subsections. More details and pseu-docode can be found in the referenced papers. Sub-sequently, the algorithms are compared and our algo-rithm is derived. For the beginning, a binary input forthe CCL algorithms is assumed. A correspondencemap can be reduced to a binary image: A pixel isa (potential) outlier or not. A label image (or labelarray) is the result of all CCL algorithms, in whicheach pixel is labeled with the component ID. All pix-els with the same ID are direct neighbors or they arerecursively connected in the input data. In the lastsubsection, depth information are additionally inte-grated.

3.1 Label Equivalence

Hawick et. al. give in (Hawick et al., 2010) anoverview of several parallel graph component label-ing algorithms with GPGPUs. This subsection sum-marizes the label equivalence method (Kernel D in(Hawick et al., 2010)) which uses one independentthread on the GPU for the processing of each pixel.One input array Dd is needed, which assigns a classID to each pixel. In our case just 1 (outlier) or 0(anything else) are used but more complex settingsare possible, too. Before the start of an iterative pro-cess, a label array Ld is initialized. It stores the labelIDs for each pixel and all pixels start with their ownone-dimensional array index (see left side of Fig. 2).Furthermore, a reference list Rd is allocated which isa copy of Ld at the beginning. The iterations of thismethod are subdivided in three GPU kernels:

• Scanning: For each element of Ld the smallestlabel ID in the Von Neumann neighborhood (4-connected pixels) with similar class ID in Dd issearched (see red arrows in Fig. 2). If the foundID is smaller than the original ID, which wasstored for the processed element in Ld , the resultis written into the reference list Rd at the positionwhich is referenced by the current element in Ld .This means the root node of the current connectedcomponent is updated and hooked into the othercomponent and not just the current node. Sinceseveral threads may update the root node concur-rently atomicMin is used by Hawick et. al. Wewaive this atomic operation because it increasesthe runtime. Without this operation sometimes theupdate of a root node is not optimal and an addi-

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 0 0 3 3

5 0 0 0 3

5 5 0 13 3

15 15 0 13 3

15 15 22 13 3

Figure 2: Visualization of a single iteration of the labelequivalence algorithm. The background color (light grayor dark gray) represents the class of each pixel. Red arrowsshow the results of the scanning phase. Blue arrows showthe results of the analysis phase. The left part of the figureillustrates the initial labels in Ld and the right part the la-bels after running the labeling kernel at the end of the firstiteration. The red arrows on the right side additionally in-dicate how the root nodes of the connected components areupdated during the scanning phase of the second iteration.

tional iteration is maybe necessary, but in generalthis costs less than the repeated use of an expen-sive atomic operation.

• Analysis: The references of each element of thereference list Rd are resolved recursively until theroot node is reached (see blue arrows in Fig. 2).For example position 7 gets the assignment 0 inFig. 2 due to the reference chain 7→ 2→ 1→0. Element 0 is a root node because it referencesitself.

• Labeling: The labeling kernel updates the label-ing in Ld by copying the results from the referencelist Rd (right side of Fig. 2).

The iterations terminate if no further modification ap-pears during the scanning phase.

3.2 Improved Label Equivalence

Based on the label equivalence method, Kalentevet. al propose in (Kalentev et al., 2011) a few im-provements in terms of runtime and memory require-ments. The reference list Rd is replaced by writing theprogress of the scanning and analysis phases directlyinto the label array Ld . This also makes the wholelabeling step obsolete. Besides the obvious memorysaving, it can reduce the computational effort, too.Some threads may benefit from updates in Ld madeby other threads. Additionally, the original input datais padded with a one pixel wide border. This allowsto remove the if conditioning in the scanning phasewhich checks the original image borders. Further-more, it makes the ID 0 free to mark irrelevant back-ground pixels, including the padded area. Kalentevet. al also removed Dd , due to this only two classes ofpixels are considered: unprocessed background pixels


502

Merging scheme

LEVEL=0

LEVEL=LEVEL+1

Labeledoutput dataLEVEL ==

Top level?Kernel 4

Update all labels

Kernel 2Merge solution

on borders (LEVEL)

Kernel 1Solve CCL subproblem

in shared memory

Input data

Kernel 3Update labes onborders (LEVEL)

0

1

Figure 3: Schematic overview of the hierarchical mergingapproach (Stava and Benes, 2010).

and connected foreground pixels. Finally, Kalentevet. al proposed to avoid the atomicMin as describedin the previous subsection, too.

3.3 Hierarchical Merging CCL

Stava and Benes published in (Stava and Benes, 2010)a CCL algorithm which is specifically adapted to theGPU architecture. In particular, Kernel 1 (see Fig. 3)applies the improved label equivalence method withinthe shared memory of the GPU. This memory typecan just be used block wise and the size is very lim-ited and therefore Kernel 1 processes the input datain small tiles. Then Kernel 2 merges the componentsalong the border of the tiles by updating the root nodewith the lower ID. Subsequently, the non-root nodesalong the borders are updated by the optional Kernel 3which enhances the performance. All tiles are mergedhierarchically with much concurrency. Finally, Ker-nel 4 flats each equivalence tree with an approachsimilar to the analysis phase of the label equivalencemethod. The main disadvantage lies in the limited for-mats of the input data. The implementation of Stavaand Benes only works with arrays that have a size ofpower of two in both dimensions. In addition, justquadratic images run with the reference implemen-tation provided with the online resources of (Stavaand Benes, 2010). The original publication detectsconnected components with the Moore neighborhoodbut we reduced it to the neighborhood with just 4-connected pixels. This improves the performance andthe impact on the quality of the the results is negligi-ble.

3.4 Comparison between the CCLAlgorithms

All three CCL algorithms provide substantially deter-ministic (just the IDs of the components can change)and correct results. The runtimes of the algorithmswere compared with three categories of test imageswithout depth information shown in Fig. 4. The firstimage was generated by binarization of a typical cor-respondence map with a moving robot slightly left ofthe image center. In addition, two more extensive cat-egories of test images were evaluated to find an upper

correspondence map

random circles

spiral

Figure 4: Test images for the comparision of the runtime ofthe three CCL algorithms.

limit of the runtime effort. Since KinFu runs in real-time, not only the average runtime with typical outlierdistributions is interesting, but also the worst case lag(response time), because this could let the ICP fail.Images with random circles represent situations withmany outlier pixels which are very unlikely but pos-sible, for example after fast camera movements. Thespiral is deemed to be a worse case scenario for CCLalgorithms. The connectivity of the spiral must betraced with very high effort over the entire image andin all directions (left, right, top and down).

The runtimes are measured with an Intel Core i7-2600 @3.40GHz, DDR3 SDRAM @1600MHz anda Nvidia GeForce GTX970. The time for memoryallocation and transfers of the input and output databetween the host and the device was not measured.These operations depend in general on the applica-tion and in KinFu MOT the input and output data justexist in the device memory. The evaluation was per-formed for several resolutions so that it is usable forhigher depth camera resolutions in the future or otherapplications. Due to the limitations of the hierarchi-cal merging CCL algorithm only square images wereused. The correspondence map test image was scaledwith nearest neighbor interpolation.

Fig. 5(a) shows the runtimes with the binary corre-spondence map. The algorithm of Hawick et. al. per-forms worst and the hierarchical merging CCL overallbest. It is noticeable that the runtime of the hierar-chical merging CCL can decrease with increasing im-age size (compare 16642 with 17922 pixels). Fig. 5(b)highlights that all algorithms suffer from highly occu-pied input images. Particularly, the hierarchical merg-ing CCL loses its lead and it is partially outperformed


503

2562

5122

7682

1024

2

1280

2

1536

2

1792

2

2048

20.1

1

10

size of input image [pixel]

avg.

runt

ime[m

s]

(a) correspondence map

2562

5122

7682

1024

2

1280

2

1536

2

1792

2

2048

20.1

1

10


avg.

runt

ime[m

s]

(b) random circles

2562

5122

7682

1024

2

1280

2

1536

2

1792

2

2048

20.1

1

10


avg.

runt

ime[m

s]

(c) spiral

Label EquivalenceImproved Label EquivalenceHierarchical Merging CCL

Figure 5: Runtimes of the CCL algorithms with differentimages and resolutions. The runtimes are logarithmicalscaled and the averages of 100 runs are shown.

by improved label equivalence. Stava and Benes al-ready noted in their paper that their algorithm per-forms best when the distribution of connected compo-nents in the analyzed data is sparse. In Fig. 5(c) im-proved label equivalence shows the overall best result.The algorithm of Stava and Benes has major troubleto resolve the high number of merges.

As expected, the basic label equivalence algorithmis outperformed by the other two. But the runtime dif-ferences need to be put in context. Hawick et. al. usean array Dd to distinguish between classes or e.g. dis-tances. It makes this algorithm more powerful thanthe other two and it is very near to our aim to usedepth information during the labeling. The effort forusing Dd should not be underestimated – the addi-tional computations are low-cost, but the device mem-ory bandwidth is limited.

To determine the best basis of our KinFu MOTextension, we need to take a closer look at the possi-ble input data sizes. The size of the correspondencemaps in the intended application is the same as thesize of the depth images. The depth image resolu-tion of the Kinect for Xbox 360 is 640× 480 pixels.Because of the structured light technique, the men-tioned resolution is far beyond the effective resolutionof the first Kinect generation. As a further example,the Kinect for Xbox One has a depth image resolutionof 512×424 pixels which can be assumed as effectivedue to the time-of-flight technique.

The input data formats of the reference implemen-tation of Stava and Benes are very restricted and itwould be necessary to pad the correspondence mapsin KinFu MOT. Therefore, runtimes with sparse VGAresolution inputs are slightly better with improved la-bel equivalence than hierarchical merging CCL. Ad-ditionally, the runtimes of the hierarchical approachmay rise more with dense input data and the complexsource code is hard to maintain or modify. Altogetherthe improved label equivalence algorithm is a goodbasis for the detection of moving objects in KinFuMOT.

3.5 Migration into KinFu MOT

Our enhancement of KinFu MOT is primary based onthe improved label equivalence method from subsec-tion 3.2. The input label array is initialized on the ba-sis of the latest correspondence map. Pixels markedas outlier or potential outlier get the respective arrayindex and all other get 0 (background) as label ID.In addition, the scanning kernel relies on the currentdepth image (bound into texture memory) during theneighborhood search. Two neighbor pixels are onlyconnected if the difference of their depth values is


504

smaller than 1 cm. This threshold seems very lowbut to establish a connection, one path out of severalpossible paths is sufficient. If one element of the VonNeumann neighborhood does not pass the thresholddue to noise, then probably one of the remaining threewill pass. Furthermore, the padding with a one pixelwide border is removed because the additional effortis too much for small images. In our case it would benecessary to pad the input label array and the depthimage and remove the padding again from the outputlabel array.

Just the largest component is considered for an ob-ject initialization. To find it, a summation array of thesame size as the label array is initialized with zeros.Each element of the output label array is processedwith its own thread by a summation kernel. For IDs6= 0 the summation array at index = ID is incrementedwith atomicAdd. The maximum is found with paral-lel reduction techniques. In contrast to conventionalimplementations a second array is needed to carry theID of the related maximum value.

The average runtime with VGA resolution inputswith the described implementation (without the mod-ifications which are presented below) is 43% higherthan the runtime of the improved label equivalencemethod from subsection 3.2. This is caused by theadditional effort for incorporating depth informationand particularly for the determination of the largestcomponent.

3.5.1 Dynamic Parallelism

One bottleneck of the label equivalence approach isthe termination condition. To check if a further it-eration should be performed, the host needs a syn-chronous memory copy to determine if modificationsappear during the last scanning phase. With CUDAcompute capability 3.5, Dynamic Parallelism was in-troduced which allows parent kernels to launch nestedchild kernels. The obvious approach is to replacethe host control loop by a parent kernel with justone thread. The synchronization between host anddevice can be avoided, but unfortunately cudaDe-viceSynchronize must be called in the parent kernelto get the changes from the child kernels. Altogether,this idea turned out to be slower than the current ap-proach. Over all image sizes the runtime of the bareCCL algorithm rises by approx. 10% (measured with-out component size computation and outside of KinFuMOT). Due to using streams in KinFu MOT cudaDe-viceSynchronize can have an even worse impact. Dy-namic Parallelism was removed from the final imple-mentation.

3.5.2 Effort for Modification Checking

The effort for modification checking within the scan-ning phase can be reduced if further CCL iteration areexecuted without verification of the termination con-dition. Fig. 2 shows that, even in the case of simpleinput data, two interation are necessary. During ourevaluation on real depth images mostly three CCLiterations are needed and always not less than two.Accordingly, the second iteration can be launched di-rectly after the first iteration without checking thecondition. This reduces the runtime by approx.0.01 ms (4.2% with VGA resolution) with the corre-spondence map. If the second condition check is alsodropped, the runtime is reduced by approx. 7.2%.

3.5.3 Plausibility Criteria

To prevent bad object initializations three criteria areverified in addition to the preconditions, which werementioned in subsection 2.3.

At first, the size of the largest found componentmust exceed a threshold. Only clusters of ouliers withsufficient elements are supposed to be caused by amoving object. Additionally, dropping false negativesis not a drawback in this case, since the followingpoint cloud registration can fail without sufficient 3Dstructure. Checking the first criterion is very conve-nient because the size is previously calculated to findthe largest component. If it is fulfilled, then additionaleffort has to be spent for the other two criteria.

It would be valuable to calculate the orientedbounding box (OBB) for the component and demanda minimum length of the minor axis and use a thresh-old for the filling degree. But the computation ofa OBB takes too much runtime. This idea is re-placed by two other characteristics which admittedlyare slightly worse rotational invariant but faster.

For each row i of the label array an additionalCUDA kernel determines the first x f irsti and the lastxlasti column index which belongs to the found com-ponent. Moreover, rowsSet counts the number ofrows which contain at least one element of the compo-nent and x f irsti = xlasti = 0 for empty rows. A furtherkernel does the same for all columns of the label array,which yields to y f irst j, ylast j and colsSet. The sec-ond critertia is met if the relation between the numberof elements of the component componentPixels andthe approximated occupied area

2 · componentPixelsrows−1

∑i=0

xlasti− x f irsti +cols−1

∑j=0

ylast j− y f irst j

(1)

exceeds a threshold of 75%, where rows and colsspecify the dimensions of the whole label array. This


505

(a) rejected by second criterion

(b) rejected by third criterion

Figure 6: Examples for shapes which are rejected by thesecond and third criterion. The green lines indicate xlasti−x f irsti and ylast j− y f irst j for some rows and columns.

ensures that the filling degree is not too low and usinghorizontal and vertical scans improves the rotationalinvariance. The third criterion is fulfilled if the av-erage expansions of the component in horizontal andvertical direction

rows−1∑

i=0xlasti− x f irsti

rowsSetand

cols−1∑j=0

ylast j− y f irst j

colsSet(2)

are at least 15 pixels. This is useful to avoid thin butlong shapes which appear very often at the border be-tween objects. Fig. 6 illustrates how both criteria takeeffect on two different shapes. Particularly, example(b) can be observed often due to misalignment of ta-bles (shown in Fig. 6(b)) and doorframes.

4 EVALUATION AND RESULTS

In this section, the CCL based object detection is eval-uated with regard to the detection results and runtimescompared to the original procedure described in sub-section 2.3. The evaluation is mainly based on sixdatasets of one scene, which was introduced in sub-section 2.3. The datasets differ in camera movementand the path the robot took (reflected by the datasetnames). Finally, a qualitative Evaluation gives an im-pression of the enhanced potential of KinFu MOT.

4.1 Detection Performance

The detection results are compared pixelwise basedon the correspondence maps in dependency of thethreshold for the minimal size of the found con-nected component (first criterion in subsection 3.5).The thresholds of the original approach remain un-changed. Fig. 7 shows for several frames of datasetrobot frontal to camera the relative enhancement ofthe new CCL based algorithm

enh =

{ nCCL+nCommonnOrig+nCommon −1, if nCCL≥ nOrig

− nOrig+nCommonnCCL+nCommon +1, otherwise

,

(3)where nCommon is the count of pixels which aremarked as found moving object by the new and theoriginal approach, nCCL counts pixels only assignedby the new process and nOrig is the count detectedexclusively by the original approach. The scale ofenh is clipped at −4 and 4 and division by 0 in Equa-tion 3 is meaningfully solved. After the object ini-tialization, the object model is continually extendedby the reconstruction process and the counted pixelsdepend only indirectly on the detection. The shownheatmap is representative for all six datasets. Theoriginal KinFu MOT algorithm initializes the robotmodel with 11.2% of the ground truth object pixelsat frame 138. With a size threshold ≤ 5000 pixelsthe proposed approach initializes the robot betweenframe 124 (31.6% of the object pixels) and 136 (darkred area). Even after frame 138, the object size de-tected by the original approach needs a few frames toreach the results of the new approach (gradient fromdark red via red, orange, yellow to green). For largersize thresholds the detection is delayed (dark blue) butthe original algorithm is caught up immediately (in-stantly from dark blue to green) or even outperformed(dark blue to orange, yellow and green).

In our experiments the size threshold is the mostimportant parameter. If it is too small, tiny com-ponents can be detected as moving objects withoutcontaining sufficient 3D structure for the registration,or they are even false positives. A high thresholdunnecessarily delays the detection. Altogether, thenew approach is never worse than the original algo-rithm which is mostly outperformed. The heatmaps ofthe other datasets support the described observations.The specific characteristics of the stepped shape of theheatmaps differ but the conclusions are similar. Overall datasets, 3500 pixels is a suitable size threshold.

4.2 Runtime Performance

Table 1 shows that in comparison with the originalapproach in KinFu MOT, the runtime is reduced by


506

120 140 160 180 200 220frame number

2000

4000

6000

8000

10000m

inim

al s

ize

thre

shol

d

-4

-3

-2

-1

0

1

2

3

4

rela

tive

enha

ncem

ent

Figure 7: The relative enhancement of the new CCL based algorithm compared to the original approach in dependency of thesize threshold and frame number.

Table 1: Total runtimes with SD for the object detectionof the proposed process described in subsection 3.5 and theoriginal approach in KinFu MOT described in (Korn andPauli, 2015).

Dataset avg. runtime per frame[ms]CCL based original approach

parallel to image plane 0.85 ± 0.22 1.69 ± 0.38robot towards camera 0.71 ± 0.32 1.15 ± 0.21robot frontal to camera 0.63 ± 0.17 1.72 ± 0.27robot away from camera 0.56 ± 0.12 1.10 ± 0.23complete robot pass 0.55 ± 0.10 1.10 ± 0.28robot rotation 0.53 ± 0.09 1.24 ± 0.32

the new approach, despite the higher complexity ofthe algorithm. This can be traced back to the exten-sive access of the global memory of the original ap-proach. Overall the standard deviation (SD) is slightlyreduced, too. While the SD of the original approach isdominated by the amount of registration outliers, thisinfluence is less important for the proposed process.Mostly the SD of the CCL based approach dependson the number of large connected components whichpass the first criterion and trigger the computation ofthe other two criteria. This observation is particularlysupported by the first dataset in Fig. 8. Overall, in thedatasets with more detection effort, it is very difficultto register the moving robot with the model becausemainly parallel planes are first visible (especially inthe upper three datasets) and the alignment is instabledue to the point-to-plane metric of KinFu. Large out-lier regions occur frequently and criteria 2 and 3 haveto be computed. In summary, Table 1 shows that theresponse time is roughly halved. But this reflects thecomputational effort insufficiently due to many syn-chronizations between the host and device in the pro-posed approach. The underlying data of Fig. 8 indi-cates that the total GPU computation time for all ker-nels is approximately quartered in comparison withthe original approach in KinFu MOT described in(Korn and Pauli, 2015). The GPU time for the first

600 frames was reduced from 869 ms to 272 ms forparallel to image plane and from 980 ms to 174 msfor robot frontal to camera.

4.3 Qualitative Evaluation

The newer Kinect for Xbox One provides very chal-lenging data. The dataset in Fig. 9 was created witha system for optical inspection, which can be seen asa three-dimensional cartesian robot. First, it was re-quired to extend KinFu with a model for radial cam-era distortion. In addition, the higher spatial depth im-age resolution yields much details but the data containexcessive noise and reflections. Due to the complexshapes of the outlier clusters, the original detectionapproach mostly fails. The new CCL approach suc-cessfully leads to one background model and threeadditional object models (the three moved axes ofthe robot). The impact of sensory distortions is re-duced because the component connectivity is deter-mined based on the depth data, too. Besides, the de-tection does not demand on square regions of outlierswith 90% filling degree which allow early object ini-tializations.

5 CONCLUSIONS

It was ascertained that improved label equivalence,proposed by Kalentev et. al, behaves well on theGPU with small images like common depth images.Whereas the hierarchical merging CCL algorithmdoes not bring any benefits for small images and itsperformance even declines if the data is not sparse.A Dynamic Parallelism approach leads to a increasedruntime of label equivalence algorithms. However,the response time can be reduced by skipping the firsttwo verifications of the termination condition.

The migration in KinFu MOT was extended by asearch for the largest component and three criteria.


507

0 50 100 150 200 250

parallel to image planerobot towards camera

robot frontal to camerarobot away from camera

complete robot passrobot rotation

total runtime [ms]

InitScanningAnalysis

Criterion 1Criteria 2&3

Figure 8: Comparison of the runtimes for six different datasets. Shown is the total runtime over the first 600 frames forselected CUDA kernels (just GPU time). The first three kernels are mentioned in subsection 3.1. Criterion 1 shows the effortfor counting the size of each component and finding the maximum. Criteria 2 & 3 involves all kernels which are needed tocompute both criteria.

rgb overview correspondence map

rendered first object rendered background model

Figure 9: Reconstruction of three moving elements and thebackground with the Kinect for Xbox One and the CCLbased detection. The four different green tones in the corre-spondence map show four independent object models.

The decrement of the response time and computa-tional effort was shown and the improvement of thedetection performance is conclusively. Additionally,the depth data are considered and false connectionsare inhibited which is a necessary extension for a re-liable detection, especially in complex datasets.

For future work, the plausibility criteria should berefined. Furthermore, the strategy for the stability ob-jective mentioned in subsection 2.3 can be extended.The large clusters of outliers which need to be foundin 5 sequential frames should match in position andperhaps even shape. This is expensive because thecamera can be moved and the problem needs to besolved in 3D.

REFERENCES

Chen, Y. and Medioni, G. (1991). Object modeling by reg-istration of multiple range images. In Robotics andAutomation, 1991. Proceedings., 1991 IEEE Interna-tional Conference on, pages 2724–2729 vol.3.

Curless, B. and Levoy, M. (1996). A volumetric method forbuilding complex models from range images. In Pro-ceedings of the 23rd Annual Conference on ComputerGraphics and Interactive Techniques, SIGGRAPH’96, pages 303–312, New York, NY, USA. ACM.

Hawick, K. A., Leist, A., and Playne, D. P. (2010). ParallelGraph Component Labelling with GPUs and CUDA.Parallel Computing, 36(12):655–678.

Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,Davison, A., et al. (2011). KinectFusion: real-time 3Dreconstruction and interaction using a moving depthcamera. In Proceedings of the 24th annual ACM sym-posium on User interface software and technology,pages 559–568. ACM.

Kalentev, O., Rai, A., Kemnitz, S., and Schneider, R.(2011). Connected component labeling on a 2D gridusing CUDA. Journal of Parallel and DistributedComputing, 71(4):615–620.

Korn, M. and Pauli, J. (2015). KinFu MOT: KinectFu-sion with Moving Objects Tracking. Proceedings ofthe 10th International Conference on Computer VisionTheory and Applications (VISAPP 2015).

Newcombe, R., Izadi, S., Hilliges, O., Molyneaux, D., Kim,D., Davison, A. J., Kohi, P., Shotton, J., Hodges, S.,and Fitzgibbon, A. (2011). KinectFusion: Real-timedense surface mapping and tracking. In Mixed andaugmented reality (ISMAR), 2011 10th IEEE interna-tional symposium on, pages 127–136. IEEE.

Rusinkiewicz, S. and Levoy, M. (2001). Efficient Variantsof the ICP Algorithm. In Proc. of the 3rd Interna-tional Conference on 3-D Digital Imaging and Mod-eling, pages 145–152.

Rusu, R. B. and Cousins, S. (2011). 3D is here: PointCloud Library (PCL). In International Conference onRobotics and Automation, Shanghai, China.

Stava, O. and Benes, B. (2010). Connected Component La-beling in CUDA. Hwu., WW (Ed.), GPU ComputingGems.


508

Documents

Moving Object Detection by Connected Component Labeling … · Moving Object Detection by Connected Component Labeling of Point Cloud Registration Outliers on the GPU Michael Korn,