Mmpaper draft10

Bidirectional Information exchangeSubtitle as needed (paper subtitle)

Authors Name/s per 1st Affiliation (Author)line 1 (of Affiliation): dept. name of organizationline 2-name of organization, acronyms acceptable

line 3-City, Countryline 4-e-mail address if desired

Authors Name/s per 2nd Affiliation (Author)line 1 (of Affiliation): dept. name of organizationline 2-name of organization, acronyms acceptable

line 3-City, Countryline 4-e-mail address if desired

Abstract— It is essential to collect and analyze environmental information surrounding the vehicle in an autonomous driving environment. To do this, it is necessary to analyze real-time information such as location, orientation and size of objects. In particular, handling of occlusion or truncation helps to ensure safe driving by providing exact information about the surroundings of the vehicle. In this paper, we propose a new method to generate object proposals around a vehicle using projected LiDAR information and to perform object detection more accurately by exploiting a 2D image based classifier. This method generates a proposal by filtering the LiDAR information into a 2D edge with a simple but strong affinity. Then classify the objects in the proposal through R-FCN and map the class labels classified at the edge points used to make the proposal. This class label combines with the orientation information of the edge in 3D space to complete the 3D box of the object. This compensates for the disadvantage of the CCD-based CNN classifier which shows difficulties to obtain spatial information and solves such problems as occlusion. We compared our method to state-of-the-art results on the KITTI Dataset and showed good results in terms of speed and accuracy (especially IOU with ground-truth).

Keywords—LiDAR; CCD; Proposal generation; Object Detection; R-FCN

I. INTRODUCTION In recent years, CNN had many contributions from fields such as object classification and object detection. Especially, as the autonomous driving becomes an issue, deep network researches for object recognition and behavior analysis in the driving environment are actively being carried out. The most recent deep network-based object classification researches are divided into two categories : two-stage type and one-stage type. In two-stage type, 'proposal generator' and 'classifier' scheme are implemented as an independent component, but in one-stage type these are performed together. In two-stage type, R-CNN[1] shows good performance in

object classification by evaluating the 'scores' of features that exist in that region via 'region'(?) and classifying the proposed

regions into feature maps. This method significantly reduces the computational cost compared to the sliding window method, and thus has advantages in terms of speed. Faster R-CNN [2] devised the 'RPN (Region Proposal Network)' that generates the proposal through feature sharing, taking advantage of the fact that the convolution layer itself contains enough information to represent the object. HyperNet [3] replaces the features used in the Faster R-CNN by pooling several times to compensate for the fact that the FC5 layer information is insufficient for small objects. It creates a hyper feature set and used it to match objects of various sizes. There is, however, a dilemma for translation invariants and variants between linked object detection and classifiers. To solve this dilemma, R-FCN [4] divides the ROI into k × k lattices and finds objects in the center of ROI through local voting.

In one-stage type, bounding prediction and class probability calculation are performed together like YOLO[5]. In YOLO, the class probability map and the bounding box confidence map in the grid are generated based on the grid without generating a separate proposal, and the class of the bounding box is classified immediately, greatly simplifying the overall calculation process. Fast YOLO [ ] simplifies the network a little bit and shows a computation speed of 155 fps. In a similar approach, a single shot detector (SSD)[6] was introduced to evaluate box and confidence through step features in the convolution layer like HyperNet.

Even though the existing methods are good at classifying occlusion and truncated objects, but is vulnerable to the processing of the obscured part of the object.

However, for autonomous navigation, additional information is needed in detecting and classifying objects well. The most important information is occlusion and truncation among objects. These two factors are important factors in understanding the actual position and arrangement of nearby objects in the driving environment. Of course, the existing set of methods is also good at classifying occlusion and truncated objects, but is vulnerable to the processing of the obscured part of the object.In both types, a bounding box is created based on the characteristics of the unobscured part of the obscured object, and the object is classified through a part of the object represented in the box. So, although the class of the object can

be found, there is a problem that the size of the actual object in the hidden area can not be estimated.

Another problem is obtaining spatial information such as the 3D location and orientation of the detected object, and the actual size of the object. In order to grasp the traffic flow around the vehicle prior to autonomous driving, it is very difficult to estimate the 3D spatial information using only one accurate RGB image. Therefore, using 3D depth information is the most intuitive method. 3DOP[7] generates Depth map using stereo image, and by projecting each Pixel of RGB Image into 3D space. It defines the relationship between each pixel as an MRF energy function through several properties, classifies the object by Linear SVM. However, the sophisticated stereo-based depth map generation used in 3DOP consumes a lot of computational cost. Therefore, there is a method to obtain depth in addition to stereo, among which laser sensor information such as LiDAR is used. Laser sensors can collect 3D spatial information very quickly. Among these studies, vote3D [8] used LiDAR directly learning in 3D space through 3D Voxel. 3DVP classifies LiDAR pointcloud into 3D Voxel and applies 2D Alignment with 3D CAD Model. In addition, there are studies in which LiDAR points are projected onto 2D to generate 2D depth maps. But, LiDAR information has sparse characteristics basically and it is difficult to learn steadily because it shows non-uniform results depending on the surface of environment or object. Based on the structure of RPN + classifier of Faster R-CNN,

SubCNN[9] adds subcategory information to both sides and corrects the area according to the size and direction information of subclass. Particularly in 'Car' category,

3DVP[10] was applied to identify the 3D location and orientation of the vehicle and perform 2D and 3D segmentation. However, this method can only be applied to the rigid body model, and many additional samples are required depending on each sub category class. In addition, manual sorting for alignment is required for all of them, and the sparse LiDAR pointcloud has the disadvantage that it is difficult to construct a sufficient voxel for long distance objects. And the computation speed is not fast enough and is not suitable for autonomous driving environments where real-time processing is essential. Therefore, this study proposes fast and accurate 2D and 3D object detection and classification method suitable for autonomous driving environment by adding CCD image and LiDAR sensor information. The proposed method uses a LiDAR Point as 2D data that combines all the points of the z-axis in one plane instead of the 3D voxel, which guarantees a much higher density than the conventional one. We filter them and group edges to find edges with high affinity and generate 2D proposals. The generated proposal is used as the ROI of the R-FCN network and the bounding box of the object in 3D space is created by mapping the classification result to the constituent edge point. Finally, if you project it onto 2D again, you can expand the box to the hidden region that does not appear in the 2D image.

Our benefits through this study:

Figure 1. System overview. The upper network is characterized by scoring through voting for the area of k x k with R-FCN. In this study, ROI of this network is generated by filtering LiDAR pointcloud. Through this ROI, the object is

classified by voting and the classified class label is mapped again to the point to expand it in 3D space. This creates a final bounding box for each object.

- This method replaces the existing proposal generator and improves the speed by mixing the simplified LiDAR sensor information with the powerful CCD-based CNN architecture.

- It is possible to acquire 3-D spatial information about the object around the vehicle at the same time by only the information generated in the process of making proposal without additional process.

- This method expands the bounding box of the occluded or truncated region to extend the box closer to the actual size of the object.

To do this, we show how to perform a mutual projection between 2D and 3D points using a simple method in Section II-A. Section II-B uses this to detect objects in the CCD-based state-of-the-art classifier R-FCN network. Section II-C expands the box based on orientation information obtained in II-A and orientation information obtained in 3D space through the detected class label to determine the optimal bounding box for the actual class. Section-III quantitatively verifies the performance of the Average precision(AP) in the object and tracking sequences in the KITTI Dataset and compares it with the previous studies. Also, confirm the expansion result for the obscured area through the picture.

II. BIDIRECTIONAL INFORMATION EXCHANGE

A. Proposal Generation with LiDARWe propose 'bidirectional projection filtering' using 2D -

3D information as an intersection through a calibration matrix between CCD image and 3D point cloud as shown in Figure 2. Through this method, it is possible to generate an object proposal, estimate an obscured area, and obtain the actual size and 3D spatial information of the detected object based on the CCD image. In this section, we describe how to remove point information that is not needed for proposal generation and exchange information between 2D and 3D space through projection. As shown in figure 1- (b), we remove the ground to separate object points and eliminate points outside the unused CCD angle. Next, project the z-axis on one side to obtain a dense 2D object shape. Then, the error points are filtered and the remaining edges are grouped based on the affinity between the neighboring edges, and a proposal is generated in units of edge groups generated.

Figure 2. Point filtering through 2D projection of LiDAR. (a) original image, (b) Lidar point cloud with ground removed, (c) removal of points outside the angle of view and reduce dimensions to increase density (d) 2D

projection and filtering of points, (e) 3D point cloud before filtering, and (F) 2D point cloud after filtering.

Ground Plane removal - In the LiDAR environment, the simplest way to remove the ground is to remove all of the LiDAR points below the corresponding height by considering the ground of the vehicle equipped with the sensor as ground plane. In this paper, we propose a surface detection method considering the curvature of the road surface because the road is infinitely flat and there is no obstacle. Although there are many methods, this study uses the fact that the objects in the driving environment have a volume including a certain height. Through the voxel grid of 20x20x10 cm, the height of the points constituting the cell is obtained for all XY cells in which the point exists.

(1)

(2)

When each coordinate of the XY plane is i and j, each grid cell is represented by Cij, and if the point set belonging to Cij is PSij. In equation (1), HPSij is the height of each cells. Ground G(i,j) in equation (2) is the grid with zero height area. Figure 2-(b) shows the point cloud in which the ground area is removed in this way.

Outside point elimination – It is necessary to further simplify the point set with ground removed. First, the LiDAR Point is multiplied with the calibration matrix to remove all points outside the x-axis range of the CCD image and then process the remaining points only. The first reason for doing a CCD projection is because you can easily remove LiDAR points outside the angle of view. Second, since the reference of the horizontal axis search changes from radian to pixel through projection, it is possible to perform filtering on the horizontal axis in a very simple manner. Figure 2-(c) shows that the range of points to be processed is greatly reduced.

Z axis elimination - Sparse LiDAR points after the ground is removed are configured in a two-dimensional plane by projecting all the points on the z-axis to increase the density and at the same time for fast computation. Since the height that can be reached is 50cm or more, PSz

ij = -0.9 is set for all point set PSij in order to locate the upper plane at 50cm above the ground where the radar sensor of the vehicle is installed. However, as shown in Figure 2-(b), since the 3D point of the existing 3D space is pressed in one plane, the density of the point is increased, but there is a portion that looks like noise due to the depth difference.

Edge filtering in CCD - The use of 2D LiDAR point clouds by CCD projection has several advantages. In 3D space, LiDAR spreads radially with zero point and obtains distance information. In order to extract the shortest line of an object from a shooting point, the distance must be calculated for all the angles that are continuous values. However, if we project this onto the CCD, the discrete value of the pixel becomes the

reference domain as shown in Figure 2-(d). As shown in the figure 3, duplicate points can easily be removed according to height or depth in one pixel column in pixel level. In addition, since the error pixel can be easily discriminated through the gradient operation proportional to the distance between neighboring pixels, it is possible to generate the edges closest to the photographed vehicle in each column of the x axis.

(3)

In equation (3), Cli denotes all x-axis columns containing at least one projected LiDAR Point and hPsi denotes the minimum height of each Cli. This gives hPsi for all i and removes noise through a one-dimensional median filter[11]. At this time, not only the height of the point selected by the median filter but also the index of the corresponding point are copied together to remove the noise point in the 3D space as well. Figure 2-(f) shows a three-dimensional point cloud arranged through this.

Segmentation by edge affinity – The point map determined through the above process is segmented by two conditions. First, we group the consecutive edges by the edge grouping method introduced in Edgeboxes[12]. In this case, the edge magnitude mp is used to inverse the Euclidean distance between adjacent points Pi and Pj of two neighboring edge groups as shown in equation (4).

(4)

(5)

The affinity score a(si,sj) is obtained for all edge groups Si, and the edge is segmented based on the boundary of the group whose score changes rapidly. (4) is a formula for obtaining the affinity score between neighboring edge groups. PGz

i and PGzj,

which are the ground heights, and PGz1.5i and PGz1.5

j, which are 1.5m high from the ground, are projected on the CCD image, respectively, for pi and pj located at the boundary of each edge set. Create a bounding box by connecting horizontally to the adjacent boundary.

Proposal generation based on actual size – We create a bounding box between segmented boundaries. At this time, the bottom value of each boundary point is determined through the ground position obtained by the ground removal process. The height of the bounding box is 1.5m which covers all three classes that are mainly detected in the driving environment and it is projected on CCD image to generate proposal in CCD image. The yellow line in Figure 4 represents the projected LiDAR edge, and the orange line represents the boundary divided by the affinity of the LiDAR edge. The green box at the bottom of Figure 4 creates a box with a height of 1.5m between the boundaries based on this.

Figure 4. A box proposal created based on edge group boundaries divided by edge affinity.

B. Classification with R-FCNThe generated proposal classifies objects by combining them with R-FCN showing state-of-the-art performance. Figure 5-(b) shows the R-FCN implemented through ResNets [13] 101. R-FCN solves the object translation invariance and variance dilemma between detection and classification by dividing

ROI is divided into n x n grid and feature set is trained through each grid cell. If a box proposal is determined according to this learning method, a high voting score is returned if the object is sufficiently wrapped. By using this feature, it is possible to organize the surrounding box based on the voting score of the redundant box due to noise of LiDAR.

Figure 3. The process by which objects are detected by bidirectional projection filtering. (a) original image, (b) 2D LiDAR projection, (c) 2D gradient-based edge cleanup, (d) Proposal creation using edge affinity, and classification result by R-

FCN. (e) 3D bounding box expansion by LiDAR projection(Red Boxes). , (f) final result.

Figure 5. Detailed structure of proposed method. (a) 3D to 2D Projection Filtering (b) R-FCN (c) 2D to 3D Projection

Filtering. (d) Result

C. 3D projection and box extension

This time, we assign the class label to the edge group inside the edge boundary of the box classified by R-FCN. Each labeled edge is projected in the 3D space through the index given in Figure 5-(c) and the orientation of the object is estimated through the X-Y axis pole among the points constituting the edge. Once the orientation is determined, the 3D box is expanded to match the class specific size and aspect ratio. In this case, the edge to be used as the base of the expansion can be determined according to the positional relationship between the edge group and the height difference of hPi and hPj at the segmentation boundary line surrounding the edge in 2D space. Using this, we extend the box around the edge that contains the edge of the object. Figure 6 shows the process of expanding the box through the labeled edges and integrating the overlapping boxes and edges.

III. EXPERIMENT RESULTS

This study was conducted through caffe[15] framework and used NVIDIA TITAN X GPU. The two networks we used for evaluation, Faster R-CNN and R-FCN, were originally studied and evaluated in the PASCAL VOC Dataset, but this study was conducted through KITTI Dataset because LiDAR

information is required together. We also compared the original R-FCN method using Selective search [14] and the results of this study with the KITTI object dataset by using R-FCN learning model learned in PASCAL VOC for objective comparison. The metrics used in the evaluation are the standard mean average precision and the mean intersection-over-union.

A. Experiments on KITTIWe first training through the KITTI Object Dataset and Tracking sequence, which is a typical dataset that can use CCD information and LiDAR information among public datasets that can use R-FCN at present. In training phase, four classes of 'Car', 'Pedestrian', 'cyclist', and 'background' were combined and learned in four classes. Since the proposed method does not require additional learning for LiDAR, there is no need to change the architecture of R-FCN to learn CCD image. However, as the type of class changes, the volume of the feature set has changed. But the valid range of LiDAR Sensor data given in KITTI Dataset is about 50m. It was not possible to evaluate the object of hard difficulty which is not wide enough to cover the entire CCD image and belong to the distance of 50m or more. Table 1 compares the object detection rates of existing state-of-the-art methods and our research by measuring the mAP by difficulty for the three classes. We measured the precision of the box with the ground truth area and IOU of 50% or more, and found good results for 'Car' and 'Pedestrian'. However, in the case of 'cyclist', LiDAR points were not uniformly distributed according to the spoke shape of the bicycle wheel, so that a box proposal of a sufficient size could not be generated, resulting in a relatively low value.

Car Pedestrian CyclistMethod E M E M E M

Regionlet[15] 84.75 76.45 73.14 61.15 70.41 58.723DVP[10] 87.46 75.77 - - - -SubCat[16] 84.14 75.46 - - - -SDP [17] 90.33 83.53 77.74 64.19 74.08 61.31

Ours 95.41 88.54 81.78 65.71 72.11 60.85

Table 1. The results of the KITTI dataset of our study are compared with state-of-the-art methods. Hard difficulty

was excluded from comparison because of the limit of measurement distance of LiDAR sensor of about 50m.

B. Experiments on PASCAL VOC & KITTIIn fact, this study is close to the proposal generation method, so the classifier R-FCN itself is used without any modification. Therefore, the measurement of mAP at KITTI is not enough to analyze the performance of our study because it is more influenced by the performance of our R-FCN. Therefore, we compared the results with R-FCN using Faster R-CNN or Selective search using RPN to objectively measure performance as a proposal creation and result correction tool. In this comparison, we used the caffe model learned from

Figure 6. Expanded box by determined edge orientation and corner detection. (a) shows the object box classified by R-FCN, and (b) is a figure that is extended to 2D according to the actual size of the class through 2D to 3D projection

and then projected to 2D.

PASCAL VOC 07 + 12 published in the original paper as it is to prevent problems that may occur during re-learning of two networks through KITTI Dataset. Table 2 compares the accuracy and computation time of two other studies with this study. Table 3 shows the AP variation with IOU ratio to ground-truth. As shown in the table, the overlap ratio is higher than that of conventional RPN or SS. This is because the box extension through the class label reduces the error by re-expanding the box closer to the actual object size. We have confirmed that, as long as the object belongs to the LiDAR sensor range, this study expresses the object as a bounding box closer to the ground-truth than the existing research. Although the box extension is performed only for the vehicle category due to the problem of orientation acquisition, as shown in Zhang et al.[14]'s experiment. But, if the aspect ratio and size of the object such as pedestrian can be generalized, it is possible to expand the additional category.

Training Data

TestData

mAP (%)

test time (sec/img)

RPN+Faster R-CNN 07+12 KITTI 75.7 0.37

RPN+R-FCN 07+12 KITTI 77.4 0.20

SS+R-FCN 07+12 KITTI 80.4 2.21

Ours +R-FCN 07+12 KITTI 82.4 0.17

Table 2. Comparison result of detection rate in KITTI Object and Tracking dataset. To reduce errors, we used

the original pre-training model.

Training Data

TestData

[email protected]

[email protected]

[email protected]

RPN+R-FCN 07+12 KITTI(car) 84.8 77.4 55.2SS+R-FCN 07+12 KITTI(car) 86.3 80.4 58.4

Ours 07+12 KITTI(car) 89.7 82.4 80.1Table 3. The AP change according to the IOU rate in the Car category. The proposed method shows a high overlap

ratio with respect to the ground truth compared to the RGB feature based proposal generator such as selective

search.

IV. CONCLUSION

We propose a simple, but strong method for autonomous driving through this study. We have shown better results than state-of-the-art methods by combining CCD-based classifier and LiDAR information effectively through the proposed BPF. Especially, 3D space information of a partially obscured object is grasped in real time, and the detection area is extended based on this, so that the result shows a high IOU rate with the

groundtruth. However, there are still problems to be solved, such as how to handle objects that appear on the CCD but are outside the LiDAR range, or to expand on both sides of the object. In addition, LiDAR sensing range is shorter than the visible range of the CCD, so it is necessary to compensate for CCD objects. We will address these issues through additional research.

REFERENCES

[1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

[2] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

[3] Kong, Tao, et al. "HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection." arXiv preprint arXiv:1604.00600 (2016).

[4] Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).

[5] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015).

[6] Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).

[7] Chen, Xiaozhi, et al. "3d object proposals for accurate object class detection." Advances in Neural Information Processing Systems. 2015.

[8] Wang, Dominic Zeng, and Ingmar Posner. "Voting for voting in online point cloud object detection." Proceedings of the Robotics: Science and Systems, Rome, Italy 1317 (2015).

[9] Xiang, Yu, et al. "Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection." arXiv preprint arXiv:1604.04693 (2016).

[10] Xiang, Yu, et al. "Data-driven 3d voxel patterns for object category recognition." 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.

[11] T. Huang, G. Yang, and G. Tang, "A fast two-dimensional median filtering algorithm", IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, no. 1, pp. 13–18, 1979.

[12] Zitnick, C. Lawrence, and Piotr Dollár. "Edge boxes: Locating object proposals from edges." European Conference on Computer Vision. Springer International Publishing, 2014.

[13] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[14] Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013): 154-171.

[15] Wang, Xiaoyu, et al. "Regionlets for generic object detection." IEEE transactions on pattern analysis and machine intelligence 37.10 (2015): 2071-2084.

[16] Ohn-Bar, Eshed, and Mohan Manubhai Trivedi. "Learning to detect vehicles by clustering appearance patterns." IEEE Transactions on Intelligent Transportation Systems 16.5 (2015): 2511-2521.

[17] Yang, Fan, Wongun Choi, and Yuanqing Lin. "Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

[18] Zhang, Liliang, et al. "Is Faster R-CNN Doing Well for Pedestrian Detection?." European Conference on Computer Vision. Springer International Publishing, 2016.

Education

Mmpaper draft10