4
Traffic Sign Detection with CNN Approach Ge Luo 1 , Zhongling Wang 1 , Ang Li 1 , Sihan Wang 2 , Zhenzhong Chen 3 School of Computer Science, Wuhan University, China Abstract—Traffic Sign Detection is challenging due to the complex road environments, such as snow, rain, haze, shadow and so on. A comprehensive traffic sign detection system should be robust to different road environments and be able to detect and classify traffic signs quickly and correctly. In this paper, we present a system for Traffic Sign Detection task in Video and Image Processing Cup (VIP Cup) 2017. In the system, an automatic traffic sign detection model that has a deep neural network structure is applied to detect traffic signs under different environments. The resulting system is able to decompose videos into frame images, extract features, produce region proposals, detect and classify traffic signs automatically. As shown in the experiments, our traffic sign detection system can effectively detect different traffic signs under different environments. I. INTRODUCTION Due to the rise of automobiles in the recent decades, traffic accidents have become an important cause of death. In each year, about 10 million people get injured in traffic accidents around the world. Many works have been proposed to promote the development of protection systems to improve traffic safety, such as autonomous driving system [1] and pedestrian detection [2-3]. There is one kind of situation to a great extent contributes to the accidents – the misunderstanding or neglecting of traffic signs. As a result, traffic sign recognition is an important part of the protection system because it can enhance the drivers capability of being aware of the traffic signs especially under challenging conditions such as bad weather. Although it has been studied for many years [11-23], the main difficulties including the diversity in the design of traffic signs, poor lighting conditions, background color similarity and occlusion still set a huge obstacle to its improvement progress. In VIP Cup 2017, traffic videos under different environ- mental conditions are provided, such as extreme lighting con- ditions, blur, haze, heavy rain, snow and so on. These videos include processed versions of captured videos and synthesized videos. Besides, type and level of the challenging conditions are varied in the video dataset, including but not limited to a range of lighting conditions, blur, haze, rain and snow levels, as shown in Fig. 1. The participants are expected to design a traffic sign detection algorithm that robustly works for the challenging conditions under such challenging environmental conditions. Traffic sign detection usually relies on two key observations: the sign has a clear shape and a unique color. 1 Undergraduate 2 Postgraduate 3 Supervisor Fig. 1. Sample images with different weather conditions and different distortion levels in VIP Cup 2017 Thus traffic sign detection methods can be categorized into three types: shape-based, color-based or hybrid. The shape- based detection approaches are mainly based on complex shape analysis, distance transform, Hough transform, Lapla- cian filter and HOG features [10]. These shape-based detection methods are usually slow because they have to deal with a wide range of changes in the geometry of traffic signs on very large images. Color-based algorithms usually follow a common two-step scheme: traffic sign detection and traffic sign classification. In the detection phase, the goal is to locate the bounding box that contains the traffic sign. And the phase of classification is to determine which category these detected areas which may contain traffic sign fall into. These methods are usually robust to deformation, but are sensitive to lighting conditions. In order to take advantage of the above two methods, the hybrid structure can extract both shape and color features. Deep convolutional neural networks (CNNs) [9, 24] is one of the hybrid models which can extract various types of features. It achieves remarkable progresses in a variety of computer vision tasks, such as image classification [5] and object detection. Among those state-of-the-art detection algorithms using the deep neural network, Faster R-CNN [4] has made a great performance. Thus, in this paper, we use Faster R-CNN as our algorithm and Caffe [6] as our deep learning framework to realize our traffic sign detection system. Our method achieves a good result with acceptable detection speed and with admissible computation resource. The remainder of the paper is organized as follows: The second section explains the technical details including the architecture of our system and the training method. The third section illustrates the detailed evaluation and analysis of our method. In the last section, we summarize our method and

Traffic Sign Detection with CNN Approach - neonjam.cc · Traffic Sign Detection with CNN Approach Ge Luo 1, ... traffic sign recognition ... of computer vision tasks,

Embed Size (px)

Citation preview

Traffic Sign Detection with CNN ApproachGe Luo1, Zhongling Wang1, Ang Li 1, Sihan Wang 2, Zhenzhong Chen 3

School of Computer Science, Wuhan University, China

Abstract—Traffic Sign Detection is challenging due to thecomplex road environments, such as snow, rain, haze, shadowand so on. A comprehensive traffic sign detection system shouldbe robust to different road environments and be able to detectand classify traffic signs quickly and correctly. In this paper,we present a system for Traffic Sign Detection task in Videoand Image Processing Cup (VIP Cup) 2017. In the system, anautomatic traffic sign detection model that has a deep neuralnetwork structure is applied to detect traffic signs under differentenvironments. The resulting system is able to decompose videosinto frame images, extract features, produce region proposals,detect and classify traffic signs automatically. As shown in theexperiments, our traffic sign detection system can effectivelydetect different traffic signs under different environments.

I. INTRODUCTION

Due to the rise of automobiles in the recent decades, trafficaccidents have become an important cause of death. In eachyear, about 10 million people get injured in traffic accidentsaround the world. Many works have been proposed to promotethe development of protection systems to improve trafficsafety, such as autonomous driving system [1] and pedestriandetection [2-3]. There is one kind of situation to a greatextent contributes to the accidents – the misunderstanding orneglecting of traffic signs. As a result, traffic sign recognitionis an important part of the protection system because itcan enhance the drivers capability of being aware of thetraffic signs especially under challenging conditions such asbad weather. Although it has been studied for many years[11-23], the main difficulties including the diversity in thedesign of traffic signs, poor lighting conditions, backgroundcolor similarity and occlusion still set a huge obstacle to itsimprovement progress.

In VIP Cup 2017, traffic videos under different environ-mental conditions are provided, such as extreme lighting con-ditions, blur, haze, heavy rain, snow and so on. These videosinclude processed versions of captured videos and synthesizedvideos. Besides, type and level of the challenging conditionsare varied in the video dataset, including but not limited to arange of lighting conditions, blur, haze, rain and snow levels,as shown in Fig. 1. The participants are expected to designa traffic sign detection algorithm that robustly works for thechallenging conditions under such challenging environmentalconditions.

Traffic sign detection usually relies on two key observations:the sign has a clear shape and a unique color.

1Undergraduate2Postgraduate3Supervisor

Fig. 1. Sample images with different weather conditions and differentdistortion levels in VIP Cup 2017

Thus traffic sign detection methods can be categorized intothree types: shape-based, color-based or hybrid. The shape-based detection approaches are mainly based on complexshape analysis, distance transform, Hough transform, Lapla-cian filter and HOG features [10]. These shape-based detectionmethods are usually slow because they have to deal with awide range of changes in the geometry of traffic signs onvery large images. Color-based algorithms usually follow acommon two-step scheme: traffic sign detection and trafficsign classification. In the detection phase, the goal is to locatethe bounding box that contains the traffic sign. And thephase of classification is to determine which category thesedetected areas which may contain traffic sign fall into. Thesemethods are usually robust to deformation, but are sensitiveto lighting conditions. In order to take advantage of the abovetwo methods, the hybrid structure can extract both shape andcolor features.

Deep convolutional neural networks (CNNs) [9, 24] isone of the hybrid models which can extract various typesof features. It achieves remarkable progresses in a varietyof computer vision tasks, such as image classification [5]and object detection. Among those state-of-the-art detectionalgorithms using the deep neural network, Faster R-CNN [4]has made a great performance. Thus, in this paper, we useFaster R-CNN as our algorithm and Caffe [6] as our deeplearning framework to realize our traffic sign detection system.Our method achieves a good result with acceptable detectionspeed and with admissible computation resource.

The remainder of the paper is organized as follows: Thesecond section explains the technical details including thearchitecture of our system and the training method. The thirdsection illustrates the detailed evaluation and analysis of ourmethod. In the last section, we summarize our method and

point out some issues need to be improved.

II. OUR PROPOSED SYSTEM

A. Architecture of the proposed system

The program of our system was developed using Caffe deeplearning framework with Python interface. In the dataset, allthe videos provided for training were well organized withannotations. The architecture of our system is shown in Fig.2.

Fig. 2. Architecture of the proposed system

First of all, we extracted frames with traffic signs fortraining, then annotation files were converted into XML formatto meet the requirement of training scripts. Python interface ofCaffe was used for training the final model on GPU device toaccelerate the training progress. Then the pre-trained Caffemodel was used to detect and classify the traffic signs inthe test set. After detection and classification, we employednon-maximum suppression (NMS) with threshold of 0.1 tomerge highly overlapped bounding boxes. Average precisionand recall were used to evaluate the results.

B. Algorithm details

The basic task of detecting traffic sign is object detection.In this field, many methods have been proposed, such asRCNN [7], Fast R-CNN [8], Faster R-CNN. For a generalobject detection structure, four components are needed: regionproposal, feature extraction, classification and position refine.

We employ Faster R-CNN method that combines the fourparts in one deep neural networks with good tradeoff betweencomputational complexity and accuracy.

In our algorithm, Datasets were converted into VOC2007format for the purpose of training on the written Faster R-CNNscripts. In this format, frames with annotations were extractedinto images from video sequences and annotation files mustbe converted into XML format files.

Then training was processed on a Workstation with a GPUof NVIDIA 1080Ti, an internal memory of 96G and freeswap space of 150G. Because of the large amounts of imagesused for training, our system often ran out of memory whiletraining the model. So, we added another 150G swap spacefor storing temp files. We used ZF Net in Faster R-CNN asthe feature extraction network. The Python interface of Caffedoesn’t reliably free GPU memory when instantiated nets arediscarded. To work around this issue, each training stage isexecuted in a separate process using multiprocessing process,and a queue is designed for communicating results betweendifferent processes.

To detect the traffic signs in the video sequence, each of thevideo frame were input into Faster R-CNN. Then boundingbox regression vectors, class numbers and class scores wereobtained. After that, we employed NMS to merge highlyoverlapped bounding boxes. Finally, candidates with classscore higher than a threshold of 0.8 were picked as the finaldetection results.

In the process of detection, many candidate windows candetect the traffic signs in it but we only need the best one torepresent the location of the sign. Non-Maximum Suppression(NMS) is used to merge highly overlapped candidates. First,the area of every candidate is calculated, then these candidatesare sorted by the detection score and we choose the candidatewith highest score to calculate the Intersection-over-Union(IOU) with other candidates. If the IOU is greater than thethreshold we set, this candidate is removed. Repeat these stepsand we can get the final detection results. We set the thresholdof NMS as 0.1 to get more precise detection results. A simpleexample is shown in Fig. 3, with NMS, lots of overlappedcandidates are sorted and we can get the most precise onewith highest detection and classification score.

III. EXPERIMENTAL RESULTS

The core model of our system is implemented using Caffeopen-source framework. Experiments were performed on amachine with a GPU of NVIDIA 1080Ti, an internal memoryof 96G and free swap space of 150G. The operating systemof this machine is Ubuntu.

Methods introduced above were applied to obtain the resultsof the test set. To test the performance of the Faster R-CNNwith ZF Net on the specific task, average precision and recallon different sequence types, challenge types and levels werecomputed in the following tables and figures.

According to Fig. 4 and Fig. 5, we can tell that the algorithmhas a poor performance on Rain type (the 9th type) and Codecerror type (the 3rd type) as well, while has a relatively good

Fig. 3. Original detection results and results after NMS with threshold0.1

TABLE IPRECISION/RECALL ON REAL ENVIRONMENTS

TypeLevel

0 1 2 3 4 5

0 0.30/0.331 0.30/0.32 0.27/0.27 0.18/0.16 0.06/0.05 0.01/0.012 0.07/0.08 0.05/0.05 0.04/0.04 0.03/0.03 0.03/0.034 0.30/0.30 0.18/0.16 0.02/0.02 0.00/0.00 0.00/0.005 0.29/0.31 0.28/0.31 0.27/0.28 0.23/0.23 0.11/0.096 0.26/0.27 0.19/0.18 0.09/0.08 0.03/0.03 0.00/0.007 0.28/0.28 0.22/0.20 0.09/0.08 0.06/0.06 0.02/0.028 0.29/0.31 0.25/0.25 0.17/0.15 0.07/0.06 0.01/0.019 0.08/0.06 0.03/0.02 0.03/0.02 0.02/0.01 0.01/0.0010 0.29/0.31 0.29/0.30 0.25/0.24 0.20/0.18 0.14/0.1211 0.30/0.33 0.23/0.22 0.12/0.10 0.04/0.03 0.00/0.0012 0.31/0.31 0.25/0.23 0.14/0.12 0.09/0.08 0.08/0.07

performance on Dirty lens type (the 5th type) and Shadow type(the 10th type) as well. With level enhancing, the algorithmperforms decreases rapidly on other types.

From Fig. 6 and Fig. 7 we can tell that the algorithm hasa poor performance on Rain type (the 9th type), while has arelatively good performance on Dirty lens type (the 5th type)

TABLE IIPRECISION/RECALL ON VIRTUAL ENVIRONMENTS

TypeLevel

0 1 2 3 4 5

0 0.29/0.401 0.30/0.39 0.31/0.32 0.23/0.18 0.03/0.02 0.01/0.012 0.33/0.35 0.32/0.23 0.22/0.15 0.13/0.07 0.07/0.043 0.19/0.26 0.15/0.20 0.13/0.18 0.12/0.15 0.11/0.154 0.34/0.35 0.16/0.10 0.01/0.01 0.00/0.00 0.00/0.005 0.29/0.39 0.29/0.39 0.29/0.37 0.30/0.32 0.22/0.156 0.29/0.32 0.22/0.17 0.11/0.07 0.06/0.04 0.02/0.017 0.36/0.36 0.24/0.16 0.12/0.07 0.05/0.03 0.02/0.018 0.30/0.39 0.30/0.32 0.22/0.16 0.07/0.04 0.01/0.019 0.13/0.08 0.04/0.03 0.02/0.01 0.02/0.01 0.00/0.00

10 0.30/0.39 0.32/0.34 0.32/0.27 0.27/0.19 0.22/0.1411 0.30/0.40 0.32/0.29 0.21/0.14 0.09/0.05 0.00/0.00

Fig. 4. Precision on real environments

Fig. 5. Recall on real environments

and Shadow type (the 10th type). With level enhancing, thealgorithm performs decreases rapidly on other types, whilestably on Codec error type (the 3rd type).

Fig. 6. Precision on virtual environments

Fig. 7. Recall on virtual environments

IV. CONCLUSION AND FUTURE WORK

In this paper we have presented the detailed implementationof our system participated in IEEE VIP Cup 2017. The systemis applied to detect traffic signs under complex environment.We trained a deep learning model using Caffe framework andintegrated it into our system to realize automatic traffic signdetection and classification. The current result is not goodenough and more works need to be done to improve theperformance in the future.

REFERENCES

[1] J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z.Kolter, D. Langer, O. Pink, V. Pratt, M. Sokolsky, G. Stanek, D. Stavens,A. Teichman, M. Werling, and S. Thrun, Towards fully autonomousdriving: Systems and algorithms, in Intelligent Vehicles Symposium (IV).IEEE, 2011.

[2] P. Dollr, R. Appel, and W. Kienzle, Crosstalk cascades for frame-ratepedestrian detection, in ECCV, 2012.

[3] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool, Pedestriandetection at 100 frames per second, in CVPR, 2012.

[4] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-timeobject detection with region proposal networks, IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 39, no. 6, p. 1137, 2017.

[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classificationwith deep convolutional neural networks, in International Conference onNeural Information Processing Systems, pp. 10971105, 2012.

[6] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.Guadarrama, and T. Darrell, Caffe: Convolutional architecture for fastfeature embedding, arXiv preprint arXiv:1408.5093, 2014.

[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierar-chiesfor accurate object detection and semantic segmentation, pp. 580587,2013

[8] R. Girshick, Fast r-cnn, in IEEE International Conference on ComputerVision, pp. 14401448, 2015.

[9] M. D. Zeiler and R. Fergus, Visualizing and understanding convolutionalnetworks, in European Conference on Computer Vision, pp. 818833,2014.

[10] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool, Traffic signrecognition - how far are we from the solution? in Neural Networks(IJCNN), The 2013 International Joint Conference on, Aug 2013, pp. 18.

[11] D. Gavrila, Traffic sign recognition revisited, in Mustererkennung 1999,21.DAGM-Symposium, London, UK, UK, 1999, pp. 8693, Springer-Verlag.

[12] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, Vision-basedtraffic sign detection and analysis for intelligent driver assistance systems:Perspectives and survey, IEEE Transactions on Intelligent TransportationSystems, vol. 13, no. 4, pp. 14841497, Dec 2012.

[13] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, The german trafficsign recognition benchmark: A multi-class classification competition, inNeural Networks (IJCNN), The 2011 International Joint Conference on,July 2011, pp. 14531460.

[14] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, Man versuscomputer:Benchmarking machine learning algorithms for traffic signrecognition, Neural Networks, vol. 32, pp. 323 332, 2012, SelectedPapers from fIJCNNg 2011.

[15] R. Timofte, K. Zimmermann, and L. V. Gool, Multiview traffic signdetection, recognition, and 3d localisation, in Applications of ComputerVision (WACV), 2009. Workshop on, Dec 2009, pp. 18.

[16] F. Larsson and M. Felsberg, Using fourier descriptors and spatial modelsfor traffic sign recognition, in Proceedings of the 17th ScandinavianConference on Image Analysis, Berlin, Heidelberg, 2011, SCIA11, pp.238249, SpringerVerlag.

[17] C. Grigorescu and N. Petkov, Distance sets for shape filters and shaperecognition, IEEE Transactions on Image Processing, vol. 12, no. 10, pp.12741286, Oct 2003.

[18] R. Belaroussi, P. Foucher, J. P. Tarel, B. Soheilian, P. Charbonnier, andN.Paparoditis, Road sign detection in images: A case study, in PatternRecognition (ICPR), 2010 20th International Conference on, Aug 2010,pp. 484488.

[19] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, Detec-tion of traffic signs in real-world images: The german traffic sign detectionbenchmark, in Neural Networks (IJCNN), The 2013 International JointConference on, Aug 2013, pp. 18.

[20] R. Timofle and L. Van Gool, Sparse representation based projections, inBMVC, 2011.

[21] R. Timofte, V. A. Prisacariu, L. J. Van Gool, and I. Reid, Chapter 3.5:Combining traffic sign detection with 3d tracking towards better driverassistance, in Emerging Topics in Computer Vision and its Applications,C. H. Chen, Ed. World Scientific Publishing, September 2011.

[22] R. Timofte, K. Zimmermann, and L. Van Gool, Multi-view trafficsign detection, recognition, and 3d localisation, Machine Vision andApplications, December 2011.

[23] A. Mgelmose, M. M. Trivedi, and T. B. Moeslund, Vision based trafficsign detection and analysis for intelligent driver assistance systems:Perspectives and survey, IEEE Intelligent Transportation Systems Trans-actions and Magazine. Special Issue on MLFTSR, 2012.

[24] P. Sermanet and Y. LeCun, Traffic sign recognition with multi-scaleconvolutional networks, in IJCNN, 2011, pp. 28092813.