10
1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines Haijun Zhang, Donghai Li, Yuzhu Ji, Haibin Zhou, Weiwei Wu, and Kai Liu Abstract—Deep learning is a popular direction in computer vision and digital image processing. It is widely utilized in many fields, such as robot navigation, intelligent video surveillance, industrial inspection, and aerospace. With the extensive use of deep learning techniques, classification and object detection algorithms have been rapidly developed. In recent years, with the introduction of the concept of “unmanned retail”, object detection and image classification play a central role in unmanned retail applications. However, open-source datasets of traditional classification and object detection have not yet been optimized for application scenarios of unmanned retail. Currently, classi- fication and object detection datasets do not exist that focus on unmanned retail solely. Therefore, in order to promote unmanned retail applications by using deep learning-based classification and object detection, we collected more than 30,000 images of unmanned retail containers using a refrigerator affixed with different cameras under both static and dynamic recognition environments. These images were categorized into 10 kinds of beverages. After manual labeling, images in our constructed dataset contained 155,153 instances, each of which was annotated with a bounding box. We performed extensive experiments on this dataset using 10 state-of-the-art deep learning-based models. Experimental results indicate great potential of using these deep learning-based models for real-world smart unmanned vending machines (UVMs). Index Terms—Unmanned retail, vending machine, object de- tection, benchmark dataset. I. I NTRODUCTION Recent years have witnessed the rapid development of e-commerce, along with dramatic advances in technologies of mobile payment [1], Internet of Things (IoT) [2], cloud computing, and artificial intelligence (AI). In the past year, unmanned retail mode, a new ecommerce concept focusing on This work was supported in part by the National Key R&D Program of China under Grant no. 2018YFB1003800, 2018YFB1003805, the National Natural Science Foundation of China under Grant no. 61832004 and no. 61572156, and the Shenzhen Science and Technology Program under Grant no. JCYJ20170413105929681 and no. JCYJ20170811161545863. Weiwei Wu’s work is supported by National Natural Science Foundation of China under Grant No.61672154, No.61972086 and the Fund of ZTE Corporation under Grant No.HC-CN-20190826009. Kai Liu’s work was supported in part by the National Natural Science Foundation of China under Grant No.61872049. Copyright c 2009 IEEE. Personal use of this material is permitted. How- ever, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. Haijun Zhang, Donghai Li, Yuzhu Ji, and Haibin Zhou are with the Harbin Institute of Technology, Shenzhen, Xili University Town, Shenzhen 518055, P. R. China. Weiwei Wu is with Southeast University, Nanjing, P. R. China. Kai Liu is with Chongqing University, Chongqing, P. R. China. (e-mail: [email protected]; [email protected]; an- [email protected]; [email protected]; weiwei- [email protected]; [email protected]). Haijun Zhang is the corresponding author. next generation retailers, was proposed by giant e-commerce companies [3]. It aims at transforming the traditional retail mode into unmanned retail scenarios by utilizing the above- mentioned advanced technologies. A veritable revolution has occurred in traditional retail with unprecedented development of technologies in AI and Internet Plus [4]. Indeed, building unmanned stores, smart unmanned vending machines (UVMs), and unstaffed sales shelves has be- come possible by relying on these new technological advances. Traditional UVMs usually rely on automation technology to deliver beverages when a customer selects an item by pressing buttons. Although new methods such as quick response (QR) code and facial recognition are employed to replace buttons [5] [6] [7], traditional vending machine has drawbacks such as non-touched purchase experience, selecting one item at a time, and non-return if a customer is not satisfied with the taken products. These limits can be addressed by AI-based UVMs. The advanced computer vision techniques could be regarded as an upgrade for traditional UVMs by guaranteeing recognition accuracy and enhancing user experience. In the big data era, the AI-based UVMs are also able to boost the potential commercial applications by collecting data of user purchase behavior. For example, at the end of 2016, Amazon launched an experimental unmanned convenience store, called “Amazon Go”. This store integrates new technologies in computer vision, deep learning, sensors, and mobile payment 1 . Specifically, it enables customers to purchase products without waiting to check out. Payment is automatically taken when a consumer has completed his or her shopping and walks out of the store, which brings a truly qualitative improvement in shopping experience. After this innovation, many e-commerce companies in China, including Alibaba, JD.com, Suning.com, etc., have made new attempts to develop unmanned retail stores. For instance, Alibaba launched an unstaffed concept store, called Tao Caf´ e, in Hangzhou, China in 2017. Specif- ically, a facial-recognition system was utilized to accurately identify customers, and sensors were used for tracking the product selection of customers. Payments were automatically deducted from the customer’s Alipay account. At the same time, JD.com developed JD.ID X-mart by employing certain technologies, such as QR code, facial recognition, and radio frequency identification (RFID). Similar technologies were also applied in an unmanned automated retail store, called Biu, which was developed by Suning.com. This store was unique in that it incorporated the concepts of automatic upselling and personalized recommendations. Other solutions, including 1 https://www.amazon.com/b?ie=UTF8&node=16008589011

Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

1

Towards New Retail: A Benchmark Dataset forSmart Unmanned Vending Machines

Haijun Zhang, Donghai Li, Yuzhu Ji, Haibin Zhou, Weiwei Wu, and Kai Liu

Abstract—Deep learning is a popular direction in computervision and digital image processing. It is widely utilized in manyfields, such as robot navigation, intelligent video surveillance,industrial inspection, and aerospace. With the extensive useof deep learning techniques, classification and object detectionalgorithms have been rapidly developed. In recent years, withthe introduction of the concept of “unmanned retail”, objectdetection and image classification play a central role in unmannedretail applications. However, open-source datasets of traditionalclassification and object detection have not yet been optimizedfor application scenarios of unmanned retail. Currently, classi-fication and object detection datasets do not exist that focus onunmanned retail solely. Therefore, in order to promote unmannedretail applications by using deep learning-based classificationand object detection, we collected more than 30,000 images ofunmanned retail containers using a refrigerator affixed withdifferent cameras under both static and dynamic recognitionenvironments. These images were categorized into 10 kinds ofbeverages. After manual labeling, images in our constructeddataset contained 155,153 instances, each of which was annotatedwith a bounding box. We performed extensive experiments onthis dataset using 10 state-of-the-art deep learning-based models.Experimental results indicate great potential of using these deeplearning-based models for real-world smart unmanned vendingmachines (UVMs).

Index Terms—Unmanned retail, vending machine, object de-tection, benchmark dataset.

I. INTRODUCTION

Recent years have witnessed the rapid development ofe-commerce, along with dramatic advances in technologiesof mobile payment [1], Internet of Things (IoT) [2], cloudcomputing, and artificial intelligence (AI). In the past year,unmanned retail mode, a new ecommerce concept focusing on

This work was supported in part by the National Key R&D Program ofChina under Grant no. 2018YFB1003800, 2018YFB1003805, the NationalNatural Science Foundation of China under Grant no. 61832004 and no.61572156, and the Shenzhen Science and Technology Program under Grantno. JCYJ20170413105929681 and no. JCYJ20170811161545863. WeiweiWu’s work is supported by National Natural Science Foundation of Chinaunder Grant No.61672154, No.61972086 and the Fund of ZTE Corporationunder Grant No.HC-CN-20190826009. Kai Liu’s work was supported inpart by the National Natural Science Foundation of China under GrantNo.61872049.

Copyright c©2009 IEEE. Personal use of this material is permitted. How-ever, permission to use this material for any other purposes must be obtainedfrom the IEEE by sending a request to [email protected].

Haijun Zhang, Donghai Li, Yuzhu Ji, and Haibin Zhou are withthe Harbin Institute of Technology, Shenzhen, Xili University Town,Shenzhen 518055, P. R. China. Weiwei Wu is with Southeast University,Nanjing, P. R. China. Kai Liu is with Chongqing University, Chongqing,P. R. China. (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). Haijun Zhang is the correspondingauthor.

next generation retailers, was proposed by giant e-commercecompanies [3]. It aims at transforming the traditional retailmode into unmanned retail scenarios by utilizing the above-mentioned advanced technologies.

A veritable revolution has occurred in traditional retail withunprecedented development of technologies in AI and InternetPlus [4]. Indeed, building unmanned stores, smart unmannedvending machines (UVMs), and unstaffed sales shelves has be-come possible by relying on these new technological advances.Traditional UVMs usually rely on automation technology todeliver beverages when a customer selects an item by pressingbuttons. Although new methods such as quick response (QR)code and facial recognition are employed to replace buttons[5] [6] [7], traditional vending machine has drawbacks suchas non-touched purchase experience, selecting one item at atime, and non-return if a customer is not satisfied with thetaken products. These limits can be addressed by AI-basedUVMs. The advanced computer vision techniques could beregarded as an upgrade for traditional UVMs by guaranteeingrecognition accuracy and enhancing user experience. In thebig data era, the AI-based UVMs are also able to boost thepotential commercial applications by collecting data of userpurchase behavior. For example, at the end of 2016, Amazonlaunched an experimental unmanned convenience store, called“Amazon Go”. This store integrates new technologies incomputer vision, deep learning, sensors, and mobile payment1.Specifically, it enables customers to purchase products withoutwaiting to check out. Payment is automatically taken when aconsumer has completed his or her shopping and walks outof the store, which brings a truly qualitative improvement inshopping experience. After this innovation, many e-commercecompanies in China, including Alibaba, JD.com, Suning.com,etc., have made new attempts to develop unmanned retailstores. For instance, Alibaba launched an unstaffed conceptstore, called Tao Cafe, in Hangzhou, China in 2017. Specif-ically, a facial-recognition system was utilized to accuratelyidentify customers, and sensors were used for tracking theproduct selection of customers. Payments were automaticallydeducted from the customer’s Alipay account. At the sametime, JD.com developed JD.ID X-mart by employing certaintechnologies, such as QR code, facial recognition, and radiofrequency identification (RFID). Similar technologies werealso applied in an unmanned automated retail store, called Biu,which was developed by Suning.com. This store was uniquein that it incorporated the concepts of automatic upsellingand personalized recommendations. Other solutions, including

1https://www.amazon.com/b?ie=UTF8&node=16008589011

Page 2: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

2

the F5 Future Store2, Bianlifeng3, BingoBox4, and AuchanMinute5 were developed in 2017. The above-mentioned in-novations show that unmanned retail is becoming a majorfocus of China’s e-commerce companies. It was reported thatdozens of enterprises and start-ups, many of which are imple-menting innovative state-of-the-art technologies, have obtainedan enormous amount of financing in 2018 [8]. Accordingto [8], in 2018, total financing only in the field of UVMsin the unmanned sales market has reached more than 1.7billion USD. Unmanned convenience stores, smart UVMs andunstaffed shelves constitute the three main forms of unmannedretailers, which are piloting the new trend to develop thenext generation of retail. Traditional vending machine hasdrawbacks such as non-touched purchase experience, selectingone item at a time, and non-return if a customer is notsatisfied with the taken products. These limits can be addressedby AI-based UVMs. By adopting advanced computer visiontechniques, such as objection detection and classification, thedeveloped smart UVMs can detect the items that a user takes orreturns before a transaction is completed based on the imagescaptured by cameras. During the purchasing process, a user isable to select product items by hands instead of the interfaceembedded in UVMs.

The concept of unmanned retail is not new. Vending ma-chines, initially developed in the early 1880s, are an instance ofunmanned retail. Currently, entirely new types of UVMs havebeen developed by leveraging new technologies, includingmobile payment and QR codes. In comparison with traditionalvending machines, efficiency in selling items and the userexperience have been greatly augmented by using these newlydeveloped technologies. In general, when using these UVMs,it is necessary for a consumer to open an app which offersmobile payment services, such as Alipay or WeChat paymentservices, and then move to the transaction settlement pageby scanning the QR code. However, it is important to notethat a traditional operation process remains to be followed.For example, only one item can be selected at a time, whichis relatively inconvenient when a user wants to buy morethan one item. In contrast, newly developed smart UVMs cangreatly improve the shopping experience by adopting advancedcomputer vision techniques, such as object detection andclassification [34] [35]. By identifying the products in videosor images taken by the camera embedded in smart UVM,the object detection and identification models can determinewhich items were taken out by a customer. For example,Tencent Youtu Laboratory introduced intelligent containersequipped with AI technologies at the Computer Vision Summitin 2018 [9]. These containers integrate various technologies,including deep learning techniques, visual product identifica-tion algorithms, WeChat online payment, etc., into a visualidentification-based system, which offers a new shopping mod-e to greatly improve the purchase experience in comparisonwith traditional vending machines. Fig. 1 illustrates the keytechnologies involved in smart UVMs. Due to the above-

2https://www.f5-futurestore.com/3https://www.bianlifeng.com/4https://www.bingobox.com/5https://www.auchan-retail.com/

mentioned new technologies, smart UVMs, as a main formof new retail, is achieving increasing popularity in the e-commerce market globally.

Cloud Management

System

Computer vision and

Deep Learning

QR code, Mobile

Payment, RFID

Hardware: Sensor,

Camera, Electronic tags,

Antenna

Fig. 1. Technologies involved in a general solution of unmanned vendingmachine.

However, a publicly available benchmark dataset still doesnot exist for object detection for retail unmanned contain-ers. Therefore, in this paper, by considering the real-worldscenarios of UVMs from the perspective of computer vision,we constructed a large-scale dataset for multi-class beveragedetection. Specifically, based on static and dynamic purchasingevents, we divided related tasks into static object detection anddynamic object classification during a customer’s purchasingprocess. As a result, 34,052 images comprising 10 categoriesof commonly available beverages in China’s market werecollected by three different kinds of cameras, including a smartcamera with a wide shooting angle, a high-definition camera,and a fish-eye camera, which were affixed to a refrigerator. Forstatic object detection, 155,153 object instances were manuallyannotated with labels and bounding boxes. For dynamic multi-class recognition, 3,046 images were collected during theprocess in which a beverage was taken out of the refrigeratorby a customer. Based on this dataset, we established a bench-mark for beverage detection and classification for applicationsin UVMs by evaluating the performance of six state-of-the-art object detection models and four classification models,respectively. The established dataset is expected to promote thedevelopment of UVM based on computer vision technologies.

The remainder of this paper is organized as follows. SectionII introduces the motivation of the construction of the proposedbenchmark dataset. Section III presents related works from theaspects of beverage detection and classification, respectively.Section IV illustrates the detailed construction of our beveragedataset, including image collection, category selection, andannotation. In Section V, we implement six object detectionmodels and four classification models as our baselines andpresent the evaluated results. Finally, we conclude this researchwith future work propositions in Section VI.

II. MOTIVATIONS

UVMs have great market potential as they are suitable tobe placed in most office buildings, residential communities,public transport hubs, etc. Importantly, the simplified purchase

Page 3: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

3

process in UVM offers convenience to customers in a waythat largely augments product sales. Customers only need topossess a certain account in a social media platform, suchas WeChat, or download an app provided and maintained bythe company that owns the particular UVM. The UVM willopen when a customer follows the instructions of the officialaccount or app. The customer can then select a product thathe or she desires, and take it out of the UVM. Paymentis completed in the official account or app when the doorof the UVM is closed. During this purchase process, aninfrared laser sensor installed near the entrance of the UVMis usually utilized to detect the human hand. Simultaneously,a camera is triggered to take images. The collected images aresubsequently uploaded to a cloud server, in which deployedmodels for classification or object detection can determine theappropriate payment by identifying which items were takenout of the machine by the customer. A transaction settlementpage is then generated and sent to the customer’s mobile phonefor payment.

Models of object detection and classification are key compo-nents in computer vision-based self-service systems for UVM-s. In recent years, deep convolutional neural network (CNN)-based models [10] working under an end-to-end frameworkhave achieved superior performance in both object detectionand classification tasks due to their powerful generalizationand feature learning capabilities. In addition, as a consequenceof the rapid development of high-performance graphic pro-cessing units (GPUs), it has become feasible to deploy deepmodels as cloud services for real-time predictions.

Two typical processes exist for computer vision-basedrecognition strategies in UVMs: the static object detectionprocess and the dynamic object classification process. Specif-ically, the static object detection process is defined by com-paring the detection results of only two frames, which arecaptured by a camera when the infrared laser sensor detectsthat a user’s hand has entered into and left the UVM, re-spectively. The dynamic object classification process is definedby predicting the categories of a set of frames recorded by acamera during the process in which a user’s hand enters intoand leaves the UVM, detected by the infrared laser sensorequipped near the door of the UVM. For the static objectdetection process, only two images are required to be uploadedto the cloud server, which is responsible for executing analgorithm to detect the product item in these two images andreturning the result of what item was taken out of the UVMby comparing the detection results of the two images. Thiscan largely improve time performance, which enhances theuser’s purchase experience. However, much data in terms ofdifferent layout combinations of items on the shelves of UVMsshould be labeled in order to assure accurate detection. Incontrast, the dynamic object classification process is simplifiedinto a prediction task, in which a classifier is adopted toclassify the set of images that comprise a hand holding theproduct into the category to which the product belongs. In thissituation, only images that contain different views of a singleproduct and samples captured at the moment that a user’shand is holding a product should be labeled for providing awell-trained classifier. A set of images during this dynamic

purchase process, however, must be uploaded to the cloudserver for classification. Consequently, time performance willbe degraded due to the requirement of more computationalpower.

In this paper, motivated by the application of computervision-based UVMs, we constructed two large-scale datasetsconsidering the aforementioned two processes, i.e., the staticobject detection process and the dynamic object classificationprocess, respectively. The datasets comprise 10 categories ofbeverages which are commonly sold in the Chinese retailmarket. Images were collected by using cameras installed indifferent positions affixed to a refrigerator, which was used tomimic the real environment of a UVM.

III. RELATED TASKS

A. Static object detection

1) Static detection-based UVM: According to the proce-dure of a static recognition strategy, object detection-baseddeep learning models are used for determining the kind ofbeverage and how many items are taken out of the vendingmachine. Concretely, when a customer opens the door ofa vending machine, it will trigger the camera to take thefirst image. When the consumer takes the beverage(s) out ofthe container, the infrared laser sensor will be triggered byinspecting the hand of the consumer. At the same time, thecamera will take the second image. These two images, whichcapture the states of beverages in the container before andafter a customer selects an item(s), are utilized to determine theselected beverage(s). According to the object detection results,we can obtain what kind of beverage and how many productswere taken out by a customer. Thus, determining the numberof beverages and the corresponding categories in the capturedimages by relying on an effective object detection algorithmbecomes critical for this solution for UVMs.

In the working procedure of a static object detection processduring a purchasing event, it is worth noting that a user canrepeatedly return the beverage to re-select other items. In ourexperiment, in order to simulate the purchasing procedure asrealistically as possible, we developed a prototype of a smartvending machine as our experimental platform by integratingcameras and other sensors into a self-service retail refrigerator(see Fig. 1). Image collections of our proposed datasets wereconducted based on this experimental environment.

2) Existing datasets for object detection: In recent years,several natural scene image datasets, such as PASCAL VOC[11] and MSCOCO [12], have been widely used for bench-marking performance of object detection models. The PAS-CAL VOC dataset [11] focuses on classification and detection.Current algorithmic model evaluations are generally performedon VOC2007 and VOC2012. VOC2007 contains 20 categories,9,963 images, and 24,640 annotated object instances. Since thetest set of the VOC2012 dataset has not yet been published,only training and validation sets, which consist of 11,540images and 27,450 object instances, have been released. Ob-jects in these images are mostly related to human beings.The MSCOCO (Common Objects in Context) dataset [12]collects a large number of general scene images containing

Page 4: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

4

91 categories of common objects. In comparison with thePASCAL VOC dataset, images in MSCOCO contain naturalscene images and target images that are more commonly seenin real life. Moreover, images with complex backgrounds andsmall objects are also included, which makes the MSCOCOdataset more challenging.

3) Object detection methods: Recently, CNN-based two-stage and one-stage detectors are continuously updating objectdetection performance in several benchmark datasets. In 2014,Girshick et al. [13] proposed the R-CNN object detectionmethod, which is currently an important method for researchon object detection based on deep learning. However, a pre-trained CNN-based classification model needs to be adoptedto perform feature extraction for each generated candidateregion separately. As a result, a large number of repeatedoperations exists, which limits the efficiency of the R-CNNmethod. In a later study, Girshick et al. [14] proposed animproved Fast R-CNN method. Inspired by the idea of multi-task loss function, Fast R-CNN combines classification lossand bounding box regression loss into a unified frameworkfor end-to-end training. However, it still requires the selectivesearch proposal generation algorithm [15] to generate positiveand negative candidate bounding boxes, which will separatethe training process of the detector. In addition, the testingstage is quite time-consuming. In order to solve this problem,Ren et al. [16] proposed a Faster R-CNN with RPN (RegionProposal Network) module to assist the generation of ob-jectness proposals. Instead of using the traditional objectnessproposal generation methods [15], [17], the anchor mechanismwas also introduced by setting a series of anchor boxes withmultiple scales and aspect ratios in the candidate locationsfor RPN. The entire network can be trained in an end-to-endfashion.

The YOLO method, proposed by Redmon et al. [18], inher-its the one-stage method based on regression of the OverFeatalgorithm [17]. The candidate bounding box extraction branch(proposal stage) in YOLO is omitted, and feature extraction,candidate bounding box regression, and classification are di-rectly integrated into the same convolution network. Attributedto this simplified network structure, detection speed can beincreased to 10 times faster than the Faster R-CNN. However,the original YOLO detection method usually produces inac-curate position prediction and lower recall rate in comparisonwith regional proposal-based methods. To overcome theselimitations, Redmon et al. [19] proposed a YOLOv2 model,which focuses on improving recall and accuracy in predict-ing bounding box positions. YOLOv2 adopts a Darknet-19network with batch normalization for feature extraction. Theidea of the anchor mechanism in Faster R-CNN is introduced,and a better anchor template is calculated through k-meansclustering algorithm based on a training set, which greatlyimproves the recall rate of the original YOLO algorithm [18].Moreover, Redmon et al. [20] further proposed the YOLOv3network, which includes a 53-layer network with residualblocks, i.e., Darknet-53. For accurate bounding box prediction,it integrates a feature pyramid network (FPN) structure [21]for high and low-level feature fusion. At the same time,Liu et al. [22] proposed a single-shot detector (SSD), which

combines the regression idea of the YOLO model and theanchor mechanism of Faster R-CNN. Furthermore, Fu etal. [23] proposed a deconvolutional SSD (DSSD) method,which modifies the backbone network in SSD from VGG-16 to Resnet-101, aiming at improving the network featureextraction ability. In this research, we evaluate the performanceof YOLOv2 [19], YOLOv3 [20], Faster RCNN [16], R-FCN[24], SSD [22], and DSSD [23] on our constructed datasetwith respect to a static object detection task.

B. Dynamic object classification

1) Dynamic classification-based UVM: As aforementioned,dynamic object classification is another strategy implementedin real-world industrial applications. Concretely, during theprocess of dynamic recognition, images can be collected whena consumer takes an item out of the vending machine. Inpractice, a certain number of frames should be selected. Inour implementation, an ordinary camera was installed on thetop of the refrigerator near the door to capture images duringthe procedure of a dynamic purchasing event. It is worthnoting that a high-speed camera could be used in real-worldapplications, but it requires heavier computational power.Since the time range during this dynamic process is quiteshort for a participant taking a beverage out of the refrigerator,images extracted from the video frames sometimes appearto be blurry. According to the goal of this task, image-levelclassification will be performed to directly determine whichkind of beverage is being held by a consumer.

2) Dataset for classification: The ImageNet [25] datasetis widely applied in the field of computer vision for imageclassification. Indeed, it has become the benchmark datasetfor the performance evaluation of algorithms in the fieldof image classification and object detection. The ImageNetdataset contains more than 14 million images, which covermore than 20,000 categories. Among them, there are morethan one million images with category labeling and boundingbox annotation.

3) Image classification methods: The past decade has wit-nessed the rapid development of deep CNN-based models forimage-level classification. Initially, in 2012, the winner of theILSVRC competition was AlexNet [26], which achieved theTop-5 error rate, at 15.3%. AlexNet is regarded as a milestonein the field of computer vision, which proves the power ofdeep learning in the field of image classification. Subsequently,the winner of the competition in 2014 was GoogLeNet [27]and the runner-up was VGGNet [28]. VGGNet simplifies thestructure of CNN by using 3x3 convolutional kernels and 2x2maximum pooling operations. On the other hand, the keycomponent of GoogLeNet was its Inception module, whichdecoupled cross-channel correlations and spatial correlationswithin a convolution layer with respect to three-dimensionalspace. In [29], a 152-layer network structure with residualblocks was proposed for multi-class object recognition. AsResNet adopts the method of residual (skip) connection, itsolves the problem in which a deep model produces a highertraining error than its shallow counterpart as the networkgrows deeper. Based on ResNet, Huang et al. [30] proposed

Page 5: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

5

DenseNet, with dense connections, which aims at mining richfeatures by densely reusing high and low-level features ineach layer along with the feed-forward path. In comparisonwith ResNet, DenseNet has a smaller number of parametersand a certain regularization effect, which further alleviates theproblems of vanishing gradients and model degradation. Inthis research, we test the performance of Inception v3 [31],Inception v4 [32], Inception-ResNet-v2 [32], and ResNet v2-152 [33] on our constructed dataset with respect to a dynamicobject classification task.

IV. OUR DATASET FOR UVM

A. Image collection

An experimental platform was developed by integratingfour different kinds of cameras and other sensors into a self-service retail refrigerator in our experiment. Image collectionsof our proposed datasets for both detection and classificationtasks were conducted under this experimental environment.For static dataset collection, three kinds of cameras for theobject detection-oriented purchasing process, i.e., Mimi Homesmart camera, high-definition (HD) USB industrial camera andSY019HD 1080P fisheye camera, were adopted and installed.Fig. 2 shows the experimental environment with respect tothe static data collection, which illustrates the locations ofcameras and the size of the refrigerator from the front viewand vertical view, respectively. SY019HD 1080P fisheye cam-era was installed in the third floor of refrigerator which isrelatively low in height at this layer. As shown in Fig. 2in the manuscript, the camera is at ten centimeters heightabove the top of bottles at the third layer, by considering thatthe SY019HD 1080P fisheye camera has a larger wide anglecompared with the other two cameras. The other two camerashave not big difference on wide angle. So we put them in thefirst and second floor.

10cm

15cm

53.4cm

157cm

15cm

Mimi Home smart

camera

HD USB industrial

camera

SY019HD 1080P

Fisheye camera

10cm

15cm

53.4cm

157cm

15cm

Mimi Home smart

camera

HD USB industrial

camera

SY019HD 1080P

Fisheye camera

(a)

Door5cm

53.4cm

4cm

5cm

41cm

55cm

5cm 5cm43.4cm

19cm

19cm

20cm

Door5cm

53.4cm

4cm

5cm

41cm

55cm

5cm 5cm43.4cm

19cm

19cm

20cm 20cm

(b)

Door5cm

53.4cm

4cm

5cm

41cm

55cm

5cm 5cm43.4cm

19cm

19cm

20cm 20cm

(b)

Camera

Camera 1

Camera 2

Camera 3

Fig. 2. Illustration of our experimental platform for static data collection byintegrating cameras into a self-service retail refrigerator from: (a) front view;(b) vertical view.

In the process of collecting static detection data, three kindsof cameras were adopted. The first one is the Xiaomi Mijiasmart camera, which can capture 1080p full HD video images.The highest frame rate is up to 20 frames per second (FPS)with a 130 degree ultra-wide-angle lens, which makes the cam-era able to cover the entire interior space of the refrigerator.

Moreover, this camera can be wirelessly controlled throughWi-Fi, which is convenient for collecting images. The secondcamera is a 1080p HD USB industrial camera with a wideangle of 170 degrees. The frame rate can reach 30 FPS, thesize of the camera is 3.6 cm × 3.6 cm, and the lens thicknessis 2.5 cm. The third camera is a fish-eye camera. The viewof images taken by this fish-eye camera is ball-round and itswide angle can reach 234 degrees, which can cover the sceneof a container to a large extent. It also allows the containerto load more items in a limited space. This camera adopts aseven-band HD lens with a large aperture and full copper datacable.

For collecting the dynamic classification dataset, another1080p, 30 FPS HD USB camera was used. It was installedon the top of refrigerator, as shown in Fig. 3, to record thedynamic process of a customer removing a beverage. Thesensor of this camera, OV2710, is a high-level CMOS sensorspecially designed for 1080p video recording. It is 0.57 inchesin size and has an aspect ratio of 16:9.

Door5cm

53.4cm

4cm

5cm

41cm

55cm

5cm 5cm43.4cm

20cm 20cm

(c)

38cm

53.4cm

157cm

(a)

1080P 30 frames HD USB

camera Door

157cm

55cm

3cm

5cm

45cm

94.5

cm

5cm

(b)

Fig. 3. Illustration of our experimental platform for dynamic data collectionfrom: (a) front view; (b) side view; (c) vertical view.

B. Image annotation procedure

Considering that illumination conditions and backgroundsof images captured by cameras in a UVM only have minorchanges, it is possible to deploy a computer vision-basedUVM by using deep learning models with high accuracy. Thiskind of UVM possesses enormous potential in retail marketsfor selling fast-moving products, such as snacks, beverages,instant noodles, fruits, etc. In this research, we chose beverageas a representative product to build the dataset. Specifically,we selected 10 kinds of beverages from China’s market. Wenamed them as drinks A, B, C, and so on, as shown in Fig.4.

In order to train the detection models, we manually labeledthe category and bounding box for each object instance. Inparticular, for the static object detection process, the boundingbox is a rectangular box that completely circumscribes theobject in an image, instead of a polygon. For a properly placedbeverage, a bounding box can cover the contour of a bottlewell. For some bottles at the boundary of an image or obscuredby others, the annotated bounding box may not cover theregion of a beverage well. However, such kind of circumstancecan be ignored because it may not substantially affect theaccuracy of bounding box prediction. It is worth noting that,

Page 6: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

6

F G H JIF G H JI

A B C D EA B C D E

Fig. 4. Ten kinds of beverages for dataset collection.

in practice, the ultimate goal of the object detection model isto determine the category and number of object instances inan image. Thus, we do not invest more effort into accuratebounding box coordinate prediction by considering accurateannotations for complicated situations, such as occlusions,fallen-down objects, etc. For the dynamic classification dataset,only the category label for each image was manually annotatedfor training image-level classification models.

The process of dataset construction can be mainly dividedinto two sub-procedures: image collection and annotation.A portion of images in the dataset was extracted from aseries of video recordings. The other portion of images wascollected after taking a beverage out of the container bysimulating a real sales process in practice. As a result, a totalof 9,000 images were collected and manually annotated by10 participants using the LabelImg tool6. In order to savethe cost of manual labeling, a YOLOv2 model [19] wastrained by using these labeled images. This trained modelwas used for predicting the approximate locations of boundingboxes covering the regions of bottles in an image in order tofacilitate manual annotation in practice. Bounding boxes andlabels were then manually adjusted and annotated. For thedynamic classification dataset, 10 participants were asked totake beverages out of the simulated vending machine. Eachparticipant was assigned to take two kinds of beverages. As aresult, 20 videos were recorded. For each video, we selectedone frame in each second by using the OpenCV library. Fordataset cleaning, images with a strong motion blur effect weremanually filtered out. Finally, a total of 3,046 images for 10kinds of beverages were selected and manually labeled.

In fact, image dataset annotation constitutes one of thedifficulties in the practical application of deep learning inthe field of computer vision. Manual image annotation forUVM applications is quite time-consuming and impracticaldue to the diverse categories of retail products. In the future,the development of a semi-automatic shooting and labelinghardware device is anticipated to facilitate the annotation

6https://github.com/tzutalin/labelImg

process. Another feasible solution is to perform 3D modelingfor a product, and to simulate the appearance of real productimages by using computer graphics (CG) technologies.

C. Dataset statistics

The organization of the static detection dataset is similar tothe VOC2007 [11] dataset, and the images are mainly withtwo resolutions of 1280 × 720 and 1920 × 1080. All of theimages were taken by three cameras (see Fig. 2). Some sampleimages and ground bounding boxes are shown in Fig. 5. Theentire dataset was divided into three subsets, i.e., training,validation, and test set. Here, the training and validation setcan be combined into a trainval set for training models. A totalof 34,052 images associated with ground-truth labels wereincluded in this dataset. Specifically, the dataset contains 10kinds of beverages, including A, B, C,and so on. Moreover, atotal of 155,153 beverage instances were labeled with categorylabels and bounding boxes. For dataset partition, the objectdistributions of each category in the training, validation, andtest set are shown in Table I.

TABLE ISAMPLE DISTRIBUTIONS OF EACH CATEGORY OVER THE STATIC

DETECTION DATASET (IMAGES INDICATE THE NUMBER OF IMAGES FORTHE CATEGORY, AND OBJECTS DENOTES THE NUMBER OF INSTANCES).

Beveragetrain validation trainval test

images objects images objects Images Objects images Objects

A 4722 7968 4320 6775 9042 14743 1733 2947D 4582 6954 4616 7060 9198 14014 1807 3053C 5003 8117 4539 7016 9542 15133 1834 3076H 3127 5502 3262 5739 6389 11241 1561 2867E 3330 5839 3205 5683 6535 11522 1580 2817I 3156 5572 3159 5543 6315 11115 1459 2644B 4968 8203 4563 6960 9531 15163 1763 2951G 3220 5666 3225 5684 6445 11350 1439 2602F 3238 5706 3134 5526 6372 11232 1501 2731J 3181 5556 3195 5657 6376 11213 1518 2739

Mimi home smart camera HD USB industrial camera SY019 HD 1080P fisheye camera

Fig. 5. Samples of ground-truth bounding box annotations for static beveragedetection.

For dynamic recognition, there are 3,046 images in thedataset, including 2,460 images in the training set, 278 imagesin the validation set, and 309 images in the testing set. Inthe recording process of a video, 10 kinds of beverages wereplaced in the refrigerator in advance, and then 10 participantswere asked to take a beverage out of the refrigerator eachtime. For each participant, two videos were recorded whentaking out two different kinds of beverages. As a result, atotal of 20 videos were recorded. The frames of these videoswere selected manually. For each participant, the process oftaking each beverage lasts about 5 seconds. Each second has30 frames. As a result, a video includes 150 frames in total.

Page 7: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

7

Fig. 6. Sample images collected for dynamic recognition.

We removed the burry frames and kept those clear framesmanually. Fig. 6 visually illustrates some frame samplescollected in the dynamic classification process simulated by10 participants. It is worth noting that, since the time range ofthis simulation process is quite short when a participant takesa beverage out of the refrigerator, images extracted from thevideo frames appear blurry. The detailed information of ourdatasets is shown in Table II and TableIII. The constructeddatasets are available at our Website (www.dl2link.com).

TABLE IISTATIC DETECTION DATASET DESCRIPTION.

Data Set Characteristics: Multivariate Number of Instances: 34052 Area: LifeAttribute Characteristics Real Number of Attributes: 10 Date Collected 2018/9/1

Associated Tasks Detection Missing Values? N/A Number of Web Hits N/A

TABLE IIIDYNAMIC RECOGNITION DATASET DESCRIPTION.

Data Set Characteristics: Multivariate Number of Instances: 3064 Area: LifeAttribute Characteristics: Real Number of Attributes: 10 Date Collected: 2018/12/1

Associated Tasks: Classification Missing Values? N/A Number of Web Hits: N/A

V. PERFORMANCE EVALUATIONS

A. Static detection1) Evaluated models: Six state-of-the-art object detection

models, including YOLOv2 [19], YOLOv3 [20], Faster RCNN[16], R-FCN [24], SSD [22], and DSSD [23] were trained onour static detection dataset in our experiment. The implementa-tion of each model is based on the latest official released codesprovided by the authors. The hyper-parameters in training andtesting for each model are listed in Table IV. All of the deepCNN models involved in the experiment were implementedand trained on a machine with two-way GeForce GTX TITANGPU and 128 GB memory.

2) Evaluated metrics: In order to measure the performanceof the detection models, average precision (AP) [11], [16],[20], a widely used metric, was calculated for each category inour experiment. The idea of AP can be conceptually regardedas calculating the area under the precision and recall curve. Inour implementation, we adopted the widely utilized 11-pointmethod to calculate the AP value for each category [11] in thefollowing form:

AP =1

11

∑r∈{0,0.1,...,1}

pinterp (r), (1)

TABLE IVPARAMETER SETTINGS OF OBJECT DETECTION MODELS.

Models YOLOv2 YOLOv3 Faster RCNN R-FCN SSD DSSD

Batchsize 64 64 1 1 32 8Sudivision 8 16 - - - -Iterations 120000 50200 50000 80000 120000 80000Optimizer SGD SGD SGD SGD SGD SGDMomentum 0.9 0.9 0.9 0.9 0.9 0.9Base learning rate 0.001 0.001 0.001 0.001 0.0004 0.00004Input Size 416x416 608x608 600x600 600x600 300x300 321x321Backbone Darknet19 Darknet53 VGG16 ResNet50 VGG16 ResNet101Framework Darknet Darknet Caffe Caffe Caffe Caffe

p (r) =tp

tp+ fp, (2)

mAP =1

m

m∑i=1

APi, (3)

where pinterp(r) = maxr:r≥r

p(r) denotes the maximum precision

when recall equals to r. The average AP of all categories,mAP , ranging from 0 to 1, can be adopted to assess theoverall performance of a detection model. tp denotes truepositives and fp denotes false positives. m denotes the numberof categories. Additional detailed information about AP andmAP can be found in [11].

However, in the static detection task in our dataset, we focusmore on the accuracy of the category and number of beveragesin an image rather than the precision of the bounding boxlocation. Therefore, we adopted another metric, i.e., successrate, to further measure the performance of the trained objectdetection models. The success rate can be defined as thenumber of successful predictions against the number of testingimages. Here, if and only if a model simultaneously predictsbeverages with correct categories and number of instances, isthe result produced by the detector regarded as a successfulprediction. For example, given an image, assume that there arethree bottles of beverage E and one bottle of beverage H. If thepredictor determines two bottles of beverage E and two bottlesof beverage H, then such a prediction result is incorrect.

3) Experimental results: In order to evaluate the perfor-mance of the state-of-the-art object detection methods in thisdataset, we evaluated two metrics, i.e., mAP and the correctnumber of images with valid predictions, in terms of bothcategories and number of instances. The comparative resultsare listed in Table V and Table VI, respectively.

TABLE VQUANTITATIVE RESULTS IN TERMS OF AP IN COMPARISON OF

STATE-OF-THE-ART OBJECT DETECTION MODELS.

YOLOv2 YOLOv3 Faster R-CNN R-FCN SSD DSSD

C 0.9090 0.9090 0.9091 0.9090 0.9091 0.9090D 0.9091 0.9091 0.9089 0.9090 0.9090 0.9089A 0.9088 0.9091 0.9091 0.9091 0.9091 0.9090B 0.9091 0.9999 0.9091 0.9637 0.9992 0.9090F 0.9084 0.9090 0.9073 0.9088 0.9087 0.9085H 0.9082 0.9091 0.9062 0.9088 0.9088 0.9083J 0.9090 0.9089 0.9083 0.9086 0.9086 0.9086I 0.9077 0.9087 0.9078 0.9085 0.9085 0.9085E 0.9089 0.9091 0.9091 0.9090 0.9090 0.9089G 0.9090 0.9088 0.9088 0.9091 0.9089 0.9089

mAP 0.9087 0.9181 0.9084 0.9143 0.9179 0.9088

Quantitative results shown in Table V reveal that the per-formances of all compared models in terms of AP and mAPvalues are quite similar. However, from the results listed in

Page 8: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

8

TABLE VICOMPARATIVE RESULTS BASED ON ACCURACY (THE THRESHOLD FOR

THE BOUNDARY CONFIDENCE SETTING IS 0.2).

The correct number of Success rateimages (5361 total)

YOLOv2 4,681 87.32%YOLOv3 5,251 97.95%Faster R-CNN 5,096 95.06%R-FCN 5,244 97.82%SSD 5,206 97.11%DSSD 5,134 95.77%

Table VI in terms of the correct number of images, YOLOv3achieved relatively higher performance in comparison to othermodels. Moreover, it takes approximately 0.0625 s to processan image with a resolution of 1920 × 1080, indicating thatYOLOv3 can meet the requirements of real-time detection.

B. Dynamic classification

1) Compared models: In this section, four state-of-the-artclassification models, including Inception v3 [31], Inceptionv4 [32], Inception-ResNet-v2 [32], and ResNet v2-152 [33]were selected for performance evaluation on dynamic clas-sification. Initially, Inception-v1 [27] has multiple branchescompared to AlexNet [26] and VGG [28], by introducing1x1 convolution to reduce network computation. Inception-v2 then introduces a batch normalization (BN) layer, andreplaces the 5x5 convolutional kernel with two 3x3 convo-lutional kernels. Furthermore, Inception-v3 [31] divides the n-by-n convolutional kernel into two convolutional kernels withsize 1xn and nx1. The generalization ability of Inception-v4 [32] and Inception-ResNet-v2 [32] are further improvedby introducing residual connections of ResNet. The hyper-parameters of each model for training were adopted based onthe original setting provided by the authors. Table VII presentsthe detailed settings for training each model.

TABLE VIIHYPER-PARAMETER SETTINGS FOR TRAINING CLASSIFICATION MODELS.

Models Inception v3 Inception v4 Inception-ResNet-v2 ResNet v2-152

Batchsize 32 32 32 32Iterations 13294 9761 9827 9739Optimizer rmsprop rmsprop rmsprop rmspropmomentum 0.9 0.9 0.9 0.9Base learning rate 0.01 0.01 0.01 0.01Framework Tensorflow Tensorflow Tensorflow Tensorflow

2) Evaluation metrics and experimental results: We adopt-ed a widely used evaluation metric for classification models,i.e., Top-1 accuracy [25]. Top-1 accuracy refers to the rateof the class with the highest probability of matching theactual result. Table VIII shows the quantitative results interms of Top-1 accuracy of each model. It is noted that mostconsecutive frames with little difference were taken during theprocess in which a participant was taking a beverage out of therefrigerator. At present, the mainstream classification modelscan achieve high Top-1 accuracy. In particular, Resnet v2-152delivered the best performance with 99.67% Top-1 accuracy,indicating the potential of this model for real-world applica-tions in industry.

TABLE VIIIQUANTITATIVE RESULTS IN TERMS OF TOP-1 ACCURACY IN COMPARISON

WITH EACH MODEL.

Top-1 Accuracy

Inception v3 0.9902Inception v4 0.9902Resnet v2-152 0.9967Inception resnet v2 0.9902

C. Discussion

Based on observations from experimental results with re-spect to the above two tasks, it can be concluded that bothstatic and dynamic purchasing logics are feasible in real-lifeapplications of UVM. Moreover, deep CNN-based models forobject detection and image-level classification can be deployedin these demanding industrial applications by consideringefficiency and accuracy. We summarized the comparativefeatures of object detection-based solution and classification-based solution in the following:

(1) For the static detection process, images before and after apurchasing event are used to determine which kind of beverageand how many items were taken out of the vending machine bya customer. Object detection models can be directly applied tofacilitate the determination process by comparing the detectionresults. In contrast, for the dynamic classification process, it isnecessary to analyze and classify a series of key frames witha fixed length of video recording which captures the processof a customer taking a beverage out of the container. This willgenerally require more computational time in comparison tostatic detection.

(2) In our experiment, deep CNN-based image-level classi-fication models achieved high performance in terms of Top-1 accuracy. However, it is necessary to alleviate the strongmotion blur effect in images in order to further improve theperformance. In practice, high-speed cameras may be utilizedfor extracting clearer images when capturing the action ofconsumers taking beverages. However, the cost of a high-speedcamera is relatively higher than that of an ordinary one.

(3) The static detection strategy allows a customer to takemultiple beverages out of the vending machine at one time.However, dynamic classification models currently only allowa customer to take out one beverage at one time. Thus, furtherexploration of powerful detection models is needed for thesystem to enable a user to take out multiple products at onetime in a dynamic recognition mode. On the other hand, theaccuracy of dynamic recognition will decline when a consumertakes a beverage out of the container very quickly due to themotion blur effect.

(4) For the static detection strategy, at least one camerashould be installed for each layer in a UVM, while only onecamera needs to be equipped on the top of a UVM near itsdoor for the dynamic classification strategy.

VI. CONCLUSION

In this paper, we built two benchmark datasets that con-tain static object detection- and dynamic classification-basedapplication scenarios for smart UVMs. Several state-of-the-art object detection models and classification models based

Page 9: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

9

on deep CNN networks were trained and evaluated on thesedatasets for both tasks, respectively. From the experimentalresults, it can be clearly seen that YOLOv3, as a solutionmodel for static object detection, achieves superior resultsin terms of mAP and success rate. Moreover, for recog-nition during the dynamic purchasing process, most of theclassification models achieve promising results, indicatingtheir potential for practical applications in industry. Certainly,real unmanned selling scenarios of a UVM are much morecomplicated than those examined in the present study. Infuture work, it would be worth extending current datasets bycollecting images with richer diversity, such as more productcategories, irregular placement of products, and products withocclusion. At present, the examined deep learning modelscan achieve approximately 99.67% performance on accuracyaccording to our experiment. It is possible to further improvetheir performances in the real-life unmanned selling scenarios.Besides collecting images with richer diversity and improvingthe model structure, the combination of AI-based technologiesand other automated sensing technologies, such as RFID,infrared sensors and gravity sensors, can be adopted to meetthe requirement of real-world industrial applications in futurework.

REFERENCES

[1] PG Schierz, O Schilke, et al. Understanding consumer acceptance ofmobile payment services: An empirical analysis. Electronic commerceresearch and applications, 9(3):209–216, 2010.

[2] S Singh, N Singh. Internet of things (iot): Security challenges, businessopportunities & reference architecture for e-commerce. In InternationalConference on Green Computing and Internet of Things, pages 1577–1581. IEEE, 2015.

[3] Y Zheng and Y Li. Unmanned retail’s distribution strategy based onsales forecasting. In International Conference on Logistics, Informaticsand Service Sciences, pages 1–5, Aug 2018.

[4] R Thiebaut. Ai revolution: How data can identify and shape consumerbehavior in ecommerce. Entrepreneurship and Development in the 21stCentury, Emerald Press, Forthcoming, 2018.

[5] KH Eom, L Sen, CW Lee, et al. The Implementation of UnmannedClothing Stores Management System using the Smart RFID System.In International Journal of u-and e-Service, Science and Technology,volume 6, number 2, pages 77–88, 2013.

[6] YT Wei, JH Cho, JO Lee. Management Framework Based on a VendingMachine Scenario. In International Conference on Hybrid InformationTechnology, pages 9–16, Springer, 2012.

[7] HS Ahn, IK Sa, YM Baek, JY Choi Intelligent Unmanned Store ServiceRobot ”Part Timer”. In Advances in Service Robotics, IntechOpen, 2008.

[8] B News. In faddish china, even glorified vend-ing machines raise billions, 2018 [Online]. Available:https://www.bloomberg.com/news/articles/2018-05-20/in-faddish-china-even-glorified-vending-machines-raise-billions. Online; accessed21-May-2018.

[9] Tencent YouTu Lab Announces Plan to Transform into Tencent’sComputer Vision Research Center and Strategic Partnership with ScienceMagazine, 2018 [Online]. Available: https://www.prnewswire.com/news-releases/tencent-youtu-lab-announces-plan-to-transform-into-tencents-computer-vision-research-center-and-strategic-partnership-with-science-magazine-300712125.html. Online; accessed 12-September-2018.

[10] Y LeCun, Y Bengio, and G Hinton. Deep learning. Nature,521(7553):436, 2015.

[11] M Everingham, L Van Gool, CKI Williams, et al. The pascal visualobject classes (voc) challenge. International journal of computer vision,88(2):303–338, 2010.

[12] TY Lin, M Maire, S Belongie, et al. Microsoft coco: Common objectsin context. In European conference on computer vision, pages 740–755.Springer, 2014.

[13] R Girshick, J Donahue, T Darrell, et al. Rich feature hierarchies foraccurate object detection and semantic segmentation. In Computer Visionand Pattern Recognition, 2014.

[14] R Girshick. Fast r-cnn. In Computer Vision and Pattern Recognition,pages 1440–1448, 2015.

[15] JRR Uijlings, KEA Van De Sande, T Gevers, et al. Selective searchfor object recognition. International journal of computer vision,104(2):154–171, 2013.

[16] SQ Ren, KM He, R Girshick, et al. Faster r-cnn: Towards real-timeobject detection with region proposal networks. In Advances in NeuralInformation Processing Systems, pages 91–99, 2015.

[17] P Sermanet, D Eigen, X Zhang, et al. Overfeat: Integrated recognition,localization and detection using convolutional networks. ComputerResearch Repository, abs/1312.6229, 2013.

[18] J Redmon, S Divvala, R Girshick, et al. You only look once: Unified,real-time object detection. In Computer Vision and Pattern Recognition,pages 779–788, 2016.

[19] J Redmon, A Farhadi. YOLO9000: better, faster, stronger. In ComputerVision and Pattern Recognition, pages 6517–6525, 2017.

[20] J Redmon, A Farhadi. Yolov3: An incremental improvement. ComputerResearch Repository, abs/1804.02767, 2018.

[21] TY Lin, P Dollar, R Girshick, et al. Feature pyramid networks forobject detection. In Computer Vision and Pattern Recognition, volume 1,page 4, 2017.

[22] W Liu, D Anguelov, D Erhan, et al. Ssd: Single shot multibox detector.In European conference on computer vision, pages 21–37. Springer,2016.

[23] CY Fu, W Liu, A Ranga, et al. DSSD : Deconvolutional single shotdetector. Computer Research Repository, abs/1701.06659, 2017.

[24] JF Dai, Y Li, KM He, et al. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural InformationProcessing Systems, pages 379–387, 2016.

[25] O Russakovsky, J Deng, H Su, et al. Imagenet large scale visual recog-nition challenge. International journal of computer vision, 115(3):211–252, 2015.

[26] A Krizhevsky, I Sutskever, GE Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems, pages 1097–1105, 2012.

[27] C Szegedy, W Liu, YQ Jia, et al. Going deeper with convolutions. InComputer Vision and Pattern Recognition, pages 1–9, 2015.

[28] K Simonyan, A Zisserman. Very deep convolutional networks for large-scale image recognition. Computer Research Repository, abs/1409.1556,2014.

[29] KM He, XY Zhang, SQ Ren, et al. Deep residual learning for imagerecognition. In Computer Vision and Pattern Recognition, pages 770–778, 2016.

[30] G Huang, Z Liu, L Van Der Maaten, et al. Densely connectedconvolutional networks. In Computer Vision and Pattern Recognition,pages 2261–2269. IEEE, 2017.

[31] C Szegedy, V Vanhoucke, S Ioffe, et al. Rethinking the inceptionarchitecture for computer vision. In Computer Vision and PatternRecognition, pages 2818–2826, 2016.

[32] C Szegedy, Sy Ioffe, V Vanhoucke, et al. Inception-v4, inception-resnetand the impact of residual connections on learning. In Association forthe Advancement of Artificial Intelligence, volume 4, page 12, 2017.

[33] KM He, XY Zhang, SQ Ren, et al. Identity mappings in deep residualnetworks. In European conference on computer vision, pages 630–645.Springer, 2016.

[34] H Gao, B Cheng, J Wang, et al. Object classification using CNN-basedfusion of vision and LIDAR in autonomous vehicle environment. InIEEE Transactions on Industrial Informatics, volume 14, number 12,pages 4224–4231. IEEE, 2018.

[35] A Filonenko, KH Jo. Unattended object identification for intelligentsurveillance systems using sequence of dual background difference. InIEEE Transactions on Industrial Informatics, volume 12, number 6,pages 2247–2255. IEEE, 2016.

Page 10: Towards New Retail: A Benchmark Dataset for Smart Unmanned ... Journal Publications/Towards New Retai… · 1 Towards New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines

10

Haijun Zhang (SM’19) received the B.Eng. andMaster’s degrees from Northeastern University,Shenyang, China, and the Ph.D. degree from theDepartment of electronic Engineering, City Univer-sity of Hong Kong, Hong Kong, in 2004, 2007, and2010, respectively. He was a Post-Doctoral ResearchFellow with the Department of Electrical and Com-puter Engineering, University of Windsor, Windsor,ON, Canada, from 2010 to 2011. Since 2012, hehas been with the Shenzhen Graduate School, HarbinInstitute of Technology, China, where he is currently

a Professor of Computer Science. His current research interests include multi-media data mining, machine learning, computational advertising, and servicecomputing. Prof. Zhang is currently an Associate Editor of Neurocomputing,Neural Computing and Applications, and Pattern Analysis and Applications.

Donghai Li received the B.S. degree in computerscience from Guangdong University of Technolo-gy, Guangzhou, China, in 2016. He was a Mastercandidate in Computer Engineering of the HarbinInstitute of Technology Shenzhen Graduate School,under this research performed. Now he is workingin SenseTime at Shenzhen. His research interestsinclude computer vision and its applications in smartcity.

Yuzhu Ji received the B.S. degree in computer sci-ence from PLA Information Engineering University,Zhengzhou, China, in 2012, and the M.S. degree incomputer engineering from the Harbin Institute ofTechnology Shenzhen Graduate School, Shenzhen,China, in 2015, where he is currently pursuing thePh.D. degree in computer science. His research in-terests include data mining, computer vision, imageprocessing, and deep learning.

Haibin Zhou received the B.S. degree in Internetof Things Engineering from Harbin Institute ofTechnology University, Harbin China, in 2019, andthe master candidate in computer science from theHarbin Institute of Technology, Shenzhen, China.His research interests include computer vision, anddeep learning.

Weiwei Wu (M’14) received his BSc degree inSouth China University of Technology (SCUT, Com-puter Science) in 2006 and the PhD degree fromCity University of Hong Kong (CityU, ComputerScience) and University of Science and Technologyof China (USTC) in 2011, and went to NanyangTechnological University (NTU, Mathematical Divi-sion, Singapore) for post-doctorial research in 2012.He is a Professor in Southeast University, Schoolof Computer Science and Engineering, P.R. Chi-na. His research interests include optimizations and

algorithm analysis, crowdsourcing, machine learning game theory, wirelesscommunications and network economics.

Kai Liu (M’12) received his Ph.D. Degree inComputer Science from the City University of HongKong in 2011. From December 2010 to May 2011,he was a Visiting Scholar with the Department ofComputer Science, University of Virginia, USA.From 2011 to 2014, he was a Postdoctoral Fellowwith Singapore Nanyang Technological University,City University of Hong Kong, and Hong KongBaptist University. He is currently an AssistantProfessor with the College of Computer Science,Chongqing University, China. His research interests

include Internet of Vehicles, Mobile Computing, Pervasive Computing andBig Data.