Simplifying Grocery Checkout with Deep Learningcs230.stanford.edu/projects_fall_2019/reports/26257432.pdf · Simplifying Grocery Checkout with Deep Learning Jing Ning Department of

Simplifying Grocery Checkout with Deep Learning

Jing NingDepartment of Computer Science

Stanford [email protected]

Yongfeng LiDepartment of Computer Science


Ajay. RameshDepartment of Computer Science


Abstract

Self-checkout are becoming increasingly popular in grocery stores. Most storesstick plastic bar-code tags onto products, especially for fruits and vegetable, to helpcustomer checkout. It presents several problems regarding maintenance cost andenvironment. In the paper, we propose an alternative Mask R-CNN based instancesegmentation system to precisely classify and localize the items at the pixel level.We also study the effect of various hyper-parameters on the performance of thesystem and present an Average Precision matrix. It shows that the algorithm isable to perform well on both object detection and instance segmentation tasks withan overall mean Average Precision(mAP) of 84% at an IoU interval of [50:95].Meanwhile, the detection time for the model is around 75 ms per image, which canachieve the near-real-time detection.

1 Introduction

Our goal is to build a model that can improve the efficiency of the supermarket checkout process.In most grocery chains, stores stick plastic tags with identification numbers onto the surface ofcheckout items. This presents several problems including those with environmental sustainabilitywith generation of significant plastic waste, consumer health with leaving potentially harmful glueresidue on sensitive items such as fruits or vegetables, manual labor intensity of labelling activitiesand the propensity of these tags to easy damage that can subsequently result in poor checkout chainefficiency.

The classic deep learning Convolutional Neural Network (CNN) algorithm can only classify theitem at the image level [1]. More advanced semantic segmentation algorithms have limitations inseparating different instances in the image. The state-of-art object detection algorithm such as YOLOcan localize item but not at pixel level [2]. These problems inspire us to think of another alternate, asingle Computer-Vision deep learning algorithm that can perform classification, object detection at ainstance level. We apply a state-of-the-art object-classification and instance-segmentation techniquecalled Mask R-CNN. More details are described in the method section.

CS230: Deep Learning, Winter 2018, Stanford University, CA. (LateX template borrowed from NIPS 2017.)

2 Related work

Classic CNN: Horea Muresan el.[3] applied a classic convolutional network for fruit detection. Theyused a four layer convolutional network with a filter size of (5,5), increasing the number of filtersfrom 4 to 64, each followed by a max pooling layer and fully connected layers. They reported testset accuracy of 94% for 60 different fruit classes. Frida Femling el.[2] experimented with differentarchitectures including the Inception v3 model and MobileNet. They reported a top three test accuracywith InceptionNet as 96% and a 97% with MobileNet. They were able to train/test the system withmultiple instances represented in the same image.

Faster Region-based CNN: The authors proposed a method to detect fruits using a Faster R-CNNmodel, achieving an F1 score of above 0.8 [4]. The network they proposed used a VGG-16 modelwith 13 convolutional layers followed by Region Proposal Network (RPN). Their network onlydetected the presence of a fruit, and did not have to classify the fruit itself.

While these approaches work well in their respective applications, we are uniquely challenged in thatour system will not only have to detect grocery items, but also accurately classify them. Furthermore,we require that our system also handle the presence of multiple instances of each grocery item. Withlimited GPU resource to train our system, the mask R-CNN based system is able to achieve accuratedetection and instance segmentation even with limited training iterations.

3 Dataset and Features

MvTec Densely Segmented Supermarket (D2S) dataset is a benchmark for instance-aware semanticsegmentation in industrial domain [5]. It includes image with high resolution (1920,1440) in RGBcolor, and it contains 60 unique grocery product classes simulating a checkout belt. The datasetcontains a total of 15,654 objects with 6 different backgrounds. Various lighting and rotation anglesare captured to augment the dataset. Annotations are created with class information, bounding boxesand pixel-wise segmentation of all object instances in a format compliant with MS-COCO [6]. Fourcoordinate values are specified for bounding boxes and masks are generated by per-pixel booleanannotations [7].

Below is one example of the image and the pixel level annotation from MvTec D2S data set.

(a) images with pixel-levelannotations and masks (b) Images with annotations (c) Class Distribution

Figure 1: Examples of MvTec D2S dataset and annotations

We shuffled and selected a subset of 3600 images from the data set and split the images intotrain/validation/test sets with a ratio of 80:10:10, resulting in a 2880/360/360 image count split,respectively. We normalized all our images to zero mean based on training set statistics. We set amini-batch size to two images based on the single-GPU (NVIDIA K80) machine characteristics.

4 Methods

Here we describe how mask R-CNN performs image segmentation with high accuracy. The methodwas originally proposed by Kaiming He, from Facebook AI Research (FAIR) [8], We use a referenceGithub repository from Matterport in our implementation [9].

Mask R-CNN is an extension over Faster R-CNN [10]. Faster R-CNN predicts bounding boxes andMask R-CNN essentially tacks on an additional branch to predict an object mask [11]. The additionalmask output is distinct from the class and box outputs, requiring extraction of much finer spatial

2

layout of an object. Mask-RCNN is considered the state of the art for instance segmentation problems[10]. It has been proven to be useful across many industries and domains.

Below we show the high level architecture of the mask R-CNN architecture and describe its majorcomponents:

Figure 2: Mask-RCNN Architecture

• Backbone model: a standard convolutional neural network (such as MobileNet, ResNet50, orResNet101) which work as a feature extractor over the entirety of the image. For example, it turnsa 1024*1024*3 RGB image into a 32*32*2048 feature map that serves as input for the next layers.

• Region Proposal Network (RPN): Using regions defined with as many as two hundred thousandanchor boxes, the RPN scans each region and predicts if an object is contained in it. To itsadvantage, the RPN does not scan the actual image, but instead the feature maps. Feature PyramidNetwork(FPN) is introduced with multiple feature maps pathway including bottom-up, top-downand lateral connections. This allows network to output proportionally sized feature maps at multiplelevels for a single-scale image input . This stage outputs several Regions of Interest (RoI).

• Region of Interest Classification and Bounding Box: This step of the algorithm takes at itsinputs the RoIs proposed by the RPN and outputs a classification (softmax) and a bounding box(regressor). RoIAlign was introduced to align the difference in size in a way where the featuremaps are cropped and resized.

• Segmentation Masks: In the final step, the algorithm takes the positive RoI regions and outputs28x28 pixel masks with float values representing detected objects. During training, we scale downground truth but at inference phase, these masks are then scaled to size of ROI bouding box.

Loss function on each sample RoI: For training iterations, the mask R-CNN method define amulti-task loss on each sampled RoI as follows:

L = Lcls + Lbox + Lmask

The classification loss Lcls and bounding-box loss Lbox are identical to those defined in FasterR-CNN. Specifically, the classification loss is defined as Lcls = Lcls(p, u) = −logpu, wherein,p = (p0, ..., pk) are the discrete probability distributions (per RoI) over K + 1 categories. pis computed by a softmax function over the K + 1 outputs with a fully connected layer. Thebounding-box loss Lbox is defined over a tuple of true bounding-box regression targets for the classu, v = (vx, vy, vw, vh), and its predicted tuple tu = (tux, t

uy , t

uw, t

uh) as below[4],

Lloc(tu, v) =

∑i∈x,y,w,h

smoothLi(tui − vi)

where

smoothLi(x) =

{0.5x2 if |x| < 1|x| − 0.5 otherwise,

is a robust L1 loss that is less sensitive to outliers than the L2 loss. When regression targets areunbounded, training with L2 loss can require careful tuning of learning rates in order to preventexploding gradients.

3

Lmask = − 1

m2

∑i∈i,j

[yij logykij + (1− yij)log(1− ykij)]

The mask branch has a Km2−dimensional output for each RoI, which encodes K binary masks ofresolution m ∗m, one for each of the K classes. To this mask R-CNN applies a per-pixel sigmoid,and defines Lmask as the average binary cross-entropy loss. For a RoI associated with ground-truthclass k, Lmask is only defined on the k-th mask, and other mask outputs do not contribute to the loss.This definition of Lmask allows the network to generate masks for every class without competitionamongst classes.

5 Experiments/Results/Discussion

We evaluate our system with substantial experimentation and report its performance using the standardMS-COCO metrics (including mean Average Precision (mAP) at various Intersection over Union(IoU) thresholds and object scales) [6]. We use a single metric for validation iterations: the AveragePrecision over the test dataset at IoU intervals in range [50:90]. All our validation models are trainedover just 10 epochs due to limited GPU time budgets.

To measure baseline performance, we set up a model pre-trained with MS-COCO weights and furthertrain all its layers over 10-epochs. We achieved a baseline mAP= 0.239 over the IoU interval [50:90].

In order to improve model performance, we then perform an extensive hyper-parameter explorationusing a coarse to fine search method. To begin with, we searched across different optimizationalgorithms such as Adam, RMSprop and Stocastic Gradient Descent with Momentum (SGDm). Wefound that Adam required tuning at much lower learning rates, consequently elongating trainingtimes. Based purely on running times < 2s/Step as satisfying metric, we picked SGDm.

With the optimization algorithm set, we then tuned the learning rate, sampling our system’s per-formance at a log scale using 10r where r sample between (−1,−6). After a few such iterations,we narrowed down the learning rate to range within (0.001, 0.01) where after a random search wasperformed.

With a final learning rate of 0.002, we were able to improve on our baseline by 28%, using aweight decay rate of 0.0001. We also analyzed our systems performance with different backbonearchitectures, and found resnet50 worked very well for our application, providing good trainingperformance with little accuracy degradation.

Below we show some important details on the parameter tuning process, including model architecture,hyper-parameters in transfer learning and image size.

Table 1: Results for backbone architecturebackbone AP50:95 AP50 AP75 APM APL

ResNet-50 0.753 0.962 0.962 0.314 0.756ResNet-101 0.788 0.957 0.924 0.291 0.791

Table 2: Results for image sizeImage Size AP50:95 AP50 AP75 APM APL

800*1024 0.837 0.979 0.979 0.61 0.839512*512 0.753 0.962 0.962 0.314 0.756256*256 0.04 0.072 0.042 0 0.058

Table 3: Results for layers to tunelayers to tune AP50:95 AP50 AP75 APM APL

heads 0.625 0.946 0.764 0.147 0.6283+ 0.84 0.98 0.97 0.451 0.8434+ 0.795 0.976 0.958 0.441 0.798all 0.753 0.962 0.962 0.314 0.756

• Architecture: After full length exploration of the hyper-parameters, we picked best ones and thenbabysat the learning process. The network performance improved slightly with architecture depthas shown in Table 1.

• Image Size: We explored the effect of image resolution on training accuracy. We observe thatmAP increases with increased image resolution as shown in Table 2. To balance training time and

4

accuracy, we choose image resolution of 512 X 512 which produced good result with reasonabletraining time.

• Layers to re-train: Due to the limited labelled supermarket dataset for training, we decide to testdifferent architectural training depths and apply transfer learning to frozen layers (Table 3). Weshow that we can predict supermarket item class at instance level, the pixel level mask works evenwith a certain level of occlusion. The algorithm also works with different background setting andvarious lighting conditions. The following show some of the results from our test set.

Figure 3: Prediction on test set

We were able to achieve best results by tuning ’3+’ layers resulting in a mAP of 0.84 over IoU 50:90.As can be seen from Figure 3, it works well even with some level of occlusion. In order to visualizethe accuracy of the model’s training, we inspect some intermediate convolution layers’ feature maps,from which see hot areas gathered at uniquely identifying areas of each object.Error analysis wereconducted on mAP of each class against number of object count as shown from Figure 5, Carrot,Adelholzener water, and Kamillentee tee have lowest mAP. Error appear to relates to difference insize of objects from same category and object occlusion with limited label displaying.

Figure 4: Feature Maps from Resnet layers

(a) Error analysis on each class

(b) training loss (c) Validation loss

6 Conclusion/Future Work

In conclusion, we explore the hyper-parameter space within our computation budgets, and findoptimal parameters to fit this dataset. The model was able to achieve a mAP of 84% on IoU 50:90 andmAR of a 81% on IoU interval [50:90] on the test set, surpassing our baseline model by a big margin.

For future work, we like to expand our model’s application to other similar tasks such asfruits/vegetables sorting, quality control and other classification tasks. We could also build morerobustness into the model by increasing the level of complexity of dataset (albeit at much increasedcomputation cost) by collecting more data with images taken at different angles.

5

7 Contributions

Development was largely a collaborative effort with all team members working together and equallyon research, coding, model parameter tuning and writing.

We sincerely thank the teaching staff of CS230: Deep Learning, with special thanks to Zahra Koochakfor her immense help in every step of the project - from the project proposal to the final report.

Code has been made available on Fruit Detection Github Repo.

References[1] Jiuxiang Gu et al. “Recent Advances in Convolutional Neural Networks”. In: CoRR

abs/1512.07108 (2015). arXiv: 1512.07108. URL: http://arxiv.org/abs/1512.07108.[2] Joseph Redmon and Ali Farhadi. “YOLOv3: An Incremental Improvement”. In: CoRR

abs/1804.02767 (2018). arXiv: 1804.02767. URL: http://arxiv.org/abs/1804.02767.[3] Horea Muresan and Mihai Oltean. “Fruit recognition from images using deep learning”. In:

CoRR abs/1712.00580 (2017). arXiv: 1712.00580. URL: http://arxiv.org/abs/1712.00580.

[4] Ross B. Girshick. “Fast R-CNN”. In: CoRR abs/1504.08083 (2015). arXiv: 1504.08083. URL:http://arxiv.org/abs/1504.08083.

[5] Patrick Follmann et al. “MVTec D2S: Densely Segmented Supermarket Dataset”. In: CoRRabs/1804.08292 (2018). arXiv: 1804.08292. URL: http://arxiv.org/abs/1804.08292.

[6] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312(2014). arXiv: 1405.0312. URL: http://arxiv.org/abs/1405.0312.

[7] Patrick Follmann et al. “Learning to See the Invisible: End-to-End Trainable Amodal InstanceSegmentation”. In: CoRR abs/1804.08864 (2018). arXiv: 1804.08864. URL: http://arxiv.org/abs/1804.08864.

[8] Kaiming He et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). arXiv: 1703.06870.URL: http://arxiv.org/abs/1703.06870.

[9] Waleed Abdulla. Mask R-CNN for object detection and instance segmentation on Keras andTensorFlow. https://github.com/matterport/Mask_RCNN. 2017.

[10] Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference imple-mentation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark. Accessed: [Fall 2019]. 2018.

[11] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with RegionProposal Networks”. In: CoRR abs/1506.01497 (2015). arXiv: 1506.01497. URL: http://arxiv.org/abs/1506.01497.

6

https://github.com/yongfeng0616/CS230_fall_2019_Fruit_Detection

https://arxiv.org/abs/1512.07108

http://arxiv.org/abs/1512.07108

















https://github.com/matterport/Mask_RCNN

https://github.com/facebookresearch/maskrcnn-benchmark

https://github.com/facebookresearch/maskrcnn-benchmark




Documents

Simplifying Grocery Checkout with Deep Learningcs230.stanford.edu/projects_fall_2019/reports/26257432.pdf · Simplifying Grocery Checkout with Deep Learning Jing Ning Department of