Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Object Detection
R. Q. FEITOSA
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
2
Computer Vision Tasks
3
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
Computer Vision Tasks
Image Classification: assigns a label to an image.
4
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
CAT
Computer Vision Tasks
Object Localization: locates the single object in the image and indicates its location with a bounding box.
5
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
CAT
Computer Vision Tasks
Object Detection: locates multiple (if any) instances of object classes with bounding boxes.
6
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
Computer Vision Tasks
Semantic Segmentation: links a class label to individual pixels with no concern to separate instances.
7
Image Classification Object LocalizationSemantic
Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
CAT, ROAD, VEGETATION
Computer Vision Tasks
Instance Segmentation: identifies each instance of each object class at the pixel level.
8
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
pixel basedregion based
reco
gn
ize
s c
lasse
s
ign
ore
s insta
nce
sre
co
gn
ize
s c
lasse
s
an
d insta
nce
s
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
9
Datasets• PASCAL Visual Object Classification (PASCAL VOC)
• 8 different challenges from 2005 and 2012
• Around 10 000 images of 20 categories
10Source: Everingham et al., 2012 Visual Object Classes Challenge 2012 Dataset
sample images of PASCAL VOC
Datasets• Common Objects in COntext (COCO) - T.-Y.Lin and al. (2015)
• Multiple challenges
• Around 160 000 images of 80 categories
11Source: Arthur Ouaknine, Review of Deep Learning Algorithms for Object Detection (2018)
sample images of PASCAL VOC
sample images of COCO
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
12
Performance Metrics
13
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒=
𝑰𝒏𝒕𝒆𝒓𝒔𝒆𝒄𝒕𝒊𝒐𝒏 𝒐𝒗𝒆𝒓 𝑼𝒏𝒊𝒐𝒏 𝑰𝒐𝑼 =
𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑨𝑷 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
false
positive
outcome
reference
false
negative
true
positive
𝑹𝒆𝒄𝒂𝒍𝒍 =𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒=
Performance Metrics
Most competitions use mean average precision (𝑚𝐴𝑃) as principal metric to evaluate object detector, though there are variations in definitions and implementations.
14
Performance MetricsPrecision-recall curve
15
Confidence score is the
probability that an anchor box
contains an object. It is usually
predicted by a classifier.
As the threshold for confidence
score decreases,
• recall increases
monotonically;
• precision can go up and
down, but the general
tendency is to decrease.
Source: Nickzeng (2018)
Performance MetricsPrecision-recall curve – Average Precision
16
The average precision (𝐴𝑃) is
the precision averaged across all
unique recall levels.
It is based on the interpolated
precision-recall curve.
The interpolated precision
𝑝𝑖𝑛𝑡𝑒𝑟𝑝 at a certain recall level 𝑟 is
defined as the highest precision
found for any recall level
𝑝𝑖𝑛𝑡𝑒𝑟𝑝 𝑟 = max𝑟′≥𝑟
𝑟′
.
Source: Nickzeng (2018)
interpolated
original
Performance MetricsMean average precision –𝑚𝐴𝑃
The calculation of 𝐴𝑃 involves one class.
However, in object detection, there are usually 𝐾 > 1 classes.
Mean average precision (𝑚𝐴𝑃) is defined as the mean of 𝐴𝑃 across all 𝐾 classes:
17
𝑚𝐴𝑃 =σ𝑖=1𝐾 𝐴𝑃𝑖𝐾
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
18
YOLO History
19
Date Event
May, 2016 Joseph Redmon’s paper "You Only Look Once: Unified, Real-Time Object Detection" introduced YOLO
December, 2017 Redmon introduced YOLOv2 in the paper "YOLO9000: Better, Faster, Stronger."
April, 2018 Redmon published "YOLOv3: An Incremental Improvement"
February, 2020 Redmon announced he was stepping away from Computer Vision
April 23, 2020, Alexey Bochkovskiy published YOLOv4.
May 29, 2020 Glenn Jocher created a repository called YOLOv5
Joseph Redmon
YOLOv3
20Redmon & Farhadi, 2018 YOLOv3: An Incremental Improvement"
YOLOv3
21Source: Sik-Ho Tsang Review: YOLOv3 – Yo Only Look Once (Object Detection)
Example of Object Detection
22
Training set:
𝒙 𝑦
1
0
1
1
0
CNN
𝒙
0/1
𝑦
Sliding Window Detection
23
CNN 0
Problem: High computational cost!
Accurate Bounding Boxes
24
YOLO Basics
25Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection
𝒚 =
𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3
0???????
1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010
1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010
convmax
pool
𝒙
𝑛
𝑛
3
33
8
𝒚
Remarks:
• It indicates if there is an object in each grid cell (𝑝𝑐), and from which class (𝑐1, 𝑐2, 𝑐3).
• It outputs the bounding box coordinates (𝑏𝑥,𝑏𝑦,𝑏ℎ,𝑏𝑤) for each grid cell.
• With finer grids you reduce the chance of multiple objects in the same grid cell.
• The object is assigned to just one grid cell, the one that contains its midpoint, no matter if the
object extends beyond the cell limits.
• A single CNN accounts for all objects in the image → efficient.
𝑛
𝑛
Labels for training → for each grid cell:
a single CNN
Parametrization of box coordinates
26Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection
𝑦 =
1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010
(0,0)
(1,1)
𝑏𝑥
𝑏𝑦
0.2
0.4
0.9
1.5
𝑏ℎ
𝑏𝑤
between
0 and 1
can be > 1
Remark: actually, there are other parametrizations, but this is
enough for the time being.
Overlapping boxes
27Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
Non-maximum suppression
28
0.90.60.70.8
0.6
Often, some of the generated boxes overlap.
Algorithm:
Discard all boxes with 𝑝𝑐 ≤ 0.6
While there are boxes:
• Pick the box with the largest 𝑝𝑐 and output it as a prediction.
• Discard any remaining box with IoU≥0.5 with the box output in the previous step.
The algorithm must run separately for each class.
1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤100
Dealing with Overlapping Objects
29
Anchor box 1
Anchor box 2
𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3
𝒚 =
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010
0???????1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010
anchor
box 1
anchor
box 2
car only?
Anchor box algorithmPreviously:
Each object in training image was assigned to the grid cell that contains that object’s midpoint.
30
With two anchor boxes:
Each object in training image is assigned to the grid cell that contains object’s midpoint andanchor box for the grid cell with highest IoU.
In our example so far
Output 𝒚: 3×3×8
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
In our example so far
Output 𝒚: 3×3×16
Selecting Anchor Boxes1. Choose by hand, guided by the frequency of shapes in your
application.
2. Use a clustering algorithm to find representative shapes.
31
Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding
boxes to get good priors for our model. The left image shows the average IOU we get with various choices for k. We find
that k = 5 gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for
VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC.
Source: Redmon, J. & Farhadi, A., 2016 YOLO9000: Better, Faster Stronger
YOLO – wrap upTraining
32
1. Pedestrian
2. Car
3. Motorcycle
0???????
𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3
𝒚 = 0???????
𝟎???????
1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010
anchor
box 1
anchor
box 2
In our example
𝒚: 3×3×2×8
grid cells# anchors
5+ #classes
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
1 2
CNN
𝑛 × 𝑛 × 3 1 × 1 × 16
grid cellgrid cell
outcome
YOLO – wrap upPrediction
33
0???????
𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3
𝒚 =0???????
𝟎???????
1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010
anchor
box 1
anchor
box 2
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
CNN
1 × 1 × 16
YOLO – wrap upPrediction
34Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
• For each grid cell, get 2 predicted bounding
boxes.
• Get rid of low probability predictions.
• For each class (pedestrian, car, motorcycle)
use non-max suppression to generate final
predictions.
YOLOv3 Network Architecture
35Source: Ayoosh Kathuria, What’s new in YOLO V3?
13𝑥13
13
13
26
26
52
52
Depper than YOLO 9000: from 53 to 106 Layers
Input image: 320, 416 or 608
Detection at three scales better at detecting smaller objects
Skip connections
YOLOv3 – AnchorsFor each cell there are 𝐵 anchors.
For each anchor the prediction is a vector of dimension (5 + 𝐶 ).
The number of anchor predictions is:
This is 10 times YOLO9000!
Example:
For COCO, 𝐵 = 3, 𝐶 = 80anchors/predictions per grid cell.
36
× 𝐵
(13 × 13 +26 × 26 +
52 × 52) = 𝟏𝟎, 𝟔𝟒𝟕
YOLOv3 – Bounding Box Prediction
37Source: Redmon, J., Farhadi, A., (2018), YOLOv3: An Incremental Improvement.
Same as YOLO 9000.
It predicts 𝑡𝑥, 𝑡𝑦, 𝑡𝑤, 𝑡ℎ for each anchor box.
The predictions 𝑏𝑥, 𝑏𝑦, 𝑏𝑤, 𝑏ℎ are given by
𝑏ℎ
𝑏𝑤
𝜎 𝑡𝑦
𝜎 𝑡𝑥
𝑝ℎ
𝑝𝑤
𝑏𝑦 = 𝜎 𝑡𝑦 + 𝑐𝑦
𝑏𝑥 = 𝜎 𝑡𝑥 + 𝑐𝑥
𝑐𝑥𝑐𝑦
𝑏𝑤 = 𝑝𝑤𝑒𝑡𝑤
𝑏ℎ = 𝑝ℎ𝑒𝑡ℎ
where
• (𝑐𝑥, 𝑐𝑦) is the offset from the
top left corner of the image.
• 𝑝𝑤 and 𝑝ℎ are the anchorwidth and height.
• 𝜎 is the sigmoid function.
YOLOv3 – Loss Function
where 1𝑖,𝑗𝑜𝑏𝑗
/1𝑖,𝑗𝑛𝑜𝑏𝑗
is the indicator function, 𝜆𝑛𝑜𝑏𝑗and 𝜆𝑐𝑜𝑜𝑟𝑑are weights,
𝑦/ො𝑦 is the prediction/ground truth, 𝜎 is the sigmoid, and 𝐵𝐶𝐸 is the binary cross entropy.
38
𝑦 =
𝑡0𝑡𝑥𝑡𝑦𝑡ℎ𝑡𝑤𝑠1⋮𝑠𝐶
+
𝑖=0
𝑆2
𝑗=0
𝐵
1𝑖,𝑗𝑜𝑏𝑗
−𝑙𝑜𝑔 𝜎 𝑡0 +
𝑘=1
𝐶
𝐵𝐶𝐸 ො𝑦𝑥 , 𝜎 𝑠𝑘
+ 𝜆𝑐𝑜𝑜𝑟𝑑
𝑖=0
𝑆2
𝑗=0
𝐵
1𝑖,𝑗𝑜𝑏𝑗
𝑡𝑥 − Ƹ𝑡𝑥2 + 𝑡𝑦 − Ƹ𝑡𝑦
2+ 𝑡𝑤 − Ƹ𝑡𝑤
2 + 𝑡ℎ − Ƹ𝑡ℎ2
𝜆𝑛𝑜𝑏𝑗
𝑖=0
𝑆2
𝑗=0
𝐵
1𝑖,𝑗𝑛𝑜𝑏𝑗
−𝑙𝑜𝑔 1 − 𝜎 𝑡0
ℒ𝑌𝑂𝐿𝑂 =
instead of softmax, as in YOLOv2
for empty anchors
for non-empty anchors
for non-empty anchors
Performance on COCO
39
Object Detector Architecturestypically compose of several components:
41
• VGG16
• RESNET-50
• DARKNET52
• RESTNEXT50
• …
• ASPP
• PAN
• RFB
• …
• RPN
• YOLO
• SSD
• …
• FASTER R-CNN
• R-CNN
• …
Network that
extracts the
features
enhance feature
discriminability
and robustness
handles the
prediction
Source: Bochkovskiy, et al., 2020
Bag of Freebiesmethods that only change the training strategy or only increase the training cost.
• “bag” refers to a set of strategies, and
• “freebies” means that we get additional performance for FREE.
The authors investigated a large “freebies” arsenal (too much to go in depth here), in particular, data
augmentation strategies.
42
Bag of SpecialsSet of modules that only increases the inference cost by a small amount but significantly improves the accuracy, such as
• Attention mechanism,
etc, both for the backbone and the head.
43
• Special activation functions,
Source: Bochkovskiy, et al., 2020
Additional Improvements• A new data augmentation method:
– MOSAIC + Self-Adversarial Training (SAT)
• Optimal Hyper-Parameters while applying Genetic Algorithms.
• Modification to some existing methods for efficient training and detection.
44
YOLOv4 Architecture
YOLOv4 consists of:
• Backbone: CSPDarknet53
• Neck: SPP, PAN
• Head: YOLOv3
45
Recall DenseNet
DenseNet increases model capacity (complexity) and improves information flow (training) with highly interconnected layers.
46
Source: Huang et al., 2018
Source: Huang & Wang., 2018
Recall DenseNet
Formed by composing multiple DenseBlocks with a transition layer in between.
47Source: Wang et al., 2020
Cross-Stage Partial Connection (CSP)CSPNet separates the input feature maps of the DenseBlock into two parts:
• The first part x₀’ bypasses the DenseBlock and becomes part of the input to the next transition layer.
• The second part x₀’’ will go thought the Dense block as below.
It reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block
48Source: Wang et al., 2020
Cross-Stage Partial Connection (CSP)CSP separates the input feature maps of the DenseBlock into two parts.
The first part bypasses the DenseBlock and becomes part of the input to the next transition layer.
49Source: Wang et al., 2020
CSPDarknet-53
“YOLOv4 utilizes the CSP connections with the Darknet-53 (on the right) as the backbone in feature extraction.”1
It reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block.
50
Darknet-53
1Jonathan Hui, 2020
YOLOv4 Architecture
YOLOv4 consists of:
• Backbone: CSPDarknet53
• Neck: SPP, PAN
• Head: YOLOv3
51
Improved Spatial Pyramid Pooling (SPP)
Feature maps are max pooled in three scales with stride 1.
Three feature maps are concatenated.
52Source: Huand & Wang, 2018
Original SPP
Enahnced SPP
CSP and SPP integrated in YOLO
53
Path Aggregation Network (PAN)YOLOv3 adapts a similar approach as FPN in making object detection predictions at different scale levels.
It upsamples (2×) the previous top-down stream and add it with the neighboring layer of the bottom-up stream. The result is passed into a 3×3 convolution filter to reduce upsamplingartifacts and create the feature maps P4 below for the head.
54Source: Lin et al., 2019
YOLO v3 vs YOLO v4 - Demo
56
click here to watch the demo
Useful links
• Jonathan Hui, 2020 YOLOv4
• Deval Shah, 2020 , YOLOVv4 – Version 3: Proposed Workflow
• Shreejai Trivedi YOLOv4 – Version 4: Final Verdict
57
YOLOv5• Glen Jocher, founder and CEO of Ultrlytics released
its open source PyTorch implementation of YOLOv5 in Github
• No paper published, just a blog
• Authors claim:
• 140 frames per second (Tesla P100) v.s 50 FPS of YOLOv4
• Small - 90% of YOLOv4
• Fast
58
YOLOv5 Performance
59
Source: Jacob Solawetz, 2020
Other OD approaches• R-CNN Rich feature hierarchies for accurate object detection and
semantic segmentation (2014)
• Fast R-CNN (2015)
• Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, (2016)
• R-FCN: Object Detection via Region-based Fully Convolutional Networks (2016)
• SSD: Single Shot MultiBox Detector(2016)
• FPN: Feature Pyramid Networks for Object Detection (2017)
• RetinaNet: Focal Loss for Dense Object Detection (2018)
• Path Aggregation Network for Instance Segmentation (2018)
• EfficientDet: Scalable and Efficient Object Detection (2020)
60
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
61
Mask-RCNN
62
CV tasks
63Figure source: Tiba Razmi, (2019) Mask R-CNN
Instance Segmentation combines
• object detection, and
• semantic segmentation.
Instance Segmentation
64
object detection semantic segmentation
Mask R-CNNIt is a two-stage framework (meta algorithm):
a) generates proposals (areas likely to contain an object)
b) segments the proposals
65Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
Mask R-CNNProposed by Kaiming He and co-authors in 2018.
It is built upon Faster R-CNN.
It involves three steps:
a) the Region Proposal Network (RPN), that generates region proposals,
b) Mask generation
The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and
Mask Generation itself, that segments the image inside of bounding boxes with objects.
66Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
Backbone
VGG
ResNet C4
FPN
…
Region Proposal Network (RPN)
67
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
Input image is resized.
Sizes in figure are just illustrative.
feature
map
Region Proposal Network (RPN)
68
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
A backbone network produces the feature
maps.
Backbone
VGG
ResNet C4
FPN
…
feature
map
Region Proposal Network (RPN)
69
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
The network has to learn …
a) whether an object is present in a
location,
b) estimate its size/shape.
… this is done by placing a set of anchors on the input image for each
location on the output feature map.
Backbone
VGG
ResNet C4
FPN
…
feature
map
Region Proposal Network (RPN)
70
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
The network has to learn …
a) whether an object is present in a
location, and
b) estimate its size/shape.
… this is done by placing a set of anchors on the input image for each
location on the output feature map.
Example of 9 anchor per image location
Backbone
VGG
ResNet C4
FPN
…
feature
map
Region Proposal Network (RPN)
71
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
First a 3x3 convolution with 512 filters is
applied to the feature map.
This is followed by two layers:
a) a 1x1 conv. with 36 filters, and
b) a 1x1 conv. with 18 filters
Backbone
VGG
ResNet C4
FPN
…
feature
map
Region Proposal Network (RPN)
72
where
• 𝑝𝑖 is output score for anchor 𝑖
• Ƹ𝑝𝑖 is the ground truth for anchor 𝑖
• 𝑅 𝑥 is the smooth L1 function 𝑅 𝑥 = ቊ0.5𝑥2 if 𝑥 < 1𝑥 − 0.5 otherwise
• 𝑡𝑖𝑥 = Τ𝑥𝑖 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; 𝑡𝑖𝑦= Τ𝑦𝑖 − 𝑦𝑖𝑎 ℎ𝑎 ; 𝑡𝑖𝑤= 𝑙𝑜𝑔 Τ𝑤𝑖𝑤𝑖𝑎 ; 𝑡𝑖ℎ= 𝑙𝑜𝑔 ൗℎ𝑖 ℎ𝑖𝑎
• Ƹ𝑡𝑖𝑥 = Τො𝑥𝑖𝑎 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; Ƹ𝑡𝑖𝑦 = Τො𝑦𝑖𝑎 − 𝑦𝑖𝑎 ℎ𝑖𝑎 ; Ƹ𝑡𝑖𝑤 = 𝑙𝑜𝑔 ൗෝ𝑤𝑖𝑎𝑤𝑖𝑎 ; Ƹ𝑡𝑖ℎ = 𝑙𝑜𝑔 ൗ
ℎ𝑖𝑎ℎ𝑖𝑎
• 𝑥𝑖 , 𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖denote the center coordinates, width and height of box 𝑖
• 𝑥𝑖, 𝑥𝑖𝑎, and ො𝑥𝑖𝑎are for the predicted, anchor and ground truth boxes (likewise for 𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖)
• BCE stands for binary cross entropy
• 𝑁𝑐𝑙𝑠 is the mini-batch size
• 𝑁𝑟𝑒𝑔 is the number anchor locations
The multi-task loss function for RPN is given by
ℒ 𝑝𝑖 , 𝑡𝑖| Ƹ𝑝𝑖 , Ƹ𝑡𝑖𝑗 =1
𝑁𝑐𝑙𝑠
𝑖
𝐵𝐶𝐸 𝑝𝑖 , Ƹ𝑝𝑖 +𝜆
𝑁𝑟𝑒𝑔
𝑖
𝑗∈ 𝑥,𝑦,𝑤,ℎ
Ƹ𝑝𝑖𝑅 𝑡𝑖𝑗 − Ƹ𝑡𝑖𝑗
bounding box loss
activated only for
positive anchors ( Ƹ𝑝𝑖=1)
classification loss 𝑅 𝑥
0 1 2-1-20
.5
1
1.5
𝑥
Non-Maximum SuppressionThe regions delivered by RPN are validate as follows:
Discard all boxes with objectness 𝑝𝑖 ≤ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
While there are boxes:
• Pick the box with the largest 𝑝𝑖 and output it as a prediction.
• Discard any remaining box with IoU≥0.5 with the box output in the previous step.
The algorithm must run separately for each class.
73
Mask R-CNNProposed by Kaiming He and co-authors in 2018.
It is built upon Faster R-CNN.
It involves three steps:
a) the Region Proposal Network (RPN), that generates region proposals,
b) Mask generation
The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and
Mask Generation itself, that segments the image inside of bounding boxes with objects.
74Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
ROI poolingThe next step (Mask Generator) requires inputs of a fixed shape/size. Thus, proposals (anchors) must be resized.
Example: anchor is 4x6 and the Mask Generator input is 2x2
75
0.8
0.8 0.6
0.7
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
ROI poolingOften, the anchor size is not an integer number of the FCN input size!
Solution: round to the nearest integer.
76Source: Shipla Ananth, (2019) Faster R-CNN for object detection
ROI AlignmentInstead of rounding to the nearest integer, the values are bilinear interpolated.
77Source: Shipla Ananth, (2019) Faster R-CNN for object detection
ROIAlign “improves mask accuracy by 10% to 50%”.
Mask R-CNNProposed by Kaiming He and co-authors in 2018.
It is built upon Faster R-CNN.
It involves three steps:
a) the Region Proposal Network (RPN), that generates region proposals,
b) Mask generation
The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and
Mask Generation itself, that segments the image inside of bounding boxes with objects.
78Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
• Fixed size feature map from ROI align goes into R-CNN resulting in bounding box predictions and class scores.
• Fixed size feature map from ROI align also goes into a FCN to get the segmentation mask.
• Mask prediction is resampled to recover
the original resolution.
Steps of Mask Generation
79
feature
map
regression
coefficients
classification
scores
mask
prediction
bounding
box
ROIAlign
conv
conv
FC
FC
FC
FC
resample
FCN
final mask
Loss for the Mask Generation
80
where
• 𝑝𝑖 is output score/probability of the true class for box 𝑖
• 𝑅 𝑥 is the smooth L1 function 𝑅 𝑥 = ቊ0.5𝑥2 if 𝑥 < 1𝑥 − 0.5 otherwise
• 𝑡𝑖𝑥 = Τ𝑥𝑖 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; 𝑡𝑖𝑦= Τ𝑦𝑖 − 𝑦𝑖𝑎 ℎ𝑎 ; 𝑡𝑖𝑤= 𝑙𝑜𝑔 Τ𝑤𝑖𝑤𝑖𝑎 ; 𝑡𝑖ℎ= 𝑙𝑜𝑔 ൗℎ𝑖 ℎ𝑖𝑎
• Ƹ𝑡𝑖𝑥 = Τො𝑥𝑖𝑎 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; Ƹ𝑡𝑖𝑦 = Τො𝑦𝑖𝑎 − 𝑦𝑖𝑎 ℎ𝑖𝑎 ; Ƹ𝑡𝑖𝑤 = 𝑙𝑜𝑔 ൗෝ𝑤𝑖𝑎𝑤𝑖𝑎 ; Ƹ𝑡𝑖ℎ = 𝑙𝑜𝑔 ൗ
ℎ𝑖𝑎ℎ𝑖𝑎
• 𝑥𝑖 , 𝑦𝑖 , 𝑤𝑖 , and denote the center coordinates and width and height of box 𝑖
• 𝑥𝑖, 𝑥𝑖𝑎, and ො𝑥𝑖𝑎are for the predicted, anchor and ground truth boxes (likewise for𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖)
• BCE stands for binary cross entropy
• 𝐾 denotes the number of classes
The multi-task loss function for the Mask generator is given by
ℒ 𝑝𝑖 , 𝑡𝑖| Ƹ𝑝𝑖 , Ƹ𝑡𝑖𝑗 =
𝑖
−𝑙𝑜𝑔 𝑝𝑖 +
𝑗∈ 𝑥,𝑦,𝑤,ℎ
𝑅 𝑡𝑖𝑗 − Ƹ𝑡𝑖𝑗 +
1≤𝑘≤𝐾
𝐵𝐶𝐸 𝑝𝑖𝑘 , Ƹ𝑝𝑖𝑘
bounding box lossclassification loss mask loss
𝑅 𝑥
0 1 2-1-20
.5
1
1.5
𝑥
OD vs IS vs SS - Demo
81
click here to watch the demo
Main Sources
Original paper: He, K., Gkioxari, G., Dollár, P., Girshick, R., (2018), Mask R-CNN
Presentation at ICCV17 by Kaiming He, the first author
82
Object DetectionThanks for your attention
R. Q. FEITOSA