PROJECT LIVENESS REPORT 01

Object Detection

R. Q. FEITOSA

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

2

Computer Vision Tasks

3

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks


Image Classification: assigns a label to an image.

4



Computer Vision

Tasks

CAT


Object Localization: locates the single object in the image and indicates its location with a bounding box.

5



Computer Vision

Tasks

CAT


Object Detection: locates multiple (if any) instances of object classes with bounding boxes.

6



Computer Vision

Tasks


Semantic Segmentation: links a class label to individual pixels with no concern to separate instances.

7

Image Classification Object LocalizationSemantic

Segmentation


Computer Vision

Tasks

CAT, ROAD, VEGETATION


Instance Segmentation: identifies each instance of each object class at the pixel level.

8



Computer Vision

Tasks

pixel basedregion based

reco

gn

ize

s c

lasse

s

ign

ore

s insta

nce

sre

co

gn

ize

s c

lasse

s

an

d insta

nce

s

Content




4. YOLO

5. Mask R-CNN

9

Datasets• PASCAL Visual Object Classification (PASCAL VOC)

• 8 different challenges from 2005 and 2012

• Around 10 000 images of 20 categories

10Source: Everingham et al., 2012 Visual Object Classes Challenge 2012 Dataset

sample images of PASCAL VOC

http://host.robots.ox.ac.uk/pascal/VOC/

http://academictorrents.com/details/df0aad374e63b3214ef9e92e178580ce27570e59

Datasets• Common Objects in COntext (COCO) - T.-Y.Lin and al. (2015)

• Multiple challenges

• Around 160 000 images of 80 categories

11Source: Arthur Ouaknine, Review of Deep Learning Algorithms for Object Detection (2018)

sample images of PASCAL VOC

sample images of COCO

http://cocodataset.org/#home

https://arxiv.org/pdf/1405.0312.pdf

https://medium.com/zylapp/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852

Content




4. YOLO

5. Mask R-CNN

12

Performance Metrics

13

𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒=

𝑰𝒏𝒕𝒆𝒓𝒔𝒆𝒄𝒕𝒊𝒐𝒏 𝒐𝒗𝒆𝒓 𝑼𝒏𝒊𝒐𝒏 𝑰𝒐𝑼 =

𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑨𝑷 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑐𝑙𝑎𝑠𝑠𝑒𝑠

false

positive

outcome

reference

false

negative

true

positive

𝑹𝒆𝒄𝒂𝒍𝒍 =𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒=

Performance Metrics

Most competitions use mean average precision (𝑚𝐴𝑃) as principal metric to evaluate object detector, though there are variations in definitions and implementations.

14

Performance MetricsPrecision-recall curve

15

Confidence score is the

probability that an anchor box

contains an object. It is usually

predicted by a classifier.

As the threshold for confidence

score decreases,

• recall increases

monotonically;

• precision can go up and

down, but the general

tendency is to decrease.

Source: Nickzeng (2018)

https://blog.zenggyu.com/en/post/2018-12-16/an-introduction-to-evaluation-metrics-for-object-detection/

Performance MetricsPrecision-recall curve – Average Precision

16

The average precision (𝐴𝑃) is

the precision averaged across all

unique recall levels.

It is based on the interpolated

precision-recall curve.

The interpolated precision

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 at a certain recall level 𝑟 is

defined as the highest precision

found for any recall level

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 𝑟 = max𝑟′≥𝑟

𝑟′

.

Source: Nickzeng (2018)

interpolated

original

https://blog.zenggyu.com/en/post/2018-12-16/an-introduction-to-evaluation-metrics-for-object-detection/

Performance MetricsMean average precision –𝑚𝐴𝑃

The calculation of 𝐴𝑃 involves one class.

However, in object detection, there are usually 𝐾 > 1 classes.

Mean average precision (𝑚𝐴𝑃) is defined as the mean of 𝐴𝑃 across all 𝐾 classes:

17

𝑚𝐴𝑃 =σ𝑖=1𝐾 𝐴𝑃𝑖𝐾

Content




4. YOLO

5. Mask R-CNN

18

YOLO History

19

Date Event

May, 2016 Joseph Redmon’s paper "You Only Look Once: Unified, Real-Time Object Detection" introduced YOLO

December, 2017 Redmon introduced YOLOv2 in the paper "YOLO9000: Better, Faster, Stronger."

April, 2018 Redmon published "YOLOv3: An Incremental Improvement"

February, 2020 Redmon announced he was stepping away from Computer Vision

April 23, 2020, Alexey Bochkovskiy published YOLOv4.

May 29, 2020 Glenn Jocher created a repository called YOLOv5

Joseph Redmon




https://twitter.com/pjreddie/status/1230524770350817280

https://arxiv.org/abs/2004.10934

https://github.com/ultralytics/yolov3/commit/0671f04e1f283aea6f059d066130d4543a320b47

YOLOv3

20Redmon & Farhadi, 2018 YOLOv3: An Incremental Improvement"


YOLOv3

21Source: Sik-Ho Tsang Review: YOLOv3 – Yo Only Look Once (Object Detection)

https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detection-eab75d7a1ba6

Example of Object Detection

22

Training set:

𝒙 𝑦

1

0

1

1

0

CNN

𝒙

0/1

𝑦

Sliding Window Detection

23

CNN 0

Problem: High computational cost!

Accurate Bounding Boxes

24

YOLO Basics

25Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection

𝒚 =

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

0???????

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010


convmax

pool

𝒙

𝑛

𝑛

3

33

8

𝒚

Remarks:

• It indicates if there is an object in each grid cell (𝑝𝑐), and from which class (𝑐1, 𝑐2, 𝑐3).

• It outputs the bounding box coordinates (𝑏𝑥,𝑏𝑦,𝑏ℎ,𝑏𝑤) for each grid cell.

• With finer grids you reduce the chance of multiple objects in the same grid cell.

• The object is assigned to just one grid cell, the one that contains its midpoint, no matter if the

object extends beyond the cell limits.

• A single CNN accounts for all objects in the image → efficient.

𝑛

𝑛

Labels for training → for each grid cell:

a single CNN


Parametrization of box coordinates

26Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection

𝑦 =


(0,0)

(1,1)

𝑏𝑥

𝑏𝑦

0.2

0.4

0.9

1.5

𝑏ℎ

𝑏𝑤

between

0 and 1

can be > 1

Remark: actually, there are other parametrizations, but this is

enough for the time being.


Overlapping boxes

27Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

https://www.youtube.com/watch?v=RTlwl2bv0Tg

Non-maximum suppression

28

0.90.60.70.8

0.6

Often, some of the generated boxes overlap.

Algorithm:

Discard all boxes with 𝑝𝑐 ≤ 0.6

While there are boxes:

• Pick the box with the largest 𝑝𝑐 and output it as a prediction.

• Discard any remaining box with IoU≥0.5 with the box output in the previous step.

The algorithm must run separately for each class.


Dealing with Overlapping Objects

29

Anchor box 1

Anchor box 2

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

𝒚 =

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg


0???????1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

anchor

box 1

anchor

box 2

car only?


Anchor box algorithmPreviously:

Each object in training image was assigned to the grid cell that contains that object’s midpoint.

30

With two anchor boxes:

Each object in training image is assigned to the grid cell that contains object’s midpoint andanchor box for the grid cell with highest IoU.

In our example so far

Output 𝒚: 3×3×8


In our example so far

Output 𝒚: 3×3×16


Selecting Anchor Boxes1. Choose by hand, guided by the frequency of shapes in your

application.

2. Use a clustering algorithm to find representative shapes.

31

Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding

boxes to get good priors for our model. The left image shows the average IOU we get with various choices for k. We find

that k = 5 gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for

VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC.

Source: Redmon, J. & Farhadi, A., 2016 YOLO9000: Better, Faster Stronger

https://arxiv.org/pdf/1612.08242v1.pdf

YOLO – wrap upTraining

32

1. Pedestrian

2. Car

3. Motorcycle

0???????


𝒚 = 0???????

𝟎???????


anchor

box 1

anchor

box 2

In our example

𝒚: 3×3×2×8

grid cells# anchors

5+ #classes


1 2

CNN

𝑛 × 𝑛 × 3 1 × 1 × 16

grid cellgrid cell

outcome


YOLO – wrap upPrediction

33

0???????


𝒚 =0???????

𝟎???????


anchor

box 1

anchor

box 2


CNN

1 × 1 × 16


YOLO – wrap upPrediction

34Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

• For each grid cell, get 2 predicted bounding

boxes.

• Get rid of low probability predictions.

• For each class (pedestrian, car, motorcycle)

use non-max suppression to generate final

predictions.


YOLOv3 Network Architecture

35Source: Ayoosh Kathuria, What’s new in YOLO V3?

13𝑥13

13

13

26

26

52

52

Depper than YOLO 9000: from 53 to 106 Layers

Input image: 320, 416 or 608

Detection at three scales better at detecting smaller objects

Skip connections

https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detection-eab75d7a1ba6

https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b

YOLOv3 – AnchorsFor each cell there are 𝐵 anchors.

For each anchor the prediction is a vector of dimension (5 + 𝐶 ).

The number of anchor predictions is:

This is 10 times YOLO9000!

Example:

For COCO, 𝐵 = 3, 𝐶 = 80anchors/predictions per grid cell.

36

× 𝐵

(13 × 13 +26 × 26 +

52 × 52) = 𝟏𝟎, 𝟔𝟒𝟕

YOLOv3 – Bounding Box Prediction

37Source: Redmon, J., Farhadi, A., (2018), YOLOv3: An Incremental Improvement.

Same as YOLO 9000.

It predicts 𝑡𝑥, 𝑡𝑦, 𝑡𝑤, 𝑡ℎ for each anchor box.

The predictions 𝑏𝑥, 𝑏𝑦, 𝑏𝑤, 𝑏ℎ are given by

𝑏ℎ

𝑏𝑤

𝜎 𝑡𝑦

𝜎 𝑡𝑥

𝑝ℎ

𝑝𝑤

𝑏𝑦 = 𝜎 𝑡𝑦 + 𝑐𝑦

𝑏𝑥 = 𝜎 𝑡𝑥 + 𝑐𝑥

𝑐𝑥𝑐𝑦

𝑏𝑤 = 𝑝𝑤𝑒𝑡𝑤

𝑏ℎ = 𝑝ℎ𝑒𝑡ℎ

where

• (𝑐𝑥, 𝑐𝑦) is the offset from the

top left corner of the image.

• 𝑝𝑤 and 𝑝ℎ are the anchorwidth and height.

• 𝜎 is the sigmoid function.


YOLOv3 – Loss Function

where 1𝑖,𝑗𝑜𝑏𝑗

/1𝑖,𝑗𝑛𝑜𝑏𝑗

is the indicator function, 𝜆𝑛𝑜𝑏𝑗and 𝜆𝑐𝑜𝑜𝑟𝑑are weights,

𝑦/ො𝑦 is the prediction/ground truth, 𝜎 is the sigmoid, and 𝐵𝐶𝐸 is the binary cross entropy.

38

𝑦 =

𝑡0𝑡𝑥𝑡𝑦𝑡ℎ𝑡𝑤𝑠1⋮𝑠𝐶

+

𝑖=0

𝑆2

𝑗=0

𝐵

1𝑖,𝑗𝑜𝑏𝑗

−𝑙𝑜𝑔 𝜎 𝑡0 +

𝑘=1

𝐶

𝐵𝐶𝐸 ො𝑦𝑥 , 𝜎 𝑠𝑘

+ 𝜆𝑐𝑜𝑜𝑟𝑑

𝑖=0

𝑆2

𝑗=0

𝐵

1𝑖,𝑗𝑜𝑏𝑗

𝑡𝑥 − Ƹ𝑡𝑥2 + 𝑡𝑦 − Ƹ𝑡𝑦

2+ 𝑡𝑤 − Ƹ𝑡𝑤

2 + 𝑡ℎ − Ƹ𝑡ℎ2

𝜆𝑛𝑜𝑏𝑗

𝑖=0

𝑆2

𝑗=0

𝐵

1𝑖,𝑗𝑛𝑜𝑏𝑗

−𝑙𝑜𝑔 1 − 𝜎 𝑡0

ℒ𝑌𝑂𝐿𝑂 =

instead of softmax, as in YOLOv2

for empty anchors

for non-empty anchors

for non-empty anchors

Performance on COCO

39

YOLOv4

40https://arxiv.org/pdf/2004.10934.pdf


Object Detector Architecturestypically compose of several components:

41

• VGG16

• RESNET-50

• DARKNET52

• RESTNEXT50

• …

• ASPP

• PAN

• RFB

• …

• RPN

• YOLO

• SSD

• …

• FASTER R-CNN

• R-CNN

• …

Network that

extracts the

features

enhance feature

discriminability

and robustness

handles the

prediction

Source: Bochkovskiy, et al., 2020


Bag of Freebiesmethods that only change the training strategy or only increase the training cost.

• “bag” refers to a set of strategies, and

• “freebies” means that we get additional performance for FREE.

The authors investigated a large “freebies” arsenal (too much to go in depth here), in particular, data

augmentation strategies.

42

Bag of SpecialsSet of modules that only increases the inference cost by a small amount but significantly improves the accuracy, such as

• Attention mechanism,

etc, both for the backbone and the head.

43

• Special activation functions,

Source: Bochkovskiy, et al., 2020


Additional Improvements• A new data augmentation method:

– MOSAIC + Self-Adversarial Training (SAT)

• Optimal Hyper-Parameters while applying Genetic Algorithms.

• Modification to some existing methods for efficient training and detection.

44

YOLOv4 Architecture

YOLOv4 consists of:

• Backbone: CSPDarknet53

• Neck: SPP, PAN

• Head: YOLOv3

45

Recall DenseNet

DenseNet increases model capacity (complexity) and improves information flow (training) with highly interconnected layers.

46

Source: Huang et al., 2018

Source: Huang & Wang., 2018


https://arxiv.org/ftp/arxiv/papers/1903/1903.08589.pdf

Recall DenseNet

Formed by composing multiple DenseBlocks with a transition layer in between.

47Source: Wang et al., 2020

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9150780

Cross-Stage Partial Connection (CSP)CSPNet separates the input feature maps of the DenseBlock into two parts:

• The first part x₀’ bypasses the DenseBlock and becomes part of the input to the next transition layer.

• The second part x₀’’ will go thought the Dense block as below.

It reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block



Cross-Stage Partial Connection (CSP)CSP separates the input feature maps of the DenseBlock into two parts.

The first part bypasses the DenseBlock and becomes part of the input to the next transition layer.



CSPDarknet-53

“YOLOv4 utilizes the CSP connections with the Darknet-53 (on the right) as the backbone in feature extraction.”1

It reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block.

50

Darknet-53

1Jonathan Hui, 2020


https://jonathan-hui.medium.com/yolov4-c9901eaa8e61

YOLOv4 Architecture

YOLOv4 consists of:

• Backbone: CSPDarknet53

• Neck: SPP, PAN

• Head: YOLOv3

51

Improved Spatial Pyramid Pooling (SPP)

Feature maps are max pooled in three scales with stride 1.

Three feature maps are concatenated.

52Source: Huand & Wang, 2018

Original SPP

Enahnced SPP

https://arxiv.org/ftp/arxiv/papers/1903/1903.08589.pdf

CSP and SPP integrated in YOLO

53

Path Aggregation Network (PAN)YOLOv3 adapts a similar approach as FPN in making object detection predictions at different scale levels.

It upsamples (2×) the previous top-down stream and add it with the neighboring layer of the bottom-up stream. The result is passed into a 3×3 convolution filter to reduce upsamplingartifacts and create the feature maps P4 below for the head.

54Source: Lin et al., 2019



Performance on COCO

55Source: Bochkovskiy etl al., 2020


YOLO v3 vs YOLO v4 - Demo

56

click here to watch the demo

https://www.youtube.com/watch?v=H5q8f3Qk7Dc

Useful links

• Jonathan Hui, 2020 YOLOv4

• Deval Shah, 2020 , YOLOVv4 – Version 3: Proposed Workflow

• Shreejai Trivedi YOLOv4 – Version 4: Final Verdict

57

https://jonathan-hui.medium.com/yolov4-c9901eaa8e61

https://medium.com/visionwizard/yolov4-version-3-proposed-workflow-e4fa175b902

https://medium.com/visionwizard/yolov4-version-4-final-verdict-e3c04c475c32#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6ImRlOTU1NmFkNDY4MDMxMmMxMTdhZmFlZjI5MjBmNWY5OWE0Yzc5ZmQiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MTkwNTQ2NTYsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwNzQ5NzgxNjcyOTg5NTI1NTE3MiIsImVtYWlsIjoicnFmZWl0b3NhQGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJuYW1lIjoiUmF1bCBRdWVpcm96IEZlaXRvc2EiLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tLy1OOVp4Zlg0akh6MC9BQUFBQUFBQUFBSS9BQUFBQUFBQUFBQS9BTVp1dWNsRUtaQmRmcGhoV3ZNNTBueHlSaXVxQUZ0c1JRL3M5Ni1jL3Bob3RvLmpwZyIsImdpdmVuX25hbWUiOiJSYXVsIiwiZmFtaWx5X25hbWUiOiJRdWVpcm96IEZlaXRvc2EiLCJpYXQiOjE2MTkwNTQ5NTYsImV4cCI6MTYxOTA1ODU1NiwianRpIjoiZjYwNWNjZGU3NGU4NDBjYWIyYWNkNjgwODE2ZTZjOGZmZWVkNWFlMiJ9.TpcRNI9CSKLX6mg5FWeQ3gOpVFHibuNBowPDB4Zo69ljWQAU-V9tT832c8su9SGubNYc5d460y3kU4kC1lwm2KQX9LE4J3hqtrC65LdwJE3F6j87Wd7Avew8Eqtqo4ZBUty9Ic_X9iivCMoUj_8j-_4lQ4OnTD6Ln_cwtvtA_K6aZCS_xD3sBck2KTC_MS_rB9JOeDzuwE70I0M1sRf5ugpALtuEon_bOuRDnsNdwqjlmsZ5VJfpFqTEGL9iEsz3G2kj0kDoSQdmhFLRttARDZcsZ0Z2atWo2Yl1Jzs-tdEF2i2M20P7jRv0VcoJEmQku-PH0dzCcBe8JzY4VoZOGw

YOLOv5• Glen Jocher, founder and CEO of Ultrlytics released

its open source PyTorch implementation of YOLOv5 in Github

• No paper published, just a blog

• Authors claim:

• 140 frames per second (Tesla P100) v.s 50 FPS of YOLOv4

• Small - 90% of YOLOv4

• Fast

58

https://github.com/ultralytics/yolov3/commit/0671f04e1f283aea6f059d066130d4543a320b47

https://blog.roboflow.com/yolov5-is-here/

YOLOv5 Performance

59

Source: Jacob Solawetz, 2020

https://blog.roboflow.com/yolov5-improvements-and-evaluation/

Other OD approaches• R-CNN Rich feature hierarchies for accurate object detection and

semantic segmentation (2014)

• Fast R-CNN (2015)

• Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, (2016)

• R-FCN: Object Detection via Region-based Fully Convolutional Networks (2016)

• SSD: Single Shot MultiBox Detector(2016)

• FPN: Feature Pyramid Networks for Object Detection (2017)

• RetinaNet: Focal Loss for Dense Object Detection (2018)

• Path Aggregation Network for Instance Segmentation (2018)

• EfficientDet: Scalable and Efficient Object Detection (2020)

60








https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8579011


Content




4. YOLO

5. Mask R-CNN

61

Mask-RCNN

62

CV tasks

63Figure source: Tiba Razmi, (2019) Mask R-CNN

https://medium.com/@tibastar/mask-r-cnn-d69aa596761f

Instance Segmentation combines

• object detection, and

• semantic segmentation.

Instance Segmentation

64

object detection semantic segmentation

Mask R-CNNIt is a two-stage framework (meta algorithm):

a) generates proposals (areas likely to contain an object)

b) segments the proposals

65Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Mask R-CNNProposed by Kaiming He and co-authors in 2018.

It is built upon Faster R-CNN.

It involves three steps:

a) the Region Proposal Network (RPN), that generates region proposals,

b) Mask generation

The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and

Mask Generation itself, that segments the image inside of bounding boxes with objects.





Backbone

VGG

ResNet C4

FPN

…

Region Proposal Network (RPN)

67

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

Input image is resized.

Sizes in figure are just illustrative.

feature

map

https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46


68


A backbone network produces the feature

maps.

Backbone

VGG

ResNet C4

FPN

…

feature

map



69


The network has to learn …

a) whether an object is present in a

location,

b) estimate its size/shape.

… this is done by placing a set of anchors on the input image for each

location on the output feature map.

Backbone

VGG

ResNet C4

FPN

…

feature

map



70


The network has to learn …

a) whether an object is present in a

location, and

b) estimate its size/shape.

… this is done by placing a set of anchors on the input image for each

location on the output feature map.

Example of 9 anchor per image location

Backbone

VGG

ResNet C4

FPN

…

feature

map



71


First a 3x3 convolution with 512 filters is

applied to the feature map.

This is followed by two layers:

a) a 1x1 conv. with 36 filters, and

b) a 1x1 conv. with 18 filters

Backbone

VGG

ResNet C4

FPN

…

feature

map



72

where

• 𝑝𝑖 is output score for anchor 𝑖

• Ƹ𝑝𝑖 is the ground truth for anchor 𝑖

• 𝑅 𝑥 is the smooth L1 function 𝑅 𝑥 = ቊ0.5𝑥2 if 𝑥 < 1𝑥 − 0.5 otherwise

• 𝑡𝑖𝑥 = Τ𝑥𝑖 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; 𝑡𝑖𝑦= Τ𝑦𝑖 − 𝑦𝑖𝑎 ℎ𝑎 ; 𝑡𝑖𝑤= 𝑙𝑜𝑔 Τ𝑤𝑖𝑤𝑖𝑎 ; 𝑡𝑖ℎ= 𝑙𝑜𝑔 ൗℎ𝑖 ℎ𝑖𝑎

• Ƹ𝑡𝑖𝑥 = Τො𝑥𝑖𝑎 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; Ƹ𝑡𝑖𝑦 = Τො𝑦𝑖𝑎 − 𝑦𝑖𝑎 ℎ𝑖𝑎 ; Ƹ𝑡𝑖𝑤 = 𝑙𝑜𝑔 ൗෝ𝑤𝑖𝑎𝑤𝑖𝑎 ; Ƹ𝑡𝑖ℎ = 𝑙𝑜𝑔 ൗ

ℎ𝑖𝑎ℎ𝑖𝑎

• 𝑥𝑖 , 𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖denote the center coordinates, width and height of box 𝑖

• 𝑥𝑖, 𝑥𝑖𝑎, and ො𝑥𝑖𝑎are for the predicted, anchor and ground truth boxes (likewise for 𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖)

• BCE stands for binary cross entropy

• 𝑁𝑐𝑙𝑠 is the mini-batch size

• 𝑁𝑟𝑒𝑔 is the number anchor locations

The multi-task loss function for RPN is given by

ℒ 𝑝𝑖 , 𝑡𝑖| Ƹ𝑝𝑖 , Ƹ𝑡𝑖𝑗 =1

𝑁𝑐𝑙𝑠

𝑖

𝐵𝐶𝐸 𝑝𝑖 , Ƹ𝑝𝑖 +𝜆

𝑁𝑟𝑒𝑔

𝑖

𝑗∈ 𝑥,𝑦,𝑤,ℎ

Ƹ𝑝𝑖𝑅 𝑡𝑖𝑗 − Ƹ𝑡𝑖𝑗

bounding box loss

activated only for

positive anchors ( Ƹ𝑝𝑖=1)

classification loss 𝑅 𝑥

0 1 2-1-20

.5

1

1.5

𝑥

Non-Maximum SuppressionThe regions delivered by RPN are validate as follows:

Discard all boxes with objectness 𝑝𝑖 ≤ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

While there are boxes:

• Pick the box with the largest 𝑝𝑖 and output it as a prediction.

• Discard any remaining box with IoU≥0.5 with the box output in the previous step.

The algorithm must run separately for each class.

73





b) Mask generation







ROI poolingThe next step (Mask Generator) requires inputs of a fixed shape/size. Thus, proposals (anchors) must be resized.

Example: anchor is 4x6 and the Mask Generator input is 2x2

75

0.8

0.8 0.6

0.7



ROI poolingOften, the anchor size is not an integer number of the FCN input size!

Solution: round to the nearest integer.

76Source: Shipla Ananth, (2019) Faster R-CNN for object detection


ROI AlignmentInstead of rounding to the nearest integer, the values are bilinear interpolated.

77Source: Shipla Ananth, (2019) Faster R-CNN for object detection

ROIAlign “improves mask accuracy by 10% to 50%”.






b) Mask generation







• Fixed size feature map from ROI align goes into R-CNN resulting in bounding box predictions and class scores.

• Fixed size feature map from ROI align also goes into a FCN to get the segmentation mask.

• Mask prediction is resampled to recover

the original resolution.

Steps of Mask Generation

79

feature

map

regression

coefficients

classification

scores

mask

prediction

bounding

box

ROIAlign

conv

conv

FC

FC

FC

FC

resample

FCN

final mask

Loss for the Mask Generation

80

where

• 𝑝𝑖 is output score/probability of the true class for box 𝑖

• 𝑅 𝑥 is the smooth L1 function 𝑅 𝑥 = ቊ0.5𝑥2 if 𝑥 < 1𝑥 − 0.5 otherwise

• 𝑡𝑖𝑥 = Τ𝑥𝑖 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; 𝑡𝑖𝑦= Τ𝑦𝑖 − 𝑦𝑖𝑎 ℎ𝑎 ; 𝑡𝑖𝑤= 𝑙𝑜𝑔 Τ𝑤𝑖𝑤𝑖𝑎 ; 𝑡𝑖ℎ= 𝑙𝑜𝑔 ൗℎ𝑖 ℎ𝑖𝑎

• Ƹ𝑡𝑖𝑥 = Τො𝑥𝑖𝑎 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; Ƹ𝑡𝑖𝑦 = Τො𝑦𝑖𝑎 − 𝑦𝑖𝑎 ℎ𝑖𝑎 ; Ƹ𝑡𝑖𝑤 = 𝑙𝑜𝑔 ൗෝ𝑤𝑖𝑎𝑤𝑖𝑎 ; Ƹ𝑡𝑖ℎ = 𝑙𝑜𝑔 ൗ

ℎ𝑖𝑎ℎ𝑖𝑎

• 𝑥𝑖 , 𝑦𝑖 , 𝑤𝑖 , and denote the center coordinates and width and height of box 𝑖

• 𝑥𝑖, 𝑥𝑖𝑎, and ො𝑥𝑖𝑎are for the predicted, anchor and ground truth boxes (likewise for𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖)

• BCE stands for binary cross entropy

• 𝐾 denotes the number of classes

The multi-task loss function for the Mask generator is given by

ℒ 𝑝𝑖 , 𝑡𝑖| Ƹ𝑝𝑖 , Ƹ𝑡𝑖𝑗 =

𝑖

−𝑙𝑜𝑔 𝑝𝑖 +

𝑗∈ 𝑥,𝑦,𝑤,ℎ

𝑅 𝑡𝑖𝑗 − Ƹ𝑡𝑖𝑗 +

1≤𝑘≤𝐾

𝐵𝐶𝐸 𝑝𝑖𝑘 , Ƹ𝑝𝑖𝑘

bounding box lossclassification loss mask loss

𝑅 𝑥

0 1 2-1-20

.5

1

1.5

𝑥

OD vs IS vs SS - Demo

81

click here to watch the demo

https://www.youtube.com/watch?v=s8Ui_kV9dhw&t=11s

Main Sources

Original paper: He, K., Gkioxari, G., Dollár, P., Girshick, R., (2018), Mask R-CNN

Presentation at ICCV17 by Kaiming He, the first author

82



https://www.youtube.com/watch?v=g7z4mkfRjI4&t=604s

Object DetectionThanks for your attention

R. Q. FEITOSA

Documents

PROJECT LIVENESS REPORT 01