83
Object Detection R. Q. FEITOSA

PROJECT LIVENESS REPORT 01

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PROJECT LIVENESS REPORT 01

Object Detection

R. Q. FEITOSA

Page 2: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

2

Page 3: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

3

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

Page 4: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Image Classification: assigns a label to an image.

4

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

CAT

Page 5: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Object Localization: locates the single object in the image and indicates its location with a bounding box.

5

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

CAT

Page 6: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Object Detection: locates multiple (if any) instances of object classes with bounding boxes.

6

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

Page 7: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Semantic Segmentation: links a class label to individual pixels with no concern to separate instances.

7

Image Classification Object LocalizationSemantic

Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

CAT, ROAD, VEGETATION

Page 8: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Instance Segmentation: identifies each instance of each object class at the pixel level.

8

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

pixel basedregion based

reco

gn

ize

s c

lasse

s

ign

ore

s insta

nce

sre

co

gn

ize

s c

lasse

s

an

d insta

nce

s

Page 9: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

9

Page 10: PROJECT LIVENESS REPORT 01

Datasets• PASCAL Visual Object Classification (PASCAL VOC)

• 8 different challenges from 2005 and 2012

• Around 10 000 images of 20 categories

10Source: Everingham et al., 2012 Visual Object Classes Challenge 2012 Dataset

sample images of PASCAL VOC

Page 11: PROJECT LIVENESS REPORT 01

Datasets• Common Objects in COntext (COCO) - T.-Y.Lin and al. (2015)

• Multiple challenges

• Around 160 000 images of 80 categories

11Source: Arthur Ouaknine, Review of Deep Learning Algorithms for Object Detection (2018)

sample images of PASCAL VOC

sample images of COCO

Page 12: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

12

Page 13: PROJECT LIVENESS REPORT 01

Performance Metrics

13

𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒=

𝑰𝒏𝒕𝒆𝒓𝒔𝒆𝒄𝒕𝒊𝒐𝒏 𝒐𝒗𝒆𝒓 𝑼𝒏𝒊𝒐𝒏 𝑰𝒐𝑼 =

𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑨𝑷 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑐𝑙𝑎𝑠𝑠𝑒𝑠

false

positive

outcome

reference

false

negative

true

positive

𝑹𝒆𝒄𝒂𝒍𝒍 =𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒=

Page 14: PROJECT LIVENESS REPORT 01

Performance Metrics

Most competitions use mean average precision (𝑚𝐴𝑃) as principal metric to evaluate object detector, though there are variations in definitions and implementations.

14

Page 15: PROJECT LIVENESS REPORT 01

Performance MetricsPrecision-recall curve

15

Confidence score is the

probability that an anchor box

contains an object. It is usually

predicted by a classifier.

As the threshold for confidence

score decreases,

• recall increases

monotonically;

• precision can go up and

down, but the general

tendency is to decrease.

Source: Nickzeng (2018)

Page 16: PROJECT LIVENESS REPORT 01

Performance MetricsPrecision-recall curve – Average Precision

16

The average precision (𝐴𝑃) is

the precision averaged across all

unique recall levels.

It is based on the interpolated

precision-recall curve.

The interpolated precision

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 at a certain recall level 𝑟 is

defined as the highest precision

found for any recall level

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 𝑟 = max𝑟′≥𝑟

𝑟′

.

Source: Nickzeng (2018)

interpolated

original

Page 17: PROJECT LIVENESS REPORT 01

Performance MetricsMean average precision –𝑚𝐴𝑃

The calculation of 𝐴𝑃 involves one class.

However, in object detection, there are usually 𝐾 > 1 classes.

Mean average precision (𝑚𝐴𝑃) is defined as the mean of 𝐴𝑃 across all 𝐾 classes:

17

𝑚𝐴𝑃 =σ𝑖=1𝐾 𝐴𝑃𝑖𝐾

Page 18: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

18

Page 19: PROJECT LIVENESS REPORT 01

YOLO History

19

Date Event

May, 2016 Joseph Redmon’s paper "You Only Look Once: Unified, Real-Time Object Detection" introduced YOLO

December, 2017 Redmon introduced YOLOv2 in the paper "YOLO9000: Better, Faster, Stronger."

April, 2018 Redmon published "YOLOv3: An Incremental Improvement"

February, 2020 Redmon announced he was stepping away from Computer Vision

April 23, 2020, Alexey Bochkovskiy published YOLOv4.

May 29, 2020 Glenn Jocher created a repository called YOLOv5

Joseph Redmon

Page 20: PROJECT LIVENESS REPORT 01

YOLOv3

20Redmon & Farhadi, 2018 YOLOv3: An Incremental Improvement"

Page 21: PROJECT LIVENESS REPORT 01

YOLOv3

21Source: Sik-Ho Tsang Review: YOLOv3 – Yo Only Look Once (Object Detection)

Page 22: PROJECT LIVENESS REPORT 01

Example of Object Detection

22

Training set:

𝒙 𝑦

1

0

1

1

0

CNN

𝒙

0/1

𝑦

Page 23: PROJECT LIVENESS REPORT 01

Sliding Window Detection

23

CNN 0

Problem: High computational cost!

Page 24: PROJECT LIVENESS REPORT 01

Accurate Bounding Boxes

24

Page 25: PROJECT LIVENESS REPORT 01

YOLO Basics

25Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection

𝒚 =

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

0???????

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

convmax

pool

𝒙

𝑛

𝑛

3

33

8

𝒚

Remarks:

• It indicates if there is an object in each grid cell (𝑝𝑐), and from which class (𝑐1, 𝑐2, 𝑐3).

• It outputs the bounding box coordinates (𝑏𝑥,𝑏𝑦,𝑏ℎ,𝑏𝑤) for each grid cell.

• With finer grids you reduce the chance of multiple objects in the same grid cell.

• The object is assigned to just one grid cell, the one that contains its midpoint, no matter if the

object extends beyond the cell limits.

• A single CNN accounts for all objects in the image → efficient.

𝑛

𝑛

Labels for training → for each grid cell:

a single CNN

Page 26: PROJECT LIVENESS REPORT 01

Parametrization of box coordinates

26Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection

𝑦 =

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

(0,0)

(1,1)

𝑏𝑥

𝑏𝑦

0.2

0.4

0.9

1.5

𝑏ℎ

𝑏𝑤

between

0 and 1

can be > 1

Remark: actually, there are other parametrizations, but this is

enough for the time being.

Page 27: PROJECT LIVENESS REPORT 01

Overlapping boxes

27Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

Page 28: PROJECT LIVENESS REPORT 01

Non-maximum suppression

28

0.90.60.70.8

0.6

Often, some of the generated boxes overlap.

Algorithm:

Discard all boxes with 𝑝𝑐 ≤ 0.6

While there are boxes:

• Pick the box with the largest 𝑝𝑐 and output it as a prediction.

• Discard any remaining box with IoU≥0.5 with the box output in the previous step.

The algorithm must run separately for each class.

Page 29: PROJECT LIVENESS REPORT 01

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤100

Dealing with Overlapping Objects

29

Anchor box 1

Anchor box 2

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

𝒚 =

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

0???????1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

anchor

box 1

anchor

box 2

car only?

Page 30: PROJECT LIVENESS REPORT 01

Anchor box algorithmPreviously:

Each object in training image was assigned to the grid cell that contains that object’s midpoint.

30

With two anchor boxes:

Each object in training image is assigned to the grid cell that contains object’s midpoint andanchor box for the grid cell with highest IoU.

In our example so far

Output 𝒚: 3×3×8

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

In our example so far

Output 𝒚: 3×3×16

Page 31: PROJECT LIVENESS REPORT 01

Selecting Anchor Boxes1. Choose by hand, guided by the frequency of shapes in your

application.

2. Use a clustering algorithm to find representative shapes.

31

Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding

boxes to get good priors for our model. The left image shows the average IOU we get with various choices for k. We find

that k = 5 gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for

VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC.

Source: Redmon, J. & Farhadi, A., 2016 YOLO9000: Better, Faster Stronger

Page 32: PROJECT LIVENESS REPORT 01

YOLO – wrap upTraining

32

1. Pedestrian

2. Car

3. Motorcycle

0???????

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

𝒚 = 0???????

𝟎???????

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

anchor

box 1

anchor

box 2

In our example

𝒚: 3×3×2×8

grid cells# anchors

5+ #classes

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

1 2

CNN

𝑛 × 𝑛 × 3 1 × 1 × 16

grid cellgrid cell

outcome

Page 33: PROJECT LIVENESS REPORT 01

YOLO – wrap upPrediction

33

0???????

𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3

𝒚 =0???????

𝟎???????

1𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤010

anchor

box 1

anchor

box 2

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

CNN

1 × 1 × 16

Page 34: PROJECT LIVENESS REPORT 01

YOLO – wrap upPrediction

34Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

• For each grid cell, get 2 predicted bounding

boxes.

• Get rid of low probability predictions.

• For each class (pedestrian, car, motorcycle)

use non-max suppression to generate final

predictions.

Page 35: PROJECT LIVENESS REPORT 01

YOLOv3 Network Architecture

35Source: Ayoosh Kathuria, What’s new in YOLO V3?

13𝑥13

13

13

26

26

52

52

Depper than YOLO 9000: from 53 to 106 Layers

Input image: 320, 416 or 608

Detection at three scales better at detecting smaller objects

Skip connections

Page 36: PROJECT LIVENESS REPORT 01

YOLOv3 – AnchorsFor each cell there are 𝐵 anchors.

For each anchor the prediction is a vector of dimension (5 + 𝐶 ).

The number of anchor predictions is:

This is 10 times YOLO9000!

Example:

For COCO, 𝐵 = 3, 𝐶 = 80anchors/predictions per grid cell.

36

× 𝐵

(13 × 13 +26 × 26 +

52 × 52) = 𝟏𝟎, 𝟔𝟒𝟕

Page 37: PROJECT LIVENESS REPORT 01

YOLOv3 – Bounding Box Prediction

37Source: Redmon, J., Farhadi, A., (2018), YOLOv3: An Incremental Improvement.

Same as YOLO 9000.

It predicts 𝑡𝑥, 𝑡𝑦, 𝑡𝑤, 𝑡ℎ for each anchor box.

The predictions 𝑏𝑥, 𝑏𝑦, 𝑏𝑤, 𝑏ℎ are given by

𝑏ℎ

𝑏𝑤

𝜎 𝑡𝑦

𝜎 𝑡𝑥

𝑝ℎ

𝑝𝑤

𝑏𝑦 = 𝜎 𝑡𝑦 + 𝑐𝑦

𝑏𝑥 = 𝜎 𝑡𝑥 + 𝑐𝑥

𝑐𝑥𝑐𝑦

𝑏𝑤 = 𝑝𝑤𝑒𝑡𝑤

𝑏ℎ = 𝑝ℎ𝑒𝑡ℎ

where

• (𝑐𝑥, 𝑐𝑦) is the offset from the

top left corner of the image.

• 𝑝𝑤 and 𝑝ℎ are the anchorwidth and height.

• 𝜎 is the sigmoid function.

Page 38: PROJECT LIVENESS REPORT 01

YOLOv3 – Loss Function

where 1𝑖,𝑗𝑜𝑏𝑗

/1𝑖,𝑗𝑛𝑜𝑏𝑗

is the indicator function, 𝜆𝑛𝑜𝑏𝑗and 𝜆𝑐𝑜𝑜𝑟𝑑are weights,

𝑦/ො𝑦 is the prediction/ground truth, 𝜎 is the sigmoid, and 𝐵𝐶𝐸 is the binary cross entropy.

38

𝑦 =

𝑡0𝑡𝑥𝑡𝑦𝑡ℎ𝑡𝑤𝑠1⋮𝑠𝐶

+

𝑖=0

𝑆2

𝑗=0

𝐵

1𝑖,𝑗𝑜𝑏𝑗

−𝑙𝑜𝑔 𝜎 𝑡0 +

𝑘=1

𝐶

𝐵𝐶𝐸 ො𝑦𝑥 , 𝜎 𝑠𝑘

+ 𝜆𝑐𝑜𝑜𝑟𝑑

𝑖=0

𝑆2

𝑗=0

𝐵

1𝑖,𝑗𝑜𝑏𝑗

𝑡𝑥 − Ƹ𝑡𝑥2 + 𝑡𝑦 − Ƹ𝑡𝑦

2+ 𝑡𝑤 − Ƹ𝑡𝑤

2 + 𝑡ℎ − Ƹ𝑡ℎ2

𝜆𝑛𝑜𝑏𝑗

𝑖=0

𝑆2

𝑗=0

𝐵

1𝑖,𝑗𝑛𝑜𝑏𝑗

−𝑙𝑜𝑔 1 − 𝜎 𝑡0

ℒ𝑌𝑂𝐿𝑂 =

instead of softmax, as in YOLOv2

for empty anchors

for non-empty anchors

for non-empty anchors

Page 39: PROJECT LIVENESS REPORT 01

Performance on COCO

39

Page 40: PROJECT LIVENESS REPORT 01

YOLOv4

40https://arxiv.org/pdf/2004.10934.pdf

Page 41: PROJECT LIVENESS REPORT 01

Object Detector Architecturestypically compose of several components:

41

• VGG16

• RESNET-50

• DARKNET52

• RESTNEXT50

• …

• ASPP

• PAN

• RFB

• …

• RPN

• YOLO

• SSD

• …

• FASTER R-CNN

• R-CNN

• …

Network that

extracts the

features

enhance feature

discriminability

and robustness

handles the

prediction

Source: Bochkovskiy, et al., 2020

Page 42: PROJECT LIVENESS REPORT 01

Bag of Freebiesmethods that only change the training strategy or only increase the training cost.

• “bag” refers to a set of strategies, and

• “freebies” means that we get additional performance for FREE.

The authors investigated a large “freebies” arsenal (too much to go in depth here), in particular, data

augmentation strategies.

42

Page 43: PROJECT LIVENESS REPORT 01

Bag of SpecialsSet of modules that only increases the inference cost by a small amount but significantly improves the accuracy, such as

• Attention mechanism,

etc, both for the backbone and the head.

43

• Special activation functions,

Source: Bochkovskiy, et al., 2020

Page 44: PROJECT LIVENESS REPORT 01

Additional Improvements• A new data augmentation method:

– MOSAIC + Self-Adversarial Training (SAT)

• Optimal Hyper-Parameters while applying Genetic Algorithms.

• Modification to some existing methods for efficient training and detection.

44

Page 45: PROJECT LIVENESS REPORT 01

YOLOv4 Architecture

YOLOv4 consists of:

• Backbone: CSPDarknet53

• Neck: SPP, PAN

• Head: YOLOv3

45

Page 46: PROJECT LIVENESS REPORT 01

Recall DenseNet

DenseNet increases model capacity (complexity) and improves information flow (training) with highly interconnected layers.

46

Source: Huang et al., 2018

Source: Huang & Wang., 2018

Page 47: PROJECT LIVENESS REPORT 01

Recall DenseNet

Formed by composing multiple DenseBlocks with a transition layer in between.

47Source: Wang et al., 2020

Page 48: PROJECT LIVENESS REPORT 01

Cross-Stage Partial Connection (CSP)CSPNet separates the input feature maps of the DenseBlock into two parts:

• The first part x₀’ bypasses the DenseBlock and becomes part of the input to the next transition layer.

• The second part x₀’’ will go thought the Dense block as below.

It reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block

48Source: Wang et al., 2020

Page 49: PROJECT LIVENESS REPORT 01

Cross-Stage Partial Connection (CSP)CSP separates the input feature maps of the DenseBlock into two parts.

The first part bypasses the DenseBlock and becomes part of the input to the next transition layer.

49Source: Wang et al., 2020

Page 50: PROJECT LIVENESS REPORT 01

CSPDarknet-53

“YOLOv4 utilizes the CSP connections with the Darknet-53 (on the right) as the backbone in feature extraction.”1

It reduces the computational complexity by separating the input into two parts — with only one going through the Dense Block.

50

Darknet-53

1Jonathan Hui, 2020

Page 51: PROJECT LIVENESS REPORT 01

YOLOv4 Architecture

YOLOv4 consists of:

• Backbone: CSPDarknet53

• Neck: SPP, PAN

• Head: YOLOv3

51

Page 52: PROJECT LIVENESS REPORT 01

Improved Spatial Pyramid Pooling (SPP)

Feature maps are max pooled in three scales with stride 1.

Three feature maps are concatenated.

52Source: Huand & Wang, 2018

Original SPP

Enahnced SPP

Page 53: PROJECT LIVENESS REPORT 01

CSP and SPP integrated in YOLO

53

Page 54: PROJECT LIVENESS REPORT 01

Path Aggregation Network (PAN)YOLOv3 adapts a similar approach as FPN in making object detection predictions at different scale levels.

It upsamples (2×) the previous top-down stream and add it with the neighboring layer of the bottom-up stream. The result is passed into a 3×3 convolution filter to reduce upsamplingartifacts and create the feature maps P4 below for the head.

54Source: Lin et al., 2019

Page 55: PROJECT LIVENESS REPORT 01

Performance on COCO

55Source: Bochkovskiy etl al., 2020

Page 56: PROJECT LIVENESS REPORT 01

YOLO v3 vs YOLO v4 - Demo

56

click here to watch the demo

Page 57: PROJECT LIVENESS REPORT 01

Useful links

• Jonathan Hui, 2020 YOLOv4

• Deval Shah, 2020 , YOLOVv4 – Version 3: Proposed Workflow

• Shreejai Trivedi YOLOv4 – Version 4: Final Verdict

57

Page 58: PROJECT LIVENESS REPORT 01

YOLOv5• Glen Jocher, founder and CEO of Ultrlytics released

its open source PyTorch implementation of YOLOv5 in Github

• No paper published, just a blog

• Authors claim:

• 140 frames per second (Tesla P100) v.s 50 FPS of YOLOv4

• Small - 90% of YOLOv4

• Fast

58

Page 59: PROJECT LIVENESS REPORT 01

YOLOv5 Performance

59

Source: Jacob Solawetz, 2020

Page 60: PROJECT LIVENESS REPORT 01

Other OD approaches• R-CNN Rich feature hierarchies for accurate object detection and

semantic segmentation (2014)

• Fast R-CNN (2015)

• Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, (2016)

• R-FCN: Object Detection via Region-based Fully Convolutional Networks (2016)

• SSD: Single Shot MultiBox Detector(2016)

• FPN: Feature Pyramid Networks for Object Detection (2017)

• RetinaNet: Focal Loss for Dense Object Detection (2018)

• Path Aggregation Network for Instance Segmentation (2018)

• EfficientDet: Scalable and Efficient Object Detection (2020)

60

Page 61: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

61

Page 62: PROJECT LIVENESS REPORT 01

Mask-RCNN

62

Page 63: PROJECT LIVENESS REPORT 01

CV tasks

63Figure source: Tiba Razmi, (2019) Mask R-CNN

Page 64: PROJECT LIVENESS REPORT 01

Instance Segmentation combines

• object detection, and

• semantic segmentation.

Instance Segmentation

64

object detection semantic segmentation

Page 65: PROJECT LIVENESS REPORT 01

Mask R-CNNIt is a two-stage framework (meta algorithm):

a) generates proposals (areas likely to contain an object)

b) segments the proposals

65Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 66: PROJECT LIVENESS REPORT 01

Mask R-CNNProposed by Kaiming He and co-authors in 2018.

It is built upon Faster R-CNN.

It involves three steps:

a) the Region Proposal Network (RPN), that generates region proposals,

b) Mask generation

The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and

Mask Generation itself, that segments the image inside of bounding boxes with objects.

66Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 67: PROJECT LIVENESS REPORT 01

Backbone

VGG

ResNet C4

FPN

Region Proposal Network (RPN)

67

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

Input image is resized.

Sizes in figure are just illustrative.

feature

map

Page 68: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

68

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

A backbone network produces the feature

maps.

Backbone

VGG

ResNet C4

FPN

feature

map

Page 69: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

69

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

The network has to learn …

a) whether an object is present in a

location,

b) estimate its size/shape.

… this is done by placing a set of anchors on the input image for each

location on the output feature map.

Backbone

VGG

ResNet C4

FPN

feature

map

Page 70: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

70

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

The network has to learn …

a) whether an object is present in a

location, and

b) estimate its size/shape.

… this is done by placing a set of anchors on the input image for each

location on the output feature map.

Example of 9 anchor per image location

Backbone

VGG

ResNet C4

FPN

feature

map

Page 71: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

71

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

First a 3x3 convolution with 512 filters is

applied to the feature map.

This is followed by two layers:

a) a 1x1 conv. with 36 filters, and

b) a 1x1 conv. with 18 filters

Backbone

VGG

ResNet C4

FPN

feature

map

Page 72: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

72

where

• 𝑝𝑖 is output score for anchor 𝑖

• Ƹ𝑝𝑖 is the ground truth for anchor 𝑖

• 𝑅 𝑥 is the smooth L1 function 𝑅 𝑥 = ቊ0.5𝑥2 if 𝑥 < 1𝑥 − 0.5 otherwise

• 𝑡𝑖𝑥 = Τ𝑥𝑖 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; 𝑡𝑖𝑦= Τ𝑦𝑖 − 𝑦𝑖𝑎 ℎ𝑎 ; 𝑡𝑖𝑤= 𝑙𝑜𝑔 Τ𝑤𝑖𝑤𝑖𝑎 ; 𝑡𝑖ℎ= 𝑙𝑜𝑔 ൗℎ𝑖 ℎ𝑖𝑎

• Ƹ𝑡𝑖𝑥 = Τො𝑥𝑖𝑎 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; Ƹ𝑡𝑖𝑦 = Τො𝑦𝑖𝑎 − 𝑦𝑖𝑎 ℎ𝑖𝑎 ; Ƹ𝑡𝑖𝑤 = 𝑙𝑜𝑔 ൗෝ𝑤𝑖𝑎𝑤𝑖𝑎 ; Ƹ𝑡𝑖ℎ = 𝑙𝑜𝑔 ൗ

ℎ𝑖𝑎ℎ𝑖𝑎

• 𝑥𝑖 , 𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖denote the center coordinates, width and height of box 𝑖

• 𝑥𝑖, 𝑥𝑖𝑎, and ො𝑥𝑖𝑎are for the predicted, anchor and ground truth boxes (likewise for 𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖)

• BCE stands for binary cross entropy

• 𝑁𝑐𝑙𝑠 is the mini-batch size

• 𝑁𝑟𝑒𝑔 is the number anchor locations

The multi-task loss function for RPN is given by

ℒ 𝑝𝑖 , 𝑡𝑖| Ƹ𝑝𝑖 , Ƹ𝑡𝑖𝑗 =1

𝑁𝑐𝑙𝑠

𝑖

𝐵𝐶𝐸 𝑝𝑖 , Ƹ𝑝𝑖 +𝜆

𝑁𝑟𝑒𝑔

𝑖

𝑗∈ 𝑥,𝑦,𝑤,ℎ

Ƹ𝑝𝑖𝑅 𝑡𝑖𝑗 − Ƹ𝑡𝑖𝑗

bounding box loss

activated only for

positive anchors ( Ƹ𝑝𝑖=1)

classification loss 𝑅 𝑥

0 1 2-1-20

.5

1

1.5

𝑥

Page 73: PROJECT LIVENESS REPORT 01

Non-Maximum SuppressionThe regions delivered by RPN are validate as follows:

Discard all boxes with objectness 𝑝𝑖 ≤ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

While there are boxes:

• Pick the box with the largest 𝑝𝑖 and output it as a prediction.

• Discard any remaining box with IoU≥0.5 with the box output in the previous step.

The algorithm must run separately for each class.

73

Page 74: PROJECT LIVENESS REPORT 01

Mask R-CNNProposed by Kaiming He and co-authors in 2018.

It is built upon Faster R-CNN.

It involves three steps:

a) the Region Proposal Network (RPN), that generates region proposals,

b) Mask generation

The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and

Mask Generation itself, that segments the image inside of bounding boxes with objects.

74Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 75: PROJECT LIVENESS REPORT 01

ROI poolingThe next step (Mask Generator) requires inputs of a fixed shape/size. Thus, proposals (anchors) must be resized.

Example: anchor is 4x6 and the Mask Generator input is 2x2

75

0.8

0.8 0.6

0.7

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

Page 76: PROJECT LIVENESS REPORT 01

ROI poolingOften, the anchor size is not an integer number of the FCN input size!

Solution: round to the nearest integer.

76Source: Shipla Ananth, (2019) Faster R-CNN for object detection

Page 77: PROJECT LIVENESS REPORT 01

ROI AlignmentInstead of rounding to the nearest integer, the values are bilinear interpolated.

77Source: Shipla Ananth, (2019) Faster R-CNN for object detection

ROIAlign “improves mask accuracy by 10% to 50%”.

Page 78: PROJECT LIVENESS REPORT 01

Mask R-CNNProposed by Kaiming He and co-authors in 2018.

It is built upon Faster R-CNN.

It involves three steps:

a) the Region Proposal Network (RPN), that generates region proposals,

b) Mask generation

The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and

Mask Generation itself, that segments the image inside of bounding boxes with objects.

78Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 79: PROJECT LIVENESS REPORT 01

• Fixed size feature map from ROI align goes into R-CNN resulting in bounding box predictions and class scores.

• Fixed size feature map from ROI align also goes into a FCN to get the segmentation mask.

• Mask prediction is resampled to recover

the original resolution.

Steps of Mask Generation

79

feature

map

regression

coefficients

classification

scores

mask

prediction

bounding

box

ROIAlign

conv

conv

FC

FC

FC

FC

resample

FCN

final mask

Page 80: PROJECT LIVENESS REPORT 01

Loss for the Mask Generation

80

where

• 𝑝𝑖 is output score/probability of the true class for box 𝑖

• 𝑅 𝑥 is the smooth L1 function 𝑅 𝑥 = ቊ0.5𝑥2 if 𝑥 < 1𝑥 − 0.5 otherwise

• 𝑡𝑖𝑥 = Τ𝑥𝑖 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; 𝑡𝑖𝑦= Τ𝑦𝑖 − 𝑦𝑖𝑎 ℎ𝑎 ; 𝑡𝑖𝑤= 𝑙𝑜𝑔 Τ𝑤𝑖𝑤𝑖𝑎 ; 𝑡𝑖ℎ= 𝑙𝑜𝑔 ൗℎ𝑖 ℎ𝑖𝑎

• Ƹ𝑡𝑖𝑥 = Τො𝑥𝑖𝑎 − 𝑥𝑖𝑎 𝑤𝑖𝑎 ; Ƹ𝑡𝑖𝑦 = Τො𝑦𝑖𝑎 − 𝑦𝑖𝑎 ℎ𝑖𝑎 ; Ƹ𝑡𝑖𝑤 = 𝑙𝑜𝑔 ൗෝ𝑤𝑖𝑎𝑤𝑖𝑎 ; Ƹ𝑡𝑖ℎ = 𝑙𝑜𝑔 ൗ

ℎ𝑖𝑎ℎ𝑖𝑎

• 𝑥𝑖 , 𝑦𝑖 , 𝑤𝑖 , and denote the center coordinates and width and height of box 𝑖

• 𝑥𝑖, 𝑥𝑖𝑎, and ො𝑥𝑖𝑎are for the predicted, anchor and ground truth boxes (likewise for𝑦𝑖 , 𝑤𝑖 , and ℎ𝑖)

• BCE stands for binary cross entropy

• 𝐾 denotes the number of classes

The multi-task loss function for the Mask generator is given by

ℒ 𝑝𝑖 , 𝑡𝑖| Ƹ𝑝𝑖 , Ƹ𝑡𝑖𝑗 =

𝑖

−𝑙𝑜𝑔 𝑝𝑖 +

𝑗∈ 𝑥,𝑦,𝑤,ℎ

𝑅 𝑡𝑖𝑗 − Ƹ𝑡𝑖𝑗 +

1≤𝑘≤𝐾

𝐵𝐶𝐸 𝑝𝑖𝑘 , Ƹ𝑝𝑖𝑘

bounding box lossclassification loss mask loss

𝑅 𝑥

0 1 2-1-20

.5

1

1.5

𝑥

Page 81: PROJECT LIVENESS REPORT 01

OD vs IS vs SS - Demo

81

click here to watch the demo

Page 82: PROJECT LIVENESS REPORT 01

Main Sources

Original paper: He, K., Gkioxari, G., Dollár, P., Girshick, R., (2018), Mask R-CNN

Presentation at ICCV17 by Kaiming He, the first author

82

Page 83: PROJECT LIVENESS REPORT 01

Object DetectionThanks for your attention

R. Q. FEITOSA