View
6
Download
0
Category
Preview:
Citation preview
Region Based CNN
Detection & Segmentation
Today’s program – 1st hour• Classification & Localization
• Object detection
• Region proposals• Selective search
• Region proposals algorithms • RCNN
• Fast RCNN
• Faster RCNN
• Detection without proposals • YOLO
• SSD
References
Some slides were taken from the following resources:1. Stanford university course Cs231n 2017 – lecture 11
2. www.coursera.orgc – convolutional neural networks – Object detection
Articles references:Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb Artificial Intelligence Lab, Massachusetts Institute of Technology
Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5) Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik UC Berkeley
Fast R-CNN Ross Girshick Microsoft Research
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon , Santosh Divvala, Ross Girshick , Ali Farhadi University of Washington , Allen Institute for AI , Facebook AI Research
SSD: Single Shot MultiBox Detector Wei Liu , Dragomir Anguelov , Dumitru Erhan , Christian Szegedy , Scott Reed , Cheng-Yang Fu, Alexander C. Berg UNC Chapel Hill Zoox Inc. Google Inc. University of Michigan, Ann-Arbor
Already covered: Classification
Some definitions…
Single object Multiple objects
ClassificationClassification with
localizationObject detection
“Car” “Car” “Car” “Car”
Classification + Localization
Classification + Localization task
Classification + Localization
Classification + Localization
Evaluation metric – IoU (Intersection Over Union)
To evaluate how well is our Bounding box prediction we use the Intersection Over Union method:
𝐼𝑂𝑈 = 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑃.𝐵𝑏𝑜𝑥,𝐺.𝐵𝑏𝑜𝑥)𝑈𝑛𝑖𝑜𝑛(𝑃.𝐵𝑏𝑜𝑥,𝐺.𝐵𝑏𝑜𝑥)
Prediction Bbox
Ground truth Bbox
In practice, the Bbox prediction is “Correct” if IOU > ~0.5-0.6
Localization as Regression
• For example for 3 classes – Pedestrian, Car & Motorcycle, the output:
⋮
Pc – Probability for existence of class in the image
Bx – x0 coordinate of the Bbox
By – y0 coordinate of the Bbox
Bh – Height of the Bbox
Bw – Width of the Bbox
Pc1 – Probability for class C1?
Pc2 – Probability for class C2?
Pc3 – Probability for class C3?
ConvNet ⋯
(0,0)
(1,1)
\ IOU
Multi Task Loss
Multi Task Loss
• How do we train the network with two absolutely different loss functions ?
• The solution – Hyper parameter that give weights to each loss.
• Be careful.. Its tricky. This hyper parameter is special, it changes the absolute value of the loss.
Side note - Landmark detection
Treat each landmark as a class.
For each landmark output 𝐿𝑥 , 𝐿𝑦 coordinates
For example :
Face recognition (64 facial landmarks), Pose estimation(14 joints)
Application example – Virtual make over
Object Detection
Object detection reminder…
• We are looking for all the objects in the input image
• The output should say the class and the bounding box of each object
Main challenge:
The image has vary number of objects.
We don’t know what to expect
Its not a good paradigm for this kind of problem…
Sliding window• Small patches cropped from the input insert to the ConvNet which will
classify each patch as true/false for each class or background if there is no object.
Is there any problem with this method ?
Sliding window main pitfall
• How to choose the crops?
• Each object can appear at any location / size / aspect ratio
• We need to tackle thousands of crops
• Before ConvNets using linear classifiers it was possible to compute.
• With ConvNets its very inefficient computationally
• Possible to improve computation using Sliding window “Overfeat” which implement sliding window convolutionally.
Region proposals
Region proposals
Goal – Find image regions that are likely to contain an object.
Example of fast method and quite used in practice – “Selective search”.
This algorithm gives 2000 region proposals in a few seconds on CPU, finds regions at any scale, consider multi grouping criteria (cup on a table), and is not limited to rectangle ROIs.
Idea: Use bottom-up grouping of image regions to generate a hierarchy of small to large regions.
Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology
We define a graph G=(V,E) s.t.
Vertices 𝑣𝑖 ∈ 𝑉 = 𝑇ℎ𝑒 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑤𝑒 𝑤𝑎𝑛𝑡 𝑡𝑜 𝑠𝑒𝑔𝑚𝑒𝑛𝑡
Edges 𝑣𝑖 , 𝑣𝑗 ∈ 𝐸 = 𝑡𝑤𝑜 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 𝑠ℎ𝑎𝑟𝑒 𝑎𝑛 𝑒𝑑𝑔𝑒.
Each edge has non negative weight w(e) that describes the dissimilarity between the two elements.
In image segmentation V = pixels in the image.
w (e) = Intensity deference (or any other measurable pixel attribute)
Efficient Graph-Based Image Segmentation Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology
We define S as the segmentation of V to components (regions in the image) s.t.
pixels in the same component are similar
pixels that are from disjoint components are non similar.
We define the predicate D which evaluate the dissimilarity between two components as follows:
𝐷(𝐶1, 𝐶2 ) =𝑡𝑟𝑢𝑒 𝑖𝑓 𝐷𝑖𝑓(𝑐1, 𝑐2 ) > 𝑀𝐼𝑛𝑡(𝑐1, 𝑐2 )
𝑓𝑎𝑙𝑠𝑒 𝑜. 𝑤.
𝐷𝑖𝑓 𝑐1, 𝑐2 = 𝑚𝑖𝑛𝑣𝑖∈𝑐1,𝑣𝑗∈𝑐2, 𝑣𝑖,𝑣𝑗 ∈𝐸 𝑤 𝑣𝑖 , 𝑣𝑗
𝑀𝐼𝑛𝑡 𝑐1, 𝑐2 = min(𝐼𝑛𝑡 𝑐1 + 𝛾 𝑐1 , 𝐼𝑛𝑡 𝑐2 + 𝛾 𝑐2 )
𝐼𝑛𝑡 𝑐 = 𝑚𝑎𝑥𝑒∈𝑀𝑆𝑇 𝑐,𝐸 𝑤(𝑒)
• sort E into 𝑜1, … , 𝑜𝑚1. Start with a segmentation 𝑆0 ,where each vertex 𝑣𝑖 is in its own
component.2. Repeat step 3 for q=1,…,m3. Construct 𝑆𝑞 given 𝑆𝑞−1as follows. Let 𝑣𝑖 𝑎𝑛𝑑 𝑣𝑗 denote the vertices
connected by the q-th edge in the ordering. i.e. 𝑜𝑞 = 𝑣𝑖 , 𝑣𝑗 . If 𝑣𝑖 𝑎𝑛𝑑 𝑣𝑗 are in disjoint components 𝑆𝑞−1 𝑎𝑛𝑑 𝑤(𝑜𝑞) is small compared to the internal difference of both those components, then merge the two components, otherwise do nothing.
4. Return 𝑆 = 𝑆𝑚
Efficient Graph-Based Image Segmentation - Algorithm Pedro F. Felzenszwalb, Artificial Intelligence Lab, Massachusetts Institute of Technology
Back to “Selective Search””
Selective search1. Generate initial over-segmentation (using the method of Felzenszwalb et al.), R
2. Initialize 𝑆 = ∅ . Recursively combine similar regions into larger ones.
a. From set of regions R , choose two that are most similar.
b. Combine them into a single, larger region.
c. Repeat until only one region remains.
3. Use the generated regions to produce candidate object locations
RCNN
Problem: each ROI may be of
different size.
Warp = transform I(u,v) pixels locations
with respect to mapping function. In this case – scaling
RCNN Problems
• ~2k regions, each region computed independently –computationally expensive
• Training is slow (84h), takes a lot of disk space• Each image trained has all ROI’s labeled with category and
Bounding box. Assume full supervision of all objects.• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
Fast & Faster RCNN – Summary
Fast RCNN1. Instead of run each region separately we first run the whole image
through convolutional network which extract a full resolution feature map with respect to the whole image.
2. Still use “Selective search” only we crop the ROI projected on the feature map instead of cropping the image pixels. Allows us to share high computations between regions.
3. ROI pooling layer (instead of warp) – quite similar to max pooling
4. Multi task loss: Log loss (classification) , L1 (regress the offsets)
Fast RCNN – Multi task loss
𝐿 𝑝, 𝑘∗, 𝑡, 𝑡∗ = 𝐿𝑐𝑙𝑠 𝑝, 𝑘∗ + 𝜆 𝑘∗ ≥ 1 Lreg t, t
∗
Where:
𝑘∗ is the true class label𝐿𝑐𝑙𝑠 𝑝, 𝑘
∗ = − log 𝑝𝑘∗ : 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦/ log 𝑙𝑜𝑠𝑠𝐿𝑟𝑒𝑔 𝑡, 𝑡
∗ :
𝑡∗ = 𝑡𝑥∗ , 𝑡𝑦∗ , 𝑡𝑤∗ , 𝑡ℎ∗ 𝑡𝑟𝑢𝑒 𝑏𝑏𝑜𝑥 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑡𝑎𝑟𝑔𝑒𝑡
𝑡 = 𝑡𝑥, 𝑡𝑦, 𝑡𝑤 , 𝑡ℎ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑡𝑜𝑢𝑝𝑙𝑒 𝑓𝑜𝑟 𝑐𝑙𝑎𝑠𝑠 𝑘
𝐿𝑟𝑒𝑔 𝑡, 𝑡∗ =
𝑖∈{𝑥,𝑦,𝑤,ℎ}
𝑠𝑚𝑜𝑜𝑡ℎ𝐿1(𝑡𝑖 , 𝑡𝑖∗)
Fast & Faster RCNN – Summary
𝑠𝑚𝑜𝑜𝑡ℎ𝐿1 𝑥 =0.5𝑥2 𝑖𝑓 𝑥 < 1
𝑥 − 0.5 𝑜. 𝑤.
Fast & Faster RCNN – Summary
Fast-er RCNN1. Instead of selective search, we do the following:
• Produce feature map in high resolution corresponding to the whole image.
• RPN – Region Proposals Network – A CNN that extract the region proposals from the feature map. The Region proposal network classifies binary { object \ No object } and regress a bounding box.
2. Using the feature map and the region proposals continue the same as Fast RCNN.
Region proposals Network (RPN)
Input – Image at any size
Output – Rectangular regions as a proposition for object location & confidence score .
1. Input image goes through a CNN to generate feature map.
Region proposals Network (RPN) cont.
2. Slide a 3X3 window on the feature map, for each window a set of 9 anchors is generated with same center (𝑥𝑎, 𝑦𝑎) but with 3 different aspect ratios and 3 different scales.
3. The window is passed to 2 FC layers:
• Classify Object / No Object (Binary softmax)
• Bounding box regression
4. For each window k=9 predictions are made for each anchor box. The output is 4k (36) numbers for the Bbox regression and 2k (18) numbers for the binary softmax classifier.
Region proposals Network (RPN) cont.
RPN Loss function:
Positive label – anchor has highest IOU with ground truth-box or has an IOU>0.7 with any ground truth-box.
Negative label – IOU<0.3 for all ground truth-boxes.
𝑝∗ =1 𝑖𝑓 𝐼𝑜𝑈 > 0.7−1 𝑖𝑓 𝐼𝑜𝑈 < 0.30 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Objective function with multi-task loss: Similar to Fast R-CNN.𝐿 𝑝𝑖 , 𝑡𝑖 = 𝐿𝑐𝑙𝑠(𝑝𝑖,𝑝𝑖
∗) + 𝜆𝑝𝑖∗𝐿𝑟𝑒𝑔(𝑡𝑖 , 𝑡𝑖
∗)
Where 𝑝𝑖∗ is 1 if the anchor is labeled positive, and 0 negative.
λ=10 bias towards better box location
Fast & Faster RCNN – Summary
Fast-er RCNN Multi task loss1. RPN – classify object Yes/No
2. RPN regress box coordinates
3. Final classification score (object classes)
4. Final box coordinates
Detection without proposals
Detection without region proposals
YOLO
YOLO Architecture
• A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
• 24 convolution layers
• 2 FC layers
Yolo – You Only Look Once
𝑂𝑢𝑡𝑝𝑢𝑡 =
𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection]
1 - pedestrian
2 - car
3 - motorcycle
Output size is 3 × 3 × 8
Grid cells Offset + classes
• Classification + localization per cell
example – 2 anchor boxes
Output =
𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤𝑐1𝑐2𝑐3
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection]
1 - pedestrian
2 - car
3 - motorcycle
Output is 3 × 3 × 2 × 8
Grid cells Anchor boxes
Offset + classes
Anchor boxes 1
Anchor boxes 2
YOLO Loss functionBbox coordinates error
Bbox w/h error
Class error when object found
Class error when object not found
Is there object error
1𝑖𝑜𝑏𝑗
= is there object at the i-th cell
1𝑖𝑗𝑜𝑏𝑗
= is there object at the i-th cell and anchor box j predicted it
𝜆𝑐𝑜𝑜𝑟𝑑 , 𝜆𝑛𝑜𝑜𝑏𝑗 are optimization parameters to give some loss parts higher/lower effectiveness
What to do with overlapping objects ?
Overlapping objects:
Anchor box 2:Anchor box 1:
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection]
Previously:
Each object in training image is assigned to grid cell that contains that object’s midpoint.
With two anchor boxes:
Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.
How to choose the anchor boxes ?
YOLO – Choosing the anchor boxes• Ad hoc – Define default anchor boxes (In practice 9 – 3 scales, 3 aspect ratios).
• Using K-Means clustering – Use the data set to conclude the anchor boxes.
IoU is used as metric distance for k-means.
Using width and height as features we compute the cluster centers
Calculation is done for various number of clusters choosing the number that is optimal for mean IoU and anchor box overlap.
• Example taken from: Vivek Yadav, Staff Software Engineer at Lockheed Martin-Autonomous System with research interest in control, machine learning/AI. Lifelong learner with glassblowing problem. Jul 10, 2017
How to deal with detecting same object more then once?
Non-max suppression example
Non-max suppression example
0.8
0.5
0.6
0.90.3
Object’s midpointObject’s
midpoint
Non-max suppression example
0.8
0.7
0.6
0.9
0.7
Non-max suppression𝑝𝑐𝑏𝑥𝑏𝑦𝑏ℎ𝑏𝑤
Discard all boxes with 𝑝𝑐 ≤ 0.6
While there are any remaining boxes:
• Pick the box with the largest 𝑝𝑐Output that as a prediction.
• Discard any remaining box with IoU ≥ 0.5 with the box outputin the previous step
Each output prediction is:
0.8
0.7
0.6
0.90.7
Outputting the non-max suppressed outputs
• For each grid call, get 2 predicted bounding boxes.
• Get rid of low probability predictions.
• For each class (pedestrian, car, motorcycle) use non-max suppression to generate final predictions.
SSD
SSD – Single Shot DetectorMain differences from YOLO
Recommended