Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
IEEE 2017 Conference on
Computer Vision and Pattern
Recognition
: Difficulty-Aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou Tang
Department of Information Engineering, The Chinese University of Hong Kong
{lx015,lz013,pluo,ccloy,xtang}@ie.cuhk.edu.hk
IEEE 2017 Conference on
Computer Vision and Pattern
Recognition
1. Introduction
• Motivation:
Partition all pixels into three sets by classification confidence
Nearly 40% easy region
• Task & General Approaches:
Semantic Segmentation
ConvNets
Bus
Human
72.5%
0.9%1.1%1.0%0.8%
0.9%
1.8%1.7%
2.6%
1.0%1.3%
0.6%
2.2%
1.1%1.2%
4.8%
0.7%0.8%1.2%1.4%
0.6%6.7%
19.0%
68.0%
bk
g
are
o
bik
e
bir
d
bo
at
bo
ttle
bu
s
car
cat
cha
ir
cow
tab
le
do
g
ho
rse
mb
ike
per
son
pla
nt
shee
p
sofa
train tv
Ea
sy
Mode…
Ha
rdPer
cen
tage
of
pix
els
on
VO
C
Easy Moderate Hard
Pix
els on
Bo
un
da
ry(%
)
background cow Easy Moderate Hard
(a)
(b)
• Existing Works:
Increase the model capacity to achieve promising results
High runtime complexity
Backbone Network Speed(300*500 image)
VGG16 5.7 fps
ResNet-101 7.1 fps
Inception-ResNet 9.0 fps
Layer Cascade 14.7 fps
Layer Cascade (fast) 23.6 fps
• Turning Inception-ResNet into Deep Layer Cascade (LC)
(a) IRNet
Image Stem Reduction-A5 IRNet-A 10 IRNet-B Reduction-B 5 IRNet-C
pooling
Fully-Connected
horse
Softmax
background
horse
person
car
unknown
stag
e-1
stag
e-2
stag
e-3
Image Stem Reduction-A5 IRNet-A 10 IRNet-B Reduction-B 5 IRNet-C Conv
ConvConv
(b) IRNet after LC
I
L1 L2
L3
2. Approach
• Our Idea:
Treats a single deep model as a cascade of several sub-models
Earlier sub-models are trained to handle easy and confident regions
Feed-forward harder regions to the next sub-model for processing
Stage-1 handles the easy regions (background)
Stage-2 focus more on the foreground than stage-1 does
Stage-3 further focus on harder classes
1
1.2
1.4
1.6
1.8
2
are
o
bik
e
bir
d
boa
t
bott
le
bu
s
car
cat
chair
cow
tab
le
dog
hors
e
mb
ike
per
son
pla
nt
shee
p
sofa
train tv
Lab
el P
erce
nta
ge
rati
o stage-3
1
1.2
1.4
1.6
1.8
2
are
o
bik
e
bir
d
boa
t
bott
le
bu
s
car
cat
chair
cow
tab
le
dog
hors
e
mb
ike
per
son
pla
nt
shee
p
sofa
train tv
Lab
el P
erce
nta
ge
rati
o stage-2
10.3%
45.9%43.7%
car
8.2%
39.2%52.6%
cat
6.6%
16.6%
76.8%
chair
4.9%
14.3%
80.8%
table
stage-1 stage-2 stage-3
Percentage of pixels
(a) (b)
• Ablation Study on Probability Thresholds ρ
ρ controls percentage of easy and hard regions
66%
68%
70%
72%
74%
41 46 51 56 61 66
mIo
U
Running Time (ms)
15.7 15.6
21.731.2
4.7
18.2
Time Spent in Each Stage (ms)
stage-3
stage-2
stage-1
0.93
0.95
0.97
0.98
0.985
0.90
• Performance and Speed Trade-off
3. Experiments
(a) Convolution (b) Region Convolution
M
• Region Convolution
+
(c) Region Convolution in Residual
• Stage-wise Label Distribution
1
1.2
1.4
1.6
1.8
2
are
o
bik
e
bir
d
boa
t
bott
le
bu
s
car
cat
chair
cow
tab
le
dog
hors
e
mb
ike
per
son
pla
nt
shee
p
sofa
train tv
La
bel
Per
cen
tag
e ra
tio stage-3
1
1.2
1.4
1.6
1.8
2
are
o
bik
e
bir
d
boa
t
bott
le
bu
s
car
cat
chair
cow
tab
le
dog
hors
e
mb
ike
per
son
pla
nt
shee
p
sofa
train tv
La
bel
Per
cen
tag
e ra
tio stage-2
10.3%
45.9%43.7%
car
8.2%
39.2%52.6%
cat
6.6%
16.6%
76.8%
chair
4.9%
14.3%
80.8%
table
stage-1 stage-2 stage-3
Percentage of pixels
(a) (b)
(a) input image (e) ground truth(b) stage1 (c) stage2 (d) stage3
4. Stages’ Outputs
5. Overall Performance
• Comparisons with DeepLab & SegNet
6. Conclusion
• LC adopts a “difficulty-aware” learning
paradigm.
• LC accelerates both training and testing by the
usage of region convolution. It is capable of
running in real-time.
• LC is an end-to-end trainable
framework that jointly optimizes
the feature learning for different
regions.80.3
82.7