1
IEEE 2017 Conference on Computer Vision and Pattern Recognition : Difficulty-Aware Semantic Segmentation via Deep Layer Cascade Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou Tang Department of Information Engineering, The Chinese University of Hong Kong {lx015,lz013,pluo,ccloy,xtang}@ie.cuhk.edu.hk IEEE 2017 Conference on Computer Vision and Pattern Recognition 1. Introduction Motivation: Partition all pixels into three sets by classification confidence Nearly 40% easy region Task & General Approaches: Semantic Segmentation ConvNets Bus Human 72.5% 0.9% 1.1% 1.0% 0.8% 0.9% 1.8% 1.7% 2.6% 1.0% 1.3% 0.6% 2.2% 1.1%1.2% 4.8% 0.7% 0.8% 1.2% 1.4% 0.6% 6.7% 19.0% 68.0% bkg areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Easy Mode… Hard Percentage of pixels on VOC Easy Moderate Hard Pixels on Boundary(%) background cow Easy Moderate Hard (a) (b) Existing Works: Increase the model capacity to achieve promising results High runtime complexity Backbone Network Speed(300*500 image) VGG16 5.7 fps ResNet-101 7.1 fps Inception-ResNet 9.0 fps Layer Cascade 14.7 fps Layer Cascade (fast) 23.6 fps Turning Inception-ResNet into Deep Layer Cascade (LC) (a) IRNet Image Stem Reduction-A 5IRNet-A 10IRNet-B Reduction-B 5IRNet-C pooling Fully-Connected horse Softmax background horse person car unknown stage-1 stage-2 stage-3 Image Stem Reduction-A 5IRNet-A 10IRNet-B Reduction-B 5IRNet-C Conv Conv Conv (b) IRNet after LC I L 1 L 2 L 3 2. Approach Our Idea: Treats a single deep model as a cascade of several sub-models Earlier sub-models are trained to handle easy and confident regions Feed-forward harder regions to the next sub-model for processing Stage-1 handles the easy regions (background) Stage-2 focus more on the foreground than stage-1 does Stage-3 further focus on harder classes 1 1.2 1.4 1.6 1.8 2 areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Label Percentage ratio stage-3 1 1.2 1.4 1.6 1.8 2 areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Label Percentage ratio stage-2 (a) Ablation Study on Probability Thresholds ρ ρ controls percentage of easy and hard regions 66% 68% 70% 72% 74% 41 46 51 56 61 66 mIoU Running Time (ms) 15.7 15.6 21.7 31.2 4.7 18.2 Time Spent in Each Stage (ms) stage-3 stage-2 stage-1 0.93 0.95 0.97 0.98 0.985 0.90 Performance and Speed Trade-off 3. Experiments (a) Convolution (b) Region Convolution M Region Convolution + (c) Region Convolution in Residual Stage-wise Label Distribution 10.3% 45.9% 43.7% car 8.2% 39.2% 52.6% cat 6.6% 16.6% 76.8% chair 4.9% 14.3% 80.8% table stage-1 stage-2 stage-3 Percentage of pixels (b) (a) input image (e) ground truth (b) stage1 (c) stage2 (d) stage3 4. Stages’ Outputs 5. Overall Performance Comparisons with DeepLab & SegNet 6. Conclusion LC adopts a “difficulty-aware” learning paradigm. LC accelerates both training and testing by the usage of region convolution. It is capable of running in real-time. LC is an end-to-end trainable framework that jointly optimizes the feature learning for different regions. 80.3 82.7

background IEEE 2017 Conference on car 0.90 0 · 2020-07-27 · IEEE 2017 Conference on Computer Vision and Pattern Recognition : Difficulty-Aware Semantic Segmentation via Deep Layer

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: background IEEE 2017 Conference on car 0.90 0 · 2020-07-27 · IEEE 2017 Conference on Computer Vision and Pattern Recognition : Difficulty-Aware Semantic Segmentation via Deep Layer

IEEE 2017 Conference on

Computer Vision and Pattern

Recognition

: Difficulty-Aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou Tang

Department of Information Engineering, The Chinese University of Hong Kong

{lx015,lz013,pluo,ccloy,xtang}@ie.cuhk.edu.hk

IEEE 2017 Conference on

Computer Vision and Pattern

Recognition

1. Introduction

• Motivation:

Partition all pixels into three sets by classification confidence

Nearly 40% easy region

• Task & General Approaches:

Semantic Segmentation

ConvNets

Bus

Human

72.5%

0.9%1.1%1.0%0.8%

0.9%

1.8%1.7%

2.6%

1.0%1.3%

0.6%

2.2%

1.1%1.2%

4.8%

0.7%0.8%1.2%1.4%

0.6%6.7%

19.0%

68.0%

bk

g

are

o

bik

e

bir

d

bo

at

bo

ttle

bu

s

car

cat

cha

ir

cow

tab

le

do

g

ho

rse

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

Ea

sy

Mode…

Ha

rdPer

cen

tage

of

pix

els

on

VO

C

Easy Moderate Hard

Pix

els on

Bo

un

da

ry(%

)

background cow Easy Moderate Hard

(a)

(b)

• Existing Works:

Increase the model capacity to achieve promising results

High runtime complexity

Backbone Network Speed(300*500 image)

VGG16 5.7 fps

ResNet-101 7.1 fps

Inception-ResNet 9.0 fps

Layer Cascade 14.7 fps

Layer Cascade (fast) 23.6 fps

• Turning Inception-ResNet into Deep Layer Cascade (LC)

(a) IRNet

Image Stem Reduction-A5 IRNet-A 10 IRNet-B Reduction-B 5 IRNet-C

pooling

Fully-Connected

horse

Softmax

background

horse

person

car

unknown

stag

e-1

stag

e-2

stag

e-3

Image Stem Reduction-A5 IRNet-A 10 IRNet-B Reduction-B 5 IRNet-C Conv

ConvConv

(b) IRNet after LC

I

L1 L2

L3

2. Approach

• Our Idea:

Treats a single deep model as a cascade of several sub-models

Earlier sub-models are trained to handle easy and confident regions

Feed-forward harder regions to the next sub-model for processing

Stage-1 handles the easy regions (background)

Stage-2 focus more on the foreground than stage-1 does

Stage-3 further focus on harder classes

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

Lab

el P

erce

nta

ge

rati

o stage-3

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

Lab

el P

erce

nta

ge

rati

o stage-2

10.3%

45.9%43.7%

car

8.2%

39.2%52.6%

cat

6.6%

16.6%

76.8%

chair

4.9%

14.3%

80.8%

table

stage-1 stage-2 stage-3

Percentage of pixels

(a) (b)

• Ablation Study on Probability Thresholds ρ

ρ controls percentage of easy and hard regions

66%

68%

70%

72%

74%

41 46 51 56 61 66

mIo

U

Running Time (ms)

15.7 15.6

21.731.2

4.7

18.2

Time Spent in Each Stage (ms)

stage-3

stage-2

stage-1

0.93

0.95

0.97

0.98

0.985

0.90

• Performance and Speed Trade-off

3. Experiments

(a) Convolution (b) Region Convolution

M

• Region Convolution

+

(c) Region Convolution in Residual

• Stage-wise Label Distribution

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

La

bel

Per

cen

tag

e ra

tio stage-3

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

La

bel

Per

cen

tag

e ra

tio stage-2

10.3%

45.9%43.7%

car

8.2%

39.2%52.6%

cat

6.6%

16.6%

76.8%

chair

4.9%

14.3%

80.8%

table

stage-1 stage-2 stage-3

Percentage of pixels

(a) (b)

(a) input image (e) ground truth(b) stage1 (c) stage2 (d) stage3

4. Stages’ Outputs

5. Overall Performance

• Comparisons with DeepLab & SegNet

6. Conclusion

• LC adopts a “difficulty-aware” learning

paradigm.

• LC accelerates both training and testing by the

usage of region convolution. It is capable of

running in real-time.

• LC is an end-to-end trainable

framework that jointly optimizes

the feature learning for different

regions.80.3

82.7