Deformable Convolutional Networks - cocodataset.orgpresentations.cocodataset.org/COCO17-Detect-MSRA.pdfJifeng Dai With Haozhi Qi, Zheng Zhang, Bin Xiao, Han Hu, Bowen Cheng, Yichen

Jifeng DaiWith Haozhi Qi*, Zheng Zhang, Bin Xiao, Han Hu, Bowen Cheng*, Yichen Wei

Visual Computing Group

Microsoft Research Asia

(*interns at MSRA)

Deformable Convolutional Networks-- MSRA COCO Detection & Segmentation Challenge 2017 Entry

Outline

• Deformable ConvNets idea

• Deformable ConvNets for COCO challenge

Highlights

• Enabling effective modeling of spatial transformation in ConvNets

• No additional supervision for learning spatial transformation

• Significant accuracy improvements on sophisticated vision tasks

Code is available at https://github.com/msracver/Deformable-ConvNets

Modeling Spatial Transformations

• A long standing problem in computer visionDeformation: Scale:

Viewpoint variation: Intra-class variation:

(Some examples are taken from Li Fei-fei’s course CS223B, 2009-2010)

Traditional Approaches

• 1) To build training datasets with sufficient desired variations

• 2) To use transformation-invariant features and algorithms

• Drawbacks: geometric transformations are assumed fixed and known, hand-crafted design of invariant features and algorithms

Scale Invariant Feature Transform (SIFT) Deformable Part-based Model (DPM)

Spatial Transformations in CNNs

• Regular CNNs are inherently limited to model large unknown transformations• The limitation originates from the fixed geometric structures of CNN modules

regular convolution regular RoI Pooling2 layers of regular convolution

Spatial Transformer Networks

• Learning a global, parametric transformation on feature maps• Prefixed transformation family, infeasible for complex vision tasks

Deformable Convolution

• Local, dense, non-parametric transformation• Learning to deform the sampling locations in the convolution/RoI Pooling modules

regular deformed scale & aspect ratio rotation

Deformable Convolution

Regular convolution

Deformable convolution

where is generated by a sibling branch of regular convolution

Deformable RoI Pooling

deformable RoI Pooling

Regular RoI pooling

Deformable RoI pooling

where is generated by a sibling fc branch

Deformable ConvNets

• Same input & output as the plain versions• Regular convolution -> deformable convolution

• Regular RoI pooling -> deformable RoI pooling

• End-to-end trainable without additional supervision

Sampling Locations of Deformable Convolution

(a) standard convolution (b) deformable convolution

Part Offsets in Deformable RoI Pooling

Deformable ConvNets for Object Detection

Fast(er) RCNN R-FCN FPN

RoI Pooling

Car Person

Car

Person

Position SensitiveRoI Pooling

Car Person

RoI Pooling

Conv5 Position SensitiveScore Map

Conv3

Conv4

Conv5 P5

P4

P3

• Regular object detectors

Deformable ConvNets for Object Detection

Fast(er) RCNN R-FCN FPN

DeformableRoI Pooling

Car Person

Car

Person

DeformablePS RoI Pooling

Car Person

DeformableRoI Pooling

: Deformable Convolution / RoI Pooling

Conv5 Position SensitiveScore Map

Conv3

Conv4

Conv5 P5

P4

P3

• Deformable object detectors

XCeption -> Aligned XCeption

14x14x728 feature maps

separable conv 728, 3x3, pad 1


max pool 3x3, stride2, pad 1

conv 1024 1x1 stride2, pad 0





entry flow

conv 32, 3x3, stride2, pad 1

conv 64, 3x3, pad 1



max pool 3x3, stride 2, pad 1

conv 128 1x1 stride2, pad 0





224x224x3 images


separable conv 256, 3x3, pad 1conv 256 1x1








conv 256 1x1, stride2, pad 0

conv 728 1x1

conv 728 1x1, stride2, pad 0





Repeat 16 times


middle flow

exit flow

• Proper feature alignment in XCeption• Efficient: 9.5 GFLOPS on 224*224 img (ResNet-101, 7.6 GFLOPS)

• Accurate: mAP 2.8% better than ResNet-101 using FPN on COCO (det, test-dev)

Object Detection on COCO (Test-dev)

• MSRA 2017 Entry• ~3% mAP improvements by Deformable ConvNets

• Best single model performance: 48.5%

37.4

40.7

41.6

45.2

40.5

43.3

44.2

45.3

46.9

47.9

48.5

50.7

30 35 40 45 50

FPN+OHEM (RESNET-101)

FPN+OHEM (ALIGNED XCEPTION)

+ MASK

+ SOFT NMS

+ MULTI-SCALE TESTING

+ ITERATIVE TESTING

+ HORIZONTAL FLIP

+ ENSEMBLE (6 MODELS)

mAP (%)

Deformable Regular

Object Detection on COCO (Test-dev)

• Deformable ConvNets v.s. regular ConvNets• Noticeable improvements for varies baselines

• Marginal parameter & computation overhead

23.2

30.3

32.1

34.5

37.4

40.2

45.2

25.8

35

35.7

37.5

40.5

43.3

48.5

20 25 30 35 40 45 50

CLASS-AWARE RPN (RESNET-101)

FASTER R-CNN, 2FC (RESNET-101)

R-FCN (RESNET-101)

R-FCN (ALIGNED-INCEPTION-RESNET)

FPN+OHEM (RESNET-101)

FPN+OHEM (ALIGNED-XCEPTION)

FPN++ (ALIGNED-XCEPTION)

mAP (%)

Deformable Regular

Conclusion

• Deformable ConvNets for dense spatial modeling• Simple, efficient, deep, and end-to-end

• No additional supervision

• Feasible and effective on sophisticated vision tasks for the first time

• Our team

Jifeng DaiHaozhi Qi* Yichen Wei

* interns at MSRA

Zheng Zhang Bowen Cheng*Bin Xiao Han Hu

Documents

Deformable Convolutional Networks - cocodataset.orgpresentations.cocodataset.org/COCO17-Detect-MSRA.pdfJifeng Dai With Haozhi Qi*, Zheng Zhang, Bin Xiao, Han Hu, Bowen Cheng*, Yichen

Deformable Convolutional Networks - cocodataset.orgpresentations.cocodataset.org/COCO17-Detect-MSRA.pdfJifeng Dai With Haozhi Qi, Zheng Zhang, Bin Xiao, Han Hu, Bowen Cheng, Yichen