Triply Supervised Decoder Networks for Joint Detection and ...static.tongtianta.site/paper_pdf/18f5e884-e1bd-11e... · Triply Supervised Decoder Networks for Joint Detection and Segmentation

Triply Supervised Decoder Networks for Joint Detection and Segmentation

Jiale Cao1, Yanwei Pang1∗, Xuelong Li21 School of Electrical and Information Engineering, Tianjin University

2Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of [email protected], [email protected], xuelong [email protected]

Abstract

Joint object detection and semantic segmentation can beapplied to many fields, such as self-driving cars and un-manned surface vessels. An initial and important progresstowards this goal has been achieved by simply sharingthe deep convolutional features for the two tasks. How-ever, this simple scheme is unable to make full use of thefact that detection and segmentation are mutually benefi-cial. To overcome this drawback, we propose a frame-work called TripleNet where triple supervisions includingdetection-oriented supervision, class-aware segmentationsupervision, and class-agnostic segmentation supervisionare imposed on each layer of the decoder network. Class-agnostic segmentation supervision provides an objectnessprior knowledge for both semantic segmentation and ob-ject detection. Besides the three types of supervisions, twolight-weight modules (i.e., inner-connected module and at-tention skip-layer fusion) are also incorporated into eachlayer of the decoder. In the proposed framework, detectionand segmentation can sufficiently boost each other. More-over, class-agnostic and class-aware segmentation on eachdecoder layer are not performed at the test stage. There-fore, no extra computational costs are introduced at the teststage. Experimental results on the VOC2007 and VOC2012datasets demonstrate that the proposed TripleNet is ableto improve both the detection and segmentation accuracieswithout adding extra computational costs.

1. IntroductionObject detection and semantic segmentation are two fun-

damental and important tasks in the field of computer vi-sion. In recent few years, object detection [36, 29, 26] andsemantic segmentation [30, 5, 1] with deep convolutionalnetworks [20, 39, 14, 17] have achieved great progress, re-spectively. Most state-of-the-art methods only focus on onesingle task, which does not join object detection and se-mantic segmentation together. However, joint object detec-tion and semantic segmentation is very necessary and im-

det

seg

det

det

seg

seg

det

seg

det

seg

det det det det

seg seg seg seg

D

S

D

S

D D D D D

S

D D D D D

S SSSS

D D D D D

SS S S S

S

(a) naive joint network (b) refined joint network (c) blitznet

(d) deeply joint pyramid(e) deeply refined joint pyramid

f f f f f

D

S

D D D D D

SS

SSS

res2res3

res4res5res6

res7

f f f f f

D

S

D D D D D

SSSSSres2

res3res4

res5res6

res7

r

S

Sa Sa SaSa Sa

Sa

r rr

rr

r

f

1x1 co

nv

upsam

ple

concat

3x3 co

nv

1x1 co

nv

sum

f Skip-layer fusion

Skip-layer fusion

refined module concatenation

3x3 co

nv

3x3 co

nv

3x3 co

nv

concat

3x3 co

nv

logits

cls

reg

1x1 co

nv

1x1 co

nv

segmentation

detection

f f f f f

det

segres2

res3res4

res5res6

res7

1x1 co

nv

upsam

ple

concat

3x3 co

nv

1x1 co

nv

sum

f skip-layer fusion

det det det det det

seg seg seg seg segres1conv1

Input Image

Detection

Segmentation

The decoder

feature map

The encoder

feature map

(a) PairNet (b) skip-layer fusion

D D D D D

S

D D D D D

S SSSS

D D D D D

S

S SS

(a) naive joint network (b) refined joint network (c) Blitznet

(d) PairNet (e) TripleNet

D

S

D

S

f f f f f

segres2res3

res4res5res6

res7 r

sega

r rr

rr

rf attentionskip-layer fusion inner-connected module concatenation

3x3 co

nv

3x3 co

nv

3x3 co

nv

concat

3x3 co

nv

logits

cls

reg

1x1 co

nv

1x1 co

nv

segmentation

detection

res1conv1

seg

det det det det det

seg

sega

seg

sega

seg

sega

seg

sega

seg

sega

Input Image

det

Detection

Segmentation

(a) TripleNet

(b) inner-connected module

The decoder

feature map

f f f f f

segres2res3

res4res5res6

res7 r r rr

rr

rf Skip-layer fusion refined module concatenation

3x3 co

nv

3x3 co

nv

3x3 co

nv

concat

3x3 co

nv

logits

cls

reg

1x1 co

nv

1x1 co

nv

segmentation

detection

res1conv1

seg

det det det det det

segsegsegseg

Input Image

det

Detection

Segmentation

(a) Deeply refined pyramid network (b) The refined moduel

The decoder

feature map

f f f f f

segres2res3

res4res5res6

res7 r

sega

r rr

rr


3x3 co

nv

3x3 co

nv

3x3 co

nv

concat

3x3 co

nv

logits

cls

reg

1x1 co

nv

1x1 co

nv

segmentation

detection

res1conv1

seg

det det det det det

sega

seg

sega

seg

sega

seg

sega

seg

sega

Input Image

det

Detection

Segmentation


The decoder

feature map

SSSa Sa Sa

SaSa

1x1 co

nv

upsam

ple

1x1

con

v

3x3 co

nv

1x1 co

nv

sumThe decoder

feature map

The encoder

feature map

(c) attention skip-layer fusion

3x3

con

v

SE

concat

Figure 1. Some architectures of joint detection and segmenta-tion. (a) The last layer of the encoder is used for detection andsegmentation[2]. (b) The branch for detection is refined by thebranch for segmentation [31, 47]. (c) Each layer of the decoder de-tects objects of different scales, and the fused layer is for segmen-tation [7]. (d) The proposed PairNet. Each layer of the decoder issimultaneously for detection and segmentation. (e) The proposedTripleNet, which has three types of supervisions and some light-weight modules.

portant in many applications, such as self-driving cars andunmanned surface vessels.

In fact, object detection and semantic segmentation arehighly related. On the one hand, semantic segmentationusually used as a multi-task supervision can help improveobject detection [31, 24]. On the other hand, object de-tection can be used as a prior knowledge to help improveperformance of semantic segmentation [14, 34].

Due to application requirements and task relevance, jointobject detection and semantic segmentation has graduallyattracted the attention of researchers. Fig. 1 summarizesthree typical methods of joint object detection and seman-tic segmentation. Fig. 1(a) shows the simplest and mostnaive way where one branch for object detection and onebranch for semantic segmentation are in parallel attached tothe last layers of the encoder [2]. In Fig. 1(b), the branch forobject detection is further refined by the features from thebranch for semantic segmentation [31, 47]. Recently, theencoder-decoder network is further used for joint object de-

1

arX

iv:1

809.

0929

9v1

[cs

.CV

] 2

5 Se

p 20

18

tection and semantic segmentation. In Fig. 1(c), each layerof the decoder is used for multi-scale object detection, andthe concatenated feature map from different layers of thedecoder is used for semantic segmentation [7]. The abovemethods have achieved great success for detection and seg-mentation. However, the performance is still far from thestrict demand of real applications such as self-driving carsand unmanned surface vessels. One possible reason is thatthe mutual benefit between the two tasks is not fully ex-ploited.

To exploit mutual benefit for joint object detection andsemantic segmentation tasks, in this paper, we propose toimpose three types of supervisions (i.e., detection-orientedsupervision, class-aware segmentation supervision, andclass-agnostic segmentation supervision) on each layer ofthe decoder network. Meanwhile, the light-weight modules(i.e., the inner-connected module and attention skip-layerfusion) are also incorporated. The corresponding networkis called TripleNet (see Fig. 1(e)). It is noted that we alsopropose to only impose the detection-oriented supervisionand class-aware segmentation supervision on each layer ofthe decoder, which is called PairNet (see Fig. 1(d)). Thecontributions of this paper can be summarized as follows:

(1) Two novel frameworks (i.e., PairNet and TripleNet)for joint object detection and semantic segmentation areproposed. In TripleNet, the detection-oriented supervision,class-aware segmentation supervision, and class-agnosticsegmentation supervision are imposed on each layer of thedecoder. Meanwhile, two light-weight modules (i.e., theinner-connected module and attention skip-layer fusion) isalso incorporated into each layer of the decoder.

(2) A lot of synergies are gained from TripleNet. Bothdetection and segmentation accuracies are significantly im-proved. The improvement is not at expense of incurringextra computational costs because the class-agnostic seg-mentation and class-aware segmentation are not performedin each layer of the decoder at the test stage .

(3) Experiments on the VOC 2007 and VOC 2012datasets are conducted to demonstrate the effectiveness ofthe proposed TripleNet.

The rest of this paper is organized as follows. Section 2reviews some related works of object detection and seman-tic segmentation. Section 3 introduces our proposed methodin detail. Experiments are shown in Section 4. Finally, it isconcluded in Section 5.

2. Related worksIn this section, a review of object detection and seman-

tic segmentation is firstly given. After that, some relatedworks of joint object detection and semantic segmentationare further introduced.

Object detection It aims to classify and locate objectsin an image. Generally, the methods of object detection can

be divided into two main classes: two-stage methods andone-stage methods. Two-stage methods firstly extract somecandidate object proposals from an image and then clas-sify these candidate proposals into the specific object cat-egories. R-CNN [11] and its variants (e.g., Fast RCNN [10]and Faster RCNN [36]) are the most representative frame-works among the two-stage methods. Based on R-CNN se-ries, researchers have done many improvements [6, 22, 4].To accelerate detection speed, Dai et al. [6] proposed R-FCN which uses position-sensitive feature maps for pro-posal classification and bounding box regression. To out-put multi-scale feature maps with strong semantics, Lin etal. [22] proposed feature pyramid network (FPN) based onskip-layer connection and top-down pathway. Recently, Caiet al. [4] trained a sequence of object detectors with increas-ing IoU thresholds to improve detection quality.

One-stage methods directly predict object class andbounding box in a single network. YOLO [35] and SSD[29] are two of the earliest proposed one-stage methods.After that, many variants are proposed [9, 18, 38, 50].DSSD [9] and RON [18] use the encoder-decoder networkto add context information for multi-scale object detec-tion. To train object detector from scratch, DSOD [38]uses dense layer-wise connections on SSD for deep super-vision. Instead of using in-network feature maps of differ-ent resolutions for multi-scale object detection, STDN [50]uses scale-transferrable module to generate different high-resolution feature maps from last feature map. To solveclass imbalance in the training stage, RetinaNet [26] in-troduces focal loss to downweight the contribution of easysamples.

Semantic segmentation It aims to predict the semanticlabel of each pixel in an image, which has achieved signif-icant progress based on fully convolutional networks (i.e.,FCN [30]). Generally, the methods of semantic segmen-tation can be also divided into two main classes: encoder-decoder methods and spatial pyramid methods. Encoder-decoder methods contain two subnetworks: an encoder sub-network and a decoder subnetwork. The encoder subnet-work extracts strong semantic features and reduces spa-tial resolution of feature maps, which is usually based onthe classical CNN models (e.g., VGG [39], ResNet [15],DenseNet [17]) pre-trained on ImageNet [37]. The decodersubnetwork gradually upsamples the feature maps of en-coder subnetwork. DeconvNet [32] and SegNet [1] usemax-pooling indices of the encoder subnetwork to upsam-ple the feature maps. To extract context information, somemethods [33, 25, 45] adopt skip-layer connection to com-bine the feature maps from the encoder and decoder subnet-works.

Spatial pyramid methods adopt the idea of spatial pyra-mid pooling [13] to extract multi-scale information fromthe last output feature maps. Chen et al. [5, 3, 42, 43]

2

det

seg

det

det

seg

seg

det

seg

det

seg

det det det det

seg seg seg seg

D

S

D

S

D D D D D

S

D D D D D

S SSSS

D D D D D

SS S S S

S



f f f f f

D

S

D D D D D

SS

SSS

res2res3

res4res5res6

res7

f f f f f

D

S

D D D D D

SSSSSres2

res3res4

res5res6

res7

r

S

Sa Sa SaSa Sa

Sa

r rr

rr

r

f

1x1

con

v

up

samp

le

con

cat

3x3

con

v

1x1

con

v

sum

f Skip-layer fusion

Skip-layer fusion


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

f f f f f

det

segres2

res3res4

res5res6

res7

1x1

con

v

up

samp

le

con

cat

3x3

con

v

1x1

con

v

sum

f skip-layer fusion

det det det det det


Input Image

Detection

Segmentation

The decoder

feature map

The encoder

feature map


D D D D D

S

D D D D D

S SSSS

D D D D D

S

S SS



D

S

D

S

f f f f f

segres2res3

res4res5res6

res7 r

sega

r rr

rr


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

res1conv1

seg

det det det det det

seg

sega

seg

sega

seg

sega

seg

sega

seg

sega

Input Image

det

Detection

Segmentation

(a) TripleNet


The decoder

feature map

f f f f f

segres2res3

res4res5res6

res7 r r rr

rr


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

res1conv1

seg

det det det det det

segsegsegseg

Input Image

det

Detection

Segmentation


The decoder

feature map

f f f f f

segres2res3

res4res5res6

res7 r

sega

r rr

rr


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

res1conv1

seg

det det det det det

sega

seg

sega

seg

sega

seg

sega

seg

sega

Input Image

det

Detection

Segmentation


The decoder

feature map

SSSa Sa Sa

SaSa

1x1

con

v

up

samp

le

1x1

con

v

3x3

con

v

1x1

con

v

sumThe decoder

feature map

The encoder

feature map


3x3

con

v

SE

con

cat

Figure 2. The proposed PairNet for joint object detection and semantic segmentation. (a) The detailed architecture of PairNet. Each layerof the decoder is simultaneously used for detection and segmentation. (b) The skip-layer fusion used in PairNet.

proposed to use multiple convolutional layers of differentatrous rates in parallel (called ASPP) to extract multi-scalefeatures. Instead of using convolutional layers of differentatrous rates, Zhao et al. [49] proposed pyramid poolingmodule (called PSPnet), which downsamples and upsam-ples the feature maps in parallel. Yang et al. [43] proposedto use dense connection to cover object scale range densely.

Joint object detection and semantic segmentation Itaims to simultaneously detect objects and predict pixel se-mantic labels by a single network. Recently, researchershave done some attempts. Yao et al. [44] proposed to usethe graphical model to holistic scene understanding. Te-ichmann et al. [40] proposed to join object detection andsemantic segmentation by sharing the encoder subnetwork.Kokkinos [21] also proposed to integrate multiple computervision tasks together. Mao et al. [31] found that joint se-mantic segmentation and pedestrian detection can help im-prove performance of pedestrian detection. The similar con-clusion is also demonstrated by SDS-RCNN [2]. Mean-while, joint instance semantic segmentation and object de-tection is also proposed [7]. Recently, Dvornik et al. [7]proposed a real-time framework (called BlitzNet) for jointobject detection and semantic segmentation. It is based onthe encoder-decoder network, where each layer of the de-coder is used to detect objects of different scales and multi-scale fused layer is used for semantic segmentation.

Though joint object detection and semantic segmenta-tion has been explored recently, we argue that there is stillroom for further improvement. Thus, in this paper, we givesome more exploration on joint object detection and seman-tic segmentation.

3. The proposed methods

In recent years, the fully convolutional networks (FCN)with encoder-decoder structure have achieved great successon object detection [26, 9] and semantic segmentation [1],respectively. For example, DSSD [9, 34] and RetinaNet

[26] use different layers of the decoder to detect objectsof different scales, respectively. By using the encoder-decoder structure, SegNet [1] and LargeKernel [33] gener-ate high-resolution logits for semantic segmentation. Basedon above observations, a very natural and simple idea is thatFCN with encoder-decoder is suitable for joint object detec-tion and semantic segmentation.

In this section, we give a detailed introduction about theproposed paired supervision decoder network (i.e., PairNet)and triply supervised decoder network (i.e., TripleNet) forjoint object detection and semantic segmentation.

3.1. Paired supervision decoder network (PairNet)

Based on the encoder-decoder structure, a feature pyra-mid network is naturally proposed to join object detectionand semantic segmentation. Namely, the supervision of ob-ject detection and semantic segmentation is added to eachlayer of the decoder, which is called PairNet. On the onehand, PairNet uses different layers of the decoder to detectobjects of different scales. On the other hand, instead of us-ing the last high-resolution layer for semantic segmentationwhich is adopted by most state-of-the-art methods [1, 33],PairNet uses each layer of the decoder to respectively parsepixel semantic labels. Though the proposed PairNet is verysimple and naive, it has not been explored for joint ob-ject detection and semantic segmentation to the best of ourknowledge.

Fig. 2(a) gives the detailed architecture of PairNet. Theinput image firstly goes through a fully convolutional net-work with encoder-decoder structure. The encoder gradu-ally down-samples the feature map. In this paper, the fa-mous ResNet-50 or ResNet101 [15] (i.e., res1-res4) andsome new added residual blocks (i.e., res5-res7) constructthe encoder. The decoder gradually maps the low-resolutionfeature map to the high-resolution feature map. To enhancecontext information, skip-layer fusion is used to fuse thefeature map from the decoder and the corresponding fea-ture map from the encoder. Fig. 2(b) gives the illustration of

3

det

seg

det

det

seg

seg

det

seg

det

seg

det det det det

seg seg seg seg

D

S

D

S

D D D D D

S

D D D D D

S SSSS

D D D D D

SS S S S

S



f f f f f

D

S

D D D D D

SS

SSS

res2res3

res4res5res6

res7

f f f f f

D

S

D D D D D

SSSSSres2

res3res4

res5res6

res7

r

S

Sa Sa SaSa Sa

Sa

r rr

rr

r

f

1x1

con

v

up

samp

le

con

cat

3x3

con

v

1x1

con

v

sum

f Skip-layer fusion

Skip-layer fusion


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

f f f f f

det

segres2

res3res4

res5res6

res7

1x1

con

v

up

samp

le

con

cat

3x3

con

v

1x1

con

v

sum

f skip-layer fusion

det det det det det


Input Image

Detection

Segmentation

The decoder

feature map

The encoder

feature map


D D D D D

S

D D D D D

S SSSS

D D D D D

S

S SS



D

S

D

S

f f f f f

segres2res3

res4res5res6

res7 r

sega

r rr

rr


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

res1conv1

seg

det det det det det

seg

sega

seg

sega

seg

sega

seg

sega

seg

sega

Input Image

det

Detection

Segmentation

(a) TripleNet


The decoder feature map

f f f f f

segres2res3

res4res5res6

res7 r r rr

rr


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

res1conv1

seg

det det det det det

segsegsegseg

Input Image

det

Detection

Segmentation


The decoder

feature map

f f f f f

segres2res3

res4res5res6

res7 r

sega

r rr

rr


3x3

con

v

3x3

con

v

3x3

con

v

con

cat

3x3

con

v

logits

cls

reg

1x1

con

v1x1

con

v

segmentation

detection

res1conv1

seg

det det det det det

sega

seg

sega

seg

sega

seg

sega

seg

sega

Input Image

det

Detection

Segmentation


The decoder

feature map

SSSa Sa Sa

SaSa

1x1

con

v

up

samp

le

1x1

con

v

3x3

con

v

1x1

con

v

sumThe decoder

feature map

The encoder feature map


3x3

con

v

SE

con

cat

Figure 3. The proposed TripleNet for joint object detection and semantic segmentation. (a) The detailed architecture of TripleNet. (b) Theinner-connected module. (c) attention skip-layer fusion.

skip-layer fusion. The feature maps in the decoder is firstlyupsampled by bilinear interpolation and then concatenatedwith the corresponding feature maps of the same resolutionin the encoder. After that, the concatenated feature maps gothrough a residual unit to generate the output feature maps.

To join object detection and semantic segmentation,each layer of the decoder is further split into two differentbranches. The branch of object detection consists of a 3× 3convolutional layer and two sibling 1×1 convolutional lay-ers for object classification and bounding box regression.The branch of object detection at different layers is used todetect objects of different scales. Specifically, the branch atfront layer of the decoder with low resolution is used to de-tect large-scale objects, while the branch at latter layer withhigh resolution is used to detect small-scale objects.

The branch of semantic segmentation consists of a 3× 3convolutional layer to generate the logits. There are two dif-ferent ways to compute the segmentation loss. The first oneis that the segmentation logits are upsampled to the sameresolution of ground-truth, and the second one is that theground-truth is downsampled to the same resolution of thelogits. We found that the first strategy have a little betterperformance, which is adopted in the follows.

3.2. Triply supervised decoder network (TripleNet)

To further improve the performance of joint object de-tection and semantic segmentation, triply supervision de-coder network (called TripleNet) is further proposed, wheredetection-oriented supervision, class-aware segmentationsupervision, and class-agnostic segmentation supervisionare added on each layer of the decoder. Fig. 3(a) gives thedetailed architecture of TripleNet. Compared to PairNet,TripleNet add some new modules (i.e., multiscale fused seg-mentation, the inner-connected module, and class-agnostic

segmentation supervision). In the following section, we in-troduce these modules in detailed.

Multiscale fused segmentation It has been demon-strated that multi-scale features are useful for semantic seg-mentation [7, 49, 46]. To use multi-scale features of dif-ferent layers in the decoder for better semantic segmenta-tion, the feature maps of different layers in the decoder areupsampled to the same spatial resolution and concatenatedtogether. After that, a 3 × 3 convolutional layer is used togenerate the segmentation logits. Compared to the segmen-tations based on one layer of the decoder, multilayer fusedfeatures can make better use of context information. Thus,multilayer fused segmentation is used for final prediction atthe test stage. Meanwhile, the semantic segmentation basedon each layer of the decoder can be seen as a deep supervi-sion for feature learning.

The inner-connected module In Section 3.1, PairNetonly shares the base network for object detection and se-mantic segmentation, while the branches of object detectionand semantic segmentation have no cross. To further helpobject detection, an inner-connected module is proposed torefine object detection by the logits of semantic segmenta-tion. Fig. 3(b) shows the inner-connected module in layeri. The feature map in layer i first goes through a 3× 3 con-volutional layer to produce the segmentation logits for thebranch of semantic segmentation. Meanwhile, the segmen-tation logit goes through two 3 × 3 convolutional layers togenerate new feature map which are further concatenatedwith feature maps in layer i. Based on concatenated featuremaps, a 3 × 3 convolutional layer is used to generate thefeature map for the branch of object detection.

Class-agnostic segmentation supervision Semanticsegmentation mentioned above is class-aware, which aimsto simultaneously identify specific object categories and the

4

Method det seg fine seg all MFS IC CAS ASF mAP mIoU(a) only detection ! 78.0 N/A(b) only segmentation with fine layer ! N/A 72.5(c) only segmentation with all layers ! N/A 72.9(d) PairNet ! ! 78.9 73.1(e) add MFS ! ! ! 79.0 73.5(f) add MFC and IC ! ! ! ! 79.5 73.6(g) add MFC, IC, and ASF ! ! ! ! ! 79.7 74.4(g) TripleNet ! ! ! ! ! ! 80.0 74.8

Table 1. Ablation experiments of PairNet and TripleNet on the VOC2012-val-seg set. The backbone model is ResNet50 [15], andthe input image is rescaled to the size of 300 × 300. “MFS” means multiscale fused segmentation, “IC” means inner-connected module,“CAS” means class-aware segmentation, and “ASF” means attention skip-layer fusion.

background. We argue that class-aware semantic segmenta-tion may ignore the discrimination between objects and thebackground. Therefore, class-agnostic segmentation super-vision module is further added to each layer of the decoder.Specifically, a 3 × 3 convolutional layer is added to gener-ate the logits of class-agnostic semantic segmentation. Togenerate the ground-truth of class-agnostic semantic seg-mentation, the objects of different categories are set as onecategory, and the background is set as another category.

Attention skip-layer fusion In Section 3.1, PairNet sim-ply fuses the feature maps of the decoder and the corre-sponding feature maps of the encoder. Generally, the fea-tures from the layer of the encoder have relatively low-levelsemantic, and that from the layer of decoder have relativelyhigh-level semantic. To enhance informative features andsuppress less useful features from the encoder by the fea-tures from the decoder, Squeeze-and-Excitation (SE) [16]block is used. The input of a SE block is the layer of the de-coder, and the output of SE block is used to scale the layerof the encoder. After that, the layer of the decoder and thescaled layer of the encoder is concatenated for fusion.

4. Experiments

4.1. Datasets

To demonstrate the effectiveness of proposed methodsand compare with same state-of-the-art methods, some ex-periments on the famous VOC 2007 and VOC 2012 datasets[8] are conducted in this section.

The PASCAL VOC challenge [8] has been held annu-ally since 2006, which consists of three principal challenges(i.e., image classification, object detection, and semanticsegmentation). Among these annual challenges, the VOC2007 and VOC 2012 datasets are usually used to evaluatethe performance of object detection and semantic segmen-tation, which have 20 object categories. The VOC 2007dataset contains 5011 trainval images and 4952 testimages. The VOC 2012 dataset is split into three sub-sets (i.e., train, val, and test). The train set con-

tains 5717 images for detection and 1464 images for se-mantic segmentation (called VOC12-train-seg). Theval set contains 5823 images for detection and 1449 im-ages for segmentation (called VOC12-val-seg). The testset contains 10991 images for detection and 1456 for seg-mentation. To enlarge the training set for semantic seg-mentation, the additional segmentation data provided by[12] is used, which contains 10582 training images (calledVOC12-trainaug-seg).

For object detection, mean average precision (i.e., mAP)is used to performance evaluation. On the PASCAL VOCdatasets, mAP is calculated under the IoU threshold of 0.5.For semantic segmentation, mean intersection over union(i.e., mIoU) is used for performance evaluation.

4.2. Ablation experiments on the VOC 2012 dataset

In this subsection, experiments are conducted on thePASCAL VOC 2012 to validate the effectiveness of pro-posed method. On the PASCAL VOC 2012, the set ofVOC12-trainaug-seg is used for training and the setof VOC12-val-seg is used for performance evaluation,where they have the ground truth of both object detectionand semantic segmentation. The input images are rescaledto the size of 300 × 300, and the size of mini-batch is 32.The total number of iteration in the training stage is 40k,where the learning rate of first 25k iterations is 0.0001, thatof following 10k iterations is 0.00001, and that of last 5kiterations is 0.000001.

The top part of Table 1 shows the ablation experimentsof PairNet. When all different layers of the decoder are onlyused for multi-scale object detection (i.e., Table 1(a)), mAPof object detection is 78.0%. When all different layers ofthe decoder are used for semantic segmentation (i.e., Table1(c)), mIoU of semantic segmentation is 72.5%. When allthe different layer of the decoder is used for object detec-tion and semantic segmentation together (i.e., Table 1(d)),mAP and mIoU of PairNet are 78.9% and 73.1%, respec-tively. Namely, PairNet can improve both object detectionand semantic segmentation, which indicates that joint ob-

5

GT of det GT of seg only det

(Table 1(a))

only seg

(Table 1(b)) Our PairNet (Table 1(d)) Our TripleNet (Table 1(g))

(a) Detection and segmentation both improved

(b) Detection improved

(c) Segmentation improved

Figure 4. Visualization of detection or segmentation results of the methods in Table 1 (i.e., “only det”, “only seg”, PairNet, and TripleNet).(a) demonstrates that detection and segmentation can be both improved by PairNet and TripleNet. (b) demonstrates that detection is mainlyimproved by PairNet or TripleNet. (c) demonstrates that segmentation is mainly improved by PairNet or TripleNet.

ject detection and semantic segmentation on each layer ofthe decoder is useful. Meanwhile, the method using all thedifferent layers of the decoder for segmentation (i.e., Table

1(c)) has better performance than the method only using thelast layer of the decoder for segmentation (i.e., Table 1(d)).The reason may be explained by that all the layers of the

6

method input size mAP mIoUBlitzNet300 300× 300 78.8 72.9PairNet300 300× 300 78.9 73.1TripleNet300 300× 300 80.0 74.8

Table 2. Comparison of BlitzNet, the proposed PairNet, and theproposed TripleNet. All the methods are re-implemented in thesame parameter settings.

decoder for semantic segmentation can give a much deepersupervision.

The bottom part of Table 1 shows the ablation exper-iments of TripleNet. Based on PairNet, TripleNet addsthree modules (i.e., MFS, IC, CAS, and AFS). When addingthe MFS module, TripleNet outperforms PairNet by 0.1%on object detection and 0.4% on semantic segmentation,respectively. When adding the MFS and IC modules,TripleNet outperforms PairNet by 0.6% on object detectionand 0.5% on semantic segmentation. When adding all fourmodules, TripleNet has the best detection performance andsegmentation performance.

Fig. 4 further visualizes the experimental results of somemethods in Table 1. The first two columns are ground-truthof detection and segmentation. The results of only detec-tion in Table 1(a) and only segmentation in Table 1(c) areshown in the third and forth columns. The results of Pair-Net in Table 1(d) and TripleNet in Table 1(g) are shownin fifth to eighth columns. In Fig. 4(a), the examples ofdetection and segmentation both improved by joint detec-tion and segmentation are given. For example, in the firstrow, “only det” and “only seg” both miss three potted plant,while PairNet only misses one potted plant and TripleNetdoes not miss any potted plant. In Fig. 4(b), the examplesof detection result improved are shown. For example, in thefirst row, “only detect” can only detect one ship, PairNetcan detect three ships, and TripleNet can detect four ships.In Fig. 4(c), the examples of segmentation results improvedare shown. For example, in the second row, “only seg” rec-ognize blue bag as motorbike, but PairNet and TripleNetcan recognize the blue bag as background.

Meanwhile, the proposed PairNet and TripleNet are alsocompared to the related BlitzNet [7]. For fair comparison,BltizNet are re-implemented in the similar parameter set-tings as the proposed PairNet and TripleNet. PairNet whichsimply joins detection and segmentation in each layer ofthe decoder has been already comparable with BlitzNet.TripleNet outperforms BlitzNet on both object detectionand semantic segmentation, which demonstrates that theproposed method can make full use of the mutual informa-tion to improve the two tasks.

method backbone mAP mIoUSSD300 [29] VGG 75.4 -SSD512 [29] VGG 79.4 -RON384 [18] VGG16 75.4 -DSSD513 [9] ResNet101 80.0 -DES300 [47] VGG16 77.1 -DES512 [47] VGG16 80.3 -RefineDet512 [48] ResNet101 80.0 -DFPR [19] ResNet101 81.1FCN [30] VGG16 - 62.2ParseNet [27] VGG16 - 69.8Deeplab v2 [5] VGG16 - 71.6DPN [28] VGG-like - 74.1RefineNet [25] ResNet101 - 82.4PSPNet [49] ResNet101 - 82.6DFN [45] ResNet101 - 82.7BlitzNet512 [7] ResNet50 79.0 75.6TripleNet512 ResNet101 81.0 82.9

Table 3. Results of object detection (mAP) and semantic segmen-tation (mIoU) on VOC 2012 test set.

4.3. Comparison with state-of-the-art methods onthe VOC2012 test dataset

In this section, the proposed PairNet is compared withsome state-of-the-art methods on the VOC 2007 dataset.Among these methods, SSD [29], RON [18], DSSD [9],DES [47], RefineDet [48], and DFPR [19] are only used forobject detection, ParseNet [27], Deeplab V2 [5], DPN [28],RefineNet [25], PSPNet [49], DFPN [45] are only used forsemantic segmentation. Table 2 shows object detect results(mAP) and semantic segmentation results (mIoU) of thesemethods on th VOC2012 test set. It can been seen moststate-of-the-art methods can only output detection results(i.e., SSD, RON, DSSD, DES, RefineDet, and DFPR) orsegmentation result (i.e., FCN, ParseNet, DeepLab, DPN,PSPNet, and DFPN). Only BlitzNet and our proposedTripleNet can simultaneously output the results of objectdetection and semantic segmentation. mAP and mIoU ofBlitzNet are 79.0% and 75.6%, while mAP and mIoU ofTripleNet are 81.0% and 82.9%. Thus, TripleNet outper-forms BlitzNet by 2.0% on object detection and 7.3% onsemantic segmentation. It can be also seen that TripleNetalmost achieves state-of-the-art performance on both objectdetection and semantic segmentation.

4.4. Comparison with some state-of-the-art meth-ods on the VOC 2007 test dataset

In this section, the proposed TripleNet and some state-of-the-art methods (i.e., SSD [29], DES [47], DSSD [9],STDN [50], BlitzNet [7], RefineDet [25]), and DFPR [19]are further compared on the VOC 2007 test set. Becauseonly the ground-truth of object detection is provided, these

7

method backbone mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tvSSD300 [29] VGG16 77.5 79.5 83.9 76.0 69.6 50.5 87.0 85.7 88.1 60.3 81.5 77.0 86.1 87.5 84.0 79.4 52.3 77.9 79.5 87.6 76.8SSD512 [29] VGG16 79.5 84.8 85.1 81.5 73.0 57.8 87.8 88.3 87.4 63.5 85.4 73.2 86.2 86.7 83.9 82.5 55.6 81.7 79.0 86.6 80.0DES300 [47] VGG16 79.7 83.5 86.0 78.1 74.8 53.4 87.9 87.3 88.6 64.0 83.8 77.2 85.9 88.6 87.5 80.8 57.3 80.2 80.4 88.5 79.5DES512 [47] VGG16 81.7 87.7 86.7 85.2 76.3 60.6 88.7 89.0 88.0 67.0 86.9 78.0 87.2 87.9 87.4 84.4 59.2 86.1 79.2 88.1 80.5DSSD321 [9] ResNet101 78.6 81.9 84.9 80.5 68.4 53.9 85.6 86.2 88.9 61.1 83.5 78.7 86.7 88.7 86.7 79.7 51.7 78.0 80.9 87.2 79.4DSSD513 [9] ResNet101 81.5 86.6 86.2 82.6 74.9 62.5 89.0 88.7 88.8 65.2 87.0 78.7 88.2 89.0 87.5 83.7 51.1 86.3 81.6 85.7 83.7STDN300 [50] DenseNet169 78.1 81.1 86.9 76.4 69.2 52.4 87.7 84.2 88.3 60.2 81.3 77.6 86.6 88.9 87.8 76.8 51.8 78.4 81.3 87.5 77.8STDN513 [50] DenseNet169 80.9 86.1 89.3 79.5 74.3 61.9 88.5 88.3 89.4 67.4 86.5 79.5 86.4 89.2 88.5 79.3 53.0 77.9 81.4 86.6 85.5BlitzNet300 [7] ResNet50 79.1 86.7 86.2 78.9 73.1 47.6 85.7 86.1 87.7 59.3 85.1 78.4 86.3 87.9 84.2 79.1 58.5 82.5 81.7 85.7 81.8BlitzNet512 [7] ResNet50 81.5 87.0 87.6 83.5 75.7 59.1 87.6 88.0 88.8 64.1 88.4 80.9 87.5 88.5 86.9 81.5 60.6 86.5 79.3 87.5 81.7RefineDet320 [48] VGG16 80.0 - - - - - - - - - - - - - - - - - - - -RefineDet512 [48] VGG16 81.8 - - - - - - - - - - - - - - - - - - - -DFPR300 [19] ResNet101 77.1 89.3 84.9 79.9 75.6 55.4 88.2 88.6 88.6 63.3 87.9 78.8 87.3 87.7 85.5 80.5 55.4 81.1 79.6 87.8 78.5DFPR512 [19] ResNet101 82.4 92.0 88.2 81.1 71.2 65.7 88.2 87.9 92.2 65.8 86.5 79.4 90.3 90.4 89.3 88.6 59.4 88.4 75.3 89.2 78.5TripleNet300 ResNet50 79.3 81.4 85.0 79.5 72.1 53.7 85.3 85.9 87.8 62.5 85.1 78.7 87.8 88.6 85.7 79.5 56.8 80.7 79.2 88.7 81.4TripleNet512 ResNet50 82.4 88.9 86.9 85.0 77.5 61.5 87.7 88.2 89.2 66.0 88.3 79.6 87.5 88.8 87.3 82.3 62.4 86.1 81.7 89.2 82.4TripleNet512 ResNet101 82.7 88.9 87.8 83.7 79.6 62.9 87.9 88.3 88.5 67.5 89.1 81.2 88.0 89.5 87.9 83.3 58.7 85.1 83.4 88.8 84.1

Table 4. Results of object detection (mAP) on the VOC 2007 test set.

methods are only evaluated on object detection. Table 4shows mAP of these methods. mAP of TripleNet is 82.7%,which is higher than that of all state-of-the-art methods.

5. ConclusionIn this paper, we proposed two fully convolutional net-

works (i.e., PairNet and TripleNet) for joint object de-tection and semantic segmentation. PairNet simultane-ously predicts objects of different scales by different lay-ers and parses pixel semantic labels by all different layers.TripleNet adds four modules (i.e, multiscale fused segmen-tation, inner-connected module, class-agnostic segmenta-tion supervision, and attention skip-layer fusion) to PairNet.Experiments demonstrate that TripleNet can achieve state-of-the-art performance on both object detection and seman-tic segmentation.

References[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Seg-

net: A deep convolutional encoder-decoder archi-tecture for image segmentation. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,39(12):2481–2495, 2017.

[2] G. Brazil, X. Yin, and X. Liu. Illuminating pedestriansvia simultaneous detection & segmentation. In Proc.IEEE International Conference on Computer Vision,2017.

[3] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam.Rethinking atrous convolution for semantic imagesegmentation. arXiv:1706.05587, 2017.

[4] Z. Cai and N. Vasconcelos. Cascade R-CNN: Delvinginto high quality object detection. In Proc. IEEE Con-ference on Computer Vision and Pattern Recognition,2018.

[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,and A. L. Yuille. Deeplab: Semantic image segmen-tation with deep convolutional nets, atrous convolu-tion, and fully connected crfs. IEEE Transactions onPattern Analysis and Machine Intelligence, 40(4):834-848, 2017.

[6] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object de-tection via region-based fully convolutional networks.In Proc. Advances in Neural Information ProcessingSystems, 2016.

[7] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid.BlitzNet: A real-time deep network for scene under-standing. In Proc. IEEE International Conference onComputer Vision, 2017.

[8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,and A. Zisserman. The pascal visual object classes(voc) challenge. International Journal of ComputerVision, 88(2):303–338, 2010.

[9] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C.Berg. Dssd: Deconvolutional single shot detector.arXiv:1701.06659, 2017.

[10] R. Girshick. Fast r-cnn. In Proc. IEEE InternationalConference on Computer Vision, 2015.

[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Richfeature hierarchies for accurate object detection andsemantic segmentation. In Proc. IEEE Conference onComputer Vision and Pattern Recognition, 2014.

[12] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, andJ. Malik. Semantic contours from inverse detectors.In Proc. IEEE International Conference on ComputerVision, 2011.

[13] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyra-mid pooling in deep convolutional networks for visual

8

recognition. In Proc. European Conference on Com-puter Vision, 2014.

[14] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In Proc. IEEE International Conference on Com-puter Vision, 2017.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In Proc. IEEE Con-ference on Computer Vision and Pattern Recognition,2016.

[16] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitationnetworks. In Proc. IEEE Conference on Computer Vi-sion and Pattern Recognition, 2017.

[17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Wein-berger. Densely connected convolutional networks. InProc. IEEE Conference on Computer Vision and Pat-tern Recognition, 2017.

[18] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen.Ron: Reverse connection with objectness prior net-works for object detection. In Proc. IEEE Conferenceon Computer Vision and Pattern Recognition, 2017.

[19] T. Kong, F. Sun, W. Huang, and H. Liu. Deep featurepyramid reconguration for object detection. In Proc.European Conference on Computer Vision, 2018.

[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-agenet classification with deep convolutional neuralnetworks. In Proc. Advances in Neural InformationProcessing Systems, 2012.

[21] I. Kokkinos. UberNet: Training a universal convolu-tional neural network for low-, mid-, and high-levelvision using diverse datasets and limited memory. InProc. IEEE Conference on Computer Vision and Pat-tern Recognition, 2017.

[22] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan,and S. Belongie. Feature pyramid networks for objectdetection. In Proc. IEEE Conference on Computer Vi-sion and Pattern Recognition, 2017.

[23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoftcoco: Common objects in context. In Proc. EuropeanConference on Computer Vision, 2014.

[24] C. Lin, J. Lu, G. Wang, and J. Zhou. Graininess-awaredeep feature learning for pedestrian detection. In Proc.European Conference on Computer Vision, 2018.

[25] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet:Multi-path refinement networks with identity map-pings for high-resolution semantic segmentation. InProc. IEEE Conference on Computer Vision and Pat-tern Recognition, 2017.

[26] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar.Focal loss for dense object detection. In Proc. IEEEInternational Conference on Computer Vision, 2017.

[27] W. Liu, A. Rabinovich, and A. C. Berg. ParseNet:Looking wider to see better. In Proc. InternationalConference on Learning Representations, 2016.

[28] Z. Liu, X. Li, P. Luo, C. Change Loy, and X. Tang. Se-mantic image segmentation via deep parsing network.In Proc. IEEE International Conference on ComputerVision, 2015.

[29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,C.-Y. Fu, and A. C. Berg. Ssd: Single shot multiboxdetector. In Proc. European Conference on ComputerVision, 2016.

[30] J. Long, E. Shelhamer, and T. Darrell. Fully con-volutional networks for semantic segmentation. InProc. IEEE Conference on Computer Vision and Pat-tern Recognition, 2015.

[31] J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can helppedestrian detection? In Proc. IEEE Conference onComputer Vision and Pattern Recognition, 2017.

[32] H. Noh, S. Hong, and B Han. Learning deconvolutionnetwork for semantic segmentation. In Proc. IEEEInternational Conference on Computer Vision, 2017.

[33] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun.Large kernel matters–improve semantic segmentationby global convolutional network. In Proc. IEEE Con-ference on Computer Vision and Pattern Recognition,2017.

[34] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar.Learning to refine object segments. In Proc. EuropeanConference on Computer Vision, 2016.

[35] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi.You only look once: Unified, real-time object detec-tion. In Proc. IEEE Conference on Computer Visionand Pattern Recognition, 2016.

[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:Towards real-time object detection with region pro-posal networks. In Proc. Advances in Neural Infor-mation Processing Systems, 2015.

[37] O. Russakovsky, J. Deng, H. Su, J. Krause, S.Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNetlarge scale visual recognition challenge. InternationalJournal of Computer Vision, 2015.

[38] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X.Xue. DSOD: Learning deeply supervised object detec-tors from scratch. In Proc. IEEE International Confer-ence on Computer Vision, 2017.

[39] K. Simonyan and A. Zisserman. Very deep convo-lutional networks for large-scale image recognition.arXiv:1409.1556, 2014.

9

[40] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla,and R. Urtasun. Multinet: Real-time joint se-mantic reasoning for autonomous driving. InarXiv:1612.07695, 2016.

[41] J. R. Uijlings, K. E. Van De Sande, T. Gevers, andA. W. Smeulders. Selective search for object recog-nition. International journal of computer vision,104(2):154–171, 2013.

[42] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou,and G. Cottrell. Understanding convolution for se-mantic segmentation. In Proc. IEEE Winter Confer-ence on Applications of Computer Vision, 2017.

[43] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang.DenseASPP for semantic segmentation in streetscenes. In Proc. IEEE Conference on Computer Vi-sion and Pattern Recognition, 2018.

[44] J. Yao, S. Fidler, and R. Urtasun. Describing the sceneas a whole: Joint object detection, scene classificationand semantic segmentation. In Proc. IEEE Conferenceon Computer Vision and Pattern Recognition, 2012.

[45] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang.Learning a discriminative feature network for seman-tic segmentation. In Proc. IEEE Conference on Com-puter Vision and Pattern Recognition, 2018.

[46] F. Yu and V. Koltun. Multi-scale context aggregationby dilated convolutions. In Proc. International Con-ference on Learning Representations, 2016.

[47] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A.L. Yuille. Single-shot object detection with enrichedsemantics. In Proc. IEEE Conference on ComputerVision and Pattern Recognition, 2018.

[48] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li.Single-shot refinement neural network for object de-tection. In Proc. IEEE Conference on Computer Vi-sion and Pattern Recognition, 2018.

[49] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramidscene parsing network. In Proc. IEEE Conference onComputer Vision and Pattern Recognition, 2017.

[50] P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu. Scale-transferrable object detection. In Proc. IEEE Con-ference on Computer Vision and Pattern Recognition,2018.

10

Documents

Triply Supervised Decoder Networks for Joint Detection and ...static.tongtianta.site/paper_pdf/18f5e884-e1bd-11e... · Triply Supervised Decoder Networks for Joint Detection and Segmentation