26
EfficientPS: Efficient Panoptic Segmentation Rohit Mohan · Abhinav Valada Abstract Understanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Seg- mentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congru- ously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmenta- tion output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive eval- uations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date. Keywords Panoptic Segmentation · Semantic Segmenta- tion · Instance Segmentation · Scene Understanding 1 Introduction Holistic scene understanding plays a pivotal role in enabling intelligent behavior. Humans from an early age are able to Rohit Mohan [email protected] Abhinav Valada [email protected] University of Freiburg, Germany Input Image EcientPS Class Prediction BoundingBox Prediction Mask Prediction Semantic Predicition Panoptic Fusion Panoptic Prediction Figure 1 Overview of our proposed EfficientPS architecture for panop- tic segmentation. Our model predicts four outputs: semantics prediction from the semantic head, and class, bounding box and mask predic- tion from the instance head. All the aforementioned predictions are then fused in the panoptic fusion module to yield the final panoptic segmentation output. effortlessly comprehend complex visual scenes which forms the bases for learning more advanced capabilities (Bremner and Slater, 2008). Similarly, intelligent systems such as ro- bots should have the ability to coherently understand visual scenes at both the fundamental pixel-level as well as at the dis- tinctive object instance level. This enables them to perceive and reason about the environment holistically which facilit- ates interaction. Such modeling ability is a crucial enabler that can revolutionize several diverse applications including autonomous driving, surveillance, and augmented reality. The components of a scene can generally be categorized into ‘stuff’ and ‘thing’ objects. ‘Stuff’ can be defined as uncountable and amorphous regions such as sky, road and sidewalk, while ‘thing’ are countable objects for example pedestrians, cars and riders. Segmentation of ‘stuff’ classes is primarily addressed using the semantic segmentation task, whereas segmentation of ‘thing’ classes is addressed by the instance segmentation task. Both tasks have garnered a sub- arXiv:2004.02307v2 [cs.CV] 19 May 2020

arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation

Rohit Mohan middot Abhinav Valada

Abstract Understanding the scene in which an autonomousrobot operates is critical for its competent functioning Suchscene comprehension necessitates recognizing instances oftraffic participants along with general scene semantics whichcan be effectively addressed by the panoptic segmentationtask In this paper we introduce the Efficient Panoptic Seg-mentation (EfficientPS) architecture that consists of a sharedbackbone which efficiently encodes and fuses semanticallyrich multi-scale features We incorporate a new semantichead that aggregates fine and contextual features coherentlyand a new variant of Mask R-CNN as the instance head Wealso propose a novel panoptic fusion module that congru-ously integrates the output logits from both the heads of ourEfficientPS architecture to yield the final panoptic segmenta-tion output Additionally we introduce the KITTI panopticsegmentation dataset that contains panoptic annotations forthe popularly challenging KITTI benchmark Extensive eval-uations on Cityscapes KITTI Mapillary Vistas and IndianDriving Dataset demonstrate that our proposed architectureconsistently sets the new state-of-the-art on all these fourbenchmarks while being the most efficient and fast panopticsegmentation architecture to date

Keywords Panoptic Segmentation middot Semantic Segmenta-tion middot Instance Segmentation middot Scene Understanding

1 Introduction

Holistic scene understanding plays a pivotal role in enablingintelligent behavior Humans from an early age are able to

Rohit Mohanmohancsuni-freiburgde

Abhinav Valadavaladacsuni-freiburgdeUniversity of Freiburg Germany

P16

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128256x

128x256

24x512x1024

40x256x512

64x128x256

13

13

111

1

Input Image

EfficientPS

Class Prediction BoundingBox Prediction Mask Prediction

Semantic Predicition

PanopticFusion

Panoptic Prediction

Figure 1 Overview of our proposed EfficientPS architecture for panop-tic segmentation Our model predicts four outputs semantics predictionfrom the semantic head and class bounding box and mask predic-tion from the instance head All the aforementioned predictions arethen fused in the panoptic fusion module to yield the final panopticsegmentation output

effortlessly comprehend complex visual scenes which formsthe bases for learning more advanced capabilities (Bremnerand Slater 2008) Similarly intelligent systems such as ro-bots should have the ability to coherently understand visualscenes at both the fundamental pixel-level as well as at the dis-tinctive object instance level This enables them to perceiveand reason about the environment holistically which facilit-ates interaction Such modeling ability is a crucial enablerthat can revolutionize several diverse applications includingautonomous driving surveillance and augmented reality

The components of a scene can generally be categorizedinto lsquostuffrsquo and lsquothingrsquo objects lsquoStuffrsquo can be defined asuncountable and amorphous regions such as sky road andsidewalk while lsquothingrsquo are countable objects for examplepedestrians cars and riders Segmentation of lsquostuffrsquo classesis primarily addressed using the semantic segmentation taskwhereas segmentation of lsquothingrsquo classes is addressed by theinstance segmentation task Both tasks have garnered a sub-

arX

iv2

004

0230

7v2

[cs

CV

] 1

9 M

ay 2

020

2 Rohit Mohan Abhinav Valada

stantial amount of attention in recent recent years (Shottonet al 2008 Krahenbuhl and Koltun 2011 Silberman et al2014 He and Gould 2014a) Moreover advances in deeplearning (Chen et al 2018b Zhao et al 2017 Valada et al2016a He et al 2017 Liu et al 2018 Zurn et al 2019)have further boosted the performance of these tasks to newheights However state-of-the-art deep learning methods stillpredominantly address theses tasks independently althoughtheir objective of understanding the scene at the pixel levelestablishes an inherent connection between them More sur-prisingly they have also fundamentally branched out intodifferent directions of proposal based methods (He et al2017) for instance segmentation and fully convolutional net-works (Long et al 2015) for semantic segmentation eventhough some earlier approaches (Tighe et al 2014 Tu et al2005 Yao et al 2012) have demonstrated the potential bene-fits in combining them

Recently Kirillov et al (2019) revived the need to tacklethese tasks jointly by coining the term panoptic segmentationand introducing the panoptic quality metric for combinedevaluation The goal of this task is to jointly predict lsquostuffrsquoand lsquothingrsquo classes essentially unifying the separate tasksof semantic and instance segmentation More specifically ifa pixel belongs to the lsquostuffrsquo class the panoptic segmenta-tion network assigns a class label from the lsquostuffrsquo classeswhereas if the pixel belongs to the lsquothingrsquo class the networkpredicts both which lsquothingrsquo class it corresponds to as wellas the instance of the object class Kirillov et al (2019) alsopresent a baseline approach for panoptic segmentation thatheuristically combines predictions from individual state-of-the-art instance and semantic segmentation networks in apost-processing step However this disjoint approach hasseveral drawbacks including large computational overheadredundancy in learning and discrepancy between the predic-tions of each network Although recent methods have madesignificant strides to address this task in top-down mannerwith shared components or in a bottom-up manner sequen-tially these approaches still face several challenges in termsof computational efficiency slow runtimes and subpar resultscompared to task-specific individual networks

In this paper we propose the novel EfficientPS archi-tecture that provides effective solutions to the aforemen-tioned problems The architecture consists of our new sharedbackbone with mobile inverted bottleneck units and our pro-posed 2-way Feature Pyramid Network (FPN) followed bytask-specific instance and semantic segmentation heads withseperable convolutions whose outputs are combined in ourparameter-free panoptic fusion module The entire networkis jointly optimized in an end-to-end manner to yield the finalpanoptic segmentation output Figure 1 shows an overviewof the information flow in our network along with the inter-mediate predictions and the final output The design of ourproposed EfficientPS is influenced by the goal of achieving

superior performance compared to existing methods whilesimultaneously being fast and computationally efficient

Currently the best performing top-down panoptic seg-mentation models (Porzi et al 2019 Xiong et al 2019 Liet al 2018a) primarily employ the ResNet-101 (He et al2016) or ResNeXt-101 (Xie et al 2017) architecture withFeature Pyramid Networks (Lin et al 2017) as the backboneAlthough these backbones have a high representational ca-pacity they consume a significant amount of parameters Inorder to achieve a better trade-off we propose a new back-bone network consisting of a modified EfficientNet (Tan andLe 2019) architecture that employs compound scaling touniformly scale all the dimensions of the network coupledwith our novel 2-way FPN Our proposed backbone is sub-stantially more efficient as well as effective than its popularcounterparts (He et al 2016 Kaiser et al 2017 Xie et al2017) Moreover we identify that the standard FPN archi-tecture has its limitations to aggregate multi-scale featuresdue to the unidirectional flow of information While thereare other extensions that aim to mitigate this problem byadding bottom-up path augmentation (Liu et al 2018) to theoutputs of the FPN they are considerably slow Thereforewe propose our novel 2-way FPN that facilities bidirectionalflow of information which substantially improves the panop-tic quality of lsquothingrsquo classes while remaining comparable inruntime

Now the outputs of our 2-way FPN are of multiple scaleswhich we refer to as large-scale features when they have adownsampling factor of times4 or times8 with respect to the inputimage and small-scale features when they have a down-sampling factor of times16 or times32 The large-scale outputs com-prise of fine or characteristic features whereas the small-scale outputs contain features rich in semantic informationThe presence of these distinct characteristics necessitatesprocessing features at each scale uniquely Therefore wepropose a new semantic head with separable convolutionswhich aggregates small-scale and large-scale features inde-pendently before correlating and fusing contextual featureswith fine features We demonstrate that this semanticallyreinforces fine features resulting in better object boundaryrefinement For our instance head we build upon Mask-R-CNN and augment it with separable convolutions and iABNsync (Rota Bulo et al 2018) layers

One of the critical challenges in panoptic segmentationdeals with resolving the conflict of overlapping predictionsfrom the semantic and instance heads Most architectures (Kir-illov et al 2019 Porzi et al 2019 Li et al 2019b de Geuset al 2018) employ a standard post-processing step (Kirillovet al 2019) that adopts instance-specific lsquothingrsquo segmenta-tion from the instance head and lsquostuffrsquo segmentation from thesemantic head This fusion technique completely ignores thelogits of the semantic head while segmenting lsquothingrsquo regionsin the panoptic segmentation output which is sub-optimal as

EfficientPS Efficient Panoptic Segmentation 3

the lsquothingrsquo logits of the semantic head can aid in resolvingthe conflict more effectively In order to thoroughly exploitthe logits from both heads we propose a parameter-free pan-optic fusion module that adaptively fuses logits by selectivelyattenuating or amplifying fused logit scores based on howagreeable or disagreeable the predictions of individual headsare for each pixel in a given instance We demonstrate thatour proposed panoptic fusion mechanism is more effectiveand efficient than other widely used methods in existing ar-chitectures

Furthermore we also introduce the KITTI panoptic seg-mentation dataset that contains panoptic annotations for im-ages in the challenging KITTI benchmark (Geiger et al2013) As KITTI provides groundtruth for a whole suite ofperception and localization tasks these new panoptic annota-tions further complement the widely popularly benchmarkWe hope that these panoptic annotations that we make pub-licly available encourages future research in multi-task learn-ing for holistic scene understanding Furthermore in order tofacilitate comparison we benchmark previous state-of-the-artmodels on our newly introduced KITTI panoptic segment-ation dataset and the IDD dataset We perform exhaustiveexperimental evaluations and benchmarking of our proposedEfficientPS architecture on four standard urban scene under-standing datasets including Cityscapes (Cordts et al 2016)Mapillary Vistas (Neuhold et al 2017) KITTI (Geiger et al2013) and Indian Driving Dataset (IDD) (Varma et al 2019)

Our proposed EfficientPS with a PQ score of 664 isranked first for panoptic segmentation on the Cityscapesbenchmark leaderboard without training on coarse annota-tions or using model ensembles Additionally EfficientPSis also ranked second for the semantic segmentation taskas well as the instance segmentation task on the Cityscapesbenchmark with a mIoU score of 842 and an AP of 391respectively On the Mapillary Vistas dataset our single Effi-cientPS model achieves a PQ score of 405 on the valida-tion set thereby outperforming all the existing methods Sim-ilarly EfficientPS consistently outperforms existing panopticsegmentation models on both the KITTI and IDD datasetsby a large margin More importantly our EfficientPS archi-tecture not only sets the new state-of-the-art on all the fourpanoptic segmentation benchmarks but it is also the mostcomputationally efficient by consuming the least amount ofparameters and having the fastest inference time comparedto previous state-of-the-art methods Finally we present de-tailed ablation studies that demonstrate the improvement inperformance due to each of the architectural contributionsthat we make in this work Moreover we also make imple-mentations of our proposed EfficientPS architecture trainingcode and pre-trained models publicly available

In summary the following are the main contributions ofthis work

1 The novel EfficientPS architecture for panoptic segmenta-tion that incorporates our proposed efficient shared back-bone with our new feature aligning semantic head a newvariant of Mask R-CNN as the instance head and ournovel adaptive panoptic fusion module

2 A new panoptic backbone consisting of an augmentedEfficientNet architecture and our proposed 2-way FPNthat both encodes and aggregates semantically rich multi-scale features in a bidirectional manner

3 A novel semantic head that captures fine features andlong-range context efficiently as well as correlates thembefore fusion for better object boundary refinement

4 A new panoptic fusion module that dynamically adaptsthe fusion of logits from the semantic and instance headsbased on their mask confidences and congruously integ-rates instance-specific lsquothingrsquo classes with lsquostuffrsquo classesto compute the panoptic prediction

5 The KITTI panoptic segmentation dataset that providespanoptic groundtruth annotations for images from thechallenging KITTI benchmark dataset

6 Benchmarking of existing state-of-the-art panoptic seg-mentation architectures on the newly introduced KITTIpanoptic segmentation dataset and IDD dataset

7 Comprehensive benchmarking of our proposed Effi-cientPS architecture on Cityscapes Mapilliary VistasKITTI and IDD datasets

8 Extensive ablation studies that compare the performanceof various architectural components that we propose inthis work with their counterparts from state-of-the-artarchitectures

9 Implementation of our proposed architecture and a livedemo on all the four datasets is publicly available athttprluni-freiburgderesearchpanoptic

2 Related Works

Panoptic segmentation is a recently introduced scene under-standing problem (Kirillov et al 2019) that unifies the tasksof semantic segmentation and instance segmentation Thereare numerous methods that have been proposed for each ofthese sub-tasks however only a handful of approaches havebeen introduced to tackle this coherent scene understand-ing problem of panoptic segmentation Most works in thisdomain are largely built upon advances made in semanticsegmentation and instance segmentation therefore we firstreview recent methods that have been proposed for theseclosely related sub-tasks followed by state-of-the-art ap-proaches that have been introduced for panoptic segmenta-tion

Semantic Segmentation There has been significant advanc-es in semantic segmentation approaches in recent years Inthis section we briefly review methods that use a single

4 Rohit Mohan Abhinav Valada

monocular image to tackle this task Approaches from thepast decade typically employ random decision forests to ad-dress this task Shotton et al (2008) use randomized decisionforests on local patches for classification whereas Plath et al(2009) fuse local and global features along with ConditionalRandom Fields(CRFs) for segmentation As opposed to lever-aging appearance-based features Brostow et al (2008) usecues from motion with random forests Sturgess et al (2009)further combine appearance-based features with structure-from-motion features in addition to CRFs to improve theperformance However 3D features extracted from densedepth maps (Zhang et al 2010) have been demonstrated to bemore effective than the combined features Kontschieder et al(2011) exploit the inherent topological distribution of objectclasses to improve the performance whereas Krahenbuhland Koltun (2011) improve segmentation by pairing CRFswith Gaussian edge potentials Nevertheless all these meth-ods employ handcrafted features that do not encapsulate allthe high-level and low-level relations thereby limiting theirrepresentational ability

The significant improvement in performance of classific-ation tasks brought about by Convolutional Neural Network(CNN) based approaches motivated researchers to exploresuch methods for semantic segmentation Initially these ap-proaches relied on patch-wise training that severely limitedtheir ability to accurately segment object boundaries How-ever they still perform substantially better than previoushandcrafted methods The advent of end-to-end learning ap-proaches for semantic segmentation lead by the introduc-tion of Fully Convolutional Networks (FCNs) (Long et al2015) revolutionized this field and FCNs still form the baseupon which state-of-the-art architecture are built upon todayFCN is an encoded-decoder architecture where the encoderis based on the VGG-16 (Simonyan and Zisserman 2014)architecture with inner-product layers replaced with convolu-tions and the decoder consists of convolution and transposedconvolution layers The subsequently proposed SegNet (Bad-rinarayanan et al 2015) architecture introduced unpoolinglayers for upsampling as a replacement for transposed con-volutions whereas ParseNet (Liu et al 2015) models globalcontext directly as opposed to only relying on the largestreceptive field of the network

The PSPNet (Zhao et al 2017) architecture emphasizeson the importance of multi-scale features and propose pyr-amid pooling to learn feature representations at differentscales Yu and Koltun (2015) introduce atrous convolutionsto further exploit multi-scale features in semantic segment-ation networks Subsequently Valada et al (2017) proposemulti-scale residual units with parallel atrous convolutionswith different dilation rates to efficiently learn multiscale fea-tures throughout the network without increasing the numberof parameters Chen et al (2017b) propose the Atrous SpatialPyramid Pooling (ASPP) module that concatenates feature

maps from multiple parallel atrous convolutions with differ-ent dilation rates and a global pooling layer ASPP substan-tially improves the performance of semantic segmentationnetworks by aggregating multi-scale features and capturinglong-range context however it significantly increases thecomputational complexity Therefore Chen et al (2018a) pro-pose Dense Prediction Cells (DPC) and Valada et al (2019)propose Efficient Atrous Spatial Pyramid Pooling (eASPP)that yield better semantic segmentation performance thanASPP while being 10-times more efficient Li et al (2019a)suggest that global feature aggregation often leads to largepattern features and also over-smooth regions of small pat-terns which results in sub-optimal performance In order toalleviate this problem the authors propose the use of a globalaggregation module coupled with a local distribution mod-ule which results in features that are balanced in small andlarge pattern regions There are also several works that havebeen proposed to improve the upsampling in decoders ofencoder-decoder architectures In (Chen et al 2018b) theauthors introduce a novel decoder module for object bound-ary refinement Tian et al (2019) propose data-dependentupsampling which accounts for the redundancy in the labelspace as opposed to simple bilinear upsampling

Instance Segmentation Some of the initial approaches em-ploy CRFs (He and Gould 2014b) and minimize integerquadratic relations (Tighe et al 2014) Methods that exploitCNNs with Markov random fields (Zhang et al 2016) andrecurrent neural networks (Romera-Paredes and Torr 2016Ren and Zemel 2017) have also been explored In this sec-tion we primarily discuss CNN-based approaches for in-stance segmentation These methods can be categorized intoproposal free and proposal based methods

Methods in the proposal free category often obtain in-stance masks from a resulting transformation Bai and Ur-tasun (2017) uses CNNs to produce an energy map of theimage and then perform a cut at a single energy level toobtain the corresponding object instances Liu et al (2017)employ a sequence of CNNs to solve sub-grouping problemsin order to compose object instances Some approaches ex-ploit FCNs which either use local coherence for estimatinginstances (Dai et al 2016) or encode the direction of eachpixel to its corresponding instance centre (Uhrig et al 2016)The recent approach SSAP (Gao et al 2019) uses pixel-pair affinity pyramids for computing the probability that twopixels hierarchically belong to the same instance Howeverthey achieve a lower than proposal based methods which hasled to a decline in their popularity

In proposal based methods Hariharan et al (2014) pro-pose a method that uses Multiscale Combinatorial Group-ing (Arbelaez et al 2014) proposals as input to CNNs forfeature extraction and then employ an SVM classifier forregion classification Subsequently Hariharan et al (2015)propose hypercolumn pixel descriptors for simultaneous de-

EfficientPS Efficient Panoptic Segmentation 5

tection and segmentation In recent works DeepMask (Pin-heiro et al 2015) uses a patch of an image as input to a CNNwhich yields a class-agnostic segmentation mask and thelikelihood of the patch containing an object FCIS (Li et al2017) employs position-sensitive score maps obtained fromclassification of pixels based on their relative positions toperform segmentation and detection jointly Dai et al (2016)propose an approach for instance segmentation that uses threenetworks for distinguishing instances estimating masks andcategorizing objects Mask R-CNN (He et al 2017) is one ofthe most popular and widely used approaches in the presenttime It extends Faster R-CNN for instance segmentation byadding an object segmentation branch parallel to an branchthat performs bounding box regression and classificationMore recently Liu et al (2018) propose an approach to im-prove Mask R-CNN by adding bottom-up path augmentationthat enhances object localization ability in earlier layers of thenetwork Subsequently BshapeNet (Kang and Kim 2018) ex-tends Faster R-CNN by adding a bounding box mask branchthat provides additional information of object positions andcoordinates to improve the performance of object detectionand instance segmentation

Panoptic Segmentation Kirillov et al (2019) revived theunification of semantic segmentation and instance segmenta-tion tasks by introducing panoptic segmentation They pro-pose a baseline model that combines the output of PSPNet(Zhao et al 2017) and Mask R-CNN (He et al 2017) with asimple post-processing step in which each model processesthe inputs independently The methods that address this taskof panoptic segmentation can be broadly classified into twocategories top-down or proposal based methods and bottom-up or proposal free methods Most of the current state-of-the-art methods adopt the top-down approach de Geus et al(2018) propose joint training with a shared backbone thatbranches into Mask R-CNN for instance segmentation andaugmented Pyramid Pooling module for semantic segment-ation Subsequently Li et al (2019b) introduce Attention-guided Unified Network that uses proposal attention moduleand mask attention module for better segmentation of lsquostuffrsquoclasses All the aforementioned methods use a similar fusiontechnique to Kirillov et al (2019) for the fusion of lsquostuffrsquo andlsquothingrsquo predictions

In top-down panoptic segmentation architectures pre-dictions of both heads have an inherent overlap betweenthem resulting in the mask overlapping problem In orderto mitigate this problem Li et al (2018b) propose a weaklysupervised model where lsquothingrsquo classes are weakly super-vised by bounding boxes and lsquostuffrsquo classes are supervisedwith image-level tags Whereas Liu et al (2019) address theproblem by introducing the spatial ranking module and Liet al (2018a) propose a method that learns a binary maskto constrain output distributions of lsquostuffrsquo and lsquothingrsquo expli-citly Subsequently UPSNet (Xiong et al 2019) introduces

a parameter-free panoptic head to address the problem ofoverlapping of instances and also predicts an extra unknownclass More recently AdaptIS (Sofiiuk et al 2019) uses pointproposals to produce instance masks and jointly trains with astandard semantic segmentation pipeline to perform panopticsegmentation In contrast Porzi et al (2019) propose an archi-tecture for panoptic segmentation that effectively integratescontextual information from a lightweight DeepLab-inspiredmodule with multi-scale features from a FPN

Compared to the popular proposal based methods thereare only a handful of proposal free methods that have beenproposed Deeper-Lab (Yang et al 2019) was the first bottom-up approach that was introduced and it employs an encoder-decoder topology to pair object centres for class-agnosticinstance segmentation with DeepLab semantic segmentationCheng et al (2019) further builds on Deeper-Lab by intro-ducing a dual-ASPP and dual-decoder structure for eachsub-task branch SSAP (Gao et al 2019) proposes to grouppixels based on a pixel-pair affinity pyramid and incorporatean efficient graph method to generate instances while jointlylearning semantic labeling

In this work we adopt a top-down approach due to itsexceptional ability to handle large scale variation of instanceswhich is a critical requirement for segmenting lsquothingrsquo classesWe present the novel EfficientPS architecture that incorpor-ates our proposed efficient backbone with our 2-way FPN forlearning rich multi-scale features in a bidirectional mannercoupled with a new semantic head that captures fine-featuresand long-range context effectively and a variant of MaskR-CNN augmented with separable convolutions as the in-stance head We propose a novel panoptic fusion moduleto dynamically adapt the fusion of logits from the semanticand instance heads to yield the panoptic segmentation outputOur architecture achieves state-of-the-art results on bench-mark datasets while being the most efficient and fast panopticsegmentation architecture

3 EfficientPS Architecture

In this section we first give a brief overview of our proposedEfficientPS architecture and then detail each of its constitut-ing components Our network follows the top-down layoutas shown in Figure 2 It consists of a shared backbone witha 2-way Feature Pyramid Network (FPN) followed by task-specific semantic segmentation and instance segmentationheads We build upon the EfficientNet (Tan and Le 2019)architecture for the encoder of our shared backbone (depictedin red) It consists of mobile inverted bottleneck (Xie et al2017) units and employs compound scaling to uniformlyscale all the dimensions of the encoder network This en-ables our encoder to have a rich representational capacitywith fewer parameters in comparison to other encoders orbackbones of similar discriminative capability

6 Rohit Mohan Abhinav Valada

256x14x14 256x28x28

NIx28x28

256X14x14

256x256x512

256x256x512

256x256x512

256x128x256

256x128x256 256x

64x128

256x64x128

256x64x128

256x128x256

48x512x1024

24x512x1024

40x256x512

64x128x256

128x64x128

176x64x128 304x32x64 512x32x64

2048x32x64

11

P32

P16

P8

P4

128x256x512

128x128x256

128x64x128

128x32x64

512x256x512NCx

1024x2048

P32

13

11

P16

13

13

P8

13

P4

11

11

11

11

11

11

13

13

13

13

22 1

1

11

256x32x64

256x32x64

256x32x64

RPN

ROI Align

MaskLogits

BBox

Class

LSFE

LSFE

DPC

DPC

Concat

SemanticLogits

PanopticFusion

(P4 P8 P16 P32)

(P4 P8 P16 P32)

Instance Segmentation Head

Semantic Segmentation Head

MC

MC

nxn Convolutionstride s

ns

Leaky ReLUBilinear upsamplingnxn Transpose convolutionstride s

ns

iABN Sync Fully connectednxn seperable convolutionstride s

ns

Output1024x2048

Input3x1024x2048

Figure 2 Illustration of our proposed EfficientPS architecture consisting of a shared backbone with our 2-way FPN and parallel semantic andinstance segmentation heads followed by our panoptic fusion module The shared backbone is built upon on the EfficientNet architecture and ournew 2-way FPN that enables bidirectional flow of information The instance segmentation head is based on a modified Mask R-CNN topology andwe incorporate our proposed semantic segmentation head Finally the outputs of both heads are fused in our panoptic fusion module to yield thepanoptic segmentation output

As opposed to employing the conventional FPN (Lin et al2017) that is commonly used in other panoptic segmentationarchitectures (Kirillov et al 2019 Li et al 2018a Porzi et al2019) we incorporate our proposed 2-way FPN that fusesmulti-scale features more effectively than its counterpartsThis can be attributed to the fact that the information flowin our 2-way FPN is not bounded to only one direction asdepicted by the purple blue and green blocks in Figure 2Subsequently after the 2-way FPN we employ two headsin parallel for semantic segmentation (depicted in yellow)and instance segmentation (depicted in gray and orange)respectively We use a variant of the Mask R-CNN (He et al2017) architecture as the instance head and we incorporateour novel semantic segmentation head consisting of denseprediction cells (Chen et al 2018a) and residual pyramidsThe semantic head consists of three different modules forcapturing fine features long-range contextual features andcorrelating the distinctly captured features for improvingobject boundary refinement Finally we employ our proposedpanoptic fusion module to fuse the outputs of the semanticand instance heads to yield the panoptic segmentation output

31 Network Backbone

The backbone of our network consists of an encoder withour proposed 2-way FPN The encoder is the basic buildingblock of any segmentation network and a strong encoder isessential to have high representational capacity In this workwe seek to find a good trade-off between the number of para-meters and computational complexity to the representationalcapacity of the network EfficientNets (Tan and Le 2019)which are a recent family of architectures have been shownto significantly outperform other networks in classificationtasks while having fewer parameters and FLOPs It employscompound scaling to uniformly scale the width depth andresolution of the network efficiently Therefore we chooseto build upon this scaled architecture with 16 22 and 456coefficients commonly known as the EfficientNet-B5 modelThis can be easily replaced with any of the EfficientNet mod-els based on the capacity of the resources that are availableand the computational budget

In order to adapt EfficientNet to our task we first removethe classification head as well as the Squeeze-and-Excitation(SE) (Hu et al 2018) connections in the network We find that

EfficientPS Efficient Panoptic Segmentation 7

the explicit modelling of interdependencies between chan-nels of the convolutional feature maps that are enabled bythe SE connections tend to suppress localization of featuresin favour of contextual elements This property is a desiredin classification networks however both are equally import-ant for segmentation tasks therefore we do not add any SEconnections in our backbone Second we replace all thebatch normalization (Ioffe and Szegedy 2015) layers withsynchronized Inplace Activated Batch Normalization (iABNsync) (Rota Bulo et al 2018) This enables synchronizationacross different GPUs which in turn yields a better estimateof gradients while performing multi-GPU training and thein-place operations frees up additional GPU memory Weanalyze the performance of our modified EfficientNet in com-parison to other encoders commonly used in state-of-the-artarchitectures in the ablation study presented in Section 442

Our EfficientNet encoder comprises of nine blocks asshown in Figure 2 (in red) We refer to each block in thefigure as block 1 to block 9 in the left to right manner Theoutput of block 2 3 5 and 9 corresponds to downsamplingfactors times4times8times16 and times32 respectively The outputs fromthese blocks with downsampling are also inputs to our 2-wayFPN The conventional FPN used in other panoptic segment-ation networks aims to address the problem of multi-scalefeature fusion by aggregating features of different resolutionsin a top-down manner This is performed by first employinga 1times1 convolution to reduce or increase the number of chan-nels of different encoder output resolutions to a predefinednumber typically 256 Then the lower resolution featuresare upsampled to a higher resolution and are subsequentlyadded together For example times32 resolution encoder outputfeatures will be resized to the times16 resolution and added tothe times16 resolution encoder output features Finally a 3times3convolution is used at each scale to further learn fused fea-tures which then yields the P4 P8 P16 and P32 outputs ThisFPN topology has a limited unidirectional flow of informa-tion resulting in an ineffective fusion of multi-scale featuresTherefore we propose to mitigate this problem by addinga second branch that aggregates multi-scale features in abottom-up fashion to enable bidirectional flow of informa-tion

Our proposed 2-way FPN shown in Figure 2 consistsof two parallel branches Each branch consists of a 1times 1convolution with 256 output filters at each scale for chan-nel reduction The top-down branch shown in blue followsthe aggregation scheme of a conventional FPN from right toleft Whereas the bottom-up branch shown in purple down-samples the higher resolution features to the next lower res-olution from left to right and subsequently adds them withthe next lower resolution encoder output features For ex-ample times4 resolution features will be resized to the times8 resol-ution and added to the times8 resolution encoder output featuresThen in the next stage the outputs from the bottom-up and

128xHxW

128xHxW

13

13

256xHxW

256xHxW

kxHxW

13 1

1

11

4kxHxW

256xHxW

128xHxW

128xHxW

13

13

256xHxW

(a) LSFE Module (b) RPN Module (c) MC Module

256xHxW

256xHxW

256xHxW

256xHxW

256xHxW

Concat

1280xHxW

11

256xHxW

128xHxW

113

6213

18153

633

163

(d) DPC Module

nxn convolutionstride s

ns

Leaky ReLUnxn seperable convolutionstride s

ns

nxn seperable convolutiondilation rate

nrhrw iABN Sync

Figure 3 Topologies of various architectural components in our pro-posed semantic head and instance head of our EfficientPS architecture

top-down branches at each resolution are correspondinglysummed together and passed through a 3times3 separable convo-lution with 256 output channels to obtain the P4 P8 P16 andP32 outputs respectively We employ separable convolutionsas opposed to standard convolutions in an effort to keep theparameter consumption low We evaluate the performance ofour proposed 2-way FPN in comparison to the conventionalFPN in the ablation study presented in Section 443

32 Semantic Segmentation Head

Our proposed semantic segmentation head consists of threecomponents each aimed at targeting one of the critical re-quirements First at large-scale the network should have theability to capture fine features efficiently In order to enablethis we employ our Large Scale Feature Extractor (LSFE)module that has two 3times3 separable convolutions with 128output filters each followed by an iABN sync and a LeakyReLU activation function The first 3times3 separable convolu-tion reduces the number of filters to 128 and the second 3times3separable convolution further learns deeper features

The second requirement is that at small-scale the net-work should be able to capture long-range context Modulesinspired by Atrous Spatial Pyramid Pooling (ASPP) Chenet al (2017a) that are widely used in state-of-the-art semanticsegmentation architectures have been demonstrated to be ef-fective for this purpose Dense Prediction Cells (DPC) (Chenet al 2018a) and Efficient Atrous Spatial Pyramid Pooling

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 2: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

2 Rohit Mohan Abhinav Valada

stantial amount of attention in recent recent years (Shottonet al 2008 Krahenbuhl and Koltun 2011 Silberman et al2014 He and Gould 2014a) Moreover advances in deeplearning (Chen et al 2018b Zhao et al 2017 Valada et al2016a He et al 2017 Liu et al 2018 Zurn et al 2019)have further boosted the performance of these tasks to newheights However state-of-the-art deep learning methods stillpredominantly address theses tasks independently althoughtheir objective of understanding the scene at the pixel levelestablishes an inherent connection between them More sur-prisingly they have also fundamentally branched out intodifferent directions of proposal based methods (He et al2017) for instance segmentation and fully convolutional net-works (Long et al 2015) for semantic segmentation eventhough some earlier approaches (Tighe et al 2014 Tu et al2005 Yao et al 2012) have demonstrated the potential bene-fits in combining them

Recently Kirillov et al (2019) revived the need to tacklethese tasks jointly by coining the term panoptic segmentationand introducing the panoptic quality metric for combinedevaluation The goal of this task is to jointly predict lsquostuffrsquoand lsquothingrsquo classes essentially unifying the separate tasksof semantic and instance segmentation More specifically ifa pixel belongs to the lsquostuffrsquo class the panoptic segmenta-tion network assigns a class label from the lsquostuffrsquo classeswhereas if the pixel belongs to the lsquothingrsquo class the networkpredicts both which lsquothingrsquo class it corresponds to as wellas the instance of the object class Kirillov et al (2019) alsopresent a baseline approach for panoptic segmentation thatheuristically combines predictions from individual state-of-the-art instance and semantic segmentation networks in apost-processing step However this disjoint approach hasseveral drawbacks including large computational overheadredundancy in learning and discrepancy between the predic-tions of each network Although recent methods have madesignificant strides to address this task in top-down mannerwith shared components or in a bottom-up manner sequen-tially these approaches still face several challenges in termsof computational efficiency slow runtimes and subpar resultscompared to task-specific individual networks

In this paper we propose the novel EfficientPS archi-tecture that provides effective solutions to the aforemen-tioned problems The architecture consists of our new sharedbackbone with mobile inverted bottleneck units and our pro-posed 2-way Feature Pyramid Network (FPN) followed bytask-specific instance and semantic segmentation heads withseperable convolutions whose outputs are combined in ourparameter-free panoptic fusion module The entire networkis jointly optimized in an end-to-end manner to yield the finalpanoptic segmentation output Figure 1 shows an overviewof the information flow in our network along with the inter-mediate predictions and the final output The design of ourproposed EfficientPS is influenced by the goal of achieving

superior performance compared to existing methods whilesimultaneously being fast and computationally efficient

Currently the best performing top-down panoptic seg-mentation models (Porzi et al 2019 Xiong et al 2019 Liet al 2018a) primarily employ the ResNet-101 (He et al2016) or ResNeXt-101 (Xie et al 2017) architecture withFeature Pyramid Networks (Lin et al 2017) as the backboneAlthough these backbones have a high representational ca-pacity they consume a significant amount of parameters Inorder to achieve a better trade-off we propose a new back-bone network consisting of a modified EfficientNet (Tan andLe 2019) architecture that employs compound scaling touniformly scale all the dimensions of the network coupledwith our novel 2-way FPN Our proposed backbone is sub-stantially more efficient as well as effective than its popularcounterparts (He et al 2016 Kaiser et al 2017 Xie et al2017) Moreover we identify that the standard FPN archi-tecture has its limitations to aggregate multi-scale featuresdue to the unidirectional flow of information While thereare other extensions that aim to mitigate this problem byadding bottom-up path augmentation (Liu et al 2018) to theoutputs of the FPN they are considerably slow Thereforewe propose our novel 2-way FPN that facilities bidirectionalflow of information which substantially improves the panop-tic quality of lsquothingrsquo classes while remaining comparable inruntime

Now the outputs of our 2-way FPN are of multiple scaleswhich we refer to as large-scale features when they have adownsampling factor of times4 or times8 with respect to the inputimage and small-scale features when they have a down-sampling factor of times16 or times32 The large-scale outputs com-prise of fine or characteristic features whereas the small-scale outputs contain features rich in semantic informationThe presence of these distinct characteristics necessitatesprocessing features at each scale uniquely Therefore wepropose a new semantic head with separable convolutionswhich aggregates small-scale and large-scale features inde-pendently before correlating and fusing contextual featureswith fine features We demonstrate that this semanticallyreinforces fine features resulting in better object boundaryrefinement For our instance head we build upon Mask-R-CNN and augment it with separable convolutions and iABNsync (Rota Bulo et al 2018) layers

One of the critical challenges in panoptic segmentationdeals with resolving the conflict of overlapping predictionsfrom the semantic and instance heads Most architectures (Kir-illov et al 2019 Porzi et al 2019 Li et al 2019b de Geuset al 2018) employ a standard post-processing step (Kirillovet al 2019) that adopts instance-specific lsquothingrsquo segmenta-tion from the instance head and lsquostuffrsquo segmentation from thesemantic head This fusion technique completely ignores thelogits of the semantic head while segmenting lsquothingrsquo regionsin the panoptic segmentation output which is sub-optimal as

EfficientPS Efficient Panoptic Segmentation 3

the lsquothingrsquo logits of the semantic head can aid in resolvingthe conflict more effectively In order to thoroughly exploitthe logits from both heads we propose a parameter-free pan-optic fusion module that adaptively fuses logits by selectivelyattenuating or amplifying fused logit scores based on howagreeable or disagreeable the predictions of individual headsare for each pixel in a given instance We demonstrate thatour proposed panoptic fusion mechanism is more effectiveand efficient than other widely used methods in existing ar-chitectures

Furthermore we also introduce the KITTI panoptic seg-mentation dataset that contains panoptic annotations for im-ages in the challenging KITTI benchmark (Geiger et al2013) As KITTI provides groundtruth for a whole suite ofperception and localization tasks these new panoptic annota-tions further complement the widely popularly benchmarkWe hope that these panoptic annotations that we make pub-licly available encourages future research in multi-task learn-ing for holistic scene understanding Furthermore in order tofacilitate comparison we benchmark previous state-of-the-artmodels on our newly introduced KITTI panoptic segment-ation dataset and the IDD dataset We perform exhaustiveexperimental evaluations and benchmarking of our proposedEfficientPS architecture on four standard urban scene under-standing datasets including Cityscapes (Cordts et al 2016)Mapillary Vistas (Neuhold et al 2017) KITTI (Geiger et al2013) and Indian Driving Dataset (IDD) (Varma et al 2019)

Our proposed EfficientPS with a PQ score of 664 isranked first for panoptic segmentation on the Cityscapesbenchmark leaderboard without training on coarse annota-tions or using model ensembles Additionally EfficientPSis also ranked second for the semantic segmentation taskas well as the instance segmentation task on the Cityscapesbenchmark with a mIoU score of 842 and an AP of 391respectively On the Mapillary Vistas dataset our single Effi-cientPS model achieves a PQ score of 405 on the valida-tion set thereby outperforming all the existing methods Sim-ilarly EfficientPS consistently outperforms existing panopticsegmentation models on both the KITTI and IDD datasetsby a large margin More importantly our EfficientPS archi-tecture not only sets the new state-of-the-art on all the fourpanoptic segmentation benchmarks but it is also the mostcomputationally efficient by consuming the least amount ofparameters and having the fastest inference time comparedto previous state-of-the-art methods Finally we present de-tailed ablation studies that demonstrate the improvement inperformance due to each of the architectural contributionsthat we make in this work Moreover we also make imple-mentations of our proposed EfficientPS architecture trainingcode and pre-trained models publicly available

In summary the following are the main contributions ofthis work

1 The novel EfficientPS architecture for panoptic segmenta-tion that incorporates our proposed efficient shared back-bone with our new feature aligning semantic head a newvariant of Mask R-CNN as the instance head and ournovel adaptive panoptic fusion module

2 A new panoptic backbone consisting of an augmentedEfficientNet architecture and our proposed 2-way FPNthat both encodes and aggregates semantically rich multi-scale features in a bidirectional manner

3 A novel semantic head that captures fine features andlong-range context efficiently as well as correlates thembefore fusion for better object boundary refinement

4 A new panoptic fusion module that dynamically adaptsthe fusion of logits from the semantic and instance headsbased on their mask confidences and congruously integ-rates instance-specific lsquothingrsquo classes with lsquostuffrsquo classesto compute the panoptic prediction

5 The KITTI panoptic segmentation dataset that providespanoptic groundtruth annotations for images from thechallenging KITTI benchmark dataset

6 Benchmarking of existing state-of-the-art panoptic seg-mentation architectures on the newly introduced KITTIpanoptic segmentation dataset and IDD dataset

7 Comprehensive benchmarking of our proposed Effi-cientPS architecture on Cityscapes Mapilliary VistasKITTI and IDD datasets

8 Extensive ablation studies that compare the performanceof various architectural components that we propose inthis work with their counterparts from state-of-the-artarchitectures

9 Implementation of our proposed architecture and a livedemo on all the four datasets is publicly available athttprluni-freiburgderesearchpanoptic

2 Related Works

Panoptic segmentation is a recently introduced scene under-standing problem (Kirillov et al 2019) that unifies the tasksof semantic segmentation and instance segmentation Thereare numerous methods that have been proposed for each ofthese sub-tasks however only a handful of approaches havebeen introduced to tackle this coherent scene understand-ing problem of panoptic segmentation Most works in thisdomain are largely built upon advances made in semanticsegmentation and instance segmentation therefore we firstreview recent methods that have been proposed for theseclosely related sub-tasks followed by state-of-the-art ap-proaches that have been introduced for panoptic segmenta-tion

Semantic Segmentation There has been significant advanc-es in semantic segmentation approaches in recent years Inthis section we briefly review methods that use a single

4 Rohit Mohan Abhinav Valada

monocular image to tackle this task Approaches from thepast decade typically employ random decision forests to ad-dress this task Shotton et al (2008) use randomized decisionforests on local patches for classification whereas Plath et al(2009) fuse local and global features along with ConditionalRandom Fields(CRFs) for segmentation As opposed to lever-aging appearance-based features Brostow et al (2008) usecues from motion with random forests Sturgess et al (2009)further combine appearance-based features with structure-from-motion features in addition to CRFs to improve theperformance However 3D features extracted from densedepth maps (Zhang et al 2010) have been demonstrated to bemore effective than the combined features Kontschieder et al(2011) exploit the inherent topological distribution of objectclasses to improve the performance whereas Krahenbuhland Koltun (2011) improve segmentation by pairing CRFswith Gaussian edge potentials Nevertheless all these meth-ods employ handcrafted features that do not encapsulate allthe high-level and low-level relations thereby limiting theirrepresentational ability

The significant improvement in performance of classific-ation tasks brought about by Convolutional Neural Network(CNN) based approaches motivated researchers to exploresuch methods for semantic segmentation Initially these ap-proaches relied on patch-wise training that severely limitedtheir ability to accurately segment object boundaries How-ever they still perform substantially better than previoushandcrafted methods The advent of end-to-end learning ap-proaches for semantic segmentation lead by the introduc-tion of Fully Convolutional Networks (FCNs) (Long et al2015) revolutionized this field and FCNs still form the baseupon which state-of-the-art architecture are built upon todayFCN is an encoded-decoder architecture where the encoderis based on the VGG-16 (Simonyan and Zisserman 2014)architecture with inner-product layers replaced with convolu-tions and the decoder consists of convolution and transposedconvolution layers The subsequently proposed SegNet (Bad-rinarayanan et al 2015) architecture introduced unpoolinglayers for upsampling as a replacement for transposed con-volutions whereas ParseNet (Liu et al 2015) models globalcontext directly as opposed to only relying on the largestreceptive field of the network

The PSPNet (Zhao et al 2017) architecture emphasizeson the importance of multi-scale features and propose pyr-amid pooling to learn feature representations at differentscales Yu and Koltun (2015) introduce atrous convolutionsto further exploit multi-scale features in semantic segment-ation networks Subsequently Valada et al (2017) proposemulti-scale residual units with parallel atrous convolutionswith different dilation rates to efficiently learn multiscale fea-tures throughout the network without increasing the numberof parameters Chen et al (2017b) propose the Atrous SpatialPyramid Pooling (ASPP) module that concatenates feature

maps from multiple parallel atrous convolutions with differ-ent dilation rates and a global pooling layer ASPP substan-tially improves the performance of semantic segmentationnetworks by aggregating multi-scale features and capturinglong-range context however it significantly increases thecomputational complexity Therefore Chen et al (2018a) pro-pose Dense Prediction Cells (DPC) and Valada et al (2019)propose Efficient Atrous Spatial Pyramid Pooling (eASPP)that yield better semantic segmentation performance thanASPP while being 10-times more efficient Li et al (2019a)suggest that global feature aggregation often leads to largepattern features and also over-smooth regions of small pat-terns which results in sub-optimal performance In order toalleviate this problem the authors propose the use of a globalaggregation module coupled with a local distribution mod-ule which results in features that are balanced in small andlarge pattern regions There are also several works that havebeen proposed to improve the upsampling in decoders ofencoder-decoder architectures In (Chen et al 2018b) theauthors introduce a novel decoder module for object bound-ary refinement Tian et al (2019) propose data-dependentupsampling which accounts for the redundancy in the labelspace as opposed to simple bilinear upsampling

Instance Segmentation Some of the initial approaches em-ploy CRFs (He and Gould 2014b) and minimize integerquadratic relations (Tighe et al 2014) Methods that exploitCNNs with Markov random fields (Zhang et al 2016) andrecurrent neural networks (Romera-Paredes and Torr 2016Ren and Zemel 2017) have also been explored In this sec-tion we primarily discuss CNN-based approaches for in-stance segmentation These methods can be categorized intoproposal free and proposal based methods

Methods in the proposal free category often obtain in-stance masks from a resulting transformation Bai and Ur-tasun (2017) uses CNNs to produce an energy map of theimage and then perform a cut at a single energy level toobtain the corresponding object instances Liu et al (2017)employ a sequence of CNNs to solve sub-grouping problemsin order to compose object instances Some approaches ex-ploit FCNs which either use local coherence for estimatinginstances (Dai et al 2016) or encode the direction of eachpixel to its corresponding instance centre (Uhrig et al 2016)The recent approach SSAP (Gao et al 2019) uses pixel-pair affinity pyramids for computing the probability that twopixels hierarchically belong to the same instance Howeverthey achieve a lower than proposal based methods which hasled to a decline in their popularity

In proposal based methods Hariharan et al (2014) pro-pose a method that uses Multiscale Combinatorial Group-ing (Arbelaez et al 2014) proposals as input to CNNs forfeature extraction and then employ an SVM classifier forregion classification Subsequently Hariharan et al (2015)propose hypercolumn pixel descriptors for simultaneous de-

EfficientPS Efficient Panoptic Segmentation 5

tection and segmentation In recent works DeepMask (Pin-heiro et al 2015) uses a patch of an image as input to a CNNwhich yields a class-agnostic segmentation mask and thelikelihood of the patch containing an object FCIS (Li et al2017) employs position-sensitive score maps obtained fromclassification of pixels based on their relative positions toperform segmentation and detection jointly Dai et al (2016)propose an approach for instance segmentation that uses threenetworks for distinguishing instances estimating masks andcategorizing objects Mask R-CNN (He et al 2017) is one ofthe most popular and widely used approaches in the presenttime It extends Faster R-CNN for instance segmentation byadding an object segmentation branch parallel to an branchthat performs bounding box regression and classificationMore recently Liu et al (2018) propose an approach to im-prove Mask R-CNN by adding bottom-up path augmentationthat enhances object localization ability in earlier layers of thenetwork Subsequently BshapeNet (Kang and Kim 2018) ex-tends Faster R-CNN by adding a bounding box mask branchthat provides additional information of object positions andcoordinates to improve the performance of object detectionand instance segmentation

Panoptic Segmentation Kirillov et al (2019) revived theunification of semantic segmentation and instance segmenta-tion tasks by introducing panoptic segmentation They pro-pose a baseline model that combines the output of PSPNet(Zhao et al 2017) and Mask R-CNN (He et al 2017) with asimple post-processing step in which each model processesthe inputs independently The methods that address this taskof panoptic segmentation can be broadly classified into twocategories top-down or proposal based methods and bottom-up or proposal free methods Most of the current state-of-the-art methods adopt the top-down approach de Geus et al(2018) propose joint training with a shared backbone thatbranches into Mask R-CNN for instance segmentation andaugmented Pyramid Pooling module for semantic segment-ation Subsequently Li et al (2019b) introduce Attention-guided Unified Network that uses proposal attention moduleand mask attention module for better segmentation of lsquostuffrsquoclasses All the aforementioned methods use a similar fusiontechnique to Kirillov et al (2019) for the fusion of lsquostuffrsquo andlsquothingrsquo predictions

In top-down panoptic segmentation architectures pre-dictions of both heads have an inherent overlap betweenthem resulting in the mask overlapping problem In orderto mitigate this problem Li et al (2018b) propose a weaklysupervised model where lsquothingrsquo classes are weakly super-vised by bounding boxes and lsquostuffrsquo classes are supervisedwith image-level tags Whereas Liu et al (2019) address theproblem by introducing the spatial ranking module and Liet al (2018a) propose a method that learns a binary maskto constrain output distributions of lsquostuffrsquo and lsquothingrsquo expli-citly Subsequently UPSNet (Xiong et al 2019) introduces

a parameter-free panoptic head to address the problem ofoverlapping of instances and also predicts an extra unknownclass More recently AdaptIS (Sofiiuk et al 2019) uses pointproposals to produce instance masks and jointly trains with astandard semantic segmentation pipeline to perform panopticsegmentation In contrast Porzi et al (2019) propose an archi-tecture for panoptic segmentation that effectively integratescontextual information from a lightweight DeepLab-inspiredmodule with multi-scale features from a FPN

Compared to the popular proposal based methods thereare only a handful of proposal free methods that have beenproposed Deeper-Lab (Yang et al 2019) was the first bottom-up approach that was introduced and it employs an encoder-decoder topology to pair object centres for class-agnosticinstance segmentation with DeepLab semantic segmentationCheng et al (2019) further builds on Deeper-Lab by intro-ducing a dual-ASPP and dual-decoder structure for eachsub-task branch SSAP (Gao et al 2019) proposes to grouppixels based on a pixel-pair affinity pyramid and incorporatean efficient graph method to generate instances while jointlylearning semantic labeling

In this work we adopt a top-down approach due to itsexceptional ability to handle large scale variation of instanceswhich is a critical requirement for segmenting lsquothingrsquo classesWe present the novel EfficientPS architecture that incorpor-ates our proposed efficient backbone with our 2-way FPN forlearning rich multi-scale features in a bidirectional mannercoupled with a new semantic head that captures fine-featuresand long-range context effectively and a variant of MaskR-CNN augmented with separable convolutions as the in-stance head We propose a novel panoptic fusion moduleto dynamically adapt the fusion of logits from the semanticand instance heads to yield the panoptic segmentation outputOur architecture achieves state-of-the-art results on bench-mark datasets while being the most efficient and fast panopticsegmentation architecture

3 EfficientPS Architecture

In this section we first give a brief overview of our proposedEfficientPS architecture and then detail each of its constitut-ing components Our network follows the top-down layoutas shown in Figure 2 It consists of a shared backbone witha 2-way Feature Pyramid Network (FPN) followed by task-specific semantic segmentation and instance segmentationheads We build upon the EfficientNet (Tan and Le 2019)architecture for the encoder of our shared backbone (depictedin red) It consists of mobile inverted bottleneck (Xie et al2017) units and employs compound scaling to uniformlyscale all the dimensions of the encoder network This en-ables our encoder to have a rich representational capacitywith fewer parameters in comparison to other encoders orbackbones of similar discriminative capability

6 Rohit Mohan Abhinav Valada

256x14x14 256x28x28

NIx28x28

256X14x14

256x256x512

256x256x512

256x256x512

256x128x256

256x128x256 256x

64x128

256x64x128

256x64x128

256x128x256

48x512x1024

24x512x1024

40x256x512

64x128x256

128x64x128

176x64x128 304x32x64 512x32x64

2048x32x64

11

P32

P16

P8

P4

128x256x512

128x128x256

128x64x128

128x32x64

512x256x512NCx

1024x2048

P32

13

11

P16

13

13

P8

13

P4

11

11

11

11

11

11

13

13

13

13

22 1

1

11

256x32x64

256x32x64

256x32x64

RPN

ROI Align

MaskLogits

BBox

Class

LSFE

LSFE

DPC

DPC

Concat

SemanticLogits

PanopticFusion

(P4 P8 P16 P32)

(P4 P8 P16 P32)

Instance Segmentation Head

Semantic Segmentation Head

MC

MC

nxn Convolutionstride s

ns

Leaky ReLUBilinear upsamplingnxn Transpose convolutionstride s

ns

iABN Sync Fully connectednxn seperable convolutionstride s

ns

Output1024x2048

Input3x1024x2048

Figure 2 Illustration of our proposed EfficientPS architecture consisting of a shared backbone with our 2-way FPN and parallel semantic andinstance segmentation heads followed by our panoptic fusion module The shared backbone is built upon on the EfficientNet architecture and ournew 2-way FPN that enables bidirectional flow of information The instance segmentation head is based on a modified Mask R-CNN topology andwe incorporate our proposed semantic segmentation head Finally the outputs of both heads are fused in our panoptic fusion module to yield thepanoptic segmentation output

As opposed to employing the conventional FPN (Lin et al2017) that is commonly used in other panoptic segmentationarchitectures (Kirillov et al 2019 Li et al 2018a Porzi et al2019) we incorporate our proposed 2-way FPN that fusesmulti-scale features more effectively than its counterpartsThis can be attributed to the fact that the information flowin our 2-way FPN is not bounded to only one direction asdepicted by the purple blue and green blocks in Figure 2Subsequently after the 2-way FPN we employ two headsin parallel for semantic segmentation (depicted in yellow)and instance segmentation (depicted in gray and orange)respectively We use a variant of the Mask R-CNN (He et al2017) architecture as the instance head and we incorporateour novel semantic segmentation head consisting of denseprediction cells (Chen et al 2018a) and residual pyramidsThe semantic head consists of three different modules forcapturing fine features long-range contextual features andcorrelating the distinctly captured features for improvingobject boundary refinement Finally we employ our proposedpanoptic fusion module to fuse the outputs of the semanticand instance heads to yield the panoptic segmentation output

31 Network Backbone

The backbone of our network consists of an encoder withour proposed 2-way FPN The encoder is the basic buildingblock of any segmentation network and a strong encoder isessential to have high representational capacity In this workwe seek to find a good trade-off between the number of para-meters and computational complexity to the representationalcapacity of the network EfficientNets (Tan and Le 2019)which are a recent family of architectures have been shownto significantly outperform other networks in classificationtasks while having fewer parameters and FLOPs It employscompound scaling to uniformly scale the width depth andresolution of the network efficiently Therefore we chooseto build upon this scaled architecture with 16 22 and 456coefficients commonly known as the EfficientNet-B5 modelThis can be easily replaced with any of the EfficientNet mod-els based on the capacity of the resources that are availableand the computational budget

In order to adapt EfficientNet to our task we first removethe classification head as well as the Squeeze-and-Excitation(SE) (Hu et al 2018) connections in the network We find that

EfficientPS Efficient Panoptic Segmentation 7

the explicit modelling of interdependencies between chan-nels of the convolutional feature maps that are enabled bythe SE connections tend to suppress localization of featuresin favour of contextual elements This property is a desiredin classification networks however both are equally import-ant for segmentation tasks therefore we do not add any SEconnections in our backbone Second we replace all thebatch normalization (Ioffe and Szegedy 2015) layers withsynchronized Inplace Activated Batch Normalization (iABNsync) (Rota Bulo et al 2018) This enables synchronizationacross different GPUs which in turn yields a better estimateof gradients while performing multi-GPU training and thein-place operations frees up additional GPU memory Weanalyze the performance of our modified EfficientNet in com-parison to other encoders commonly used in state-of-the-artarchitectures in the ablation study presented in Section 442

Our EfficientNet encoder comprises of nine blocks asshown in Figure 2 (in red) We refer to each block in thefigure as block 1 to block 9 in the left to right manner Theoutput of block 2 3 5 and 9 corresponds to downsamplingfactors times4times8times16 and times32 respectively The outputs fromthese blocks with downsampling are also inputs to our 2-wayFPN The conventional FPN used in other panoptic segment-ation networks aims to address the problem of multi-scalefeature fusion by aggregating features of different resolutionsin a top-down manner This is performed by first employinga 1times1 convolution to reduce or increase the number of chan-nels of different encoder output resolutions to a predefinednumber typically 256 Then the lower resolution featuresare upsampled to a higher resolution and are subsequentlyadded together For example times32 resolution encoder outputfeatures will be resized to the times16 resolution and added tothe times16 resolution encoder output features Finally a 3times3convolution is used at each scale to further learn fused fea-tures which then yields the P4 P8 P16 and P32 outputs ThisFPN topology has a limited unidirectional flow of informa-tion resulting in an ineffective fusion of multi-scale featuresTherefore we propose to mitigate this problem by addinga second branch that aggregates multi-scale features in abottom-up fashion to enable bidirectional flow of informa-tion

Our proposed 2-way FPN shown in Figure 2 consistsof two parallel branches Each branch consists of a 1times 1convolution with 256 output filters at each scale for chan-nel reduction The top-down branch shown in blue followsthe aggregation scheme of a conventional FPN from right toleft Whereas the bottom-up branch shown in purple down-samples the higher resolution features to the next lower res-olution from left to right and subsequently adds them withthe next lower resolution encoder output features For ex-ample times4 resolution features will be resized to the times8 resol-ution and added to the times8 resolution encoder output featuresThen in the next stage the outputs from the bottom-up and

128xHxW

128xHxW

13

13

256xHxW

256xHxW

kxHxW

13 1

1

11

4kxHxW

256xHxW

128xHxW

128xHxW

13

13

256xHxW

(a) LSFE Module (b) RPN Module (c) MC Module

256xHxW

256xHxW

256xHxW

256xHxW

256xHxW

Concat

1280xHxW

11

256xHxW

128xHxW

113

6213

18153

633

163

(d) DPC Module

nxn convolutionstride s

ns

Leaky ReLUnxn seperable convolutionstride s

ns

nxn seperable convolutiondilation rate

nrhrw iABN Sync

Figure 3 Topologies of various architectural components in our pro-posed semantic head and instance head of our EfficientPS architecture

top-down branches at each resolution are correspondinglysummed together and passed through a 3times3 separable convo-lution with 256 output channels to obtain the P4 P8 P16 andP32 outputs respectively We employ separable convolutionsas opposed to standard convolutions in an effort to keep theparameter consumption low We evaluate the performance ofour proposed 2-way FPN in comparison to the conventionalFPN in the ablation study presented in Section 443

32 Semantic Segmentation Head

Our proposed semantic segmentation head consists of threecomponents each aimed at targeting one of the critical re-quirements First at large-scale the network should have theability to capture fine features efficiently In order to enablethis we employ our Large Scale Feature Extractor (LSFE)module that has two 3times3 separable convolutions with 128output filters each followed by an iABN sync and a LeakyReLU activation function The first 3times3 separable convolu-tion reduces the number of filters to 128 and the second 3times3separable convolution further learns deeper features

The second requirement is that at small-scale the net-work should be able to capture long-range context Modulesinspired by Atrous Spatial Pyramid Pooling (ASPP) Chenet al (2017a) that are widely used in state-of-the-art semanticsegmentation architectures have been demonstrated to be ef-fective for this purpose Dense Prediction Cells (DPC) (Chenet al 2018a) and Efficient Atrous Spatial Pyramid Pooling

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 3: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 3

the lsquothingrsquo logits of the semantic head can aid in resolvingthe conflict more effectively In order to thoroughly exploitthe logits from both heads we propose a parameter-free pan-optic fusion module that adaptively fuses logits by selectivelyattenuating or amplifying fused logit scores based on howagreeable or disagreeable the predictions of individual headsare for each pixel in a given instance We demonstrate thatour proposed panoptic fusion mechanism is more effectiveand efficient than other widely used methods in existing ar-chitectures

Furthermore we also introduce the KITTI panoptic seg-mentation dataset that contains panoptic annotations for im-ages in the challenging KITTI benchmark (Geiger et al2013) As KITTI provides groundtruth for a whole suite ofperception and localization tasks these new panoptic annota-tions further complement the widely popularly benchmarkWe hope that these panoptic annotations that we make pub-licly available encourages future research in multi-task learn-ing for holistic scene understanding Furthermore in order tofacilitate comparison we benchmark previous state-of-the-artmodels on our newly introduced KITTI panoptic segment-ation dataset and the IDD dataset We perform exhaustiveexperimental evaluations and benchmarking of our proposedEfficientPS architecture on four standard urban scene under-standing datasets including Cityscapes (Cordts et al 2016)Mapillary Vistas (Neuhold et al 2017) KITTI (Geiger et al2013) and Indian Driving Dataset (IDD) (Varma et al 2019)

Our proposed EfficientPS with a PQ score of 664 isranked first for panoptic segmentation on the Cityscapesbenchmark leaderboard without training on coarse annota-tions or using model ensembles Additionally EfficientPSis also ranked second for the semantic segmentation taskas well as the instance segmentation task on the Cityscapesbenchmark with a mIoU score of 842 and an AP of 391respectively On the Mapillary Vistas dataset our single Effi-cientPS model achieves a PQ score of 405 on the valida-tion set thereby outperforming all the existing methods Sim-ilarly EfficientPS consistently outperforms existing panopticsegmentation models on both the KITTI and IDD datasetsby a large margin More importantly our EfficientPS archi-tecture not only sets the new state-of-the-art on all the fourpanoptic segmentation benchmarks but it is also the mostcomputationally efficient by consuming the least amount ofparameters and having the fastest inference time comparedto previous state-of-the-art methods Finally we present de-tailed ablation studies that demonstrate the improvement inperformance due to each of the architectural contributionsthat we make in this work Moreover we also make imple-mentations of our proposed EfficientPS architecture trainingcode and pre-trained models publicly available

In summary the following are the main contributions ofthis work

1 The novel EfficientPS architecture for panoptic segmenta-tion that incorporates our proposed efficient shared back-bone with our new feature aligning semantic head a newvariant of Mask R-CNN as the instance head and ournovel adaptive panoptic fusion module

2 A new panoptic backbone consisting of an augmentedEfficientNet architecture and our proposed 2-way FPNthat both encodes and aggregates semantically rich multi-scale features in a bidirectional manner

3 A novel semantic head that captures fine features andlong-range context efficiently as well as correlates thembefore fusion for better object boundary refinement

4 A new panoptic fusion module that dynamically adaptsthe fusion of logits from the semantic and instance headsbased on their mask confidences and congruously integ-rates instance-specific lsquothingrsquo classes with lsquostuffrsquo classesto compute the panoptic prediction

5 The KITTI panoptic segmentation dataset that providespanoptic groundtruth annotations for images from thechallenging KITTI benchmark dataset

6 Benchmarking of existing state-of-the-art panoptic seg-mentation architectures on the newly introduced KITTIpanoptic segmentation dataset and IDD dataset

7 Comprehensive benchmarking of our proposed Effi-cientPS architecture on Cityscapes Mapilliary VistasKITTI and IDD datasets

8 Extensive ablation studies that compare the performanceof various architectural components that we propose inthis work with their counterparts from state-of-the-artarchitectures

9 Implementation of our proposed architecture and a livedemo on all the four datasets is publicly available athttprluni-freiburgderesearchpanoptic

2 Related Works

Panoptic segmentation is a recently introduced scene under-standing problem (Kirillov et al 2019) that unifies the tasksof semantic segmentation and instance segmentation Thereare numerous methods that have been proposed for each ofthese sub-tasks however only a handful of approaches havebeen introduced to tackle this coherent scene understand-ing problem of panoptic segmentation Most works in thisdomain are largely built upon advances made in semanticsegmentation and instance segmentation therefore we firstreview recent methods that have been proposed for theseclosely related sub-tasks followed by state-of-the-art ap-proaches that have been introduced for panoptic segmenta-tion

Semantic Segmentation There has been significant advanc-es in semantic segmentation approaches in recent years Inthis section we briefly review methods that use a single

4 Rohit Mohan Abhinav Valada

monocular image to tackle this task Approaches from thepast decade typically employ random decision forests to ad-dress this task Shotton et al (2008) use randomized decisionforests on local patches for classification whereas Plath et al(2009) fuse local and global features along with ConditionalRandom Fields(CRFs) for segmentation As opposed to lever-aging appearance-based features Brostow et al (2008) usecues from motion with random forests Sturgess et al (2009)further combine appearance-based features with structure-from-motion features in addition to CRFs to improve theperformance However 3D features extracted from densedepth maps (Zhang et al 2010) have been demonstrated to bemore effective than the combined features Kontschieder et al(2011) exploit the inherent topological distribution of objectclasses to improve the performance whereas Krahenbuhland Koltun (2011) improve segmentation by pairing CRFswith Gaussian edge potentials Nevertheless all these meth-ods employ handcrafted features that do not encapsulate allthe high-level and low-level relations thereby limiting theirrepresentational ability

The significant improvement in performance of classific-ation tasks brought about by Convolutional Neural Network(CNN) based approaches motivated researchers to exploresuch methods for semantic segmentation Initially these ap-proaches relied on patch-wise training that severely limitedtheir ability to accurately segment object boundaries How-ever they still perform substantially better than previoushandcrafted methods The advent of end-to-end learning ap-proaches for semantic segmentation lead by the introduc-tion of Fully Convolutional Networks (FCNs) (Long et al2015) revolutionized this field and FCNs still form the baseupon which state-of-the-art architecture are built upon todayFCN is an encoded-decoder architecture where the encoderis based on the VGG-16 (Simonyan and Zisserman 2014)architecture with inner-product layers replaced with convolu-tions and the decoder consists of convolution and transposedconvolution layers The subsequently proposed SegNet (Bad-rinarayanan et al 2015) architecture introduced unpoolinglayers for upsampling as a replacement for transposed con-volutions whereas ParseNet (Liu et al 2015) models globalcontext directly as opposed to only relying on the largestreceptive field of the network

The PSPNet (Zhao et al 2017) architecture emphasizeson the importance of multi-scale features and propose pyr-amid pooling to learn feature representations at differentscales Yu and Koltun (2015) introduce atrous convolutionsto further exploit multi-scale features in semantic segment-ation networks Subsequently Valada et al (2017) proposemulti-scale residual units with parallel atrous convolutionswith different dilation rates to efficiently learn multiscale fea-tures throughout the network without increasing the numberof parameters Chen et al (2017b) propose the Atrous SpatialPyramid Pooling (ASPP) module that concatenates feature

maps from multiple parallel atrous convolutions with differ-ent dilation rates and a global pooling layer ASPP substan-tially improves the performance of semantic segmentationnetworks by aggregating multi-scale features and capturinglong-range context however it significantly increases thecomputational complexity Therefore Chen et al (2018a) pro-pose Dense Prediction Cells (DPC) and Valada et al (2019)propose Efficient Atrous Spatial Pyramid Pooling (eASPP)that yield better semantic segmentation performance thanASPP while being 10-times more efficient Li et al (2019a)suggest that global feature aggregation often leads to largepattern features and also over-smooth regions of small pat-terns which results in sub-optimal performance In order toalleviate this problem the authors propose the use of a globalaggregation module coupled with a local distribution mod-ule which results in features that are balanced in small andlarge pattern regions There are also several works that havebeen proposed to improve the upsampling in decoders ofencoder-decoder architectures In (Chen et al 2018b) theauthors introduce a novel decoder module for object bound-ary refinement Tian et al (2019) propose data-dependentupsampling which accounts for the redundancy in the labelspace as opposed to simple bilinear upsampling

Instance Segmentation Some of the initial approaches em-ploy CRFs (He and Gould 2014b) and minimize integerquadratic relations (Tighe et al 2014) Methods that exploitCNNs with Markov random fields (Zhang et al 2016) andrecurrent neural networks (Romera-Paredes and Torr 2016Ren and Zemel 2017) have also been explored In this sec-tion we primarily discuss CNN-based approaches for in-stance segmentation These methods can be categorized intoproposal free and proposal based methods

Methods in the proposal free category often obtain in-stance masks from a resulting transformation Bai and Ur-tasun (2017) uses CNNs to produce an energy map of theimage and then perform a cut at a single energy level toobtain the corresponding object instances Liu et al (2017)employ a sequence of CNNs to solve sub-grouping problemsin order to compose object instances Some approaches ex-ploit FCNs which either use local coherence for estimatinginstances (Dai et al 2016) or encode the direction of eachpixel to its corresponding instance centre (Uhrig et al 2016)The recent approach SSAP (Gao et al 2019) uses pixel-pair affinity pyramids for computing the probability that twopixels hierarchically belong to the same instance Howeverthey achieve a lower than proposal based methods which hasled to a decline in their popularity

In proposal based methods Hariharan et al (2014) pro-pose a method that uses Multiscale Combinatorial Group-ing (Arbelaez et al 2014) proposals as input to CNNs forfeature extraction and then employ an SVM classifier forregion classification Subsequently Hariharan et al (2015)propose hypercolumn pixel descriptors for simultaneous de-

EfficientPS Efficient Panoptic Segmentation 5

tection and segmentation In recent works DeepMask (Pin-heiro et al 2015) uses a patch of an image as input to a CNNwhich yields a class-agnostic segmentation mask and thelikelihood of the patch containing an object FCIS (Li et al2017) employs position-sensitive score maps obtained fromclassification of pixels based on their relative positions toperform segmentation and detection jointly Dai et al (2016)propose an approach for instance segmentation that uses threenetworks for distinguishing instances estimating masks andcategorizing objects Mask R-CNN (He et al 2017) is one ofthe most popular and widely used approaches in the presenttime It extends Faster R-CNN for instance segmentation byadding an object segmentation branch parallel to an branchthat performs bounding box regression and classificationMore recently Liu et al (2018) propose an approach to im-prove Mask R-CNN by adding bottom-up path augmentationthat enhances object localization ability in earlier layers of thenetwork Subsequently BshapeNet (Kang and Kim 2018) ex-tends Faster R-CNN by adding a bounding box mask branchthat provides additional information of object positions andcoordinates to improve the performance of object detectionand instance segmentation

Panoptic Segmentation Kirillov et al (2019) revived theunification of semantic segmentation and instance segmenta-tion tasks by introducing panoptic segmentation They pro-pose a baseline model that combines the output of PSPNet(Zhao et al 2017) and Mask R-CNN (He et al 2017) with asimple post-processing step in which each model processesthe inputs independently The methods that address this taskof panoptic segmentation can be broadly classified into twocategories top-down or proposal based methods and bottom-up or proposal free methods Most of the current state-of-the-art methods adopt the top-down approach de Geus et al(2018) propose joint training with a shared backbone thatbranches into Mask R-CNN for instance segmentation andaugmented Pyramid Pooling module for semantic segment-ation Subsequently Li et al (2019b) introduce Attention-guided Unified Network that uses proposal attention moduleand mask attention module for better segmentation of lsquostuffrsquoclasses All the aforementioned methods use a similar fusiontechnique to Kirillov et al (2019) for the fusion of lsquostuffrsquo andlsquothingrsquo predictions

In top-down panoptic segmentation architectures pre-dictions of both heads have an inherent overlap betweenthem resulting in the mask overlapping problem In orderto mitigate this problem Li et al (2018b) propose a weaklysupervised model where lsquothingrsquo classes are weakly super-vised by bounding boxes and lsquostuffrsquo classes are supervisedwith image-level tags Whereas Liu et al (2019) address theproblem by introducing the spatial ranking module and Liet al (2018a) propose a method that learns a binary maskto constrain output distributions of lsquostuffrsquo and lsquothingrsquo expli-citly Subsequently UPSNet (Xiong et al 2019) introduces

a parameter-free panoptic head to address the problem ofoverlapping of instances and also predicts an extra unknownclass More recently AdaptIS (Sofiiuk et al 2019) uses pointproposals to produce instance masks and jointly trains with astandard semantic segmentation pipeline to perform panopticsegmentation In contrast Porzi et al (2019) propose an archi-tecture for panoptic segmentation that effectively integratescontextual information from a lightweight DeepLab-inspiredmodule with multi-scale features from a FPN

Compared to the popular proposal based methods thereare only a handful of proposal free methods that have beenproposed Deeper-Lab (Yang et al 2019) was the first bottom-up approach that was introduced and it employs an encoder-decoder topology to pair object centres for class-agnosticinstance segmentation with DeepLab semantic segmentationCheng et al (2019) further builds on Deeper-Lab by intro-ducing a dual-ASPP and dual-decoder structure for eachsub-task branch SSAP (Gao et al 2019) proposes to grouppixels based on a pixel-pair affinity pyramid and incorporatean efficient graph method to generate instances while jointlylearning semantic labeling

In this work we adopt a top-down approach due to itsexceptional ability to handle large scale variation of instanceswhich is a critical requirement for segmenting lsquothingrsquo classesWe present the novel EfficientPS architecture that incorpor-ates our proposed efficient backbone with our 2-way FPN forlearning rich multi-scale features in a bidirectional mannercoupled with a new semantic head that captures fine-featuresand long-range context effectively and a variant of MaskR-CNN augmented with separable convolutions as the in-stance head We propose a novel panoptic fusion moduleto dynamically adapt the fusion of logits from the semanticand instance heads to yield the panoptic segmentation outputOur architecture achieves state-of-the-art results on bench-mark datasets while being the most efficient and fast panopticsegmentation architecture

3 EfficientPS Architecture

In this section we first give a brief overview of our proposedEfficientPS architecture and then detail each of its constitut-ing components Our network follows the top-down layoutas shown in Figure 2 It consists of a shared backbone witha 2-way Feature Pyramid Network (FPN) followed by task-specific semantic segmentation and instance segmentationheads We build upon the EfficientNet (Tan and Le 2019)architecture for the encoder of our shared backbone (depictedin red) It consists of mobile inverted bottleneck (Xie et al2017) units and employs compound scaling to uniformlyscale all the dimensions of the encoder network This en-ables our encoder to have a rich representational capacitywith fewer parameters in comparison to other encoders orbackbones of similar discriminative capability

6 Rohit Mohan Abhinav Valada

256x14x14 256x28x28

NIx28x28

256X14x14

256x256x512

256x256x512

256x256x512

256x128x256

256x128x256 256x

64x128

256x64x128

256x64x128

256x128x256

48x512x1024

24x512x1024

40x256x512

64x128x256

128x64x128

176x64x128 304x32x64 512x32x64

2048x32x64

11

P32

P16

P8

P4

128x256x512

128x128x256

128x64x128

128x32x64

512x256x512NCx

1024x2048

P32

13

11

P16

13

13

P8

13

P4

11

11

11

11

11

11

13

13

13

13

22 1

1

11

256x32x64

256x32x64

256x32x64

RPN

ROI Align

MaskLogits

BBox

Class

LSFE

LSFE

DPC

DPC

Concat

SemanticLogits

PanopticFusion

(P4 P8 P16 P32)

(P4 P8 P16 P32)

Instance Segmentation Head

Semantic Segmentation Head

MC

MC

nxn Convolutionstride s

ns

Leaky ReLUBilinear upsamplingnxn Transpose convolutionstride s

ns

iABN Sync Fully connectednxn seperable convolutionstride s

ns

Output1024x2048

Input3x1024x2048

Figure 2 Illustration of our proposed EfficientPS architecture consisting of a shared backbone with our 2-way FPN and parallel semantic andinstance segmentation heads followed by our panoptic fusion module The shared backbone is built upon on the EfficientNet architecture and ournew 2-way FPN that enables bidirectional flow of information The instance segmentation head is based on a modified Mask R-CNN topology andwe incorporate our proposed semantic segmentation head Finally the outputs of both heads are fused in our panoptic fusion module to yield thepanoptic segmentation output

As opposed to employing the conventional FPN (Lin et al2017) that is commonly used in other panoptic segmentationarchitectures (Kirillov et al 2019 Li et al 2018a Porzi et al2019) we incorporate our proposed 2-way FPN that fusesmulti-scale features more effectively than its counterpartsThis can be attributed to the fact that the information flowin our 2-way FPN is not bounded to only one direction asdepicted by the purple blue and green blocks in Figure 2Subsequently after the 2-way FPN we employ two headsin parallel for semantic segmentation (depicted in yellow)and instance segmentation (depicted in gray and orange)respectively We use a variant of the Mask R-CNN (He et al2017) architecture as the instance head and we incorporateour novel semantic segmentation head consisting of denseprediction cells (Chen et al 2018a) and residual pyramidsThe semantic head consists of three different modules forcapturing fine features long-range contextual features andcorrelating the distinctly captured features for improvingobject boundary refinement Finally we employ our proposedpanoptic fusion module to fuse the outputs of the semanticand instance heads to yield the panoptic segmentation output

31 Network Backbone

The backbone of our network consists of an encoder withour proposed 2-way FPN The encoder is the basic buildingblock of any segmentation network and a strong encoder isessential to have high representational capacity In this workwe seek to find a good trade-off between the number of para-meters and computational complexity to the representationalcapacity of the network EfficientNets (Tan and Le 2019)which are a recent family of architectures have been shownto significantly outperform other networks in classificationtasks while having fewer parameters and FLOPs It employscompound scaling to uniformly scale the width depth andresolution of the network efficiently Therefore we chooseto build upon this scaled architecture with 16 22 and 456coefficients commonly known as the EfficientNet-B5 modelThis can be easily replaced with any of the EfficientNet mod-els based on the capacity of the resources that are availableand the computational budget

In order to adapt EfficientNet to our task we first removethe classification head as well as the Squeeze-and-Excitation(SE) (Hu et al 2018) connections in the network We find that

EfficientPS Efficient Panoptic Segmentation 7

the explicit modelling of interdependencies between chan-nels of the convolutional feature maps that are enabled bythe SE connections tend to suppress localization of featuresin favour of contextual elements This property is a desiredin classification networks however both are equally import-ant for segmentation tasks therefore we do not add any SEconnections in our backbone Second we replace all thebatch normalization (Ioffe and Szegedy 2015) layers withsynchronized Inplace Activated Batch Normalization (iABNsync) (Rota Bulo et al 2018) This enables synchronizationacross different GPUs which in turn yields a better estimateof gradients while performing multi-GPU training and thein-place operations frees up additional GPU memory Weanalyze the performance of our modified EfficientNet in com-parison to other encoders commonly used in state-of-the-artarchitectures in the ablation study presented in Section 442

Our EfficientNet encoder comprises of nine blocks asshown in Figure 2 (in red) We refer to each block in thefigure as block 1 to block 9 in the left to right manner Theoutput of block 2 3 5 and 9 corresponds to downsamplingfactors times4times8times16 and times32 respectively The outputs fromthese blocks with downsampling are also inputs to our 2-wayFPN The conventional FPN used in other panoptic segment-ation networks aims to address the problem of multi-scalefeature fusion by aggregating features of different resolutionsin a top-down manner This is performed by first employinga 1times1 convolution to reduce or increase the number of chan-nels of different encoder output resolutions to a predefinednumber typically 256 Then the lower resolution featuresare upsampled to a higher resolution and are subsequentlyadded together For example times32 resolution encoder outputfeatures will be resized to the times16 resolution and added tothe times16 resolution encoder output features Finally a 3times3convolution is used at each scale to further learn fused fea-tures which then yields the P4 P8 P16 and P32 outputs ThisFPN topology has a limited unidirectional flow of informa-tion resulting in an ineffective fusion of multi-scale featuresTherefore we propose to mitigate this problem by addinga second branch that aggregates multi-scale features in abottom-up fashion to enable bidirectional flow of informa-tion

Our proposed 2-way FPN shown in Figure 2 consistsof two parallel branches Each branch consists of a 1times 1convolution with 256 output filters at each scale for chan-nel reduction The top-down branch shown in blue followsthe aggregation scheme of a conventional FPN from right toleft Whereas the bottom-up branch shown in purple down-samples the higher resolution features to the next lower res-olution from left to right and subsequently adds them withthe next lower resolution encoder output features For ex-ample times4 resolution features will be resized to the times8 resol-ution and added to the times8 resolution encoder output featuresThen in the next stage the outputs from the bottom-up and

128xHxW

128xHxW

13

13

256xHxW

256xHxW

kxHxW

13 1

1

11

4kxHxW

256xHxW

128xHxW

128xHxW

13

13

256xHxW

(a) LSFE Module (b) RPN Module (c) MC Module

256xHxW

256xHxW

256xHxW

256xHxW

256xHxW

Concat

1280xHxW

11

256xHxW

128xHxW

113

6213

18153

633

163

(d) DPC Module

nxn convolutionstride s

ns

Leaky ReLUnxn seperable convolutionstride s

ns

nxn seperable convolutiondilation rate

nrhrw iABN Sync

Figure 3 Topologies of various architectural components in our pro-posed semantic head and instance head of our EfficientPS architecture

top-down branches at each resolution are correspondinglysummed together and passed through a 3times3 separable convo-lution with 256 output channels to obtain the P4 P8 P16 andP32 outputs respectively We employ separable convolutionsas opposed to standard convolutions in an effort to keep theparameter consumption low We evaluate the performance ofour proposed 2-way FPN in comparison to the conventionalFPN in the ablation study presented in Section 443

32 Semantic Segmentation Head

Our proposed semantic segmentation head consists of threecomponents each aimed at targeting one of the critical re-quirements First at large-scale the network should have theability to capture fine features efficiently In order to enablethis we employ our Large Scale Feature Extractor (LSFE)module that has two 3times3 separable convolutions with 128output filters each followed by an iABN sync and a LeakyReLU activation function The first 3times3 separable convolu-tion reduces the number of filters to 128 and the second 3times3separable convolution further learns deeper features

The second requirement is that at small-scale the net-work should be able to capture long-range context Modulesinspired by Atrous Spatial Pyramid Pooling (ASPP) Chenet al (2017a) that are widely used in state-of-the-art semanticsegmentation architectures have been demonstrated to be ef-fective for this purpose Dense Prediction Cells (DPC) (Chenet al 2018a) and Efficient Atrous Spatial Pyramid Pooling

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 4: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

4 Rohit Mohan Abhinav Valada

monocular image to tackle this task Approaches from thepast decade typically employ random decision forests to ad-dress this task Shotton et al (2008) use randomized decisionforests on local patches for classification whereas Plath et al(2009) fuse local and global features along with ConditionalRandom Fields(CRFs) for segmentation As opposed to lever-aging appearance-based features Brostow et al (2008) usecues from motion with random forests Sturgess et al (2009)further combine appearance-based features with structure-from-motion features in addition to CRFs to improve theperformance However 3D features extracted from densedepth maps (Zhang et al 2010) have been demonstrated to bemore effective than the combined features Kontschieder et al(2011) exploit the inherent topological distribution of objectclasses to improve the performance whereas Krahenbuhland Koltun (2011) improve segmentation by pairing CRFswith Gaussian edge potentials Nevertheless all these meth-ods employ handcrafted features that do not encapsulate allthe high-level and low-level relations thereby limiting theirrepresentational ability

The significant improvement in performance of classific-ation tasks brought about by Convolutional Neural Network(CNN) based approaches motivated researchers to exploresuch methods for semantic segmentation Initially these ap-proaches relied on patch-wise training that severely limitedtheir ability to accurately segment object boundaries How-ever they still perform substantially better than previoushandcrafted methods The advent of end-to-end learning ap-proaches for semantic segmentation lead by the introduc-tion of Fully Convolutional Networks (FCNs) (Long et al2015) revolutionized this field and FCNs still form the baseupon which state-of-the-art architecture are built upon todayFCN is an encoded-decoder architecture where the encoderis based on the VGG-16 (Simonyan and Zisserman 2014)architecture with inner-product layers replaced with convolu-tions and the decoder consists of convolution and transposedconvolution layers The subsequently proposed SegNet (Bad-rinarayanan et al 2015) architecture introduced unpoolinglayers for upsampling as a replacement for transposed con-volutions whereas ParseNet (Liu et al 2015) models globalcontext directly as opposed to only relying on the largestreceptive field of the network

The PSPNet (Zhao et al 2017) architecture emphasizeson the importance of multi-scale features and propose pyr-amid pooling to learn feature representations at differentscales Yu and Koltun (2015) introduce atrous convolutionsto further exploit multi-scale features in semantic segment-ation networks Subsequently Valada et al (2017) proposemulti-scale residual units with parallel atrous convolutionswith different dilation rates to efficiently learn multiscale fea-tures throughout the network without increasing the numberof parameters Chen et al (2017b) propose the Atrous SpatialPyramid Pooling (ASPP) module that concatenates feature

maps from multiple parallel atrous convolutions with differ-ent dilation rates and a global pooling layer ASPP substan-tially improves the performance of semantic segmentationnetworks by aggregating multi-scale features and capturinglong-range context however it significantly increases thecomputational complexity Therefore Chen et al (2018a) pro-pose Dense Prediction Cells (DPC) and Valada et al (2019)propose Efficient Atrous Spatial Pyramid Pooling (eASPP)that yield better semantic segmentation performance thanASPP while being 10-times more efficient Li et al (2019a)suggest that global feature aggregation often leads to largepattern features and also over-smooth regions of small pat-terns which results in sub-optimal performance In order toalleviate this problem the authors propose the use of a globalaggregation module coupled with a local distribution mod-ule which results in features that are balanced in small andlarge pattern regions There are also several works that havebeen proposed to improve the upsampling in decoders ofencoder-decoder architectures In (Chen et al 2018b) theauthors introduce a novel decoder module for object bound-ary refinement Tian et al (2019) propose data-dependentupsampling which accounts for the redundancy in the labelspace as opposed to simple bilinear upsampling

Instance Segmentation Some of the initial approaches em-ploy CRFs (He and Gould 2014b) and minimize integerquadratic relations (Tighe et al 2014) Methods that exploitCNNs with Markov random fields (Zhang et al 2016) andrecurrent neural networks (Romera-Paredes and Torr 2016Ren and Zemel 2017) have also been explored In this sec-tion we primarily discuss CNN-based approaches for in-stance segmentation These methods can be categorized intoproposal free and proposal based methods

Methods in the proposal free category often obtain in-stance masks from a resulting transformation Bai and Ur-tasun (2017) uses CNNs to produce an energy map of theimage and then perform a cut at a single energy level toobtain the corresponding object instances Liu et al (2017)employ a sequence of CNNs to solve sub-grouping problemsin order to compose object instances Some approaches ex-ploit FCNs which either use local coherence for estimatinginstances (Dai et al 2016) or encode the direction of eachpixel to its corresponding instance centre (Uhrig et al 2016)The recent approach SSAP (Gao et al 2019) uses pixel-pair affinity pyramids for computing the probability that twopixels hierarchically belong to the same instance Howeverthey achieve a lower than proposal based methods which hasled to a decline in their popularity

In proposal based methods Hariharan et al (2014) pro-pose a method that uses Multiscale Combinatorial Group-ing (Arbelaez et al 2014) proposals as input to CNNs forfeature extraction and then employ an SVM classifier forregion classification Subsequently Hariharan et al (2015)propose hypercolumn pixel descriptors for simultaneous de-

EfficientPS Efficient Panoptic Segmentation 5

tection and segmentation In recent works DeepMask (Pin-heiro et al 2015) uses a patch of an image as input to a CNNwhich yields a class-agnostic segmentation mask and thelikelihood of the patch containing an object FCIS (Li et al2017) employs position-sensitive score maps obtained fromclassification of pixels based on their relative positions toperform segmentation and detection jointly Dai et al (2016)propose an approach for instance segmentation that uses threenetworks for distinguishing instances estimating masks andcategorizing objects Mask R-CNN (He et al 2017) is one ofthe most popular and widely used approaches in the presenttime It extends Faster R-CNN for instance segmentation byadding an object segmentation branch parallel to an branchthat performs bounding box regression and classificationMore recently Liu et al (2018) propose an approach to im-prove Mask R-CNN by adding bottom-up path augmentationthat enhances object localization ability in earlier layers of thenetwork Subsequently BshapeNet (Kang and Kim 2018) ex-tends Faster R-CNN by adding a bounding box mask branchthat provides additional information of object positions andcoordinates to improve the performance of object detectionand instance segmentation

Panoptic Segmentation Kirillov et al (2019) revived theunification of semantic segmentation and instance segmenta-tion tasks by introducing panoptic segmentation They pro-pose a baseline model that combines the output of PSPNet(Zhao et al 2017) and Mask R-CNN (He et al 2017) with asimple post-processing step in which each model processesthe inputs independently The methods that address this taskof panoptic segmentation can be broadly classified into twocategories top-down or proposal based methods and bottom-up or proposal free methods Most of the current state-of-the-art methods adopt the top-down approach de Geus et al(2018) propose joint training with a shared backbone thatbranches into Mask R-CNN for instance segmentation andaugmented Pyramid Pooling module for semantic segment-ation Subsequently Li et al (2019b) introduce Attention-guided Unified Network that uses proposal attention moduleand mask attention module for better segmentation of lsquostuffrsquoclasses All the aforementioned methods use a similar fusiontechnique to Kirillov et al (2019) for the fusion of lsquostuffrsquo andlsquothingrsquo predictions

In top-down panoptic segmentation architectures pre-dictions of both heads have an inherent overlap betweenthem resulting in the mask overlapping problem In orderto mitigate this problem Li et al (2018b) propose a weaklysupervised model where lsquothingrsquo classes are weakly super-vised by bounding boxes and lsquostuffrsquo classes are supervisedwith image-level tags Whereas Liu et al (2019) address theproblem by introducing the spatial ranking module and Liet al (2018a) propose a method that learns a binary maskto constrain output distributions of lsquostuffrsquo and lsquothingrsquo expli-citly Subsequently UPSNet (Xiong et al 2019) introduces

a parameter-free panoptic head to address the problem ofoverlapping of instances and also predicts an extra unknownclass More recently AdaptIS (Sofiiuk et al 2019) uses pointproposals to produce instance masks and jointly trains with astandard semantic segmentation pipeline to perform panopticsegmentation In contrast Porzi et al (2019) propose an archi-tecture for panoptic segmentation that effectively integratescontextual information from a lightweight DeepLab-inspiredmodule with multi-scale features from a FPN

Compared to the popular proposal based methods thereare only a handful of proposal free methods that have beenproposed Deeper-Lab (Yang et al 2019) was the first bottom-up approach that was introduced and it employs an encoder-decoder topology to pair object centres for class-agnosticinstance segmentation with DeepLab semantic segmentationCheng et al (2019) further builds on Deeper-Lab by intro-ducing a dual-ASPP and dual-decoder structure for eachsub-task branch SSAP (Gao et al 2019) proposes to grouppixels based on a pixel-pair affinity pyramid and incorporatean efficient graph method to generate instances while jointlylearning semantic labeling

In this work we adopt a top-down approach due to itsexceptional ability to handle large scale variation of instanceswhich is a critical requirement for segmenting lsquothingrsquo classesWe present the novel EfficientPS architecture that incorpor-ates our proposed efficient backbone with our 2-way FPN forlearning rich multi-scale features in a bidirectional mannercoupled with a new semantic head that captures fine-featuresand long-range context effectively and a variant of MaskR-CNN augmented with separable convolutions as the in-stance head We propose a novel panoptic fusion moduleto dynamically adapt the fusion of logits from the semanticand instance heads to yield the panoptic segmentation outputOur architecture achieves state-of-the-art results on bench-mark datasets while being the most efficient and fast panopticsegmentation architecture

3 EfficientPS Architecture

In this section we first give a brief overview of our proposedEfficientPS architecture and then detail each of its constitut-ing components Our network follows the top-down layoutas shown in Figure 2 It consists of a shared backbone witha 2-way Feature Pyramid Network (FPN) followed by task-specific semantic segmentation and instance segmentationheads We build upon the EfficientNet (Tan and Le 2019)architecture for the encoder of our shared backbone (depictedin red) It consists of mobile inverted bottleneck (Xie et al2017) units and employs compound scaling to uniformlyscale all the dimensions of the encoder network This en-ables our encoder to have a rich representational capacitywith fewer parameters in comparison to other encoders orbackbones of similar discriminative capability

6 Rohit Mohan Abhinav Valada

256x14x14 256x28x28

NIx28x28

256X14x14

256x256x512

256x256x512

256x256x512

256x128x256

256x128x256 256x

64x128

256x64x128

256x64x128

256x128x256

48x512x1024

24x512x1024

40x256x512

64x128x256

128x64x128

176x64x128 304x32x64 512x32x64

2048x32x64

11

P32

P16

P8

P4

128x256x512

128x128x256

128x64x128

128x32x64

512x256x512NCx

1024x2048

P32

13

11

P16

13

13

P8

13

P4

11

11

11

11

11

11

13

13

13

13

22 1

1

11

256x32x64

256x32x64

256x32x64

RPN

ROI Align

MaskLogits

BBox

Class

LSFE

LSFE

DPC

DPC

Concat

SemanticLogits

PanopticFusion

(P4 P8 P16 P32)

(P4 P8 P16 P32)

Instance Segmentation Head

Semantic Segmentation Head

MC

MC

nxn Convolutionstride s

ns

Leaky ReLUBilinear upsamplingnxn Transpose convolutionstride s

ns

iABN Sync Fully connectednxn seperable convolutionstride s

ns

Output1024x2048

Input3x1024x2048

Figure 2 Illustration of our proposed EfficientPS architecture consisting of a shared backbone with our 2-way FPN and parallel semantic andinstance segmentation heads followed by our panoptic fusion module The shared backbone is built upon on the EfficientNet architecture and ournew 2-way FPN that enables bidirectional flow of information The instance segmentation head is based on a modified Mask R-CNN topology andwe incorporate our proposed semantic segmentation head Finally the outputs of both heads are fused in our panoptic fusion module to yield thepanoptic segmentation output

As opposed to employing the conventional FPN (Lin et al2017) that is commonly used in other panoptic segmentationarchitectures (Kirillov et al 2019 Li et al 2018a Porzi et al2019) we incorporate our proposed 2-way FPN that fusesmulti-scale features more effectively than its counterpartsThis can be attributed to the fact that the information flowin our 2-way FPN is not bounded to only one direction asdepicted by the purple blue and green blocks in Figure 2Subsequently after the 2-way FPN we employ two headsin parallel for semantic segmentation (depicted in yellow)and instance segmentation (depicted in gray and orange)respectively We use a variant of the Mask R-CNN (He et al2017) architecture as the instance head and we incorporateour novel semantic segmentation head consisting of denseprediction cells (Chen et al 2018a) and residual pyramidsThe semantic head consists of three different modules forcapturing fine features long-range contextual features andcorrelating the distinctly captured features for improvingobject boundary refinement Finally we employ our proposedpanoptic fusion module to fuse the outputs of the semanticand instance heads to yield the panoptic segmentation output

31 Network Backbone

The backbone of our network consists of an encoder withour proposed 2-way FPN The encoder is the basic buildingblock of any segmentation network and a strong encoder isessential to have high representational capacity In this workwe seek to find a good trade-off between the number of para-meters and computational complexity to the representationalcapacity of the network EfficientNets (Tan and Le 2019)which are a recent family of architectures have been shownto significantly outperform other networks in classificationtasks while having fewer parameters and FLOPs It employscompound scaling to uniformly scale the width depth andresolution of the network efficiently Therefore we chooseto build upon this scaled architecture with 16 22 and 456coefficients commonly known as the EfficientNet-B5 modelThis can be easily replaced with any of the EfficientNet mod-els based on the capacity of the resources that are availableand the computational budget

In order to adapt EfficientNet to our task we first removethe classification head as well as the Squeeze-and-Excitation(SE) (Hu et al 2018) connections in the network We find that

EfficientPS Efficient Panoptic Segmentation 7

the explicit modelling of interdependencies between chan-nels of the convolutional feature maps that are enabled bythe SE connections tend to suppress localization of featuresin favour of contextual elements This property is a desiredin classification networks however both are equally import-ant for segmentation tasks therefore we do not add any SEconnections in our backbone Second we replace all thebatch normalization (Ioffe and Szegedy 2015) layers withsynchronized Inplace Activated Batch Normalization (iABNsync) (Rota Bulo et al 2018) This enables synchronizationacross different GPUs which in turn yields a better estimateof gradients while performing multi-GPU training and thein-place operations frees up additional GPU memory Weanalyze the performance of our modified EfficientNet in com-parison to other encoders commonly used in state-of-the-artarchitectures in the ablation study presented in Section 442

Our EfficientNet encoder comprises of nine blocks asshown in Figure 2 (in red) We refer to each block in thefigure as block 1 to block 9 in the left to right manner Theoutput of block 2 3 5 and 9 corresponds to downsamplingfactors times4times8times16 and times32 respectively The outputs fromthese blocks with downsampling are also inputs to our 2-wayFPN The conventional FPN used in other panoptic segment-ation networks aims to address the problem of multi-scalefeature fusion by aggregating features of different resolutionsin a top-down manner This is performed by first employinga 1times1 convolution to reduce or increase the number of chan-nels of different encoder output resolutions to a predefinednumber typically 256 Then the lower resolution featuresare upsampled to a higher resolution and are subsequentlyadded together For example times32 resolution encoder outputfeatures will be resized to the times16 resolution and added tothe times16 resolution encoder output features Finally a 3times3convolution is used at each scale to further learn fused fea-tures which then yields the P4 P8 P16 and P32 outputs ThisFPN topology has a limited unidirectional flow of informa-tion resulting in an ineffective fusion of multi-scale featuresTherefore we propose to mitigate this problem by addinga second branch that aggregates multi-scale features in abottom-up fashion to enable bidirectional flow of informa-tion

Our proposed 2-way FPN shown in Figure 2 consistsof two parallel branches Each branch consists of a 1times 1convolution with 256 output filters at each scale for chan-nel reduction The top-down branch shown in blue followsthe aggregation scheme of a conventional FPN from right toleft Whereas the bottom-up branch shown in purple down-samples the higher resolution features to the next lower res-olution from left to right and subsequently adds them withthe next lower resolution encoder output features For ex-ample times4 resolution features will be resized to the times8 resol-ution and added to the times8 resolution encoder output featuresThen in the next stage the outputs from the bottom-up and

128xHxW

128xHxW

13

13

256xHxW

256xHxW

kxHxW

13 1

1

11

4kxHxW

256xHxW

128xHxW

128xHxW

13

13

256xHxW

(a) LSFE Module (b) RPN Module (c) MC Module

256xHxW

256xHxW

256xHxW

256xHxW

256xHxW

Concat

1280xHxW

11

256xHxW

128xHxW

113

6213

18153

633

163

(d) DPC Module

nxn convolutionstride s

ns

Leaky ReLUnxn seperable convolutionstride s

ns

nxn seperable convolutiondilation rate

nrhrw iABN Sync

Figure 3 Topologies of various architectural components in our pro-posed semantic head and instance head of our EfficientPS architecture

top-down branches at each resolution are correspondinglysummed together and passed through a 3times3 separable convo-lution with 256 output channels to obtain the P4 P8 P16 andP32 outputs respectively We employ separable convolutionsas opposed to standard convolutions in an effort to keep theparameter consumption low We evaluate the performance ofour proposed 2-way FPN in comparison to the conventionalFPN in the ablation study presented in Section 443

32 Semantic Segmentation Head

Our proposed semantic segmentation head consists of threecomponents each aimed at targeting one of the critical re-quirements First at large-scale the network should have theability to capture fine features efficiently In order to enablethis we employ our Large Scale Feature Extractor (LSFE)module that has two 3times3 separable convolutions with 128output filters each followed by an iABN sync and a LeakyReLU activation function The first 3times3 separable convolu-tion reduces the number of filters to 128 and the second 3times3separable convolution further learns deeper features

The second requirement is that at small-scale the net-work should be able to capture long-range context Modulesinspired by Atrous Spatial Pyramid Pooling (ASPP) Chenet al (2017a) that are widely used in state-of-the-art semanticsegmentation architectures have been demonstrated to be ef-fective for this purpose Dense Prediction Cells (DPC) (Chenet al 2018a) and Efficient Atrous Spatial Pyramid Pooling

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 5: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 5

tection and segmentation In recent works DeepMask (Pin-heiro et al 2015) uses a patch of an image as input to a CNNwhich yields a class-agnostic segmentation mask and thelikelihood of the patch containing an object FCIS (Li et al2017) employs position-sensitive score maps obtained fromclassification of pixels based on their relative positions toperform segmentation and detection jointly Dai et al (2016)propose an approach for instance segmentation that uses threenetworks for distinguishing instances estimating masks andcategorizing objects Mask R-CNN (He et al 2017) is one ofthe most popular and widely used approaches in the presenttime It extends Faster R-CNN for instance segmentation byadding an object segmentation branch parallel to an branchthat performs bounding box regression and classificationMore recently Liu et al (2018) propose an approach to im-prove Mask R-CNN by adding bottom-up path augmentationthat enhances object localization ability in earlier layers of thenetwork Subsequently BshapeNet (Kang and Kim 2018) ex-tends Faster R-CNN by adding a bounding box mask branchthat provides additional information of object positions andcoordinates to improve the performance of object detectionand instance segmentation

Panoptic Segmentation Kirillov et al (2019) revived theunification of semantic segmentation and instance segmenta-tion tasks by introducing panoptic segmentation They pro-pose a baseline model that combines the output of PSPNet(Zhao et al 2017) and Mask R-CNN (He et al 2017) with asimple post-processing step in which each model processesthe inputs independently The methods that address this taskof panoptic segmentation can be broadly classified into twocategories top-down or proposal based methods and bottom-up or proposal free methods Most of the current state-of-the-art methods adopt the top-down approach de Geus et al(2018) propose joint training with a shared backbone thatbranches into Mask R-CNN for instance segmentation andaugmented Pyramid Pooling module for semantic segment-ation Subsequently Li et al (2019b) introduce Attention-guided Unified Network that uses proposal attention moduleand mask attention module for better segmentation of lsquostuffrsquoclasses All the aforementioned methods use a similar fusiontechnique to Kirillov et al (2019) for the fusion of lsquostuffrsquo andlsquothingrsquo predictions

In top-down panoptic segmentation architectures pre-dictions of both heads have an inherent overlap betweenthem resulting in the mask overlapping problem In orderto mitigate this problem Li et al (2018b) propose a weaklysupervised model where lsquothingrsquo classes are weakly super-vised by bounding boxes and lsquostuffrsquo classes are supervisedwith image-level tags Whereas Liu et al (2019) address theproblem by introducing the spatial ranking module and Liet al (2018a) propose a method that learns a binary maskto constrain output distributions of lsquostuffrsquo and lsquothingrsquo expli-citly Subsequently UPSNet (Xiong et al 2019) introduces

a parameter-free panoptic head to address the problem ofoverlapping of instances and also predicts an extra unknownclass More recently AdaptIS (Sofiiuk et al 2019) uses pointproposals to produce instance masks and jointly trains with astandard semantic segmentation pipeline to perform panopticsegmentation In contrast Porzi et al (2019) propose an archi-tecture for panoptic segmentation that effectively integratescontextual information from a lightweight DeepLab-inspiredmodule with multi-scale features from a FPN

Compared to the popular proposal based methods thereare only a handful of proposal free methods that have beenproposed Deeper-Lab (Yang et al 2019) was the first bottom-up approach that was introduced and it employs an encoder-decoder topology to pair object centres for class-agnosticinstance segmentation with DeepLab semantic segmentationCheng et al (2019) further builds on Deeper-Lab by intro-ducing a dual-ASPP and dual-decoder structure for eachsub-task branch SSAP (Gao et al 2019) proposes to grouppixels based on a pixel-pair affinity pyramid and incorporatean efficient graph method to generate instances while jointlylearning semantic labeling

In this work we adopt a top-down approach due to itsexceptional ability to handle large scale variation of instanceswhich is a critical requirement for segmenting lsquothingrsquo classesWe present the novel EfficientPS architecture that incorpor-ates our proposed efficient backbone with our 2-way FPN forlearning rich multi-scale features in a bidirectional mannercoupled with a new semantic head that captures fine-featuresand long-range context effectively and a variant of MaskR-CNN augmented with separable convolutions as the in-stance head We propose a novel panoptic fusion moduleto dynamically adapt the fusion of logits from the semanticand instance heads to yield the panoptic segmentation outputOur architecture achieves state-of-the-art results on bench-mark datasets while being the most efficient and fast panopticsegmentation architecture

3 EfficientPS Architecture

In this section we first give a brief overview of our proposedEfficientPS architecture and then detail each of its constitut-ing components Our network follows the top-down layoutas shown in Figure 2 It consists of a shared backbone witha 2-way Feature Pyramid Network (FPN) followed by task-specific semantic segmentation and instance segmentationheads We build upon the EfficientNet (Tan and Le 2019)architecture for the encoder of our shared backbone (depictedin red) It consists of mobile inverted bottleneck (Xie et al2017) units and employs compound scaling to uniformlyscale all the dimensions of the encoder network This en-ables our encoder to have a rich representational capacitywith fewer parameters in comparison to other encoders orbackbones of similar discriminative capability

6 Rohit Mohan Abhinav Valada

256x14x14 256x28x28

NIx28x28

256X14x14

256x256x512

256x256x512

256x256x512

256x128x256

256x128x256 256x

64x128

256x64x128

256x64x128

256x128x256

48x512x1024

24x512x1024

40x256x512

64x128x256

128x64x128

176x64x128 304x32x64 512x32x64

2048x32x64

11

P32

P16

P8

P4

128x256x512

128x128x256

128x64x128

128x32x64

512x256x512NCx

1024x2048

P32

13

11

P16

13

13

P8

13

P4

11

11

11

11

11

11

13

13

13

13

22 1

1

11

256x32x64

256x32x64

256x32x64

RPN

ROI Align

MaskLogits

BBox

Class

LSFE

LSFE

DPC

DPC

Concat

SemanticLogits

PanopticFusion

(P4 P8 P16 P32)

(P4 P8 P16 P32)

Instance Segmentation Head

Semantic Segmentation Head

MC

MC

nxn Convolutionstride s

ns

Leaky ReLUBilinear upsamplingnxn Transpose convolutionstride s

ns

iABN Sync Fully connectednxn seperable convolutionstride s

ns

Output1024x2048

Input3x1024x2048

Figure 2 Illustration of our proposed EfficientPS architecture consisting of a shared backbone with our 2-way FPN and parallel semantic andinstance segmentation heads followed by our panoptic fusion module The shared backbone is built upon on the EfficientNet architecture and ournew 2-way FPN that enables bidirectional flow of information The instance segmentation head is based on a modified Mask R-CNN topology andwe incorporate our proposed semantic segmentation head Finally the outputs of both heads are fused in our panoptic fusion module to yield thepanoptic segmentation output

As opposed to employing the conventional FPN (Lin et al2017) that is commonly used in other panoptic segmentationarchitectures (Kirillov et al 2019 Li et al 2018a Porzi et al2019) we incorporate our proposed 2-way FPN that fusesmulti-scale features more effectively than its counterpartsThis can be attributed to the fact that the information flowin our 2-way FPN is not bounded to only one direction asdepicted by the purple blue and green blocks in Figure 2Subsequently after the 2-way FPN we employ two headsin parallel for semantic segmentation (depicted in yellow)and instance segmentation (depicted in gray and orange)respectively We use a variant of the Mask R-CNN (He et al2017) architecture as the instance head and we incorporateour novel semantic segmentation head consisting of denseprediction cells (Chen et al 2018a) and residual pyramidsThe semantic head consists of three different modules forcapturing fine features long-range contextual features andcorrelating the distinctly captured features for improvingobject boundary refinement Finally we employ our proposedpanoptic fusion module to fuse the outputs of the semanticand instance heads to yield the panoptic segmentation output

31 Network Backbone

The backbone of our network consists of an encoder withour proposed 2-way FPN The encoder is the basic buildingblock of any segmentation network and a strong encoder isessential to have high representational capacity In this workwe seek to find a good trade-off between the number of para-meters and computational complexity to the representationalcapacity of the network EfficientNets (Tan and Le 2019)which are a recent family of architectures have been shownto significantly outperform other networks in classificationtasks while having fewer parameters and FLOPs It employscompound scaling to uniformly scale the width depth andresolution of the network efficiently Therefore we chooseto build upon this scaled architecture with 16 22 and 456coefficients commonly known as the EfficientNet-B5 modelThis can be easily replaced with any of the EfficientNet mod-els based on the capacity of the resources that are availableand the computational budget

In order to adapt EfficientNet to our task we first removethe classification head as well as the Squeeze-and-Excitation(SE) (Hu et al 2018) connections in the network We find that

EfficientPS Efficient Panoptic Segmentation 7

the explicit modelling of interdependencies between chan-nels of the convolutional feature maps that are enabled bythe SE connections tend to suppress localization of featuresin favour of contextual elements This property is a desiredin classification networks however both are equally import-ant for segmentation tasks therefore we do not add any SEconnections in our backbone Second we replace all thebatch normalization (Ioffe and Szegedy 2015) layers withsynchronized Inplace Activated Batch Normalization (iABNsync) (Rota Bulo et al 2018) This enables synchronizationacross different GPUs which in turn yields a better estimateof gradients while performing multi-GPU training and thein-place operations frees up additional GPU memory Weanalyze the performance of our modified EfficientNet in com-parison to other encoders commonly used in state-of-the-artarchitectures in the ablation study presented in Section 442

Our EfficientNet encoder comprises of nine blocks asshown in Figure 2 (in red) We refer to each block in thefigure as block 1 to block 9 in the left to right manner Theoutput of block 2 3 5 and 9 corresponds to downsamplingfactors times4times8times16 and times32 respectively The outputs fromthese blocks with downsampling are also inputs to our 2-wayFPN The conventional FPN used in other panoptic segment-ation networks aims to address the problem of multi-scalefeature fusion by aggregating features of different resolutionsin a top-down manner This is performed by first employinga 1times1 convolution to reduce or increase the number of chan-nels of different encoder output resolutions to a predefinednumber typically 256 Then the lower resolution featuresare upsampled to a higher resolution and are subsequentlyadded together For example times32 resolution encoder outputfeatures will be resized to the times16 resolution and added tothe times16 resolution encoder output features Finally a 3times3convolution is used at each scale to further learn fused fea-tures which then yields the P4 P8 P16 and P32 outputs ThisFPN topology has a limited unidirectional flow of informa-tion resulting in an ineffective fusion of multi-scale featuresTherefore we propose to mitigate this problem by addinga second branch that aggregates multi-scale features in abottom-up fashion to enable bidirectional flow of informa-tion

Our proposed 2-way FPN shown in Figure 2 consistsof two parallel branches Each branch consists of a 1times 1convolution with 256 output filters at each scale for chan-nel reduction The top-down branch shown in blue followsthe aggregation scheme of a conventional FPN from right toleft Whereas the bottom-up branch shown in purple down-samples the higher resolution features to the next lower res-olution from left to right and subsequently adds them withthe next lower resolution encoder output features For ex-ample times4 resolution features will be resized to the times8 resol-ution and added to the times8 resolution encoder output featuresThen in the next stage the outputs from the bottom-up and

128xHxW

128xHxW

13

13

256xHxW

256xHxW

kxHxW

13 1

1

11

4kxHxW

256xHxW

128xHxW

128xHxW

13

13

256xHxW

(a) LSFE Module (b) RPN Module (c) MC Module

256xHxW

256xHxW

256xHxW

256xHxW

256xHxW

Concat

1280xHxW

11

256xHxW

128xHxW

113

6213

18153

633

163

(d) DPC Module

nxn convolutionstride s

ns

Leaky ReLUnxn seperable convolutionstride s

ns

nxn seperable convolutiondilation rate

nrhrw iABN Sync

Figure 3 Topologies of various architectural components in our pro-posed semantic head and instance head of our EfficientPS architecture

top-down branches at each resolution are correspondinglysummed together and passed through a 3times3 separable convo-lution with 256 output channels to obtain the P4 P8 P16 andP32 outputs respectively We employ separable convolutionsas opposed to standard convolutions in an effort to keep theparameter consumption low We evaluate the performance ofour proposed 2-way FPN in comparison to the conventionalFPN in the ablation study presented in Section 443

32 Semantic Segmentation Head

Our proposed semantic segmentation head consists of threecomponents each aimed at targeting one of the critical re-quirements First at large-scale the network should have theability to capture fine features efficiently In order to enablethis we employ our Large Scale Feature Extractor (LSFE)module that has two 3times3 separable convolutions with 128output filters each followed by an iABN sync and a LeakyReLU activation function The first 3times3 separable convolu-tion reduces the number of filters to 128 and the second 3times3separable convolution further learns deeper features

The second requirement is that at small-scale the net-work should be able to capture long-range context Modulesinspired by Atrous Spatial Pyramid Pooling (ASPP) Chenet al (2017a) that are widely used in state-of-the-art semanticsegmentation architectures have been demonstrated to be ef-fective for this purpose Dense Prediction Cells (DPC) (Chenet al 2018a) and Efficient Atrous Spatial Pyramid Pooling

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 6: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

6 Rohit Mohan Abhinav Valada

256x14x14 256x28x28

NIx28x28

256X14x14

256x256x512

256x256x512

256x256x512

256x128x256

256x128x256 256x

64x128

256x64x128

256x64x128

256x128x256

48x512x1024

24x512x1024

40x256x512

64x128x256

128x64x128

176x64x128 304x32x64 512x32x64

2048x32x64

11

P32

P16

P8

P4

128x256x512

128x128x256

128x64x128

128x32x64

512x256x512NCx

1024x2048

P32

13

11

P16

13

13

P8

13

P4

11

11

11

11

11

11

13

13

13

13

22 1

1

11

256x32x64

256x32x64

256x32x64

RPN

ROI Align

MaskLogits

BBox

Class

LSFE

LSFE

DPC

DPC

Concat

SemanticLogits

PanopticFusion

(P4 P8 P16 P32)

(P4 P8 P16 P32)

Instance Segmentation Head

Semantic Segmentation Head

MC

MC

nxn Convolutionstride s

ns

Leaky ReLUBilinear upsamplingnxn Transpose convolutionstride s

ns

iABN Sync Fully connectednxn seperable convolutionstride s

ns

Output1024x2048

Input3x1024x2048

Figure 2 Illustration of our proposed EfficientPS architecture consisting of a shared backbone with our 2-way FPN and parallel semantic andinstance segmentation heads followed by our panoptic fusion module The shared backbone is built upon on the EfficientNet architecture and ournew 2-way FPN that enables bidirectional flow of information The instance segmentation head is based on a modified Mask R-CNN topology andwe incorporate our proposed semantic segmentation head Finally the outputs of both heads are fused in our panoptic fusion module to yield thepanoptic segmentation output

As opposed to employing the conventional FPN (Lin et al2017) that is commonly used in other panoptic segmentationarchitectures (Kirillov et al 2019 Li et al 2018a Porzi et al2019) we incorporate our proposed 2-way FPN that fusesmulti-scale features more effectively than its counterpartsThis can be attributed to the fact that the information flowin our 2-way FPN is not bounded to only one direction asdepicted by the purple blue and green blocks in Figure 2Subsequently after the 2-way FPN we employ two headsin parallel for semantic segmentation (depicted in yellow)and instance segmentation (depicted in gray and orange)respectively We use a variant of the Mask R-CNN (He et al2017) architecture as the instance head and we incorporateour novel semantic segmentation head consisting of denseprediction cells (Chen et al 2018a) and residual pyramidsThe semantic head consists of three different modules forcapturing fine features long-range contextual features andcorrelating the distinctly captured features for improvingobject boundary refinement Finally we employ our proposedpanoptic fusion module to fuse the outputs of the semanticand instance heads to yield the panoptic segmentation output

31 Network Backbone

The backbone of our network consists of an encoder withour proposed 2-way FPN The encoder is the basic buildingblock of any segmentation network and a strong encoder isessential to have high representational capacity In this workwe seek to find a good trade-off between the number of para-meters and computational complexity to the representationalcapacity of the network EfficientNets (Tan and Le 2019)which are a recent family of architectures have been shownto significantly outperform other networks in classificationtasks while having fewer parameters and FLOPs It employscompound scaling to uniformly scale the width depth andresolution of the network efficiently Therefore we chooseto build upon this scaled architecture with 16 22 and 456coefficients commonly known as the EfficientNet-B5 modelThis can be easily replaced with any of the EfficientNet mod-els based on the capacity of the resources that are availableand the computational budget

In order to adapt EfficientNet to our task we first removethe classification head as well as the Squeeze-and-Excitation(SE) (Hu et al 2018) connections in the network We find that

EfficientPS Efficient Panoptic Segmentation 7

the explicit modelling of interdependencies between chan-nels of the convolutional feature maps that are enabled bythe SE connections tend to suppress localization of featuresin favour of contextual elements This property is a desiredin classification networks however both are equally import-ant for segmentation tasks therefore we do not add any SEconnections in our backbone Second we replace all thebatch normalization (Ioffe and Szegedy 2015) layers withsynchronized Inplace Activated Batch Normalization (iABNsync) (Rota Bulo et al 2018) This enables synchronizationacross different GPUs which in turn yields a better estimateof gradients while performing multi-GPU training and thein-place operations frees up additional GPU memory Weanalyze the performance of our modified EfficientNet in com-parison to other encoders commonly used in state-of-the-artarchitectures in the ablation study presented in Section 442

Our EfficientNet encoder comprises of nine blocks asshown in Figure 2 (in red) We refer to each block in thefigure as block 1 to block 9 in the left to right manner Theoutput of block 2 3 5 and 9 corresponds to downsamplingfactors times4times8times16 and times32 respectively The outputs fromthese blocks with downsampling are also inputs to our 2-wayFPN The conventional FPN used in other panoptic segment-ation networks aims to address the problem of multi-scalefeature fusion by aggregating features of different resolutionsin a top-down manner This is performed by first employinga 1times1 convolution to reduce or increase the number of chan-nels of different encoder output resolutions to a predefinednumber typically 256 Then the lower resolution featuresare upsampled to a higher resolution and are subsequentlyadded together For example times32 resolution encoder outputfeatures will be resized to the times16 resolution and added tothe times16 resolution encoder output features Finally a 3times3convolution is used at each scale to further learn fused fea-tures which then yields the P4 P8 P16 and P32 outputs ThisFPN topology has a limited unidirectional flow of informa-tion resulting in an ineffective fusion of multi-scale featuresTherefore we propose to mitigate this problem by addinga second branch that aggregates multi-scale features in abottom-up fashion to enable bidirectional flow of informa-tion

Our proposed 2-way FPN shown in Figure 2 consistsof two parallel branches Each branch consists of a 1times 1convolution with 256 output filters at each scale for chan-nel reduction The top-down branch shown in blue followsthe aggregation scheme of a conventional FPN from right toleft Whereas the bottom-up branch shown in purple down-samples the higher resolution features to the next lower res-olution from left to right and subsequently adds them withthe next lower resolution encoder output features For ex-ample times4 resolution features will be resized to the times8 resol-ution and added to the times8 resolution encoder output featuresThen in the next stage the outputs from the bottom-up and

128xHxW

128xHxW

13

13

256xHxW

256xHxW

kxHxW

13 1

1

11

4kxHxW

256xHxW

128xHxW

128xHxW

13

13

256xHxW

(a) LSFE Module (b) RPN Module (c) MC Module

256xHxW

256xHxW

256xHxW

256xHxW

256xHxW

Concat

1280xHxW

11

256xHxW

128xHxW

113

6213

18153

633

163

(d) DPC Module

nxn convolutionstride s

ns

Leaky ReLUnxn seperable convolutionstride s

ns

nxn seperable convolutiondilation rate

nrhrw iABN Sync

Figure 3 Topologies of various architectural components in our pro-posed semantic head and instance head of our EfficientPS architecture

top-down branches at each resolution are correspondinglysummed together and passed through a 3times3 separable convo-lution with 256 output channels to obtain the P4 P8 P16 andP32 outputs respectively We employ separable convolutionsas opposed to standard convolutions in an effort to keep theparameter consumption low We evaluate the performance ofour proposed 2-way FPN in comparison to the conventionalFPN in the ablation study presented in Section 443

32 Semantic Segmentation Head

Our proposed semantic segmentation head consists of threecomponents each aimed at targeting one of the critical re-quirements First at large-scale the network should have theability to capture fine features efficiently In order to enablethis we employ our Large Scale Feature Extractor (LSFE)module that has two 3times3 separable convolutions with 128output filters each followed by an iABN sync and a LeakyReLU activation function The first 3times3 separable convolu-tion reduces the number of filters to 128 and the second 3times3separable convolution further learns deeper features

The second requirement is that at small-scale the net-work should be able to capture long-range context Modulesinspired by Atrous Spatial Pyramid Pooling (ASPP) Chenet al (2017a) that are widely used in state-of-the-art semanticsegmentation architectures have been demonstrated to be ef-fective for this purpose Dense Prediction Cells (DPC) (Chenet al 2018a) and Efficient Atrous Spatial Pyramid Pooling

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 7: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 7

the explicit modelling of interdependencies between chan-nels of the convolutional feature maps that are enabled bythe SE connections tend to suppress localization of featuresin favour of contextual elements This property is a desiredin classification networks however both are equally import-ant for segmentation tasks therefore we do not add any SEconnections in our backbone Second we replace all thebatch normalization (Ioffe and Szegedy 2015) layers withsynchronized Inplace Activated Batch Normalization (iABNsync) (Rota Bulo et al 2018) This enables synchronizationacross different GPUs which in turn yields a better estimateof gradients while performing multi-GPU training and thein-place operations frees up additional GPU memory Weanalyze the performance of our modified EfficientNet in com-parison to other encoders commonly used in state-of-the-artarchitectures in the ablation study presented in Section 442

Our EfficientNet encoder comprises of nine blocks asshown in Figure 2 (in red) We refer to each block in thefigure as block 1 to block 9 in the left to right manner Theoutput of block 2 3 5 and 9 corresponds to downsamplingfactors times4times8times16 and times32 respectively The outputs fromthese blocks with downsampling are also inputs to our 2-wayFPN The conventional FPN used in other panoptic segment-ation networks aims to address the problem of multi-scalefeature fusion by aggregating features of different resolutionsin a top-down manner This is performed by first employinga 1times1 convolution to reduce or increase the number of chan-nels of different encoder output resolutions to a predefinednumber typically 256 Then the lower resolution featuresare upsampled to a higher resolution and are subsequentlyadded together For example times32 resolution encoder outputfeatures will be resized to the times16 resolution and added tothe times16 resolution encoder output features Finally a 3times3convolution is used at each scale to further learn fused fea-tures which then yields the P4 P8 P16 and P32 outputs ThisFPN topology has a limited unidirectional flow of informa-tion resulting in an ineffective fusion of multi-scale featuresTherefore we propose to mitigate this problem by addinga second branch that aggregates multi-scale features in abottom-up fashion to enable bidirectional flow of informa-tion

Our proposed 2-way FPN shown in Figure 2 consistsof two parallel branches Each branch consists of a 1times 1convolution with 256 output filters at each scale for chan-nel reduction The top-down branch shown in blue followsthe aggregation scheme of a conventional FPN from right toleft Whereas the bottom-up branch shown in purple down-samples the higher resolution features to the next lower res-olution from left to right and subsequently adds them withthe next lower resolution encoder output features For ex-ample times4 resolution features will be resized to the times8 resol-ution and added to the times8 resolution encoder output featuresThen in the next stage the outputs from the bottom-up and

128xHxW

128xHxW

13

13

256xHxW

256xHxW

kxHxW

13 1

1

11

4kxHxW

256xHxW

128xHxW

128xHxW

13

13

256xHxW

(a) LSFE Module (b) RPN Module (c) MC Module

256xHxW

256xHxW

256xHxW

256xHxW

256xHxW

Concat

1280xHxW

11

256xHxW

128xHxW

113

6213

18153

633

163

(d) DPC Module

nxn convolutionstride s

ns

Leaky ReLUnxn seperable convolutionstride s

ns

nxn seperable convolutiondilation rate

nrhrw iABN Sync

Figure 3 Topologies of various architectural components in our pro-posed semantic head and instance head of our EfficientPS architecture

top-down branches at each resolution are correspondinglysummed together and passed through a 3times3 separable convo-lution with 256 output channels to obtain the P4 P8 P16 andP32 outputs respectively We employ separable convolutionsas opposed to standard convolutions in an effort to keep theparameter consumption low We evaluate the performance ofour proposed 2-way FPN in comparison to the conventionalFPN in the ablation study presented in Section 443

32 Semantic Segmentation Head

Our proposed semantic segmentation head consists of threecomponents each aimed at targeting one of the critical re-quirements First at large-scale the network should have theability to capture fine features efficiently In order to enablethis we employ our Large Scale Feature Extractor (LSFE)module that has two 3times3 separable convolutions with 128output filters each followed by an iABN sync and a LeakyReLU activation function The first 3times3 separable convolu-tion reduces the number of filters to 128 and the second 3times3separable convolution further learns deeper features

The second requirement is that at small-scale the net-work should be able to capture long-range context Modulesinspired by Atrous Spatial Pyramid Pooling (ASPP) Chenet al (2017a) that are widely used in state-of-the-art semanticsegmentation architectures have been demonstrated to be ef-fective for this purpose Dense Prediction Cells (DPC) (Chenet al 2018a) and Efficient Atrous Spatial Pyramid Pooling

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 8: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

8 Rohit Mohan Abhinav Valada

(eASPP) (Valada et al 2019) are two variants of ASPP thatare significantly more efficient and also yield a better perform-ance We find that DPC demonstrates a better performancewith a minor increase in the number of parameters comparedto eASPP Therefore we employ a modified DPC module inour semantic head as shown in Figure 2 We augment the ori-ginal DPC topology by replacing batch normalization layerswith iABN sync and ReLUs with Leaky ReLUs The DPCmodule consists of a 3times3 separable convolution with 256output channels having a dilation rate of (16) and extendsout to five parallel branches Three of the branches eachconsist of a 3times 3 dilated separable convolution with 256outputs where the dilation rates are (11) (621) and (1815)respectively The fourth branch takes the output of the dilatedseparable convolution with a dilation rate of (1815) as inputand passes it through another 3times3 dilated separable convolu-tion with 256 output channels and a dilation rate of (63) Theoutputs from all these parallel branches are then concaten-ated to yield a tensor with 1280 channels This tensor is thenfinally passed through a 1times1 convolution with 256 outputchannels and forms the output of the DPC module Note thateach of the convolutions in the DPC module is followed by aiABN sync and a Leaky ReLU activation function

The third and final requirement for the semantic head isthat it should be able to mitigate the mismatch between large-scale and small-scale features while performing feature ag-gregation To this end we employ our Mismatch CorrectionModule (MC) that correlates the small-scale features withrespect to large-scale features It consists of cascaded 3times3separable convolutions with 128 output channels followedby iABN sync with Leaky ReLU and a bilinear upsamplinglayer that upsamples the feature maps by a factor of 2 Fig-ures 3 (a) 3 (c) and 3 (d) illustrate the topologies of thesemain components of our semantic head

The four different scaled outputs of our 2-way FPNnamely P4 P8 P16 and P32 are the inputs to our semantichead The small-scale inputs P32 and P16 with downsamplingfactors of times32 and times16 are each fed into two parallel DPCmodules While the large-scale inputs P8 and P4 with down-sampling factors of times8 and times4 are each passed through twoparallel LSFE modules Subsequently the outputs from eachof these parallel DPC and LSFE modules are augmentedwith feature alignment connections and each of them is up-sampled by a factor of 4 These upsampled feature maps arethen concatenated to yield a tensor with 512 channels whichis then input to a 1times1 convolution with Nlsquostuffrsquo+lsquothingrsquo outputfilters This tensor is then finally upsampled by a factor of2 and passed through a softmax layer to yield the semanticlogits having the same resolution as the input image Nowthe feature alignment connections from the DPC and LSFEmodules interconnect each of these outputs by element-wisesummation as shown in Figure 2 We add our MC modulesin the interconnections between the second DPC and LSFE

as well as between both the LSFE connections These cor-relation connections aggregate contextual information fromsmall-scale features and characteristic large-scale featuresfor better object boundary refinement We use the weightedper-pixel log-loss (Bulo et al 2017) for training which isgiven by

Lpp(Θ) =minussumi j

wi j log pi j(plowasti j) (1)

where plowasti j is the groundtruth for a given image pi j is thepredicted probability for the pixel (i j) being assigned classc isin p wi j =

4WH if pixel (i j) belongs to 25 of the worst

prediction and wi j = 0 otherwise W and H are the widthand height of the given input image The overall semantichead loss is given by

Lsemantic(Θ) =1n sumLpp (2)

where n is the batch size We present in-depth analysis ofour proposed semantic head in comparison other semanticheads commonly used in state-of-the-art architectures in Sec-tion 444

33 Instance Segmentation Head

The instance segmentation head of our EfficientPS networkshown in Figure 2 has a topology similar to Mask R-CNN (Heet al 2017) with certain modifications More specifically wereplace all the standard convolutions batch normalizationlayers and ReLU activations with separable convolutioniABN sync and Leaky ReLU respectively Similar to the restof our architecture we use separable convolutions instead ofstandard convolutions to reduce the number of parametersconsumed by the network This enables us to conserve 209Mparameters in comparison to the conventional Mask R-CNN

Mask R-CNN consists of two stages In the first stage theRegion Proposal Network (RPN) module shown in Figure3 (b) employs a fully convolutional network to output a set ofrectangular object proposals and an objectness score for thegiven input FPN level The network architecture incorporatesa 3times3 separable convolution with 256 output channels aniABN sync and Leaky ReLU activation followed by twoparallel 1times1 convolutions with 4k and k number of outputfilters respectively Here k is the maximum number of objectproposals The k proposals are parameterized relative to kreference bounding boxes which are called anchors Thescale and aspect ratio define a particular anchor and it var-ies depending on the dataset being used As the generatedproposals can potentially overlap an additional filtering stepof Non-Max Suppression (NMS) is employed Essentiallyin case of overlaps the proposal with the highest objectnessscore is retained and the rest are discarded RPN first com-putes object proposals of the form (pu pv pw ph) and the

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 9: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 9

objectness score σ(ps) that transforms an anchor a of theform (uvwh) to (u+ puv+ pv) and (wepw heph) where(uv) is the position of anchor a in image coordinate system(wh) are the dimensions and σ(middot) is the sigmoid functionThis is followed by NMS to yield the final candidate bound-ing box proposals to be used in the next stage

Subsequently ROI align (He et al 2017) uses objectproposals to extract features from FPN encodings by directlypooling features from the nth channel with a 14times14 spatialresolution bounded within a bounding box proposal Here

n = max(1min(4b3+ log2 (radic

wh224)c)) (3)

where w and h are the width and height of the bounding boxproposal The features that are extracted then serve as input tothe bounding box regression object classification and masksegmentation networks The RPN network operates on eachof the output scales of the FPN sequentially thereby accumu-lating candidates from different scales and ROI align uses theaccumulated list for feature extraction from all the scales Thebounding box regression and object classification networksconsist of two shared consecutive fully-connected layers with1024 channels iABN sync and Leaky ReLU Subsequentlythe feature maps are passed through a fully-connected layerat the end for each task with 4Nlsquothingrsquo outputs and Nlsquothingrsquo+1outputs respectively The object classification logits yield aprobability distribution over class and void from a softmaxlayer whereas the bounding box regression network encodesclass-specific correction factors For a given object proposalof the form (ucvcwchc) for a class c a new bounding boxis computed of the form (u+ cuv+ cvwecw hech) where(cucvcwch) are the class-specific correction factors (uv)and (wh) are the position and dimensions of the boundingbox proposal respectively

The mask segmentation network employs four consec-utive 3times3 separable convolution with 256 output channelsfollowed by iABN sync and Leaky ReLU Subsequently theresulting feature maps are passed through a 2times2 transposedconvolution with 256 output channels and an output strideof 2 followed by iABN sync and Leaky ReLU activationfunction Finally a 1times 1 convolution with Nlsquothingrsquo outputchannels is employed that yields 28times 28 logits for eachclass The resulting logits give mask foreground probabilitiesfor the candidate bounding box proposal using a sigmoidfunction This is then fused with the semantic logits in ourproposed panoptic fusion module described in Section 34

In order to train the instance segmentation head we adoptthe loss functions proposed in Mask R-CNN ie two lossfunctions for the first stage objectness score loss and objectproposal loss and three loss functions for the second stageclassification loss bounding box loss and mask segmentationloss We take a set of randomly sampled positive matchesand negative matches such that |Ns| le 256 The objectness

score loss Los defined as log loss for a given Ns is given by

Los(Θ) =minus 1|Ns| sum

(plowastospos)isinNs

plowastos middot log pos

+(1minus plowastos) middot log(1minus pos) (4)

where pos is the output of the objectness score branch of RPNand plowastos is the groundtruth label which is 1 if the anchor ispositive and 0 if the anchor is negative We use the samestrategy as Mask R-CNN for defining positive and negativematches For a given anchor a if the groundtruth box blowast hasthe largest Intersection over Union (IoU) or IoU(blowasta)gt TH then the corresponding prediction b is a positive match andb is a negative match when IoU(blowasta)lt TL The thresholdsTH and TL are pre-defined where TH gt TL

The object proposal loss Lop is a regression loss that isdefined only on positive matches and is given by

Lop(Θ) =1|Ns| sum

(tlowastt)isinNp

sum(ilowasti)isin(tlowastt)

L1(ilowast i) (5)

where L1 is the smooth L1 Norm Np is the subset of Ns pos-itive matches tlowast = (tlowastx t

lowasty tlowastw tlowasth ) and t = (tx ty tw th) are the

parameterizations of blowast and b respectively blowast=(xlowastylowastwlowasthlowast)is the groundtruth box blowast= (xywh) is the predicted bound-ing box xyw and h are the center coordinates width andheight of the predicted bounding box Similarly xlowastylowastwlowast

and hlowast denote the center coordinates width and height of thegroundtruth bounding box The parameterizations (Girshick2015) are given by

tx =(xminus xa)

wa ty =

(yminus ya)

ha tw = log

wwa

th = loghha

(6)

tlowastx =(xlowastminus xa)

wa tlowasty =

(ylowastminus ya)

ha tlowastw = log

wlowast

wa tlowasth = log

hlowast

ha (7)

where xayawa and ha denote the center coordinates widthand height of the anchor a

Similar to the objectness score loss Los the classificationloss Lcls is defined for a set of Ks randomly sampled positiveand negative matches such that |Ks| le 512 The classificationloss Lcls is given by

Lcls(Θ) =minus 1|Ks|

Nlsquothingprime+1

sumc=1

Y lowastoc middot logYoc for(Y lowastY ) isin Ks

(8)

where Y is the output of the classification branch Y lowast is theone hot encoded groundtruth label o is the observed classand c is the correct classification for object o For a givenimage it is a positive match if IoU(blowastb)gt Tn and otherwisea negative match where blowast is the groundtruth box and b isthe object proposal from the first stage

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 10: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

10 Rohit Mohan Abhinav Valada

kx28x28

kx28x28

48x512x1024

11

Input3x1024x2048

P32

128x64x128

128x32x641

1

11

11

DPC

256x256x512

256x64x128

256x128x256

24x512x1024

40x256x512

64x128x256

13

13

11

11

Confidence

Threshold

Sort based

on confidence Overlap

Threshold

8x28x28

Scale Resize

Pad

8xHxW

MLA

4xHxW

Semantic Logits

MLB

4xHxW

Nst+thxHxW

8x28x28

σ()

σ()

4xHxW

FusedLogits

IntermediateLogits

NstxHxW

(4+Nst)xHxW

16x28x28

Mask Logitskx28x28

Confidence

BBox Pred

Class Predf

Intermediate Prediction

argmax()

Semantic Prediction

argmax()

Canvas

Panoptic Prediction

Fill canvas

with stuff

Fill canvas

with things instances

Figure 4 Illustration of our proposed Panoptic Fusion Module Here MLA and MLB mask logits are fused as (σ(MLA)+σ(MLB))(MLA+MLA))where MLB is output of the function f lowast σ(middot) is the sigmoid function and is the Hadamard product Here the f lowast function for given class predictionc (cyclist in this example) zeroes out the score of the c channel of the semantic logits outside the corresponding bounding box

The bounding box loss Lbbx is a regression loss that isdefined only on positive matches and is expressed as

Lbbx(Θ) =1|Ks| sum

(T lowastT )isinKp

sum(ilowasti)isin(T lowastT )

L1(ilowast i) (9)

where L1 is the smooth L1 Norm (Girshick 2015) Kp is thesubset of Ks positive matches T lowast and T are the parameteriza-tions of Blowast and B respectively similar to Equation (4) and (5)where Blowast is the groundtruth box and B is the correspondingpredicted bounding box

Finally the mask segmentation loss is also defined onlyfor positive samples and is given by

Lmask(Θ) =minus 1|Ks| sum

(PlowastP)isinKs

Lp(PlowastP) (10)

where Lp(PlowastP) is given as

Lp(PlowastP) =minus1|Tp| sum

(i j)isinTp

Plowasti j middot logPi j

+(1minusPlowasti j) middot log(1minusPi j) (11)

where P is the predicted 28times28 binary mask for a class withPi j denoting the probability of the mask pixel (i j) Plowast is the28times28 groundtruth binary mask for the class and Tp is theset of non-void pixels in Plowast

All the five losses are weighed equally and the total in-stance segmentation head loss is given by

Linstance = Los +Lop +Lcls +Lbbx +Lmask (12)

Similar to Mask R-CNN the gradient that is computed wrt tothe losses Lcls Lbbx and Lmask flow only through the networkbackbone and not through the region proposal network

34 Panoptic Fusion Module

In order to obtain the panoptic segmentation output we needto fuse the prediction of the semantic segmentation head andthe instance segmentation head However fusing both thesepredictions is not a straightforward task due to the inherentoverlap between them Therefore we propose a novel pan-optic fusion module to tackle the aforementioned problemin an adaptive manner in order to thoroughly exploit the pre-dictions from both the heads congruously Figure 4 showsthe topology of our panoptic fusion module We obtain aset of object instances from the instance segmentation headof our network where for each instance we have its corres-ponding class prediction confidence score bounding boxand mask logits First we reduce the number of predictedobject instances in two stages We begin by discarding allobject instances that have a confidence score of less than acertain confidence threshold We then resize zero pad andscale the 28times28 mask logits of each object instance to thesame resolution as the input image Subsequently we sortthe class prediction bounding box and mask logits accordingto the respective confidence scores In the second stage wecheck each sorted instance mask logit for overlap with otherobject instances In case the overlap is higher than a givenoverlap threshold we discard the other object instance

After filtering the object instances we have the classprediction bounding box prediction and mask logit MLA ofeach instance We simultaneously obtain semantic logits withN channels from the semantic head where N is the sum ofNlsquostu f f prime and Nlsquothingprime We then compute a second mask logitMLB for each instance where we select the channel of thesemantic logits based on its class prediction We only keepthe logit score of the selected channel for the area withinthe instance bounding box while we zero out the scores that

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 11: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 11

are outside this region In the end we have two mask logitsfor each instance one from instance segmentation head andthe other from the semantic segmentation head We com-bine these two logits adaptively by computing the Hadamardproduct of the sum of sigmoid of MLA and sigmoid of MLBand the sum of MLA and MLB to obtain the fused mask logitsFL of instances expressed as

FL = (σ(MLA)+σ(MLB)) (MLA +MLA) (13)

where σ(middot) is the sigmoid function and is the Hadamardproduct We then concatenate the fused mask logits of the ob-ject instances with the lsquostuffrsquo logits along the channel dimen-sion to generate intermediate panoptic logits Subsequentlywe apply the argmax operation along the channel dimensionto obtain the intermediate panoptic prediction In the finalstep we take a zero-filled canvas and first copy the instance-specific lsquothingrsquo prediction from the intermediate panopticprediction We then fill the empty parts of the canvas withlsquostuffrsquo class predictions by copying them from the predictionsof the semantic head while ignoring the classes that havean area smaller than a predefined threshold called minimumstuff area This then gives us the final panoptic segmentationoutput

We fuse MLA and MLB instance logits in the aforemen-tioned manner due to the fact that if both logits for a givenpixel conform with each other the final instance score willincrease proportionately to their agreement or vice-versa Incase of agreement the corresponding object instance willdominate or be superseded by other instances as well as thelsquostuffrsquo classes score Similarly in case of disagreement thescore of the given object instance will reflect the extent oftheir difference Simply put the fused logit score is either ad-aptively attenuated or amplified according to the consensusWe evaluate the performance of our proposed panoptic fu-sion module in comparison to other existing methods in theablation study presented in Section 445

4 Experimental Results

In this section we first describe the standard evaluation met-rics that we adopt for empirical evaluations followed by briefdescriptions of the datasets that we benchmark on in Sec-tion 41 We then present extensive quantitative comparisonsand benchmarking results in Section 43 and detailed abla-tion studies on the various proposed architectural componentsin Section 44 Finally we present qualitative comparisonsand visualizations of panoptic segmentation on each of thedatasets that we evaluate on in Section 45 and Section 46respectively

We use PyTorch (Paszke et al 2019) for implementingall our architectures and we trained our models on a systemwith an Intel Xenon220GHz processor and NVIDIA TI-TAN X GPUs We use the standard Panoptic Quality (PQ)

metric (Kirillov et al 2019) for quantifying the performanceof our models The PQ metric is computed as

PQ =sum(pg)isinT P IoU(pg)

|T P|+ 12 |FP|+ 1

2 |FN| (14)

where T PFPFN and IoU are true positives false positivesfalse negatives and the intersection-over-union The IoU iscomputed as IoU = T P(T P+FP+FN) We also report theSegmentation Quality (SQ) and Recognition Quality (RQ)metrics computed as

SQ =sum(pg)isinT P IoU(pg)

|T P| (15)

RQ =|T P|

|T P|+ 12 |FP|+ 1

2 |FN| (16)

Following the standard benchmarking criteria for pan-toptic segmentation we report PQ SQ and RQ over all theclasses in the dataset and we also report them for the lsquostuffrsquoclasses (PQSt SQSt RQSt) and the lsquothingrsquo classes (PQThSQTh RQTh) Additionally for the sake of completenesswe report the Average Precision (AP) mean Intersection-over-Union (mIoU) for both lsquostuffrsquo and lsquothingrsquo classes aswell as the inference time and FLOPs for comparisons Theimplementation of our proposed EfficientPS model and alive demo on various datasets is publicly available at httpsrluni-freiburgderesearchpanoptic

41 Datasets

We benchmark our proposed EfficientPS for panoptic seg-mentation on four challenging urban scene understandingdatasets namely Cityscapes (Cordts et al 2016) KITTI (Gei-ger et al 2013) Mapillary Vistas (Neuhold et al 2017) andIndian Driving Dataset (Varma et al 2019) The KITTI bench-mark does not provide panoptic annotations therefore to fa-cilitate this work we publicly release manually annotatedpanoptic groundtruth segmentation labels for the popularKITTI benchmark These four diverse datasets contain im-ages that range from congested city driving scenarios to ruralscenes and highways They also contain scenes in challen-ging perceptual conditions including snow motion blur andother seasonal visual changes We briefly describe the char-acteristics of these datasets in this section

Cityscapes The Cityscapes dataset (Cordts et al 2016) con-sists of urban street scenes and focuses on semantic under-standing of common driving scenarios It is one of the mostchallenging datasets for panoptic segmentation due to itssheer diversity as it covers scenes from over 50 Europeancities recorded over several seasons such as spring summerand fall The presence of a large number of dynamic ob-jects further add to its complexity Figure 5 (a) shows an

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 12: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

12 Rohit Mohan Abhinav Valada

RGB Panoptic Groundtruth

(a)C

itysc

apes

(b)K

ITT

I(c

)Map

illar

yV

ista

s(d

)ID

D

Figure 5 Example images from the challenging urban scene under-standing datasets that we benchmark on namely Cityscapes KITTIMapillary Vistas and Indian Driving Dataset (IDD) The images showcluttered urban scenes with many dynamic objects occluded objectsperpetual snowy conditions and unstructured environments

example image and the corresponding panoptic groundtruthannotation from the Cityscapes dataset As we see from thisexample the scenes are extremely clutterd with many dy-namic objects such as pedestrians and cyclists that are oftengrouped near one and another or partially occluded Thesefactors make panoptic segmentation especially segmentingthe lsquothingrsquo class exceedingly challenging

The widely used Cityscapes dataset recently introduceda benchmark for the task of panoptic segmentation The data-set contains pixel-level annotations for 19 object classes ofwhich 11 are lsquostuffrsquo classes and 8 are instance-specific lsquothingrsquoclasses It consists of 5000 finely annotated images and 20000coarsely annotated images that were captured at a resolu-tion of 2048times1024 pixels using an automotive-grade 22cmbaseline stereo camera The finely annotated images are di-vided into 2975 for training 500 for validation and 1525for testing The annotations for the test set are not publiclyreleased they are rather only available to the online eval-uation server that automatically computes the metrics andpublishes the results We report the performance of our pro-posed EfficientPS on both the validation set as well as thetest set We also use the Cityscapes dataset for evaluatingthe improvement due to the various architectural contribu-tions that we make in the ablation study We report resultson the validation set for our model trained only on the fineannotations and we report the results on the test set from the

benchmarking server for our model trained on both the fineand coarse annotations

KITTI The KITTI vision benchmark suite (Geiger et al2013) is one of the most comprehensive datasets that providesgroundtruth for a variety of tasks such as semantic segment-ation scene flow estimation optical flow estimation depthprediction odometry estimation tracking and road lane de-tection However it still has not expanded its annotations tosupport the recently introduced panoptic segmentation taskThe challenging nature of the KITTI scenes and its potentialfor benchmarking multi-task learning problems makes ex-tending this dataset to include panoptic annotations of greatinterest to the community Therefore in this work we in-troduce the KITTI panoptic segmentation dataset for urbanscene understanding that provides panoptic annotations for asubset of images from the KITTI vision benchmark suite Theannotations for the images that we provide do not intersectwith the official KITTI semanticinstance segmentation testset therefore in addition to panoptic segmentation they canalso be used as supplementary training data for benchmarkingsemantic or instance segmentation tasks individually

Our dataset consists of a total of 1055 images out ofwhich 855 are used for the training set and 200 are usedfor the validation set We provide annotations for 11 lsquostuffrsquoclasses and 8 lsquothingrsquo classes adhering to the Cityscapes lsquostuffrsquoand lsquothingrsquo class distribution In order to create panopticannotations we gathered semantic annotations from com-munity driven extensions of KITTI (Xu et al 2016 Roset al 2015) and combined them with the 200 training im-ages from the KITTI semantic training set We then manuallyannotated all the images with instance masks followed bymerging them with the semantic annotations to generate thepanoptic segmentation groundtruth labels The images inour KITTI panoptic segmentation dataset are a resolution of1280times384 pixels and contain scenes from both residentialand inner city scenarios Figure 5 (b) shows an example im-age from the KITTI panoptic segmentation dataset and itscorresponding panoptic segmentation labels We observe thatthe car denoted in teal color pixels and the van are both par-tially occluded by other lsquostuffrsquo classes such that they causean object instance to be disjoint into two components Wefind that scenarios such as these are extremely challengingfor the task of panoptic segmentation as the disjoint objectmask has to be assigned to the same instance ID We hopethat this dataset encourages innovative solutions to such real-world problems that are uncommon in other datasets and alsoaccelerates research in multi-task learning for urban sceneunderstanding

Mapillary Vistas Mapillary Vistas (Neuhold et al 2017)is one of the largest publicly available street-level imagerydatasets that contains pixel-accurate and instance-specific se-mantic annotations The novel aspects of this dataset include

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 13: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 13

diverse scenes from over six continents and in a variety ofweather conditions season time of day cameras and view-points It consists of 18000 images for training 2000 im-ages for validation and 5000 images for testing The datasetprovides panoptic annotations for 37 lsquothingrsquo classes and 28lsquostuffrsquo classes The images in this dataset are of different resol-utions ranging from 1024times768 pixels to 4000times6000 pixelsFigure 5 (c) shows an example image and the correspond-ing panoptic segmentation groundtruth from the MapillaryVistas dataset We can see that due to the snowy conditionrecognizing distant objects such as the car in this examplebecomes extremely difficult Such drastic seasonal changesmake this dataset one of the most challenging for panopticsegmentation

Indian Driving Dataset The Indian Driving Dataset (IDD)(Varma et al 2019) was recently introduced for scene under-standing of unstructured environments Unlike other urbanscene understanding datasets IDD consists of scenes thatdo not have well-delineated infrastructures such as lanesand sidewalks It has a significantly more number of lsquothingrsquoinstances in each scene compared to other datasets and itonly has a small number of well-defined categories for trafficparticipants The images in this dataset were captured witha front-facing camera mounted on a car and the data wasgathered in two Indian cities as well as in their outskirtsIDD consists of a total of 10003 images where 6993 areused for training 981 for validation and 2029 for testingThe images are a resolution of either 1920times1080 pixels or720times1280 pixels We train and evaluate all our models on720p resolution on this dataset The annotations are providedin four levels of hierarchy Existing approaches primarilyreport their results for level 3 therefore we report the res-ults of our model on the same to facilitate comparison Thislevel comprises of a total of 26 classes out of which 17 arelsquostuffrsquo classes and 9 are instance-specific lsquothingrsquo classes Anexample image and the corresponding panoptic segmentationgroundtruth from the IDD dataset is shown in Figure 5 (d)We observe that the transition between the road and the side-walk class is structurally not well defined which often leadsto misclassifications Factors such as this make evaluatingon this dataset uniquely challenging

42 Training Protocol

We train our network on crops of different resolutions of theinput image namely 1024times2048 1024times1024 384times1280and 720times1280 pixels We take crops from the full resolutionof the image provided in each of the datasets We perform alimited set of random data augmentations including flippingand scaling within the range of [0520] We initialize thebackbone of our EfficientPS with weights from the Efficient-Net model pre-trained on the ImageNet dataset (Russakovsky

et al 2015) and initialize the weights of the iABN sync layersto 1 We use Xavier initialization (Glorot and Bengio 2010)for the other layers zero constant initialization for the biasesand we use Leaky ReLU with a slope of 001 We use thesame hyperparameters as Girshick (2015) for our instancehead and additionally set TH = 07 TL = 03 and TN = 05In our proposed panoptic fusion module we use a confid-ence threshold of ct = 05 overlap threshold of ot = 05 andminimum stuff area of minsa = 2048

We train our model with Stochastic Gradient Descent(SGD) with a momentum of 09 using a multi-step learn-ing rate schedule ie we start with an initial base learningrate and train the model for a certain number of iterationsfollowed by lowering the learning rate by a factor of 10at each milestone and continue training until convergenceWe denote the base learning rate lrbase milestones and thetotal number of iterations ti for each dataset in the follow-ing format lrbasemilestonemilestone ti The trainingschedule for Cityscapes Mapillary Vistas KITTI and IDDare 007 32K 44K 50K 007 144K 176K 192K007 16K 22K 25K and 007 108K 130K 144Krespectively At the beginning of the training we have awarm-up phase where the lrbase is increased linearly from13 middot lrbase to lrbase in 200 iterations Aditionally we freeze theiABN sync layers and further train the model for 10 epochswith a fixed learning rate of lr = 10minus4 The final loss Ltotalthat we optimize is computed as

Ltotal = Lsemantic +Linstance (17)

where Lsemantic and Linstance are given in Equation (2) andEquation (12) respectively We train our EfficientPS with abatch size of 16 on 16 NVIDIA Titan X GPUs where eachGPU tends to a single-image

43 Benchmarking Results

In this section we report results comparing the perform-ance of our proposed EfficientPS architecture against currentstate-of-the-art panoptic segmentation approaches For com-parisons on the Cityscapes and Mapillary Vistas datasets wedirectly report the performance metrics of the state-of-the-artmethods as stated in their corresponding manuscripts Whilefor KITTI and IDD we report results for the models thatwe trained using the official implementations that have beenpublicly released by the authors after further tuning of hy-perparameters to the best of our ability Note that existingmethods have not reported results on KITTI and IDD val-idation sets We report results on the validation sets for allthe datasets and we additionally report results on the test setfor the Cityscapes dataset by evaluating them on the officialserver Note that at the time of submission only the City-scapes benchmark has the provision to evaluate the results

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 14: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

14 Rohit Mohan Abhinav Valada

Table 1 Performance comparison of panoptic segmentation on the Cityscapes validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively minus denotes that the metric has not been reported for the corresponding method

Mode Network Pre-training PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

WeaklySupervised 473 minus minus 396 minus minus 529 minus minus 243 716TASCNet 559 minus minus 505 minus minus 598 minus minus minus minusPanoptic FPN 581 minus minus 520 minus minus 625 minus minus 330 757AUNet 590 minus minus 548 minus minus 621 minus minus 344 756UPSNet 593 797 730 546 793 687 627 801 762 333 752DeeperLab 563 minus minus minus minus minus minus minus minus minus minusSeamless 603 minus minus 561 minus minus 633 minus minus 336 775SSAP 611 minus minus 550 minus minus minus minus minus minus minusAdaptIS 620 minus minus 587 minus minus 644 minus minus 363 792Panoptic-DeepLab 630 minus minus minus minus minus minus minus minus 353 805

EfficientPS (ours) 639 815 771 607 812 741 662 818 792 383 793

TASCNet COCO 593 minus minus 56 minus minus 615 minus minus 376 781UPSNet COCO 605 809 735 570 minus minus 630 minus minus 378 778Seamless Vistas 650 minus minus 607 minus minus 680 minus minus minus 807Panoptic-Deeplab Vistas 653 minus minus minus minus minus minus minus minus 388 825

EfficientPS (ours) Vistas 661 825 789 627 819 752 685 829 816 419 810

Mul

ti-Sc

ale

Panoptic-DeepLab 641 minus minus minus minus minus minus minus minus 385 815

EfficientPS (ours) 651 822 790 615 814 754 677 828 817 397 803

TASCNet COCO 604 minus minus 561 minus minus 633 minus minus 391 787M-RCNN + PSPNet COCO 612 809 744 540 minus minus 664 minus minus 364 809UPSNet COCO 618 813 748 576 777 705 648 814 390 792Panoptic-Deeplab Vistas 670 minus minus minus minus minus minus minus minus 425 831

EfficientPS (ours) Vistas 675 832 802 635 822 772 704 839 824 438 821

Table 2 Comparison of panoptic segmentation benchmarking results on the Cityscapes test set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Network Pre-training PQ SQ RQ PQTh PQSt

() () () () ()

SSAP 589 824 706 484 665TASCNet COCO 607 810 738 534 660Panoptic-Deeplab 623 824 748 521 697Seamless Vistas 626 821 753 560 675Panoptic-Deeplab Vistas 655 778 752 565 720

EfficientPS (ours) 641 826 768 567 694EfficientPS (ours) Vistas 664 835 788 593 715

Table 3 Comparison of model efficiency with both state-of-the-art top-down and bottom-up panoptic segmentation architectures

Network Input Size Params FLOPs Time(pixels) (M) (B) (ms)

DeeperLab 1025times2049 minus minus 463UPSNet 1024times2048 4505 48702 202Seamless 1024times2048 5143 51400 168Panoptic-Deeplab 1025times2049 4673 54749 175

EfficientPS (ours) 1024times2048 4089 43394 166

on the test set On each of the datasets we report both thesingle-scale and multi-scale evaluation results Following

standard practise we perform horizontal flipping and scaling(scales of 075 1 125 15 175 2) during the multi-scaleevaluations

We compare the performance of our proposed EfficientPSagainst state-of-the-art models on the Cityscapes dataset in-cluding WeaklySupervised (Li et al 2018b) TASCNet (Liet al 2018a) Panoptic FPN (Kirillov et al 2019) AUNet (Liet al 2019b) UPSNet (Xiong et al 2019) DeeperLab (Yanget al 2019) Seamless (Porzi et al 2019) SSAP (Gao et al2019) AdaptIS (Sofiiuk et al 2019) and Panoptic-DeepLab(Cheng et al 2019) Table 1 shows the results on the City-scapes validation set For a fair comparison we categorizemodels in the table separately according to those that report

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 15: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 15

Table 4 Performance comparison of panoptic segmentation on the Mapillary Vistas validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectivelyminus denotes that the metric has not been reported for the corresponding method

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale

JSIS-Net 176 559 235 100 476 141 275 669 358 minus minusDeeperLab 320 minus minus minus minus minus minus minus minus minus 553TASCNet 326 minus minus 311 minus minus 344 minus minus 185 minusAdaptIS 359 minus minus 315 minus minus minus 419 minus minus minusSeamless 377 minus minus 338 minus minus 429 minus minus 164 504Panoptic-DeepLab 377 minus minus 304 minus minus 474 minus minus 149 553

EfficientPS (ours) 383 742 480 339 733 430 442 754 547 187 526

Mul

ti-Sc

ale

TASCNet 343 minus minus 348 minus minus 336 minus minus 204 minusPanoptic-DeepLab 403 minus minus 335 minus minus 493 minus minus 172 568

EfficientPS (ours) 405 749 495 350 738 444 477 762 564 208 541

single-scale and multi-scale evaluation as well as withoutany pre-training and pre-training on other datasets namelyMapillary Vistas (Neuhold et al 2017) denoted as Vistas andMicrosoft COCO (Lin et al 2014) abbreviated as COCOWe report the performance of all the aforementioned variantsof our EfficientPS model Note that we do not use the City-scapes coarse annotations depth data or exploit temporaldata Our EfficientPS model trained only on the Cityscapesfine annotations and with single-scale evaluation outperformsthe previous best proposal based approach AdaptIS by 19in PQ and 20 in AP while outperforming the best bottom-up approach Panoptic-Deeplab by 09 in PQ and 30 inAP Furthermore our EfficientPS model trained only on theCityscapes fine annotations and with multi-scale evaluationachieves an improvement of 10 in PQ and 12 in APover Panoptic-Deeplab We observe a similar trend whilecomparing with models that have been pre-trained with addi-tional data where our proposed EfficientPS outperforms theformer state-of-the-art Panoptic-Deeplab in both single-scaleevaluation and multi-scale evaluation EfficientPS pre-trainedon Mapillary Vistas and with single-scale evaluation outper-forms Panoptic-Deeplab in the same configuration by 08in PQ and 31 in AP while for multi-scale evaluation itexceeds the performance of Panoptic-Deeplab by 05 inPQ and 13 in AP

We report the benchmarking results on the Cityscapestest set in Table 2 where the results were obtained directlyfrom the leaderboard Note that the official Cityscapes bench-mark only reports the PQ PQSt PQTh SQ and RQ metricsand ranks the methods primarily based on the standard PQmetric Our proposed EfficientPS without pre-training on anyextra data achieves a PQ of 641 which is an improvementof 18 over the previous state-of-the-art Panoptic-Deeplabtrained only using Cityscapes fine annotations and an im-provement of 15 in PQ over the Seamless model that also

uses extra data More importantly our proposed EfficientPSmodel pre-trained on Mapillary Vistas sets the new state-of-art on the Cityscapes panoptic benchmark achieving aPQ score of 664 This accounts for an improvement of09 in PQ compared to the previous state-of-the-art Panop-tic Deeplab pre-trained on Mapillary Vistas Moreover ourEfficientPS model ranks second in the semantic segmenta-tion task with a mIoU of 842 as well as second in theinstance segmentation task with an AP of 391 among allthe published methods in the Cityscapes benchmark

We compare the efficiency of our proposed EfficientPSarchitecture against state-of-the-art models in terms of thenumber of parameters and FLOPs that it consumes as well asthe runtime on the Cityscapes dataset We compute the end-to-end runtime including the time consumed for fusing thesemantic and instance predictions that yield the final panopticsegmentation output Table 3 shows the comparison withthe top two top-down and bottom-up panoptic segmentationarchitectures Our proposed EfficientPS has a runtime of166ms for an input image resolution of 1024times2048 pixelswhich makes it faster than the competing methods We alsoobserve that our EfficientPS architecture consumes the leastamount of parameters and FLOPs thereby making it the mostefficient state-of-the-art panoptic segmentation model

In Table 4 we report results on the Mapillary Vistasvalidation set The Mapillary Vistas dataset presents a sub-stantial challenge as it contains images from varying seasonsweather conditions and time of day as well as the presence of65 semantic object classes Our proposed EfficientPS modelexceeds the state-of-the-art for both single-scale and multi-scale evaluation For single-scale evaluation it achieves animprovement of 06 in PQ over the top-down approachSeamless and the bottom-up approach Panoptic-DeepLabWhile for multi-scale evaluation it achieves an improvementof 04 in PQ and 36 in AP over the previous state-of-

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 16: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

16 Rohit Mohan Abhinav Valada

Table 5 Performance comparison of panoptic segmentation on the KITTI validation set Note that no additional data was used for trainingEfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 386 704 512 261 683 401 476 719 592 244 521

UPSNet 391 707 517 266 685 406 483 724 598 247 526Seamless 413 717 523 285 692 423 506 736 596 259 538

EfficientPS (ours) 429 727 536 304 698 437 520 749 609 271 553

Mul

ti-Sc

ale Panoptic FPN 393 708 516 269 687 404 483 724 598 248 528

UPSNet 399 712 520 272 688 408 491 729 602 252 532Seamless 422 723 529 291 697 429 518 742 601 266 551

EfficientPS (ours) 437 732 541 309 702 440 531 754 615 279 564

Table 6 Performance comparison of panoptic segmentation on the Indian Driving Dataset (IDD) validation set Note that no additional data wasused for training EfficientPS on this dataset other than pre-training the encoder on ImageNet Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquoclasses respectively

Mode Network PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Sing

le-S

cale Panoptic FPN 459 759 608 461 778 609 458 749 607 278 681

UPSNet 466 765 609 476 789 611 460 753 608 282 684Seamless 477 772 612 489 795 615 471 761 611 301 696

EfficientPS (ours) 501 784 620 507 806 616 498 771 622 316 713

Mul

ti-Sc

ale Panoptic FPN 467 770 610 473 789 611 464 761 610 289 701

UPSNet 471 779 609 476 798 612 468 769 608 292 706Seamless 485 782 619 495 804 622 479 771 617 314 713

EfficientPS (ours) 511 788 635 526 812 654 503 775 625 329 721

the-art Panoptic-DeepLab Note that we do not use model en-sembles Our network falls short of the bottom-up approachPanoptic-Deeplab in PQSt score primarily due to the out-put stride of 16 at which it operates which increases thecomputational complexity whereas our EfficientPS uses anoutput stride of 32 hence is more efficient On the one handbottom-up approaches tend to have a better semantic seg-mentation ability which is evident from the high PQSt ofPanoptic-Deeplab While on the other hand top-down ap-proaches tend to have better instance segmentation abilityas they can handle large-scale variations in object instancesIt would be interesting to investigate architectures that cancombine the strengths of the two in future

We present results on the KITTI validation set in Table 5Our proposed EfficientPS outperforms the previous state-of-the-art Seamless by 16 in PQ 12 in AP and 15 mIoUfor single scale evaluation and 15 in PQ 13 in AP and13 in mIoU for multi-scale evaluation This dataset con-sists of cluttered and occluded objects that often have objectmasks split into two or more parts In these cases contextaggregation plays a major role Hence the improvement thatwe observe can be attributed to three factors the multi-scalefeature aggregation in our 2-way FPN due to the bidirectionalflow of information the long-range context being captured

by our semantic head and the adaptive fusion in our panopticfusion module that effectively leverages the predictions fromthe individual heads

Finally we also report results on the Indian Driving Data-set (IDD) largely due to the fact that it contains images ofunstructured urban environments and scenes that do not haveclear delineated road infrastructure which makes it extremelychallenging Table 6 presents results on the IDD validationset Our proposed EfficientPS substantially exceeds the state-of-the-art by achieving a PQ score of 501 and 511 forsingle-scale and multi-scale evaluation respectively Thisamounts to an improvement of 26 in PQ over Seamlessand 4 in PQ over UPSNet for multi-scale evaluation Theunstructured scenes in this dataset challenges the ability ofmodels to detect object boundaries of lsquostuffrsquo classes such asroad and sidewalk Our EfficientPS achieves a PQSt score of498 for single-scale evaluation which is an improvementof 27 over Seamless and this can be attributed to the effect-iveness of our proposed semantic head in capturing objectboundaries

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 17: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 17

Table 7 Ablation study on various architectural contributions proposed in our EfficientPS model The performance is shown for the modelstrained on Cityscapes fine annotations and evaluated on the validation set SIH SH and PFM denotes Separable Instance Head Semantic Headand Panoptic Fusion Module respectively rsquo-rsquo refers to the standard configuration as Kirillov et al (2019) whereas rsquoXrsquo refers to our proposedconfiguration Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model Encoder 2-way SIH SH PFM PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoUFPN () () () () () () () () () () ()

M1 ResNet-50 - - - - 582 791 720 524 788 672 624 794 756 316 743M2 ResNet-50 - - - X 588 795 726 534 792 628 679 797 761 338 751M3 ResNet-50 - X - X 586 794 724 531 791 675 626 796 759 337 750M4 EfficientNet-B5 - X - X 597 799 733 547 766 681 633 803 795 341 763M5 EfficientNet-B5 X X - X 615 807 756 572 806 725 646 809 779 368 773M6 EfficientNet-B5 X X X X 639 815 771 607 812 741 662 818 792 383 793

44 Ablation Studies

In this section we present extensive ablation studies on thevarious architectural components that we propose in our Ef-ficientPS architecture in comparison to their counterpartsemployed in state-of-the-art models Primarily we study theimpact of our proposed network backbone semantic headand panoptic fusion module on the overall panoptic segment-ation performance of our network We begin with a detailedanalysis of various components of our EfficientPS architec-ture followed by comparisons of different encoder networktopologies and FPN architectures for the network backboneWe then study the impact of different parameter configur-ations in our proposed semantic head and its comparisonwith existing semantic head topologies Finally we assessthe performance of our proposed panoptic fusion module bycomparing with different panoptic fusion methods proposedin the literature For all the ablative experiments we train ourmodels on the Cityscapes fine annotations and evaluate it onthe validation set We use the PQ metric as the primary evalu-ation criteria for all the experiments presented in this sectionNevertheless we also report the other metrics defined in thebeginning of Section 4

441 Detailed Study on the EfficientPS Architecture

We first study the improvement due to the various compon-ents that we propose in our EfficientPS architecture Resultsfrom this experiment are shown in Table 7 The basic modelM1 employs the network configuration and panoptic fusionheuristics as Kirillov et al (2019) It uses the standard ResNet-50 with FPN as the backbone and incorporates Mask R-CNNfor the instance head The semantic head of this network iscomprised of an upsampling stage which has a 3times3 convolu-tion group norm (Wu and He 2018) ReLU and times2 bilinearupsampling At each FPN level this upsampling stage is re-peated until the feature maps are 14 scale of the input Theseresulting feature maps are then summed element-wise andpassed through a 1times1 convolution followed by times4 bilinearupsampling and softmax to yield the semantic segmentation

output This model M1 achieves a PQ of 582 AP of 316and an mIoU score of 746

The next model M2 that incorporates our proposed pan-optic fusion module achieves an improvement of 06 inPQ 22 in AP and 08 in the mIoU score without increas-ing the number of parameters This increase in performancedemonstrates that the adaptive fusion of semantic and in-stance head outputs is effective in resolving the inherentoverlap conflict In the M3 model we replace all the standardconvolutions in the instance head with separable convolu-tions which reduces the number of parameters of the modelby 209M with a drop of 02 in PQ 01 drop in AP andmIoU score However from the aspect of having an efficientmodel a reduction of 5 of the model parameters for a dropof 02 in PQ can be considered as a reasonable trade-offTherefore we employ separable convolutions in the instancehead of our proposed EfficientPS architecture

In the M4 model we replace the ResNet-50 encoder withour modified EfficientNet-B5 encoder that does not have anysqueeze-and-excitation connections and we replace all thenormalization layers and ReLU activations with iABN syncand leaky ReLU This model achieves a PQ of 597 whichis an improvement of 11 in PQ over the M3 model and alarger improvement is also observed in the mIoU score Theimprovement in performance can be attributed to the richerrepresentational capacity of the EfficientNet-B5 architectureSubsequently in the M5 model we replace the standard FPNwith our proposed 2-way FPN which additionally improvesthe performance by 18 in PQ and 27 in AP The additionof the parallel bottom-up branch in our 2-way FPN enablesbidirectional flow of information thus breaking away fromthe limitation of the standard FPN

Finally we incorporate our proposed semantic head intothe M6 model that fuses and aligns multi-scale features effect-ively which enables it to achieve a PQ of 639 Althoughour semantic head contributes to this improvement of 24in the PQ score it cannot not be solely attributed to the se-mantic head This is due to the fact that if we employ standardpanoptic fusion heuristics an improvement in semantic seg-

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 18: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

18 Rohit Mohan Abhinav Valada

Table 8 Performance comparison of various encoder topologies employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Encoder Params PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU(M) () () () () () () () () () () ()

MobileNetV3 540 558 781 702 504 774 671 598 786 724 291 722ResNet-50 2560 603 801 726 553 799 689 639 803 753 349 761ResNet-101 4450 611 803 751 565 801 719 642 805 774 359 772Xception-71 2750 621 811 754 585 809 723 647 812 777 362 781ResNeXt-101 8674 632 812 760 596 804 729 658 817 783 369 789Mod EfficientNet-B5 (Ours) 3000 639 815 771 607 812 741 662 818 792 383 793

Table 9 Performance comparison of various FPN architectures employed in the M6 model Results are shown for the models trained on theCityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Architecture PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Bottom-Up FPN 604 806 737 563 804 699 634 808 764 352 753Top-Down FPN 622 809 757 581 801 724 651 814 780 365 782PANet FPN 631 811 755 594 803 723 658 816 778 371 7882-way FPN (Ours) 639 815 771 607 812 741 662 818 792 383 793

mentation would only contribute to an increase in PQst scoreHowever our proposed adaptive panoptic fusion yields animprovement in PQth as well which is evident from the over-all improvement in the PQ score We denote this M6 modelconfiguration as EfficientPS in this work In the followingsections we further analyze the individual architectural com-ponents of the M6 model in more detail

442 Comparison of Encoder Topologies

There are numerous network architectures that have beenproposed for addressing the task of image classification Typ-ically these networks serve as the encoder or feature ex-tractor for more complex tasks such as panoptic segment-ation In this section we evaluate the performance of ourproposed modified EfficientNet-B5 in comparison to fivewidely employed encoder architectures For a fair compar-ison we keep all the other components of our EfficientPSnetwork the same and only replace encoder More specific-ally we compare with MobileNetV3 (Howard et al 2019)ResNet-50 (He et al 2016) ResNet-101 (He et al 2016)Xception-71 (Chollet 2017) ResNeXt-101 (Xie et al 2017)and EfficientNet-B5 (Tan and Le 2019) Results from thisexperiment are presented in Table 8 We observe that our mod-ified EfficientNet-B5 architecture yields the highest PQ scoreclosely followed by the ResNeXt-101 architecture HoweverResNext-101 has an additional 5674M parameters whichis more than twice the number of parameters consumed byour modified EfficientNet-B5 architecture We can see thatthe other encoder models especially MobileNetV3 ResNet-50 and Xception-71 have a comparable or fewer parameters

than our modified EfficientNet-B5 however they also yielda substantially lower PQ score Therefore we employ ourmodified EfficientNet-B5 as the encoder backbone in ourproposed EfficientPS architecture

443 Evaluation of the 2-way FPN

In this section we compare the performance of our novel2-way FPN with other existing FPN variants For a fair com-parison we keep all the other components of our EfficientPSnetwork the same and only replace the 2-way FPN in thebackbone We compare with the top-down FPN (Lin et al2017) bottom-up FPN and PANet FPN variants We refer tothe FPN architecture described in Liu et al (2018) as PANetFPN in which the top-down path is followed by a bottom-uppath The results from comparing with various FPN architec-tures are shown in Table 9

The top-down FPN model predominantly propagates se-mantically high-level features which describe entire objectswhereas the bottom-up FPN model propagates low-level in-formation such as local textures and patterns The EfficientPSmodel with the bottom-up FPN achieves a PQ of 604while the model with the top-down FPN achieves a PQ of622 Both these models achieve a performance which is32 and 14 lower in PQ than our 2-way FPN respectivelyA similar trend can also be observed in the other metricsThe lower PQ score of the individual bottom-up FPN andtop-down FPN models substantiate the limitation of the uni-directional flow of information in the standard FPN topologyBoth the PANet FPN and our proposed 2-way FPN aim tomitigate this problem by adding another bottom-up path to

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 19: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 19

Table 10 Ablation study on our semantic head topology incorporated into the M6 model Results are shown for the models trained on the Cityscapesfine annotations and evaluated on the validation set Pos is the output of our 2-way FPN at the os pyramid scale level ck

f refers to a convolution layerwith f number of filters and ktimes k kernel size LSFE refers to Large Scale Feature Extractor and DPC refers to Dense Prediction Cells SuperscriptsSt and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model P32 P16 P8 P4 Feature PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU

Correlation () () () () () () () () () () ()

M61 [c3128c3

128] [c3128c3

128] [c3128c3

128] [c3128c3

128] - 616 806 757 573 804 726 647 807 779 367 772

M62 LSFE LSFE LSFE LSFE - 617 805 754 579 805 726 645 806 774 366 774

M63 DPC LSFE LSFE LSFE - 623 809 759 579 804 720 656 813 788 368 781

M64 DPC DPC LSFE LSFE - 629 810 757 590 805 714 658 814 787 370 786

M65 DPC DPC DPC LSFE - 624 808 754 588 803 724 651 811 776 367 782

M66 DPC DPC LSFE LSFE X 639 815 771 607 812 741 662 818 792 383 793

the standard FPN in a sequential or parallel manner respect-ively We observe that the model with our proposed 2-wayFPN demonstrates an improvement of 05 in PQ over themodel with the PANet FPN This can be attributed to thefact that the multi-scale aggregation of features in parallelinformation pathways enhances the capability of the FPN

444 Detailed Study on the Semantic Head

We construct the topology of our proposed semantic head con-sidering two critical factors First since large-scale outputscomprise of characteristic features and small-scale outputsconsist of contextual features they both should be captureddistinctly by the semantic head Second while fusing smalland large-scale outputs the contextual features need to bealigned to obtain semantically reinforced fine features In or-der to demonstrate that these two critical factors are essentialwe perform ablative experiments on various configurations ofour semantic head incorporated into the M6 model describedin Section 444 Results from this experiment are presentedin Table 10

The output at each level of the 2-way FPN P32 P16 P8and P4 are the inputs to our semantic head In the first M61model configuration we employ two cascaded 3times3 convo-lutions iABN sync and leaky ReLU activation sequentiallyat each level of the 2-way FPN The aforementioned seriesof layers constitute the LSFE module which is followed by abilinear upsampling layer at each level of the 2-way FPN toyield an output which is 14 scale of the input image Theseupsampled features are then concatenated and passed througha 1times1 convolution and bilinear upsamplig to yield an outputwhich is the same scale as the input image This M61 modelachieves a PQ of 616 In the subsequent M62 model con-figuration we replace all the standard 3times3 convolutions with3times3 separable convolutions in the LSFE module to reducethe number of parameters This also yields a minor improve-ment in performance compared to the M61 model therefore

we employ separable convolutions in all the experiments thatfollow

In the M63 model we replace the LSFE module in theP32 level of the 2-way FPN with dense prediction cells (DPC)described in Section 32 This M63 model achieves an im-provement of 06 in PQ and 07 in the mIoU score Thiscan be attributed to the ability of DPC to effectively capturelong-range context In the M64 model we replace the LSFEmodule in the P16 level with DPC and in the subsequent M65model we introduce DPC at both P16 and P8 levels We findthat the M64 model achieves an improvement of 06 inPQ over M63 however the performance drops in the M65model by 05 in PQ when we add the DPC module at the P8level This can be attributed to the fact that DPC consistingof dilated convolutions do not capture characteristic featureseffectively at this large-scale The final M66 model is de-rived from the M64 model to which we add our mismatchcorrection (MC) module along with the feature correlationconnections as described in Section 32 This model achievesthe highest PQ score of 639 which is an improvement of10 compared to the M64 model This can be attributed tothe MC module that correlates the semantically rich contex-tual features with fine features and subsequently merges themalong the feature correlation connection to obtain semantic-ally reinforced features that results in better object boundaryrefinement

Additionally we present experimental comparisons ofour proposed semantic head against those that are used inother state-of-the-art panoptic segmentation architecturesSpecifically we compare against the semantic head proposedby Kirillov et al (2019) which we denote as the baselineUPSNet (Xiong et al 2019) and Seamless (Porzi et al 2019)For a fair comparison we keep all the other componentsof the EfficientPS architecture the same across different ex-periments while only replacing the semantic head Table 11presents the results of this experiment

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 20: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

20 Rohit Mohan Abhinav Valada

Table 11 Performance comparison of various existing semantic head topologies employed in the M6 model Results are reported for the modeltrained on the Cityscapes fine annotations and evaluated on the validation set Superscripts St and Th refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Semantic Head PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt AP mIoU() () () () () () () () () () ()

Baseline 615 807 756 572 806 725 646 809 779 368 773UPSNet 620 810 747 585 805 709 645 813 775 359 761Seamless 629 811 755 589 804 713 656 816 785 368 785Ours 639 815 771 607 812 741 662 818 792 383 793

Table 12 Performance comparison of our proposed panoptic fusionmodule with various other panoptic fusion mechanisms employed in theM6 model Results are reported for the model trained on the Cityscapesfine annotations and evaluated on the validation set Superscripts St andTh refer to lsquostuffrsquo and lsquothingrsquo classes respectively

Model PQ SQ RQ PQTh SQTh RQTh PQSt SQSt RQSt

() () () () () () () () ()

Baseline 624 808 754 587 804 726 651 811 774TASCNet 625 809 756 586 805 728 653 812 777UPSNet 631 813 761 595 806 732 657 818 782Ours 639 815 771 607 812 741 662 818 792

The semantic head of UPSNet which is essentially a sub-network comprising of sequential deformable convolutionlayers (Dai et al 2017) achieves a PQ score of 620 whichis an improvement of 05 over the baseline model Thesemantic head of the Seamless model employs their MiniDLmodule at each level of the 2-way FPN that further improvesthe PQ by 09 over semantic head of UPSNet The semanticheads of all these models use the same module at each levelof the 2-way FPN output which are of different scales Incontrast our proposed semantic head that employs a com-bination of LSFE and DPC modules at different levels ofthe 2-way FPN achieves the highest PQ score of 639 andconsistently outperforms the other semantic head topologiesin all the evaluation metrics

445 Evaluation of Panoptic Fusion Module

In this section we evaluate the performance of our proposedpanoptic fusion module in comparison to other existing pan-optic fusion mechanisms First we compare with the panopticfusion heuristics introduced by Kirillov et al (2019) whichwe consider as a baseline as it is extensively used in sev-eral panoptic segmentation networks We then compare withMask-Guided fusion (Li et al 2018a) and the panoptic fusionheuristics proposed in (Xiong et al 2019) which we refer toas TASCNet and UPSNet in the results respectively Onceagain for a fair comparison we keep all the other networkcomponents the same across different experiments and onlychange the panoptic fusion mechanism

Table 12 presents results from this experiment Combin-ing the outputs of the semantic head and instance head thathave an inherent overlap is one of the critical challengesfaced by panoptic segmentation networks The baseline ap-proach directly chooses the output of the instance head ieif there is an overlap between predictions of the lsquothingrsquo andlsquostuffrsquo classes for a given pixel the baseline heuristic clas-sifies the pixel as a lsquothingrsquo class and assigns it an instanceID This baseline approach achieves the lowest performanceof 624 in PQ demonstrating that this fusion problem ismore complex than just assigning the output from one ofthe heads The Mask-Guided fusion method of TASCNetseeks to address this problem by using a segmentation maskThe mask selects which pixel to consider from the instancesegmentation output and which pixel to consider from thesemantic segmentation output This fusion approach achievesa PQ of 625 which is comparable to the baseline methodSubsequently the model that employs the UPSNet fusionheuristics achieves a larger improvement with a PQ score of631 This method leverages the logits of both the headsby adding them and therefore it relies on the mode computedover-segmentation output to resolve the conflict betweenlsquostuffrsquo and lsquothingrsquo classes However our proposed adaptivefusion method that dynamically fuses the outputs from boththe heads achieves the highest PQ score of 639 which isan improvement of 08 over the UPSNet method We alsoobserve a consistently higher performance in all the othermetrics

45 Qualitative Evaluations

In this section we qualitatively evaluate the panoptic seg-mentation performance of our proposed EfficientPS archi-tecture in comparison to the state-of-the-art Seamless (Porziet al 2019) model on each of the datasets that we benchmarkon We use the publicly available official implementationof the Seamless architecture to obtain the outputs for thequalitative comparisons The best performing state-of-the-artmodel Panoptic-Deeplab does not provide any publicly avail-able implementation or pre-trained models which makes suchcomparisons infeasible Figure 6 presents two examples fromthe validation sets of each of the urban scene understanding

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 21: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 21

Input Image Seamless Output EfficientPS Output ImprovementError Map

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 6 Qualitative panoptic segmentation results of our proposed EfficientPS network in comparison to the state-of-the-art Seamless architec-ture (Porzi et al 2019) on different benchmark datasets In addition to the panoptic segmentation output we also show the improvementerror mapwhich denotes the pixels that are misclassified by the Seamless model but correctly predicted by the EfficientPS model in green and the pixels thatare misclassified by both the EfficientPS model and the Seamless model in red

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 22: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

22 Rohit Mohan Abhinav Valada

dataset For each example we show the input image the cor-responding panoptic segmentation output from the Seamlessmodel and our proposed EfficientPS model Additionally weshow the improvement and error map where a green pixelindicates that our EfficientPS made the right prediction butthe Seamless model misclassified it (improvement of Effi-cientPS over Seamless) and a red pixel denotes that bothmodels misclassified it with respect to the groundtruth

Figure 6 (a) and (b) show examples from the Cityscapesdataset in which the improvement over the Seamless modelcan be seen in the ability to segment heavily occluded lsquothingrsquoclass instances In the first example the truck far behind onthe bridge is occluded by cars and a cyclist and in the secondexample the distant car parked on the left side of the imageis only partially visible as the car in the front occludes itWe observe from the improvement maps that our proposedEfficientPS model accurately detect classify and segmentthese instances while the Seamless model misclassifies thesepixels This can be primarily attributed to our 2-way FPN thateffectively aggregates multi-scale features to learn semantic-ally richer representations and the panoptic fusion modulethat addresses the instance overlap ambiguity in an adaptivemanner

In Figure 6 (c) and (d) we qualitatively compare the per-formance on the challenging Mapillary Vistas dataset Weobserve that in Figure 6 (c) the group of people towards leftside of the image who are behind the fence are misclassifiedin the output of the Seamless model and the instances of thesepeople are not detected Whereas our EfficientPS model ac-curately segments each of the instances of the people Simil-arly the distant van on the right side of the image shown inFigure 6 (d) is partially occluded by the neighboring cars andis entirely misclassified by the Seamless model However ourEfficientPS model accurately captures this heavily occludedobject instance In Figure 6 (c) interestingly the Seamlessmodel misclassifies the cyclist on the road as a pedestrianWe hypothesize that this might be due to the fact that one ofthe legs of the cyclist is touching the ground and the other legwhich is on the pedal of the bicycle is barely visible Hencethis causes the Seamless model to misclassify the object in-stance Whereas our EfficientPS model effectively leveragesboth the semantic and instance prediction in our panopticfusion module to accurately address this ambiguity in thescene We also observe in Figure 6 (c) that the EfficientPSmodel misclassifies the traffic sign fixed on the fence andonly partially segments the advertisement board attached tothe building near the fence while it accurately segments allthe other instances of this class This is primarily due to thefact that there is a lack of relevant examples for this type oftraffic sign which is atypical of those found in the trainingset

Figure 6 (e) and (f) show qualitative comparisons on theKITTI dataset In Figure 6 (e) we see that the Seamless

model misclassifies the bus that is towards the right of theimage as a truck although it segments the object coherentlyThis is primarily due to the fact that there are poles as well asan advertisement board in front of the bus which divides theit into different subregions This leads the model to predict itas a truck that has a transition between the tractor unit and thetrailer However our proposed EfficientPS model mitigatesthis problem with its bidirectional aggregation of multi-scalefeatures that effectively captures contextual information InFigure 6 (f) we observe that a distant truck on the rightlane is partially occluded by cars behind it which causes theSeamless model to not detect the truck as a new instancerather it detects the truck and the car behind it as being thesame object This is similar to the scenario observed on theCityscapes dataset in Figure 6 (a) Nevertheless our pro-posed EfficientPS model yields accurate predictions in suchchallenging scenarios consistently across different datasets

In Figure 6 (g) and (h) we present examples from theIDD dataset We can see that our EfficientPS model cap-tures the boundaries of lsquostuffrsquo classes more precisely than theSeamless model in both the examples For instance the pillarof the bridge in Figure 6 (g) and the extent of the sidewalk inFigure 6 (h) are more well defined in the panoptic segmenta-tion output of our EfficientPS model This can be attributedto the object boundary refinement ability of our semantichead that correlates features of different scales before fusingthem In Figure 6 (h) the Seamless model misclassifies theauto-rickshaw as a caravan due to the similar visual appear-ances of these two objects however our proposed EfficientPSmodel with our novel panoptic backbone has an extensiverepresentational capacity which enables it to accurately clas-sify objects even with such subtle differences We observethat although the upper half of the cyclist towards the left ofthe image is accurately segmented the front leg of the cyclistis misclassifies as being part of the bicycle This is a chal-lenging scenario due to the high contrast in this region Wealso observe that the boundary of the sidewalk towards theleft of the auto rickshaw is misclassified However on visualinspection of the groundtruth it appears that the sidewalkboundary in this region is mislabeled in groundtruth maskwhile the model is making a reasonable prediction

46 Visualizations

We present visualizations of panoptic segmentation resultsfrom our proposed EfficientPS architecture on CityscapesMapillary Vistas KITTI and Indian Driving Dataset (IDD)in Figure 7 The figures show the panoptic segmentationoutput of our EfficientPS model using single scale evalu-ation which is overlaid on the input image Figure 7 (a) and(b) show examples from the Cityscapes dataset which ex-hibit complex road scenes consisting of a large number of

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 23: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 23

(i) (ii) (iii) (iv)

(a)C

itysc

apes

(b)C

itysc

apes

(c)M

apill

ary

Vis

tas

(d)M

apill

ary

Vis

tas

(e)K

ITT

I(f

)KIT

TI

(g)I

DD

(h)I

DD

Figure 7 Visual panoptic segmentation results of our proposed EfficientPS model on each of the challenging urban scene understanding datasetsthat we benchmark on which in total encompasses scenes from over 50 countries These examples show complex urban scenarios with numerousobject instances in multiple scales and with partial occlusion These scenes also show diverse lighting conditions from dawn to dusk as well asseasonal changes

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 24: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

24 Rohit Mohan Abhinav Valada

traffic participants These examples show challenging scen-arios with dynamic as well as static pedestrian groups inclose proximity to each other and distant parked cars thatare barely visible due to their neighbouring lsquothingrsquo class in-stances Our proposed EfficientPS architecture effectivelyaddresses these challenges and yields reliable panoptic seg-mentation results In Figure 7 (c) and (d) we present resultson the Mapillary Vistas dataset that show drastic viewpointvariations and scenes in different times of day Figure 7 (civ)(di) and (div) show scenes that were captured from uncom-mon viewpoints from those observed in the training dataand Figure 7 (diii) shows a scene that was captured duringnighttime Nevertheless our EfficientPS model demonstratessubstantial robustness against these perceptual variations

In Figure 7 (e) and (f) we present results on the KITTIdataset which show residential and highway road scenesconsisting of several parked and dynamic cars as well as alarge amount of thin structures such as poles We observethat our EfficientPS model generalizes effectively to thesecomplex scenes even when the network was only trainedon the relatively small dataset Figure 7 (g) and (h) showexamples from the IDD dataset that highlight challengesof an unstructured environment One such challenge is theaccurate segmentation of sidewalks as the transition betweenthe road and the sidewalk is not well delineated often causedby a layer of sand over asphalt The examples also showheavy traffic with numerous types of vehicles motorcyclesand pedestrians scattered all over the scene However ourproposed EfficientPS model shows exceptional robustness inthese immensely challenging scenes thereby demonstratingits suitability for autonomous driving applications

5 Conclusions

In this paper we presented our EfficientPS architecture forpanoptic segmentation that achieves state-of-the-art perform-ance while being computationally efficient It incorporatesour proposed panoptic backbone with a variant of Mask R-CNN augmented with separable convolutions as the instancehead a new semantic head that captures fine and contextualfeatures efficiently and our novel adaptive panoptic fusionmodule We demonstrated that our panoptic backbone con-sisting of the modified EfficientNet encoder and our 2-wayFPN achieves the right trade-off between performance andcomputational complexity Our 2-way FPN achieves effect-ive aggregation of semantically rich multi-scale features dueto its bidirectional flow of information Thus in combina-tion with our encoder it establishes a new strong panopticbackbone We proposed a new semantic head that employsscale-specific feature aggregation to capture long-range con-text and characteristic features effectively followed by cor-relating them to achieve better object boundary refinementcapability We also introduced our parameter-free panoptic

fusion module that dynamically fuses logits from both headsbased on their mask confidences and congruously integratesinstance-specific lsquothingrsquo classes with lsquostuffrsquo classes to yieldthe panoptic segmentation output

Additionally we introduced the KITTI panoptic segment-ation dataset that contains panoptic groundtruth annotationsfor images from the challenging KITTI benchmark We hopethat our panoptic annotations complement the suite of otherperception tasks in KITTI and encourage the research com-munity to develop novel multi-task learning methods thatinclude panoptic segmentation We presented exhaustivebenchmarking results on Cityscapes Mapillary Vistas KITTIand IDD datasets that demonstrate that our proposed Effi-cientPS sets the new state-of-the-art in panoptic segmentationwhile being faster and more parameter efficient than exist-ing architectures In addition to being ranked first on theCityscapes panoptic segmentation leaderboard our model isranked second on both the Cityscapes semantic segmentationand instance segmentation leaderboards We also presenteddetailed ablation studies qualitative analysis and visualiza-tions that highlight the improvements that we make to variouscore modules of panoptic segmentation architectures To thebest of our knowledge this work is the first to benchmarkon all the four standard urban scene understanding datasetsthat support panoptic segmentation and exceed the state-of-the-art on each of them while simultaneously being the mostefficient

Acknowledgements

This work was partly funded by the European Unionrsquos Ho-rizon 2020 research and innovation program under grantagreement No 871449-OpenDR and a Google Cloud researchgrant

References

Arbelaez P Pont-Tuset J Barron JT Marques F Malik J (2014)Multiscale combinatorial grouping In Proceedings of the Confer-ence on Computer Vision and Pattern Recognition pp 328ndash335

Badrinarayanan V Kendall A Cipolla R (2015) Segnet A deep convo-lutional encoder-decoder architecture for image segmentation arXivpreprint arXiv 151100561

Bai M Urtasun R (2017) Deep watershed transform for instance seg-mentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 5221ndash5229

Bremner JG Slater A (2008) Theories of infant development JohnWiley amp Sons

Brostow GJ Shotton J Fauqueur J Cipolla R (2008) Segmentation andrecognition using structure from motion point clouds In Europeanconference on computer vision Springer pp 44ndash57

Bulo SR Neuhold G Kontschieder P (2017) Loss max-pooling forsemantic image segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 7082ndash7091

Chen LC Papandreou G Kokkinos I Murphy K Yuille AL (2017a)Deeplab Semantic image segmentation with deep convolutional nets

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 25: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

EfficientPS Efficient Panoptic Segmentation 25

atrous convolution and fully connected crfs IEEE transactions onpattern analysis and machine intelligence 40(4)834ndash848

Chen LC Papandreou G Schroff F Adam H (2017b) Rethinking at-rous convolution for semantic image segmentation arXiv preprintarXiv170605587

Chen LC Collins M Zhu Y Papandreou G Zoph B Schroff F AdamH Shlens J (2018a) Searching for efficient multi-scale architecturesfor dense image prediction In Advances in Neural InformationProcessing Systems pp 8713ndash8724

Chen LC Zhu Y Papandreou G Schroff F Adam H (2018b) Encoder-decoder with atrous separable convolution for semantic image seg-mentation arXiv preprint arXiv180202611

Cheng B Collins MD Zhu Y Liu T Huang TS Adam H Chen LC(2019) Panoptic-deeplab A simple strong and fast baseline forbottom-up panoptic segmentation arXiv preprint arXiv191110194

Chollet F (2017) Xception Deep learning with depthwise separableconvolutions In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 1251ndash1258

Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson RFranke U Roth S Schiele B (2016) The cityscapes dataset for se-mantic urban scene understanding In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3213ndash3223

Dai J He K Li Y Ren S Sun J (2016) Instance-sensitive fully convolu-tional networks In European Conference on Computer Vision pp534ndash549

Dai J Qi H Xiong Y Li Y Zhang G Hu H Wei Y (2017) Deform-able convolutional networks In Proceedings of the InternationalConference on Computer Vision pp 764ndash773

Gao N Shan Y Wang Y Zhao X Yu Y Yang M Huang K (2019)Ssap Single-shot instance segmentation with affinity pyramid InProceedings of the International Conference on Computer Vision pp642ndash651

Geiger A Lenz P Stiller C Urtasun R (2013) Vision meets roboticsThe kitti dataset International Journal of Robotics Research

de Geus D Meletis P Dubbelman G (2018) Panoptic segmentation witha joint semantic and instance segmentation network arXiv preprintarXiv180902110

Girshick R (2015) Fast r-cnn In Proceedings of the International Con-ference on Computer Vision pp 1440ndash1448

Glorot X Bengio Y (2010) Understanding the difficulty of trainingdeep feedforward neural networks In Proceedings of the thirteenthinternational conference on artificial intelligence and statistics pp249ndash256

Hariharan B Arbelaez P Girshick R Malik J (2014) Simultaneousdetection and segmentation In European Conference on ComputerVision pp 297ndash312

Hariharan B Arbelaez P Girshick R Malik J (2015) Hypercolumns forobject segmentation and fine-grained localization In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp447ndash456

He K Zhang X Ren S Sun J (2016) Deep residual learning for imagerecognition In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 770ndash778

He K Gkioxari G Dollar P Girshick R (2017) Mask r-cnn In Pro-ceedings of the International Conference on Computer Vision pp2961ndash2969

He X Gould S (2014a) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition

He X Gould S (2014b) An exemplar-based crf for multi-instance objectsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 296ndash303

Howard A Sandler M Chu G Chen LC Chen B Tan M Wang W ZhuY Pang R Vasudevan V et al (2019) Searching for mobilenetv3 InProceedings of the International Conference on Computer Vision pp1314ndash1324

Hu J Shen L Sun G (2018) Squeeze-and-excitation networks In Pro-ceedings of the Conference on Computer Vision and Pattern Recog-nition pp 7132ndash7141

Ioffe S Szegedy C (2015) Batch normalization Accelerating deepnetwork training by reducing internal covariate shift arXiv preprintarXiv150203167

Kaiser L Gomez AN Chollet F (2017) Depthwise separable convolu-tions for neural machine translation arXiv preprint arXiv170603059

Kang BR Kim HY (2018) Bshapenet Object detection and in-stance segmentation with bounding shape masks arXiv preprintarXiv181010327

Kirillov A He K Girshick R Rother C Dollar P (2019) Panopticsegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 9404ndash9413

Kontschieder P Bulo SR Bischof H Pelillo M (2011) Structured class-labels in random forests for semantic image labelling In Proceed-ings of the International Conference on Computer Vision pp 2190ndash2197

Krahenbuhl P Koltun V (2011) Efficient inference in fully connectedcrfs with gaussian edge potentials In Advances in neural informa-tion processing systems pp 109ndash117

Li J Raventos A Bhargava A Tagawa T Gaidon A (2018a) Learningto fuse things and stuff arXiv preprint arXiv181201192

Li Q Arnab A Torr PH (2018b) Weakly-and semi-supervised panop-tic segmentation In Proceedings of the European Conference onComputer Vision (ECCV) pp 102ndash118

Li X Zhang L You A Yang M Yang K Tong Y (2019a) Globalaggregation then local distribution in fully convolutional networksarXiv preprint arXiv190907229

Li Y Qi H Dai J Ji X Wei Y (2017) Fully convolutional instance-aware semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition pp 2359ndash2367

Li Y Chen X Zhu Z Xie L Huang G Du D Wang X (2019b) Attention-guided unified network for panoptic segmentation In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp7026ndash7035

Lin TY Maire M Belongie S Hays J Perona P Ramanan D Dollar PZitnick CL (2014) Microsoft coco Common objects in context InEuropean conference on computer vision Springer pp 740ndash755

Lin TY Dollar P Girshick R He K Hariharan B Belongie S (2017)Feature pyramid networks for object detection In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp2117ndash2125

Liu H Peng C Yu C Wang J Liu X Yu G Jiang W (2019) An end-to-end network for panoptic segmentation In Proceedings of theConference on Computer Vision and Pattern Recognition pp 6172ndash6181

Liu S Jia J Fidler S Urtasun R (2017) Sgn Sequential grouping net-works for instance segmentation In Proceedings of the InternationalConference on Computer Vision pp 3496ndash3504

Liu S Qi L Qin H Shi J Jia J (2018) Path aggregation network for in-stance segmentation In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 8759ndash8768

Liu W Rabinovich A Berg AC (2015) Parsenet Looking wider to seebetter arXiv preprint arXiv 150604579

Long J Shelhamer E Darrell T (2015) Fully convolutional networksfor semantic segmentation In Proceedings of the Conference onComputer Vision and Pattern Recognition

Neuhold G Ollmann T Rota Bulo S Kontschieder P (2017) The mapil-lary vistas dataset for semantic understanding of street scenes InProceedings of the International Conference on Computer Vision pp4990ndash4999

Paszke A Gross S Massa F Lerer A Bradbury J Chanan G Killeen TLin Z Gimelshein N Antiga L et al (2019) Pytorch An imperativestyle high-performance deep learning library In Advances in NeuralInformation Processing Systems pp 8024ndash8035

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions
Page 26: arXiv:2004.02307v2 [cs.CV] 19 May 2020 · 16 48x 512x 1024 1 Input 3x 1024x 2048 P 32 128x 64x128 32x64 DPC 1 256x 256x512 256x 64x128 128x256 24x 512x 1024 40x 256x 512 64x 128x

26 Rohit Mohan Abhinav Valada

Pinheiro PO Collobert R Dollar P (2015) Learning to segment objectcandidates In Advances in Neural Information Processing Systemspp 1990ndash1998

Plath N Toussaint M Nakajima S (2009) Multi-class image segment-ation using conditional random fields and global classification InProceedings of the International Conference on Machine Learningpp 817ndash824

Porzi L Bulo SR Colovic A Kontschieder P (2019) Seamless scenesegmentation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 8277ndash8286

Radwan N Valada A Burgard W (2018) Multimodal interaction-awaremotion prediction for autonomous street crossing arXiv preprintarXiv180806887

Ren M Zemel RS (2017) End-to-end instance segmentation with re-current attention In Proceedings of the Conference on ComputerVision and Pattern Recognition pp 6656ndash6664

Romera-Paredes B Torr PHS (2016) Recurrent instance segmentationIn European conference on computer vision Springer pp 312ndash329

Ros G Ramos S Granados M Bakhtiary A Vazquez D Lopez AM(2015) Vision-based offline-online perception paradigm for autonom-ous driving In IEEE Winter Conference on Applications of Com-puter Vision pp 231ndash238

Rota Bulo S Porzi L Kontschieder P (2018) In-place activated batch-norm for memory-optimized training of dnns In Proceedings ofthe Conference on Computer Vision and Pattern Recognition pp5639ndash5647

Russakovsky O Deng J Su H Krause J Satheesh S Ma S Huang ZKarpathy A Khosla A Bernstein M et al (2015) Imagenet largescale visual recognition challenge International Journal of ComputerVision 115(3)211ndash252

Shotton J Johnson M Cipolla R (2008) Semantic texton forests forimage categorization and segmentation In Proceedings of the Con-ference on Computer Vision and Pattern Recognition

Silberman N Sontag D Fergus R (2014) Instance segmentation ofindoor scenes using a coverage loss In European Conference onComputer Vision pp 616ndash631

Simonyan K Zisserman A (2014) Very deep convolutional networksfor large-scale image recognition arXiv preprint arXiv14091556

Sofiiuk K Barinova O Konushin A (2019) Adaptis Adaptive instanceselection network In Proceedings of the International Conferenceon Computer Vision pp 7355ndash7363

Sturgess P Alahari K Ladicky L Torr PH (2009) Combining appear-ance and structure from motion features for road scene understandingIn British Machine Vision Conference

Tan M Le QV (2019) Efficientnet Rethinking model scaling for convo-lutional neural networks arXiv preprint arXiv190511946

Tian Z He T Shen C Yan Y (2019) Decoders matter for semanticsegmentation Data-dependent decoding enables flexible feature ag-gregation In Proceedings of the Conference on Computer Visionand Pattern Recognition pp 3126ndash3135

Tighe J Niethammer M Lazebnik S (2014) Scene parsing with objectinstances and occlusion ordering In Proceedings of the Conferenceon Computer Vision and Pattern Recognition pp 3748ndash3755

Tu Z Chen X Yuille AL Zhu SC (2005) Image parsing Unifyingsegmentation detection and recognition International Journal ofComputer Vision 63(2)113ndash140

Uhrig J Cordts M Franke U Brox T (2016) Pixel-level encodingand depth layering for instance-level semantic labeling In GermanConference on Pattern Recognition pp 14ndash25

Valada A Dhall A Burgard W (2016a) Convoluted mixture of deepexperts for robust semantic segmentation In IEEERSJ Internationalconference on intelligent robots and systems (IROS) workshop stateestimation and terrain perception for all terrain mobile robots

Valada A Oliveira G Brox T Burgard W (2016b) Towards robustsemantic segmentation using deep fusion In Robotics Science andsystems (RSS 2016) workshop are the sceptics right Limits andpotentials of deep learning in robotics

Valada A Vertens J Dhall A Burgard W (2017) Adapnet Adaptivesemantic segmentation in adverse environmental conditions In Pro-ceedings of the IEEE International Conference on Robotics andAutomation pp 4644ndash4651

Valada A Radwan N Burgard W (2018) Incorporating semantic andgeometric priors in deep pose regression In Workshop on Learningand Inference in Robotics Integrating Structure Priors and Modelsat Robotics Science and Systems (RSS)

Valada A Mohan R Burgard W (2019) Self-supervised model adapta-tion for multimodal semantic segmentation International Journal ofComputer Vision DOI 101007s11263-019-01188-y special IssueDeep Learning for Robotic Vision

Varma G Subramanian A Namboodiri A Chandraker M Jawahar C(2019) Idd A dataset for exploring problems of autonomous naviga-tion in unconstrained environments In IEEE Winter Conference onApplications of Computer Vision (WACV) pp 1743ndash1751

Wu Y He K (2018) Group normalization In Proceedings of theEuropean Conference on Computer Vision (ECCV) pp 3ndash19

Xie S Girshick R Dollar P Tu Z He K (2017) Aggregated residualtransformations for deep neural networks In Proceedings of theConference on Computer Vision and Pattern Recognition pp 1492ndash1500

Xiong Y Liao R Zhao H Hu R Bai M Yumer E Urtasun R (2019)Upsnet A unified panoptic segmentation network In Proceedingsof the Conference on Computer Vision and Pattern Recognition pp8818ndash8826

Xu P Davoine F Bordes JB Zhao H Denœux T (2016) Multimodalinformation fusion for urban scene understanding Machine Visionand Applications 27(3)331ndash349

Yang TJ Collins MD Zhu Y Hwang JJ Liu T Zhang X Sze VPapandreou G Chen LC (2019) Deeperlab Single-shot image parserarXiv preprint arXiv190205093

Yao J Fidler S Urtasun R (2012) Describing the scene as a whole Jointobject detection scene classification and semantic segmentationIn Proceedings of the Conference on Computer Vision and PatternRecognition pp 702ndash709

Yu F Koltun V (2015) Multi-scale context aggregation by dilated con-volutions arXiv preprint arXiv151107122

Zhang C Wang L Yang R (2010) Semantic segmentation of urbanscenes using dense depth maps In European Conference on Com-puter Vision Springer pp 708ndash721

Zhang Z Fidler S Urtasun R (2016) Instance-level segmentation forautonomous driving with deep densely connected mrfs In Proceed-ings of the Conference on Computer Vision and Pattern Recognitionpp 669ndash677

Zhao H Shi J Qi X Wang X Jia J (2017) Pyramid scene parsingnetwork In Proceedings of the Conference on Computer Vision andPattern Recognition pp 2881ndash2890

Zurn J Burgard W Valada A (2019) Self-supervised visual terrainclassification from unsupervised acoustic feature learning arXivpreprint arXiv191203227

  • 1 Introduction
  • 2 Related Works
  • 3 EfficientPS Architecture
  • 4 Experimental Results
  • 5 Conclusions