arXiv:1811.10323v1 [cs.CV] 26 Nov 2018 · 2018-11-27 · Tarun Kalluri 1Girish Varma Manmohan Chandrakar2 C V Jawahar1 1IIIT Hyderabad 2University of California, San Diego Abstract

Universal Semi-Supervised Semantic Segmentation

Tarun Kalluri1 Girish Varma1 Manmohan Chandrakar2 C V Jawahar1

1IIIT Hyderabad 2University of California, San Diego

Abstract

In recent years, the need for semantic segmentation hasarisen across several different applications and environ-ments. However, the expense and redundancy of annotationoften limits the quantity of labels available for training inany domain, while deployment is easier if a single modelworks well across domains. In this paper, we pose the novelproblem of universal semi-supervised semantic segmenta-tion and propose a solution framework, to meet the dualneeds of lower annotation and deployment costs. In con-trast to counterpoints such as fine tuning, joint training orunsupervised domain adaptation, universal semi-supervisedsegmentation ensures that across all domains: (i) a singlemodel is deployed, (ii) unlabeled data is used, (iii) perfor-mance is improved, (iv) only a few labels are needed and(v) label spaces may differ. To address this, we minimizesupervised as well as within and cross-domain unsupervisedlosses, introducing a novel feature alignment objective basedon pixel-aware entropy regularization for the latter. Wedemonstrate quantitative advantages over other approacheson several combinations of segmentation datasets acrossdifferent geographies (Germany, England, India) and envi-ronments (outdoors, indoors), as well as qualitative insightson the aligned representations.

1. IntroductionSemantic segmentation is the task of pixel level classifi-

cation of an image into a predefined set of categories. State-of-the-art semantic segmentation architectures [31, 3, 8] pre-train deep networks on a classification task on datasets likeImageNet [12, 44] and then fine-tune on finely annotated la-beled examples [11, 54]. The availability of such large-scalelabeled datasets has been crucial to achieve high accura-cies for semantic segmentation in applications ranging fromnatural scene understanding [18] to medical imaging [43].However, performance often suffers even in the presence ofa minor domain shift. For example, a segmentation modeltrained on a driving dataset from a specific geographic loca-tion may not generalize to a new city due to differences inweather, lighting or traffic density. Further, a segmentation

Traditional fine-tuning approach

Domain B

Labeled Images

Unlabeled Images

Unlabeled Images

Labeled Images

Domain A Domain A specific model

Domain B specific model

Traditional fine-tuning approach

Proposed Approach

Figure 1: Proposed universal segmentation model can be jointlytrained across datasets with different label spaces, making useof the large amounts of unlabeled data available. Traditionaldomain adaptation and transfer learning based approaches typi-cally require training separate models for each domain.

model trained on traffic scenes for outdoor navigation maynot be applicable for an indoor robot.

While such domain shift is a challenge for any machinelearning problem, it is particularly exacerbated for segmen-tation where human annotation is highly prohibitive andredundant for different locations and tasks. Thus, there is agrowing interest towards learning segmentation representa-tions that may be shared across domains. A prominent lineof work addresses this through unsupervised domain adap-tation from a labeled source to an unlabeled target domain[23, 51, 9, 36, 6]. But there remain limitations. For instance,unsupervised domain adaptation usually does not leveragetarget domain data to improve source performance. Further,it is designed for the restrictive setting of large-scale labeledsource domain and unlabeled target domain. While someapplications such as self-driving have large-scale annotateddatasets for particular source domains, the vast majority ofapplications only have limited data in any domain. Finally,most of the above works assume that the target label setmatches with the source one, which is often not the case in

1

arX

iv:1

811.

1032

3v1

[cs

.CV

] 2

6 N

ov 2

018

practice. For example, road scene segmentation across dif-ferent countries, or segmentation across outdoor and indoorscenes, have domain-specific label sets.

In this paper, we propose and address the novel prob-lem of universal semi-supervised semantic segmentationas a practical setting for many real-world applications. Itseeks to aggregate knowledge from several different domainsduring training, each of which has few labeled examplesbut several unlabeled examples. The goal is to simultane-ously limit training cost through reduced annotations anddeployment cost by obtaining a single model to be usedacross domains. Label spaces may be partially or fully non-overlapping across domains. While fine-tuning a sourcemodel on a small amount of target data is a possible coun-terpoint, it usually requires plentiful source labels and ne-cessitates deployment of a separate model in every domain.Another option is joint training, which does yield a unifiedmodel across domains, but does not leverage unlabeled dataavailable in each domain. Our semi-supervised universalsegmentation approach leverages both limited labeled andlarger-scale unlabeled data in every domain, to obtain a sin-gle model that performs well across domains.

In particular, we use the labeled examples in each domainto supervise the universal model, akin to multi-tasking [27,34, 26], albeit with limited labels. We simultaneously makeuse of the large number of unlabeled examples to align pixellevel deep feature representations from multiple domainsusing entropy regularization based objective functions. Wecalculate the similarity score vector between the features andthe label embeddings (computed from class prototypes [49])and minimize the entropy of this discrete distribution topositively align similar examples between the labeled andthe unlabeled images. We do this unsupervised alignmentboth within domain, as well as across domains.

We believe such within and cross-domain alignment isfruitful even with non-overlapping label spaces, particularlyso for semantic segmentation, since label definitions oftenencode relationships that may positively reinforce perfor-mance in each domain. For instance, two road scene datasetssuch as Cityscapes [11] and IDD [54] might have differentlabels, but share similar label hierarchies. Even an outdoordataset like Cityscapes and an indoor one like SUN [50] mayhave label relationships, for example, between horizontal(road, floor) and vertical (building, wall) classes. Similarobservations have been made for multi-task training [60].

We posit that our pixel wise entropy-based objective dis-covers such alignments to improve over joint training, asdemonstrated quantitatively and qualitatively in our exper-iments. Specifically, our experiments lend insights acrossvarious notions of domain gaps. With Cityscapes [11] asone of domains (road scenes in Germany), we derive univer-sal models with respect to CamVid (roads in England) [4],IDD (roads in India) [54] and SUN (indoor rooms) [50]. In

each case, our semi-supervised universal model improvesover fine-tuning and joint training, with visualizations of thelearned feature representations demonstrating conceptuallymeaningful alignments. We use dilated residual networks inour experiments [59], but the framework is equally applica-ble to any of the existing deep encoder-decoder architecturesfor semantic segmentation.

In summary, we make the following contributions:

• We propose a universal segmentation framework to traina single joint model on multiple domains with disparatelabel spaces to improve performance on each domain.

• We introduce a pixel-level entropy regularization schemeto train semantic segmentation architectures using datasetswith few labeled examples and larger quantities of unla-beled examples.

• We demonstrate the effectiveness of our alignment overa wide variety of indoor [50] and outdoor [11, 54, 4] seg-mentation datasets with various degrees of label overlaps.

2. Related Work

Semantic Segmentation Most of the state of the art mod-els for semantic segmentation [59, 31, 37, 3, 8, 42] havebeen possible largely due to breakthroughs in deep learningthat have pushed the boundaries of performance in imageclassification [28, 21, 22] and related tasks. The pioneer-ing work in [31] proposes an end-to-end trainable networkfor semantic segmentation by replacing the fully connectedlayers of pretrained AlexNet [28] and VGG Net [48] withfully convolutional layers that aggregate spatial informationacross various resolutions. Noh et al. [37] use transposeconvolutions to build a learnable decoder module, whileDeepLab network [8] uses artrous convolutions along withartrous spatial pyramid pooling for better aggregation of spa-tial features. Segmentation architectures based on dilatedconvolutions [58] for real time semantic segmentation havealso been proposed [59, 42].

Transfer Learning and Domain Adaptation Transferlearning [55] involves transferring deep feature represen-tations learned in one domain or task to another domainor task where labeled data availability is low. Previ-ous works demonstrate transfer learning capabilities be-tween related tasks [13, 61, 39, 40] or completely differenttasks [19, 41, 31]. Unsupervised domain adaptation is arelated paradigm which leverages labeled data from one ormore source domains to learn a classifier for a new unsuper-vised target domain in the presence of a domain shift. Ganinet al. [16, 17], followed by [53, 52, 23, 51, 5, 9] proposevarious models for learning domain agnostic discriminativefeature representations with the help of an auxiliary domainclassifier trained using an adversarial loss. Most of these

classwise centroids

Encoder Module F(. )

Decoder (. )G1

Decoder (. )G2

(. )Lsup

W1

H1

C1

W2

C2

(. )Lunsup

Entropy Module

H2

Domain A

Domain B

Domain A

Domain B

{ }X(2)

U

{ }X(1)

U

{ , }X(1)

LY

(1)

{ , }X(2)

LY

(2)

Supervised set

Unsupervised Set

ψ(. )

Entropy

Similarity

Figure 2: Different modules in the proposed universal semantic segmentation framework. {X(1)l ,Y(1)} , {X(2)

l ,Y(2)} are the labeledexamples and X

(1)u , X(2)

u are the unlabeled examples. Both the datasets share an encoder but have two different decoder layers G1 and G2 toenable prediction in their respective output spaces. The entropy module uses the unlabeled examples to perform alignment of pixel wisefeatures from multiple domains by calculating pixel wise similarity with the labels, and minimizing the entropy of this discrete distribution.

works in domain adaptation assume equal source and targetdataset label spaces or a subset target label space, which isnot the most general case for real world applications. Luo etal. [32] relax this assumption to perform few shot adaptationacross arbitrary label spaces for image and action recognitiontasks. While transfer learning or adaptation based methodsare typically focused on using knowledge from a labeledsource domain to improve performance on a specific targetdomain, we propose a joint training framework to train asingle model that delivers good performance on both thedomains.

Label Embeddings Most tasks in computer vision per-form prediction in a one-hot label space where each labelis represented as a one-hot vector. However, following thesuccess in metric learning based approaches [24], tasks suchas fine grained classification [2, 1], latent hierarchy learn-ing [45] and zero-shot prediction [38, 15, 29] have bene-fited greatly by projecting images and labels into a commonsubspace. Popular methods for obtaining vector representa-tions for labels include hand crafted semantic attribute vec-tors [56] or semantic representations of labels [38, 15] fromword2vec [35]. More recent works propose using prototypescomputed from class wise feature representations as the la-bel embeddings to perform few shot classification[49, 14],unsupervised domain adaptation [57] as well as few shotdomain adaptation [32].

Universal Segmentation Multitask learning [7] is shownto help in improving performance for many tasks thatshare useful relationships between them in computer vi-sion [47, 27, 60], natural language processing [10, 34, 26]and speech recognition [46]. Universal segmentation buildson this idea by training a single joint network applicableacross multiple semantic segmentation domains with possi-bly different label spaces to make use of transferable repre-sentations at lower levels of the network. Liang et al. [30]first propose the idea of universal segmentation by perform-ing dynamic propagation through a label hierarchy graphconstructed from an external knowledge source like Word-Net. We propose an alternative method to perform universalsegmentation without the need for any outside knowledgesource or additional model parameters during inference, andinstead make efficient use of the large set of unlabeled exam-ples in each of the domains for unsupervised feature align-ment.

3. Problem Description

In this section, we explain the framework used to traina single model across different segmentation datasets withpossibly disparate label spaces using a novel pixel awareentropy regularization objective.

We have d datasets {D(i)}di=1, each of which has fewlabeled examples and many unlabeled examples. The labeledimages and corresponding labels from D(i) are denoted by

{X(i)l ,Y(i)}N

(i)l

i=1 , where Y(i) ∈ Yi andN (i)l is the number of

labeled examples. The unlabeled images are represented by

{X(i)u }N

(i)u

i=1 where N (i)u is the number of unlabeled examples.

We work with domains with very few labeled examples,so N

(i)u � N

(i)l . We consider the general case of non-

intersecting label spaces, that is Yp 6= Yq for any p, q. Thelabel spaces might still have a partial overlap between them,which is common in the case of segmentation datasets. Forease of notation, we consider the special case of two datasets{D(1),D(2)}, but similar idea can be applied for the case ofmultiple datasets as well.

The proposed universal segmentation model is summa-rized in Figure 2. We first concatenate the datasets D(1)

and D(2) and randomly select samples from this mixture ineach mini-batch for training. Deep semantic segmentationarchitectures generally consist of an encoder module whichaggregates the spatial information across various resolutionsand a decoder module that up samples the image to enablepixel wise predictions at a resolution that matches the in-put. In order to enable joint training with multiple datasets,we modify this encoder decoder architecture by having ashared encoder module F and different decoder layers G1(.),G2(.) for prediction in different label spaces. For a labeledinput image xl, the pixel wise predictions are denoted byy(k) = Gk(F(xl)) for k = 1, 2 which, along with the la-beled annotations, gives us the supervised loss. To includethe visual information from the unlabeled examples Xu, wepropose an entropy regularization module E(.). This entropymodule takes as input the output of the encoder F(.) to givepixel wise embedding representations. The entropy of thepixel level similarity scores of these embedding representa-tions with the label embeddings result in the unsupervisedloss term. Each of these loss terms is explained in detail inthe following sections.

Supervised Loss The supervised loss is calculated sepa-rately at each output softmax layer as a masked cross entropyloss between the predicted segmentation mask at that layer yand the corresponding pixel wise ground truth segmentationmasks for all labeled examples. Specifically, for the outputsoftmax layer k which corresponds to dataset k,

L(k)S =

1

N(k)l

∑i,xli∈D(k)

ψk (yi,Gk (F (xli))) , (1)

where ψk is the softmax cross entropy loss function overthe label space Yk, which is averaged over all the pixels ofthe segmentation map. L(1)

S and L(2)S together comprise the

supervised loss term LS .

Entropy Module The large number of unsupervised im-ages in the training set provides us with rich information

Figure 3: In addition to a traditional decoder layer that outputspredictions in the respective label spaces Rc, the proposednetwork consists of a domain matching entropy module E(.)that first maps the features of both the domains into a commonembedding space Rd, and then calculates similarity scores withthe label embeddings of respective datasets.

regarding the visual similarity between the domains and thelabel structure, which the existing methods on few shot seg-mentation [14] or universal segmentation [30] do not exploit.To address this issue, we propose using entropy regulariza-tion [20] to transfer the information from labeled images tothe unsupervised images. We introduce a pixel level entropyregularization module, which helps in aligning the visuallysimilar pixel level features from both the datasets calculatedfrom the segmentation network close to each other in anunsupervised way.

The entropy module E takes as input the encoder outputF(x) and projects the representation at each pixel into a ddimensional embedding space Rd. A similarity metric φ(., .)which operates on each pixel is then used to calculate thesimilarity score of the embedding representations with eachof the d dimensional label embeddings using the equation

[vij ]k = φ(E(F(x(i)u

)), c

(j)k

)∀k ∈ {|Yj |}, (2)

where x(i)u is an image from the ith unlabeled set, c(j)k ∈ Rd

is the label embedding corresponding to the kth label fromthe jth dataset and [vij ] ∈ R|Yj |. When i = j, the scorescorrespond to the similarity scores within a dataset, andwhen i 6= j, they provide the cross dataset similarity scores.

To calculate the vector representation of the labels of adataset, we first train an end to end segmentation model withall the supervised training data available from that dataset.Using the output of the encoder of this network, we calculatethe centroid of the vector representation of all the pixelsbelonging to a label to obtain the label embedding for thatparticular label. These centroids have then been used toinitialize the label embeddings for the universal segmentationmodel. In our experiments, the label embeddings have been

kept static over the course of training the joint segmentationnetwork, since we found that the limited supervised data wasnot sufficient to jointly train a universal segmentation modelas well as fine tune the label embeddings. More informationregarding this is provided in the supplementary section.

As demonstrated in Figure 3, although the entropy mod-ule E is similar to the decoder module G, in practice, twocrucial differences exists between the two. First, the en-tropy module projects the encoder features into a large ddimensional embedding space, while the decoder projectsthe features into a much smaller label space. Also, unlikethe decoder, features corresponding to both the datasets areprojected into a common embedding space by the entropymodule.

Unsupervised Loss We have two parts for the unsuper-vised entropy loss. The first part, the cross dataset entropyloss, is obtained by minimizing the entropy of the crossdataset similarity vectors.

LUS,c = H(σ([v12])) +H(σ([v21])), (3)

whereH(.) is the entropy measure of a discrete distribution,σ(.) is the softmax operator and the similarity vector [v]is from Eq (2). Minimizing LUS,c makes the probabilitydistribution peaky over a single label from a dataset, whichhelps to align visually similar images across datasets close toeach other improving the overall prediction of the network.In addition, we also have a within dataset entropy loss givenby

LUS,w = H(σ([v11])) +H(σ([v22])) (4)

which aligns the unlabeled examples within the same do-main.

The total loss LT is the sum of the supervised lossfrom Eq (1), and the unsupervised loss terms from Eq (3)and Eq (4), written as

LT = LS(X(1)l ,Y(1),X

(2)l ,Y(2)) + α · LUS,c(X

(1)u ,X

(2)u )

+β · LUS,w(X(1)u ,X

(2)u ) (5)

where α and β are a hyper parameters that control the influ-ence of the unsupervised loss in the total loss.

Multiple Centroids Many labels in a segmentation datasetoften appear in more than one visual form or modalities. Forexample, road class can appear as dry road, wet road, shadyroad etc., or a class labeled as building can come in dif-ferent structures and sizes. To better capture the multiplemodalities involved in the visual information of the label, wepropose using multiple embeddings for each label insteadof a single mean centroid. This is analogous to polysemyin vocabulary, where many words can have multiple mean-ings and can occur in different contexts, and context specific

word vector representations are used to capture this behav-ior. To calculate the multiple label embeddings, we performK-means clustering of the pixel level encoder feature repre-sentations calculated from networks pretrained on the limitedsupervised data, and calculate similarity scores with all themultiple label embeddings, and include the results in theexperiments.

Inference For a query image q(k) from dataset k duringtest time, the output of the kth decoder y(k) = Gk(F(q(k)))is used to obtain the segmentation map over the label set Ykwhich gives us the pixel wise label predictions. Althoughwe calculate feature and label embeddings in our methodand metric based inference schemes like nearest neighborsearch might enable prediction in a label set agnostic manner,calculating pixel wise nearest neighbor predictions can provevery slow and costly for images with high resolution.

4. Results and DiscussionDatasets Cityscapes [11] is a standard autonomous driv-ing dataset consisting of 2975 training images collected fromvarious cities across Europe finely annotated with 19 classes.CamVid [4] dataset contains 367 training and 233 testingimages from England taken from video sequences finely la-beled with 32 classes, although we use the more popular11 class version from [3]. We also demonstrate results onIDD [25, 54] dataset, which is an in-the-wild dataset forautonomous navigation in unconstrained environments. Itconsists of 6993 training and 981 validation images finely an-notated with 26 classes collected from 182 drive sequenceson Indian roads, taken in highly varying weather and envi-ronment conditions. This is a challenging driving datasetsince it contains images taken from largely unstructuredenvironments.

While autonomous driving datasets typically offer manychallenges, there is still limited variation with respect to theclasses, object orientation or camera angles. Therefore, wealso use SUN RGB-D [50] dataset for indoor segmentation,which contains 5285 training images along with 475 vali-dation images finely annotated with 37 labels consisting ofregular household objects like chair, table, desk, pillow etc.We use only the RGB information for our universal trainingand ignore the depth information provided.

Training Details Although the proposed framework isreadily applicable to any state-of-the art encoder-decoder se-mantic segmentation framework, we use the openly availablePyTorch implementation of dilated residual network [59]with a ResNet-18 backbone owing to its low latency in au-tonomous driving applications. We train every model on 2Nvidia GeForce GTX 1080 GPUs with a uniform crop sizeof 512 × 512 across datasets and a batch size of 10. Weemploy SGD learning algorithm with an initial learning rate

Method Roa

d

Side

Wal

k

Bui

ldin

g

Wal

l

Fenc

e

Pole

Traf

ficlig

ht

Traf

ficSi

gn

Veg

etat

ion

Trai

n

Sky

Pers

on

Rid

er

Car

Truc

k

Bus

Trai

n

Mot

orC

ycle

Bic

ycle

mIo

U

Supervised Only 90.53 51.03 77.48 12.63 11.35 26.56 14.14 35.32 79.2 33.47 78.13 47.09 12.86 75.98 25.94 6.0 0.0 16.36 40.07 38.64Univ-basic 90.23 53.72 79.52 15.54 10.79 30.64 16.24 40.03 81.25 37.64 82.91 50.9 19.33 76.05 20.89 3.79 0.0 19.54 42.04 40.58Univ-full 90.68 54.49 81.69 9.27 26.38 36.01 23.96 41.54 83.45 41.58 81.6 55.86 28.92 75.31 21.79 18.28 0.0 14.89 52.49 44.12Universal (K=3) 92.66 58.72 80.62 30.14 18.4 32.98 19.41 35.95 83.32 31.44 80.55 51.47 20.88 81.31 12.88 27.73 16.21 8.52 48.4 43.77

Method Sky

Bui

ldin

g

Pole

Roa

d

Pave

men

t

Tree

Sign

Fenc

e

Car

Pede

stri

an

Bic

yclis

t

mIo

U

Only CamVid 65.79 71.69 5.25 81.27 53.55 78.29 7.82 12.73 57.86 27.49 13.03 43.16Univ-basic 91.62 82.39 18.54 74.92 48.94 81.65 18.61 10.61 77.61 26.25 35.56 51.52Univ-full 90.49 77.23 24.19 71.37 72.01 80.88 33.22 5.09 78.38 40.54 20.52 53.99Univ-full (K=3) 90.93 80.55 21.33 67.76 77.69 83.8 18.4 19.62 72.65 5.93 62.42 54.64

Table 1: Class Wise IoU values for the 19 classes in Cityscapes dataset and 11 classes in the CamVid dataset with various ablations ofuniversal semantic segmentation models.

Method N=50 N=100 N=300

CS CVD CS CVD CS CVD

Supervised Only 34.42 39.38 38.64 43.16 47.89 48.52Finetune 33.28 49.06 37.05 53.60 47.44 57.81

Univ-basic 35.47 50.23 40.58 51.52 49.04 64.74Univ-cross 32.54 46.20 38.33 52.49 47.83 63.39Univ-full 37.18 51.76 44.12 53.99 49.41 61.04Univ-full (K = 3) 39.3 51.31 43.77 54.64 53.07 63.67

Table 2: mIoU values for universal segmentation usingcityscapes(CS) [11] and camvid(CVD) [4] datasets. The high-lighted entries correspond to the method for which the sum ofmIoUs are the highest. N is the number of supervised exam-ples available from each dataset and K is the number of clustercentroids for each label.

of 0.001 and a momentum of 0.9, along with a poly learningrate schedule [8] with a power of 0.9. We take the embed-ding dimension d to be 128, and use dot product for thepixel level similarity metric φ(.) as it can be implementedas a 1× 1 convolution on most of the modern deep learningpackages. We use 100 labeled examples from each datasetas a standard setting for the experiments.

Evaluation Metric We use the mean IoU (Intersectionover Union) as the performance analysis metric. The IoU foreach class is given by

IoU =TP

TP+FP+FN, (6)

where TP , FP , FN are the true positive, false positive andfalse negative pixels respectively, and mIoU is the mean ofIoUs over all the classes. mIoUs are calculated separatelyfor all datasets in a universal model and the models arecompared against each other using the sum of mIoUs as themetric. All the mIoU values reported are on the publiclyavailable validation sets of the respective datasets.

Ablation Study We perform the following ablation studiesin our experiments to provide insights into the various com-ponents of the proposed objective function. (i) supervisedonly: This result involves training two different segmentationmodels from scratch on the limited available supervised datafrom each dataset. (ii) finetune: We also compare againstthe fine tuning strategy commonly used in cases where thereis limited amount of training data. In our case, the networktrained on the limited labeled data from one dataset is usedto fine tune on the other labeled dataset after changing thefinal softmax classification layer of the decoder. (iii) Univ-basic: To study the effect of the unsupervised losses, weput α, β = 0 and perform training using only the supervisedloss term from Eq (1) and no entropy module at all. (iv)Univ-cross: To study the effect of the cross dataset loss termfrom Eq (3), we conduct experiments with β = 0 using onlythe cross dataset loss term. (v) Univ-full: This is the pro-posed model, including all the supervised and unsupervisedloss terms. We use α, β = 1 in the loss function in Eq (5).Note that (i) and (ii) need to be trained separately for eachdomain while our method makes it possible to universally de-ploy a single model across different domains. For reference,the mIoU when using the full dataset as the labeled data is63.23% on Cityscapes and 49.46% on CamVid, 64.54% onIDD and 28.11% on SUN-RGB-D.

4.1. Driving Datasets

Cityscapes + CamVid. The results for joint training a uni-versal model on Cityscapes and CamVid datasets is givenin Table 2. For our standard setting of N=100 which corre-sponds to using 100 labeled examples from each domain,the proposed method gives an mIoU value of 44.12% onCityscapes and 53.99% on CamVid clearly outperformingthe supervised only training method as well as the fine tuningbased method, with an added advantage of deploying only asingle joint model in both the domains. Moreover, the univer-

Method N=100 N=300

IDD CS IDD CS

Supervised Only 30.83 38.64 37.81 47.89Finetune 31.95 40.46 37.63 47.69

Univ-basic 31.64 43.48 37.37 49.53Univ-cross 28.54 37.72 43.17 54.24Univ-full 32.23 43.81 39.75 53.33

Table 3: Universal segmentation results using IDD and CSdatasets for K = 1, N = 100.

sal segmentation method using the proposed unsupervisedlosses also performs better than using only supervised losses,which demonstrates the advantage of having unsupervisedentropy regularization in domains with few labeled data andlots of unlabeled data.

In the case of semantic segmentation datasets, very lowvalues of N offers challenges like limited representation formany of the smaller labels, but we notice that the proposedmodel for N=50 still manages to perform consistently betteron both the datasets, and the advantage is more pronouncedin the case of Cityscapes which has many more unlabeledexamples compared to CamVid.

For N=300, we observe that our full universal segmen-tation model improves a lot upon all the baselines onCityscapes with an mIoU of 49.41% by making use of themany available unlabeled examples. But for CamVid with atotal dataset size of 367, this increase in number of labeledexamples to N=300 means sufficient supervision, and theuniv-basic method gave better results, asserting that the pro-posed universal segmentation model is particularly usefulwith few labeled data.

From Table 2, we also observe that using multiple cen-troids helps for all the experiment settings irrespective of thenumber of labeled examples available, as the sum of mIoUsare highest when the models are trained using 3 centroidsper label (K=3). This is because many of the labels occurin multiple modalities and more than one label embeddingusually captures this variance better.

Comparison of classwise mIoUs against other methodsfor N=100 are given in Table 1. The entropy based mod-els on most of the labels perform better compared to thebaselines, while we also observe that few categories whichshow a diversity in visual appearance like wall, car, train inCityscapes and pavement, tree in CamVid benefit more byhaving more than one representation per label.

IDD + Cityscapes The results for universal semantic seg-mentation using IDD and cityscapes are shown in Table 3.This combination is a good candidate for validating the uni-

Method CS SUN

Supervised Only 38.64 8.22Finetune 38.24 8.71

Univ-basic 33.77 6.87Univ-full 40.15 8.43

Table 4: mIoU values for universal segmentation across differ-ent task datasets. While Cityscapes is an autonomous drivingdataset, SUN dataset is more used for indoor segmentation.This demonstrates the effectiveness of universal segmentationeven across diverse tasks.

versal segmentation approach as the images are from widelydissimilar domains in terms of geography, weather condi-tions as well as traffic setup, and capture the wide variety ofroad scenes one might encounter while training autonomousdriving datasets. Using 100 training examples from eachdomain, the proposed univ-full model gives an mIoU of32.23% on IDD and 43.81% on cityscapes, performing bet-ter than the univ-basic or univ-cross methods as well as thetraditional fine tuning based methods. If we increase thenumber of supervised examples from 100 to 300, the univ-cross performs better than the univ-full model, which impliesthat the within dataset unsupervised loss has a greater effectwith few labeled data in case of IDD + cityscapes datasets.

4.2. Cross Domain Experiment

A useful advantage of the universal segmentation modelis its ability to perform knowledge transfer between datasetsused in completely different settings, due to its ability to ef-fectively exploit useful visual relationships. We signify thiseffect in the case of joint training between Cityscapes, whichis a driving dataset with road scenes used for autonomousnavigation and SUN RGB-D, which is an indoor segmenta-tion dataset with household objects used for high-level sceneunderstanding.

The 19 labels in Cityscapes and the 37 labels in SUNdataset are completely different (non overlapping), so thesimple joint training techniques generally give poor results.However, from Table 4, our universal segmentation modelwith unsupervised entropy regularization losses shows arelative improvement of 18.8% mIoU on Cityscapes datasetand 22% mIoU on the SUN RGB-D dataset compared to thesimple joint training based method, proving the effectivenessof having a label alignment module in addition to sharedlower level representations for cross domain tasks.

4.3. Feature Visualization

A more intuitive understanding of the feature alignmentperformed by our universal model is obtained from the tSNEembeddings [33] of the visual features. The pixel wise outputof the encoder module is used to plot the tSNE of selected

(a) (b) (c) (d)

Figure 4: tSNE visualizations of the encoder output representations for majority classes from CS, CVD and SUN datasets. Plots (a)and (b) are for the Univ-basic and Univ-full model from CS-CVD datasets. Observe that the feature embeddings for large classes likeCS:Building-CVD:Building, CS:SideWalk-CVD:Pavement, CS:Sky-CVD:Sky align a lot better with universal model. Plots (c) and (d)are for the Univ-basic and Univ-full model from CS-SUN datasets, and labels with similar visual features like CS:Road - SUN:Floorshow better feature alignment. Best viewed in color and zoom.

(a) Original Image (b) Without Entropy Module (c) With Entropy Module (d) Ground Truth Segmentation

Figure 5: Qualitative examples from each dataset training with and without the proposed entropy regularization module. The first rowshows example from IDD dataset, second row from Cityscapes and the third from the CamVid dataset.

labels in Figure 4. For the universal training between CS andCVD in Figures 4a and 4b, we can observe that large classeslike Building and Sky from both the datasets align with eachother better when trained using a universal segmentationobjective. For the universal training between CS and SUNfrom Figure 4c and Figure 4d, labels with similar visualattributes such as Road and Floor align close to each otherin spite of the label sets themselves being completely nonoverlapping.

5. ConclusionIn this work, we demonstrate a simple and effective way

to perform universal semi-supervised semantic segmentation.We train a joint model using the few labeled examples andlarge amounts of unlabeled examples from each domain by

an entropy regularization based semantic transfer objective.We show this approach to be useful in better alignment ofthe visual features corresponding to different domains. Wedemonstrate superior performance of the proposed approachwhen compared to supervised training or fine tuning basedmethods over a wide variety of segmentation datasets withvarying degree of label overlap. We hope that our workwould address the growing concern in the deep learningcommunity over the difficulty involved in collection of largenumber of labeled examples for dense prediction tasks suchas semantic segmentation. In future, we aim to extend thepresent method to other problems in computer vision suchas object detection or instance aware segmentation.

References[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-

embedding for attribute-based classification. In 2013 IEEEConference on Computer Vision and Pattern Recognition,pages 819–826, June 2013. 3

[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evalua-tion of output embeddings for fine-grained image classifica-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2927–2936, 2015. 3

[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: Adeep convolutional encoder-decoder architecture for imagesegmentation. arXiv preprint arXiv:1511.00561, 2015. 1, 2,5

[4] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic objectclasses in video: A high-definition ground truth database.Pattern Recognition Letters, 30(2):88–97, 2009. 2, 5, 6

[5] Z. Cao, L. Ma, M. Long, and J. Wang. Partial adversarialdomain adaptation. In The European Conference on ComputerVision (ECCV), September 2018. 2

[6] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulo.Autodial: Automatic domain alignment layers. In Interna-tional Conference on Computer Vision (ICCV), 2017. 1

[7] R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. 3

[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille. Deeplab: Semantic image segmentation with deepconvolutional nets, atrous convolution, and fully connectedcrfs. IEEE transactions on pattern analysis and machineintelligence, 40(4):834–848, 2018. 1, 2, 6

[9] Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. F.Wang, and M. Sun. No more discrimination: Cross cityadaptation of road scene segmenters. In Computer Vision(ICCV), 2017 IEEE International Conference on, pages 2011–2020. IEEE, 2017. 1, 2

[10] R. Collobert and J. Weston. A unified architecture for naturallanguage processing: Deep neural networks with multitasklearning. In Proceedings of the 25th international conferenceon Machine learning, pages 160–167. ACM, 2008. 3

[11] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 3213–3223, 2016. 1, 2, 5, 6

[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 248–255. Ieee, 2009. 1

[13] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutional activa-tion feature for generic visual recognition. corr abs/1310.1531(2013), 2013. 2

[14] N. Dong and E. P. Xing. Few-shot semantic segmentationwith prototype learning. In BMVC, 2018. 3, 4

[15] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean,T. Mikolov, et al. Devise: A deep visual-semantic embeddingmodel. In Advances in neural information processing systems,pages 2121–2129, 2013. 3

[16] Y. Ganin and V. Lempitsky. Unsupervised domain adaptationby backpropagation. In Proceedings of the 32nd InternationalConference on Machine Learning, pages 1180–1189, 2015. 2

[17] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Ma-chine Learning Research, 17(1):2096–2030, 2016. 2

[18] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, and J. Garcia-Rodriguez.A survey on deep learning techniques for image and videosemantic segmentation. Applied Soft Computing, 70:41 – 65,2018. 1

[19] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Richfeature hierarchies for accurate object detection and semanticsegmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 580–587,2014. 2

[20] Y. Grandvalet and Y. Bengio. Semi-supervised learning byentropy minimization. In Advances in neural informationprocessing systems, pages 529–536, 2005. 4

[21] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep intorectifiers: Surpassing human-level performance on imagenetclassification. In Proceedings of the IEEE international con-ference on computer vision, pages 1026–1034, 2015. 2

[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 770–778,2016. 2

[23] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild:Pixel-level adversarial and constraint-based adaptation. arXivpreprint arXiv:1612.02649, 2016. 1, 2

[24] J. Hu, J. Lu, and Y.-P. Tan. Deep transfer metric learning. InProceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 325–333, 2015. 3

[25] C. Jawahar, A. Subramanian, A. Namboodiri, M. Chan-drakar, and S. Ramalingam. AutoNUE workshop andchallenge at ECCV’18. http://cvit.iiit.ac.in/scene-understanding-challenge-2018/. 5

[26] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar,L. Jones, and J. Uszkoreit. One model to learn them all. arXivpreprint arXiv:1706.05137, 2017. 2, 3

[27] I. Kokkinos. Ubernet: Training a universal convolutionalneural network for low-, mid-, and high-level vision usingdiverse datasets and limited memory. In CVPR, volume 2,page 8, 2017. 2, 3

[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012. 2

[29] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 36(3):453–465, March 2014. 3

[30] X. Liang, H. Zhou, and E. Xing. Dynamic-structured semanticpropagation network. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 752–761,2018. 3, 4

http://cvit.iiit.ac.in/scene-understanding-challenge-2018/

http://cvit.iiit.ac.in/scene-understanding-challenge-2018/

[31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of theIEEE conference on computer vision and pattern recognition,pages 3431–3440, 2015. 1, 2

[32] Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label efficientlearning of transferable representations acrosss domains andtasks. In Advances in Neural Information Processing Systems,pages 165–177, 2017. 3

[33] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605,2008. 7

[34] B. McCann, N. S. Keskar, C. Xiong, and R. Socher. Thenatural language decathlon: Multitask learning as questionanswering. arXiv preprint arXiv:1806.08730, 2018. 2, 3

[35] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.Distributed representations of words and phrases and theircompositionality. In Advances in neural information process-ing systems, pages 3111–3119, 2013. 3

[36] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, andK. Kim. Image to image translation for domain adaptation.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4500–4509, 2018. 1

[37] H. Noh, S. Hong, and B. Han. Learning deconvolution net-work for semantic segmentation. In Proceedings of the IEEEinternational conference on computer vision, pages 1520–1528, 2015. 2

[38] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens,A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning byconvex combination of semantic embeddings. InternationalConference on Learning Representations (ICLR), 2014. 3

[39] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning andtransferring mid-level image representations using convolu-tional neural networks. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages1717–1724, 2014. 2

[40] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You onlylook once: Unified, real-time object detection. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 779–788, 2016. 2

[41] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InAdvances in neural information processing systems, pages91–99, 2015. 2

[42] E. Romera, J. M. lvarez, L. M. Bergasa, and R. Arroyo. Erfnet:Efficient residual factorized convnet for real-time semanticsegmentation. IEEE Transactions on Intelligent Transporta-tion Systems, 19(1):263–272, Jan 2018. 2

[43] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutionalnetworks for biomedical image segmentation. In Interna-tional Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. 1

[44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Ima-genet large scale visual recognition challenge. InternationalJournal of Computer Vision, 115(3):211–252, 2015. 1

[45] S. Saha, G. Varma, and C. Jawahar. Class2Str: End to endlatent hierarchy learning. In International Conference onPattern Recognition (ICPR), 2018. 3

[46] M. L. Seltzer and J. Droppo. Multi-task learning in deep neu-ral networks for improved phoneme recognition. In Acoustics,Speech and Signal Processing (ICASSP), 2013 IEEE Interna-tional Conference on, pages 6965–6969. IEEE, 2013. 3

[47] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks. arXiv preprintarXiv:1312.6229, 2013. 3

[48] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. 2

[49] J. Snell, K. Swersky, and R. Zemel. Prototypical networksfor few-shot learning. In Advances in Neural InformationProcessing Systems, pages 4077–4087, 2017. 2, 3

[50] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-dscene understanding benchmark suite. In 2015 IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),pages 567–576. IEEE, 2015. 2, 5

[51] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang,and M. Chandraker. Learning to adapt structured output spacefor semantic segmentation. arXiv preprint arXiv:1802.10349,2018. 1, 2

[52] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultane-ous deep transfer across domains and tasks. In Proceedingsof the IEEE International Conference on Computer Vision,pages 4068–4076, 2015. 2

[53] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarialdiscriminative domain adaptation. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), July2017. 2

[54] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker,and C. Jawahar. IDD: A dataset for exploring problems ofautonomous navigation in unconstrained environments. In2019 IEEE Winter Conference on Applications of ComputerVision (WACV). IEEE, 2019. 1, 2, 5

[55] K. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey oftransfer learning. Journal of Big Data, 3(1):9, 2016. 2

[56] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shotlearning - a comprehensive evaluation of the good, the badand the ugly. IEEE Transactions on Pattern Analysis andMachine Intelligence, pages 1–1, 2018. 3

[57] S. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semanticrepresentations for unsupervised domain adaptation. In Inter-national Conference on Machine Learning, pages 5419–5428,2018. 3

[58] F. Yu and V. Koltun. Multi-scale context aggregation bydilated convolutions. arXiv preprint arXiv:1511.07122, 2015.2

[59] F. Yu, V. Koltun, and T. A. Funkhouser. Dilated residualnetworks. In CVPR, volume 2, page 3, 2017. 2, 5

[60] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, andS. Savarese. Taskonomy: Disentangling task transfer learning.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 3712–3722, 2018. 2, 3

[61] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In European conference on computervision, pages 818–833. Springer, 2014. 2

Method N=50 N=100 N=300

CS CVD CS CVD CS CVD

Univ-full, θ = 0 29.52 34.24 37.84 35.52 40.01 58.14Univ-full, θ = 0.5 32.43 48.69 28.88 33.51 41.89 52.46Univ-full, θ = 1 37.18 51.76 44.12 53.99 49.41 61.04

Table 5: mIoU values for various values of θ for CS-CVDuniversal training between CS, CVD.

A. Calculating the label embeddingsIn this section, we describe the method used to obtain the

vector representations for the labels. For each dataset sep-arately, we train an end-to-end segmentation network fromscratch using only the limited training data available in thatdataset. We use this trained segmentation network to calcu-late the encoder outputs of the training data at each pixel.Typically, the size of the output dimension of the encoder ateach pixel (512 for a ResNet-18 or ResNet-34 backbone) isnot equal to the dimension of the label embeddings (d=128,in our case). So we first apply a dimensionality reductiontechnique like PCA to reduce the dimension of the outputsto match the dimension of the label embeddings d, and thenup sample this output to match the resolution of the groundtruth segmentation map. We then calculate the class wisecentroids to obtain the label embeddings.

B. Updating the label embeddingsIn our original experiments, we fixed the pretrained label

embeddings over the phase of training the universal model.Here, we present a method to jointly train the segmentationmodel as well as the label embeddings. We initialize the em-beddings with the values computed from the pretrained net-works, and make use of the following exponentially weightedaverage rule to update the centroids at the tth time step.

c(k)t = (θ) · c(k)t−1 + (1− θ) · µL(Ft(x

(i)u )). (7)

In Eq (7), c(k)t−1 denotes the centroids at the (t − 1)th timestep, Ft is the state of the encoder module at the tth timestep and µL calculates the class wise centroids.A value ofθ = 1 implies that the centroids are not updated from theirinitial state, and a value of θ = 0 means that the centroidsare calculated afresh at each update. We make an update tothe centroids after every epoch of the original training data.

However, we can observe from Table 5 that there is adrop in the mIoU values for both cityscapes and CamViddatasets in cases when θ = 0 or θ = 0.5 compared to θ = 1.The reason for this could be that the limited training dataavailable is not sufficient to jointly train the segmentationnetwork as well as update the centroids. This can also be ob-served from the fact that an increase in the number of labeledexamples presented to the network results in an increase inthe accuracy of jointly updating the embeddings.

Method N=100 N=300

IDD CS IDD CS

Univ-full (θ = 0) 27.48 35.86 31.46 43.3Univ-full (θ = 1) 32.23 43.81 39.75 53.33

Table 6: Universal segmentation results using trainable cen-troids on IDD and CS datasets.

Method CS SUN

Univ-full (θ = 0) 34.39 7.95Univ-full (θ = 1) 40.15 8.43

Table 7: mIoU values for universal training between CS andSUN datasets with trainable centroids.

Similar observations can be made in the case of universaltraining between IDD and CS datasets from Table 6, andfor the cross task experiment between CS and SUN datasetsfrom Table 7.

C. Qualitative examplesFrom the qualitative examples presented in Figure 6

and Figure 7, we can observe that the entropy module isparticularly useful in better segmentation of finer objects andforeground classes.


Figure 6: Qualitative examples from the cityscapes dataset with and without the proposed entropy regularization module, whentrained on a universal model on CS+CamVid ising 100 labeled images.


Figure 7: Qualitative examples from the IDD dataset with and without the proposed entropy regularization module, when trained on auniversal model on IDD+CS ising 300 labeled images.

Documents

arXiv:1811.10323v1 [cs.CV] 26 Nov 2018 · 2018-11-27 · Tarun Kalluri 1Girish Varma Manmohan Chandrakar2 C V Jawahar1 1IIIT Hyderabad 2University of California, San Diego Abstract