arXiv:1912.08936v1 [cs.CV] 18 Dec 2019 · arXiv:1912.08936v1 [cs.CV] 18 Dec 2019 In order to improve image-level few-shot object segmentation we propose a multi-modal interaction

One-Shot Weakly Supervised Video ObjectSegmentation

Mennatullah Siam∗

University of [email protected]

Naren Doraiswamy*

Indian Institute of Science (IISc)[email protected]

Boris N. Oreshkin*

[email protected]

Hengshuai YaoHiSilicon, Huawei [email protected]

Martin JagersandUniversity of [email protected]

Abstract

Conventional few-shot object segmentation methods learn object segmentationfrom a few labelled support images with strongly labelled segmentation masks.Recent work has shown to perform on par with weaker levels of supervision in termsof scribbles and bounding boxes. However, there has been limited attention givento the problem of few-shot object segmentation with image-level supervision. Wepropose a novel multi-modal interaction module for few-shot object segmentationthat utilizes a co-attention mechanism using both visual and word embeddings.It enables our model to achieve 5.1% improvement over previously proposedimage-level few-shot object segmentation. Our method compares relatively closeto the state of the art methods that use strong supervision, while ours use theleast possible supervision. We further propose a novel setup for few-shot weaklysupervised video object segmentation(VOS) that relies on image-level labels for thefirst frame. The proposed setup uses weak annotation unlike semi-supervised VOSsetting that utilizes strongly labelled segmentation masks. The setup evaluates theeffectiveness of generalizing to novel classes in the VOS setting. The setup splitsthe VOS data into multiple folds with different categories per fold. It provides apotential setup to evaluate how few-shot object segmentation methods can benefitfrom additional object poses, or object interactions that is not available in staticframes as in PASCAL-5i benchmark.

1 Introduction

The few-shot learning literature has mainly focused on the classification task such as Koch et al.(2015)Vinyals et al. (2016)Snell et al. (2017)Qi et al. (2017)Finn et al. (2017)Ravi & Larochelle(2016)Sung et al. (2018)Qiao et al. (2018). Recently, solutions for few-shot object segmentationwhich learns pixel-wise classification based on few labelled samples have emerged Shaban et al.(2017)Rakelly et al. (2018)Dong & Xing (2018)Wang et al. (2019)Zhang et al. (2019a)Zhang et al.(2019b)Siam et al. (2019). Previous literature in few-shot object segmentation relied on manuallylabelled segmentation masks. A few recent works Rakelly et al. (2018)Zhang et al. (2019b)Wanget al. (2019) started to conduct experiments using weak annotations in terms of scribbles, boundingboxes. However, limited research was conducted on using image-level supervision for few-shot objectsegmentation with one sole recent work Raza et al. (2019).

∗equally contributing

Preprint. Under review.

arX

iv:1

912.

0893

6v1

[cs

.CV

] 1

8 D

ec 2

019

In order to improve image-level few-shot object segmentation we propose a multi-modal interactionmodule that leverages the interaction between support and query visual features and word embeddings.The use of neural word embeddings pretrained on GoogleNews Mikolov et al. (2013) and visualembeddings from pretrained networks on ImageNet Deng et al. (2009) allows to build such a model.We propose a novel approach that uses Stacked Co-Attention to leverage the interaction amongvisual and word embeddings. We mainly inspire from Lu et al. (2019) which proposed a co-attentionsiamese network for unsupervised video object segmentation. However, our setup mainly focuseson the few-shot object segmentation aspect to assess its ability to generalize to novel classes. Thisaspect motivates why we meta-learned a multi-modal interaction module that leverages the interactionbetween support and query with image-level supervision.

Concurrent to few-shot object segmentation, video object segmentation(VOS) has been extensivelyresearched Pont-Tuset et al. (2017)Perazzi et al. (2016)Lu et al. (2019)Tokmakov et al. (2016)Jainet al. (2017)Luiten et al. (2018). Two main categories for video object segmentation are studied whichare semi-supervised VOS and unsupervised VOS. The semi-supervised VOS also named as few-shotrequires an initial strongly labelled segmentation mask to be provided. In a recent work Khorevaet al. (2018) video object segmentation using language expression has been studied. However, theirwork does not focus on the aspect of segmenting novel categories which we try to address since itcan be of potential benefit to the few-shot learning community. The additional temporal informationin VOS can provide extra unlabelled data about the object poses or its interactions with other objectsto the few-shot object segmentation method. It has the potential to be closer to human-like learningrather than learning from a single frame. To the best of our knowledge we are the first to propose anovel setup for video object segmentation that focuses on the ability to generalize to novel classes bysplitting the categories to different folds which would better assess the generalization ability. Thesetup as well only requires image-level labels for the first frame to segment the corresponding objectsunlike conventional semi-supervised VOS. Although Youtube-VOS Xu et al. (2018) provided a wayto evaluate on certain unseen categories but it does not utilize any of the category label information inthe segmentation model. However, in order to ensure the evaluation for the few-shot method is notbiased to a certain category, it is best to split into multiple folds and evaluate on different ones. Tothis end we use Youtube-VOS dataset Xu et al. (2018) which is a large-scale VOS dataset with 65different object categories, the originally provided training data is further split into 5 different foldswith 13 novel classes per fold. The novel setup opens the door towards studying how few-shot objectsegmentation can benefit from temporal information for the different object poses or its interactions.The main contributions in this paper are summarized as follows:

• A novel formulation for learning image-level labelled few-shot segmentation based ona multi-modal interaction module is presented. It utilizes neural word embeddings andattention mechanisms that relates the support and query images. It enables attention to themost relevant spatial locations in the query feature maps.

• We propose a novel setup called few-shot weakly supervised video object segmentation thatcan be of potential benefit to few-shot object segmentation.

2 Method

2.1 Stacked Co-Attention

A simple conditioning on the support set image would be insufficient where neither sparse nor denseannotations is provided. In order to leverage the interaction between the support set images and queryset image a conditioned support-to-query attention module is used to learn the correlation betweenthem. Initially a base network is used to extract features from support set image Is and query imageIq which we denote as Vs ∈ RW×H×C and Vq ∈ RW×H×C . Where H and W denote the featuremaps height and width respectively, while C denote the feature channels.

It is important to initially capture the class label semantic representation through using semanticword embeddings Mikolov et al. (2013) before performing the co-attention. The main reason is thatboth the support set and query set can contain multiple common objects from different classes, sodepending solely on the support-to-query attention will fail in this case. A projection layer is usedon the semantic word embeddings to construct z ∈ Rd where d is 256. It is then spatially tiled andconcatenated with the visual features resulting in Vs and Vq . An affinity matrix S is then computed to

2

Encoder

Encoder

Shared Weights

‘boat’Word2Vec + Projection

LayerSpatial Tiling

Multi-Modal Interaction ModuleNx

Support Set

Query Set

Few-Shot Object Segmentation with Image-Level Supervision

Decoder

Figure 1: Architecture of Few-Shot Object segmentation model with co-attention. The ⊕ operatordenotes concatenation, ⊗ denotes element-wise multiplication, ◦ denotes matrix multiplication.

capture the similarity between them using equation 1.

S = VsTWcoVq (1)

The feature maps are flattened into matrix representations where Vq ∈ RC×WH and Vs ∈ RC×WH ,while Wco ∈ RC×C learns the correlation between feature channels. We use a vanilla co-attentionsimilar to Lu et al. (2019) where Wco is learned using a fully connected layer. The resulting affinitymatrix S ∈ RWH×WH relates each column from Vq and Vs. A softmax operation is performed onthe S row-wise and column-wise depending on what relation direction we are interested in followingequations 2a and 2b.

Sc = softmax(S) (2a)

Sr = softmax(ST ) (2b)Sc∗,j has the relevance of the jth spatial location in Vq with all spatial locations of Vs, wherej = 1, ...,WH . The normalized affinity matrix can be used to compute Uq using equation 3 and Us

similarly. Uq and Us act as the attention summaries.

Uq = VsSc (3)

The attention summaries are further gated using a gating function fg with learnable weights Wg andbias bg following equations 4a and 4b. The gating function ensures the output to be in the interval [0,1] in order to mask the attention summaries using a sigmoid activation function σ. The ◦ operatordenotes the hadamard product or element-wise multiplication.

fg(Uq) = σ(WgUq + bg) ∈ [0, 1] (4a)

Uq = fg(Uq) ◦ Uq (4b)The gated attention summaries Uq are concatenated with the original visual features Vq and reshapedback to construct the final output from the attention module. Figure 1 demonstrates our proposedmethod. We utilize a ResNet-50 He et al. (2016) encoder pre-trained on ImageNet Deng et al. (2009)to extract visual features. The segmentation decoder is comprised of an iterative optimization module(IOM) Zhang et al. (2019b) and an atrous spatial pyramid pooling (ASPP) Chen et al. (2017a)Chenet al. (2017b). In order to improve our model we use two stacked co-attention modules. It allowsthe model to learn a better representation by letting the support set guide the attention on the queryimage and the reverse with respect to the support set through multiple iterations.

2.2 Few-Shot Weakly Supervised Video Object Segmentation

We propose a novel setup for few-shot video object segmentation where we utilize image-level labelledfirst frame to learn object segmentation in the sequence rather than using manual segmentation masks.More importantly the setup is devised in a way to split the categories to multiple folds to assessthe generalization ability to novel classes. We utilize Youtube-VOS dataset training data which has65 categories, and we split into 5 folds. Each fold has 13 classes that are used as novel classes,while the rest are used in the meta-training phase. A randomly sampled category Ys and sequenceV = {I1, I2, ..., IN} is used to construct support set Sp = {(I1, Ys)} and query images Ii.

3

Table 1: Quantitative results for 1-way, 1-shot segmentation on the PASCAL-5i dataset showingmean-Iou Shaban et al. (2017) and binary-IoU Rakelly et al. (2018)Dong & Xing (2018). W: standsfor using weak supervision from Image-Level labels.

Method W fold 0 fold 1 fold 2 fold 3 mean-IoU binary-IoUFG-BG 7 - - - - - 55.1Shaban et al. (2017) 7 33.6 55.3 40.9 33.5 40.8 -Rakelly et al. (2018) 7 36.7 50.6 44.9 32.4 41.1 60.1Dong & Xing (2018) 7 - - - - - 61.2Siam et al. (2019) 7 41.9 50.2 46.7 34.7 43.4 62.2Wang et al. (2019) 7 42.3 58.0 51.1 41.2 48.1 66.5Zhang et al. (2019b) 7 52.5 65.9 51.3 51.9 55.4 66.2Zhang et al. (2019a) 7 56.0 66.9 50.6 50.4 56.0 69.9Raza et al. (2019) 3 - - - - - 58.7Baseline 3 38.6 56.6 43.8 38.2 44.3 60.2Ours-V1 3 49.4 65.5 50.0 49.2 53.5 65.6Ours-V2 3 42.1 65.1 47.9 43.8 49.7 63.8

3 Experimental Results

In this section we demonstrate results from experiments conducted on the PASCAL-5i dataset Shabanet al. (2017) which is the most widely used dataset for evaluating few-shot segmentation. We alsoconduct experiments on our novel few-shot weakly supervised video object segmentation.

3.1 Experimental Setup

We report two evaluation metrics, the mean-IoU computes the intersection over union for all 5 classeswithin the fold and averages them neglecting the background Shaban et al. (2017). Whereas thebinary-IoU metric proposed in Rakelly et al. (2018)Dong & Xing (2018) computes the mean offoreground and background IoU in a class agnostic manner. Both metrics are reported as an averageof 5 different runs to ensure a stable result following Wang et al. (2019). We have also noticed somedeviation in the validation schemes used in previous works. Zhang et al. (2019b) follows a procedurewhere the validation is performed on the Ltest classes to save the best model whereas Wang et al.(2019) does not perform validation and rather trains for a fixed number of iterations. We choose theapproach followed in Wang et al. (2019) since we consider that the Ltest classes are not availableduring the initial meta-training phase.

3.2 Few-shot Weakly Supervised Object Segmentation

We compare the result of our proposed method with stacked co-attention against the other state of theart methods for 1-way 1-shot segmentation on pascal-5i in Table 1 using mean-IoU and binary-IoUmetrics. We report the results for both the two validation schemes where V1 is following Zhanget al. (2019b) and V2 is following Wang et al. (2019) validation scheme. Without the utilization ofsegmentation mask or even sparse annotations, our method with the least supervision of image levellabels performs relatively on-par 53.5% compared to the current state of the art methods 56% showingthe efficacy of our proposed algorithm. It outperforms the previous one-shot weakly supervisedsegmentation Raza et al. (2019) with 5.1%. Our proposed model outperforms the baseline methodwhich utilizes a co-attention module without using word embeddings. Figure 2 shows the qualitativeresults for our proposed approach.

3.3 Few-shot Weakly Supervised Video Object Segmentation

Table 2 shows results on our proposed novel setup and comparing our method with the baseline ofusing co-attention module without utilizing word embeddings similar to Lu et al. (2019). It shows thepotential benefit from utilizing neural word embeddings to guide the attention module.

4

(a) ’bicycle’ (b) ’bottle’ (c) ’bird’

Figure 2: Qualitative evaluation on PASCAL-5i 1-way 1-shot. The support set and prediction on thequery image are shown in pairs.

Table 2: Quantitative Results on Youtube-VOS One-shot weakly supervised setup.Method fold 0 fold 1 fold 2 fold 3 fold 4 MeanBaseline 40.1 33.7 47.1 36.4 36.6 38.8

Ours 41.6 40.8 51.4 41.5 39.1 42.9

4 Conclusions

Our proposed method demonstrates great promise toward performing few-shot object segmentationbased on gated co-attention that leverages the interaction between the support set and query image.Our model utilizes neural word embeddings to guide the attention mechanism which improves thesegmentation accuracy compared to the baseline. We demonstrate promising results on Pascal-5iwhere we outperform the previously proposed image-level labelled one-shot segmentation method by5.1% and perform closer to methods that use strongly labelled masks. Our novel setup provides amean to experiment with the effect of capturing different object viewpoints, and appearance changesin few-shot object segmentation. It closely mimics human learning for novel objects from fewlabelled samples by aggregating information from different viewpoints and capturing different objectinteractions.

ReferencesLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fullyconnected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848,2017a.

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrousconvolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017b.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,pp. 248–255. Ieee, 2009.

Nanqing Dong and Eric P. Xing. Few-shot semantic segmentation with prototype learning. In BMVC,volume 3, pp. 4, 2018.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation ofdeep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume70, pp. 1126–1135. JMLR. org, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,pp. 770–778, 2016.

Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine motionand appearance for fully automatic segmention of generic objects in videos. arXiv preprintarXiv:1701.05384, 2017.

Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referringexpressions. In Asian Conference on Computer Vision, pp. 123–141. Springer, 2018.

5

Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shotimage recognition. In ICML Deep Learning Workshop, volume 2, 2015.

Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See more,know more: Unsupervised video object segmentation with co-attention siamese networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3623–3632,2019.

Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposal-generation, refinementand merging for video object segmentation. In Asian Conference on Computer Vision, pp. 565–580.Springer, 2018.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In Advances in neural information processingsystems, pp. 3111–3119, 2013.

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. Abenchmark dataset and evaluation methodology for video object segmentation. In Computer Visionand Pattern Recognition, 2016.

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, andLuc Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.

Hang Qi, Matthew Brown, and David G Lowe. Learning with imprinted weights. arXiv preprintarXiv:1712.07136, 2017.

Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predictingparameters from activations. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pp. 7229–7238, 2018.

Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alyosha Efros, and Sergey Levine. Conditionalnetworks for few-shot semantic segmentation. 2018.

Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.

Hasnain Raza, Mahdyar Ravanbakhsh, Tassilo Klein, and Moin Nabi. Weakly supervised oneshot segmentation. In Proceedings of the IEEE International Conference on Computer VisionWorkshops, pp. 0–0, 2019.

Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and Byron Boots. One-shot learning forsemantic segmentation. arXiv preprint arXiv:1709.03410, 2017.

Mennatullah Siam, Boris N Oreshkin, and Martin Jagersand. Amp: Adaptive masked proxies forfew-shot segmentation. In Proceedings of the IEEE International Conference on Computer Vision,pp. 5249–5258, 2019.

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. InAdvances in Neural Information Processing Systems, pp. 4077–4087, 2017.

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 1199–1208, 2018.

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. arXivpreprint arXiv:1612.07217, 2016.

Oriol Vinyals, Charles Blundell, Tim Lillicrap, and Daan Wierstra. Matching networks for one shotlearning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.

Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot imagesemantic segmentation with prototype alignment. In Proceedings of the IEEE InternationalConference on Computer Vision, pp. 9197–9206, 2019.

6

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and ThomasHuang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprintarXiv:1809.03327, 2018.

Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo, Qingyao Wu, and Rui Yao. Pyramid graph net-works with connection attentions for region-based one-shot semantic segmentation. In Proceedingsof the IEEE International Conference on Computer Vision, pp. 9587–9595, 2019a.

Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua Shen. Canet: Class-agnostic segmen-tation networks with iterative refinement and attentive few-shot learning. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pp. 5217–5226, 2019b.

7

Documents

arXiv:1912.08936v1 [cs.CV] 18 Dec 2019 · arXiv:1912.08936v1 [cs.CV] 18 Dec 2019 In order to improve image-level few-shot object segmentation we propose a multi-modal interaction