Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Interactive Full Image Segmentation by Considering All Regions Jointly
Eirikur Agustsson
Google Research
Jasper R. R. Uijlings
Google Research
Vittorio Ferrari
Google Research
I) Input image with extreme points provided by annotator
II) Machine predictions from extreme points
III) corrective scribbles provided by annotator
IV) Machine predictions from extreme points and corrective scribbles
Figure 1. Illustration of our interactive full image segmentation workflow. First (I) the annotator marks extreme points. Then (II) our
model (Sec. 3) uses them to generate a segmentation. This is presented to the annotator, after which we iterate: (III) the annotator makes
corrections using scribbles (Sec. 4), and (IV) our model uses them to update the predicted segmentation (Sec. 3).
Abstract
We address interactive full image annotation, where the
goal is to accurately segment all object and stuff regions in
an image. We propose an interactive, scribble-based an-
notation framework which operates on the whole image to
produce segmentations for all regions. This enables shar-
ing scribble corrections across regions, and allows the an-
notator to focus on the largest errors made by the machine
across the whole image. To realize this, we adapt Mask-
RCNN [22] into a fast interactive segmentation framework
and introduce an instance-aware loss measured at the pixel-
level in the full image canvas, which lets predictions for
nearby regions properly compete for space. Finally, we
compare to interactive single object segmentation on the
COCO panoptic dataset [11, 27, 34]. We demonstrate that
our interactive full image segmentation approach leads to a
5% IoU gain, reaching 90% IoU at a budget of four extreme
clicks and four corrective scribbles per region.
1. Introduction
We address the task of interactive full image segmen-
tation, where the goal is to obtain accurate segmentations
for all object and stuff regions in the image. Full im-
age annotations are important for many applications such
as self-driving cars [17, 19], navigation assistance for the
blind [51], and automatic image captioning [25, 56]. How-
ever, creating such datasets requires large amounts of hu-
man labor. For example, annotating a single image took
1.5 hours for Cityscapes [17]. For COCO+stuff [11, 34],
annotating one image took 19 minutes (80 seconds per ob-
ject [34] plus 3 minutes for stuff regions [11]), which totals
39k hours for the 123k images. So there is a clear need for
faster annotation tools.
This paper proposes an efficient interactive framework
for full image segmentation (Fig. 1 and 2). Given an image,
an annotator first marks extreme points [41] on all object
and stuff regions. These provide a tight bounding box with
four boundary points for each region, and can be efficiently
collected (7s per region [41]). Next, the machine predicts
an initial segmentation for the full image based on these
extreme points. Afterwards we present the whole image
with the predicted segmentation to the annotator and iterate
between (A) the annotator providing scribbles on the errors
of the current segmentation, and (B) the machine updating
the predicted segmentation accordingly (Fig. 1).
Our approach of full image segmentation brings several
advantages over modern interactive single object segmenta-
tion methods [24, 30, 31, 32, 36, 37, 60]: (I) It enables the
annotator to focus on the largest errors in the whole image,
rather than on the largest error of one given object. (II) It
shares annotations across multiple object and stuff regions.
In our approach, a single scribble correction specifies the
extension of one region and the shrinkage of neighboring re-
11622
gions (Sec. 3.2 and Fig. 3). In interactive single object seg-
mentation, corrections are used for the given target object
only. (III) Our approach lets regions compete for space in
the common image canvas, ensuring that a pixel is assigned
exactly one label (Fig. 3.1). In single object segmentation
instead, pixels along boundary regions may be assigned to
multiple objects, leading to contradictory labels, or to none,
leading to holes. At the same time, since regions compete,
corrections of one region influences nearby regions in our
framework (e.g. Fig. 6). (IV) Instead of only annotating
object instances, we also annotate stuff regions, capturing
important classes such as pavement or river.
We realize interactive interactive full image segmenta-
tion by adapting Mask-RCNN [22] (Fig. 2). We start from
extreme points [41], which define bounding boxes. There-
fore we bypass the Region Proposal Network of Mask-
RCNN and use these boxes directly to extract Region-of-
Interest (RoI) features (Sec. 3.1). Afterwards, we incorpo-
rate extreme points and scribble annotations inside Mask-
RCNN by concatenating them to the RoI-features. We en-
code annotations in a way that allows to share them across
regions, enabling advantage (II) above (Sec. 3.2). Finally,
while Mask-RCNN [22] predicts each mask separately, we
project the mask predictions back on the pixels in the com-
mon image canvas 3.1. Then we define a new loss which
is instance-aware yet lets predictions properly compete for
space, enabling advantage (III) above (Sec. 3.3).
To the best of our knowledge, all deep interactive single
object segmentation methods [24, 30, 31, 32, 36, 37, 60]
are based on Fully Convolutional Networks (FCNs) [14,
35, 46]. We chose to start from Mask-RCNN [22] for ef-
ficiency. FCN-style interactive segmentation methods con-
catenate corrections to a crop of an RGB image, and pass
that through a large neural net (e.g. ResNet-101 [23]). This
requires a full inference pass for each region at each correc-
tion iteration. In our Mask-RCNN framework instead, the
RGB image is first passed through the large backbone net-
work. Afterwards, for each region only a pass over the final
segmentation head is required (Fig. 2). This is much faster
and more memory efficient (Sec. 6).
We perform thorough experiments in increasingly com-
plex settings: (1) Single object segmentation: On the
COCO dataset [34], our Mask-RCNN style architecture
achieves similar performance to DEXTR [37] on single ob-
ject segmentation from extreme points [41]. (2) Full image
segmentation: We evaluate on the COCO panoptic chal-
lenge [11, 27, 34] the task of segmenting all object and stuff
regions in an image, starting from extreme points. Our idea
to share annotations across regions in combination with our
pixel-wise loss yield a +3% IoU gain over an interactive
single region segmentation baseline. (3) Interactive full im-
age segmentation: On the COCO panoptic challenge, we
demonstrate the combined effects of our three advantages
(I)-(III) above: at a budget of four extreme clicks and four
scribbles per region, we get a total +5% IoU gain over the
interactive single region segmentation baseline.
2. Related Work
Semantic segmentation from weakly labeled data. Many
works address semantic segmentation by training from
weakly labeled data, such as image-level labels [28, 44, 58],
point-clicks [5, 6, 13, 57], boxes [26, 37, 41] and scrib-
bles [33, 59]. Boxes can be efficiently annotated using ex-
treme points [41] which can also be used as an extra sig-
nal for generating segmentations [37, 41]. This is related
as our method starts from extreme points for each region.
However, the above methods operate from annotations col-
lected before any machine processing. Our work instead is
in the interactive scenario, where the annotator iteratively
provide corrective annotations for the current machine seg-
mentation.
Interactive object segmentation. Interactive object seg-
mentation is a long standing research topic. Most classi-
cal approaches [3, 4, 8, 47, 18, 16, 21, 38] formulate object
segmentation as energy minimization on a regular graph de-
fined over pixels, with unary potential capturing low-level
appearance properties and pairwise or higher-order poten-
tials encouraging regular segmentation outputs.
Starting from Xu et al. [60], recent methods ad-
dress interactive object segmentation with deep neural net-
works [24, 30, 31, 32, 36, 37, 60]. These works build
on Fully Convolutional architectures such as FCNs [35] or
Deeplab [14]. They input the RGB image plus two extra
channels for object and non-object corrections, and output
a binary mask.
In [15] they perform interactive object segmentation in
video. They use Deeplab [14] to create a pixel-wise embed-
ding space. Annotator corrections are used to create a near-
est neighbor classifier on top of this embedding, enabling
quick updates of the object predictions.
Finally, Polygon-RNN [1, 12] is an interesting alterna-
tive approach. Instead of predicting a mask, they uses a re-
current neural net to predict polygon vertices. Corrections
made by the annotator are used by the machine to refine its
vertex predictions.
Interactive full image segmentation. Recently, [2] pro-
posed Fluid Annotation, which also addresses the task of
full image annotation. Our work shares the spirit of fo-
cusing annotator effort on the biggest errors made by the
machine across the whole image. However, [2] uses Mask-
RCNN [22] to create a large pool of fixed segments and then
provides an efficient interface for the annotator to rapidly
select which of these should form the final segmentation. In
contrast, in our work all segments are created from the ini-
tial extreme points and are all part of the final annotation.
Our method then enables to correct the shape of segments
11623
RoI CropSoftmaxCanvas
Projection
+
+
+
+
Backbone
(ResNet)
Segmentation
Head
P<latexit sha1_base64="kv0A1g8Rykmhn4krc+ahYmdNbMw=">AAAB8XicbZA9SwNBEIbn/IzxK2ppsxgEq3Bno40YtLGMYD4wOcLeZi5Zsrd37O4J4Qj4I2wsFLH1h9jb+W/cS1Jo4gsLD+/MsO9MkAiujet+O0vLK6tr64WN4ubW9s5uaW+/oeNUMayzWMSqFVCNgkusG24EthKFNAoENoPhdV5vPqDSPJZ3ZpSgH9G+5CFn1FjrvhNRMwjCrDbulspuxZ2ILII3g/LlZ/HiEQBq3dJXpxezNEJpmKBatz03MX5GleFM4LjYSTUmlA1pH9sWJY1Q+9kk8ZgcW6dHwljZJw2ZuL8nMhppPYoC25kn1PO13Pyv1k5NeO5nXCapQcmmH4WpICYm+fqkxxUyI0YWKFPcZiVsQBVlxh6paI/gza+8CI3Timf51itXr2CqAhzCEZyAB2dQhRuoQR0YSHiCF3h1tPPsvDnv09YlZzZzAH/kfPwALsqSvg==</latexit><latexit sha1_base64="u7vK/E3+Dqb0AeNsYZUC8HNqM3I=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKKuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyBZlDI=</latexit><latexit sha1_base64="u7vK/E3+Dqb0AeNsYZUC8HNqM3I=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKKuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyBZlDI=</latexit><latexit sha1_base64="++4IxT5RIzIkxOzNFo281vbUCiQ=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiRu7LLoxmUF+8A2lMn0ph06mYSZiVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSATXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6DhVDNssFrHqBVSj4BLbhhuBvUQhjQKB3WB6m/vdJ1Sax/LBzBL0IzqWPOSMGis9DiJqJkGYtebDas2tuwuQdeIVpAYFWsPq12AUszRCaZigWvc9NzF+RpXhTOC8Mkg1JpRN6Rj7lkoaofazReI5ubDKiISxsk8aslB/b2Q00noWBXYyT6hXvVz8z+unJmz4GZdJalCy5UdhKoiJSX4+GXGFzIiZJZQpbrMSNqGKMmNLqtgSvNWT10nnqu5Zfu/VmjdFHWU4g3O4BA+uoQl30II2MJDwDK/w5mjnxXl3PpajJafYOYU/cD5/AL+DkPE=</latexit>
li<latexit sha1_base64="8tEy/p5YoQgDd9MUcDWSbaxcSm0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZZBG8sEzAOSJc5OZpMhs7PLzF0hLPkNGwtFbP0ZOxu/xdkkhSYeGDiccy/3zAkSKQy67pdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDsBNVwKxZsoUPJOojmNAsnbwfg299uPXBsRq3ucJNyP6FCJUDCKVur1IoqjIMzktC/65YpbdWcgq8RbkErtpPH9AAD1fvmzN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42ezzFNybpUBCWNtn0IyU39vZDQyZhIFdjLPaJa9XPzP66YYXvuZUEmKXLH5oTCVBGOSF0AGQnOGcmIJZVrYrISNqKYMbU0lW4K3/OVV0rqsepY3vErtBuYowimcwQV4cAU1uIM6NIFBAk/wAq9O6jw7b877fLTgLHaO4Q+cjx/6lpPO</latexit><latexit sha1_base64="uKLl/PODvpmIL5PSPvpOKBaS/P0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMj7ts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFL5pWK</latexit><latexit sha1_base64="uKLl/PODvpmIL5PSPvpOKBaS/P0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMj7ts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFL5pWK</latexit><latexit sha1_base64="I9PDnzXy9/NY+p2F76+8YlSbOQQ=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfUATymR60w6dTMLMRCihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTpoJr47rfTmVjc2t7p7pb29s/ODyqH590dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9K/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEka5mA/5sN5wm+4CZJ14JWlAifaw/uWPEpbFKA0TVOuB56YmyKkynAmc1/xMY0rZlI5xYKmkMeogX2SekwurjEiUKPukIQv190ZOY61ncWgni4x61SvE/7xBZqKbIOcyzQxKtjwUZYKYhBQFkBFXyIyYWUKZ4jYrYROqKDO2ppotwVv98jrpXjU9yx+8Ruu2rKMKZ3AOl+DBNbTgHtrQAQYpPMMrvDmZ8+K8Ox/L0YpT7pzCHzifP2v+kek=</latexit>
si<latexit sha1_base64="FT1gPrQg0GUDv1qDB5DRjGQT/a0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZZBG8sEzAOSJc5OZpMhs7PLzF0hLPkNGwtFbP0ZOxu/xdkkhSYeGDiccy/3zAkSKQy67pdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDsBNVwKxZsoUPJOojmNAsnbwfg299uPXBsRq3ucJNyP6FCJUDCKVur1IoqjIMzMtC/65YpbdWcgq8RbkErtpPH9AAD1fvmzN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42ezzFNybpUBCWNtn0IyU39vZDQyZhIFdjLPaJa9XPzP66YYXvuZUEmKXLH5oTCVBGOSF0AGQnOGcmIJZVrYrISNqKYMbU0lW4K3/OVV0rqsepY3vErtBuYowimcwQV4cAU1uIM6NIFBAk/wAq9O6jw7b877fLTgLHaO4Q+cjx8FVpPV</latexit><latexit sha1_base64="2xMyJ9SAOFeEJoVSnCepx6mj1ac=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMjXts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFWl5WR</latexit><latexit sha1_base64="2xMyJ9SAOFeEJoVSnCepx6mj1ac=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMjXts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFWl5WR</latexit><latexit sha1_base64="2Dg25NtOyfCvM/pjWFGXO03oEWw=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfUATymR60w6dTMLMRCihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTpoJr47rfTmVjc2t7p7pb29s/ODyqH590dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9K/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEka5ng/5sN5wm+4CZJ14JWlAifaw/uWPEpbFKA0TVOuB56YmyKkynAmc1/xMY0rZlI5xYKmkMeogX2SekwurjEiUKPukIQv190ZOY61ncWgni4x61SvE/7xBZqKbIOcyzQxKtjwUZYKYhBQFkBFXyIyYWUKZ4jYrYROqKDO2ppotwVv98jrpXjU9yx+8Ruu2rKMKZ3AOl+DBNbTgHtrQAQYpPMMrvDmZ8+K8Ox/L0YpT7pzCHzifP3avkfA=</latexit>
Z<latexit sha1_base64="PBoE7Nr4jx6TALBNCGe+6Rzp5TQ=">AAAB8XicbZDLSgMxFIZP6q2Ot6pLN8EiuCozbnQjFt24rGAvtB1KJs20oZnMkGSEMhR8CDcuFHHrg7h359uYabvQ1h8CH/85h/znBIng2rjuNyqsrK6tbxQ3na3tnd290v5BQ8epoqxOYxGrVkA0E1yyuuFGsFaiGIkCwZrB6CavNx+Y0jyW92acMD8iA8lDTomxVrsbETMMwqw96ZXKbsWdCi+DN4fy1adz+QgAtV7pq9uPaRoxaaggWnc8NzF+RpThVLCJ0001SwgdkQHrWJQkYtrPpokn+MQ6fRzGyj5p8NT9PZGRSOtxFNjOPKFerOXmf7VOasILP+MySQ2TdPZRmApsYpyvj/tcMWrE2AKhitusmA6JItTYIzn2CN7iysvQOKt4lu+8cvUaZirCERzDKXhwDlW4hRrUgYKEJ3iBV6TRM3pD77PWAprPHMIfoY8fPfySyA==</latexit><latexit sha1_base64="YsJboe/DWZB+8rcnAJ5BhG2YdBg=">AAAB8XicbZDLSgMxFIbPeK3jrerSTbAIrsqMG92IRTcuK9gLbYeSSTNtaCYTkoxQhr6FGxeK6NIHce9GfBszbRfa+kPg4z/nkP+cUHKmjed9O0vLK6tr64UNd3Nre2e3uLdf10mqCK2RhCeqGWJNORO0ZpjhtCkVxXHIaSMcXuf1xj1VmiXizowkDWLcFyxiBBtrtToxNoMwylrjbrHklb2J0CL4MyhdfrgX8u3LrXaLn51eQtKYCkM41rrte9IEGVaGEU7HbifVVGIyxH3atihwTHWQTRKP0bF1eihKlH3CoIn7eyLDsdajOLSdeUI9X8vN/2rt1ETnQcaETA0VZPpRlHJkEpSvj3pMUWL4yAImitmsiAywwsTYI7n2CP78yotQPy37lm/9UuUKpirAIRzBCfhwBhW4gSrUgICAB3iCZ0c7j86L8zptXXJmMwfwR877Dy+LlDw=</latexit><latexit sha1_base64="YsJboe/DWZB+8rcnAJ5BhG2YdBg=">AAAB8XicbZDLSgMxFIbPeK3jrerSTbAIrsqMG92IRTcuK9gLbYeSSTNtaCYTkoxQhr6FGxeK6NIHce9GfBszbRfa+kPg4z/nkP+cUHKmjed9O0vLK6tr64UNd3Nre2e3uLdf10mqCK2RhCeqGWJNORO0ZpjhtCkVxXHIaSMcXuf1xj1VmiXizowkDWLcFyxiBBtrtToxNoMwylrjbrHklb2J0CL4MyhdfrgX8u3LrXaLn51eQtKYCkM41rrte9IEGVaGEU7HbifVVGIyxH3atihwTHWQTRKP0bF1eihKlH3CoIn7eyLDsdajOLSdeUI9X8vN/2rt1ETnQcaETA0VZPpRlHJkEpSvj3pMUWL4yAImitmsiAywwsTYI7n2CP78yotQPy37lm/9UuUKpirAIRzBCfhwBhW4gSrUgICAB3iCZ0c7j86L8zptXXJmMwfwR877Dy+LlDw=</latexit><latexit sha1_base64="XPBiC87+uQ72eY6Esz5oPxUuAT0=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfdA2lMn0ph06mYSZiVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSATXxnW/ndLG5tb2Tnm3srd/cHhUPT5p6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWB6l/udJ1Sax/LRzBL0IzqWPOSMGiv1BhE1kyDMevNhtebW3QXIOvEKUoMCzWH1azCKWRqhNExQrfuemxg/o8pwJnBeGaQaE8qmdIx9SyWNUPvZIvGcXFhlRMJY2ScNWai/NzIaaT2LAjuZJ9SrXi7+5/VTE974GZdJalCy5UdhKoiJSX4+GXGFzIiZJZQpbrMSNqGKMmNLqtgSvNWT10n7qu5Z/uDVGrdFHWU4g3O4BA+uoQH30IQWMJDwDK/w5mjnxXl3PpajJafYOYU/cD5/AM61kPs=</latexit>
X<latexit sha1_base64="lo8+OkdEkXTnL+EvESuTbluL5Dg=">AAAB8XicbZA9SwNBEIbn/IzxK2ppsxgEq3Bno40YtLGMYD4wOcLeZi9Zsrd37M4J4Qj4I2wsFLH1h9jb+W/cS1Jo4gsLD+/MsO9MkEhh0HW/naXlldW19cJGcXNre2e3tLffMHGqGa+zWMa6FVDDpVC8jgIlbyWa0yiQvBkMr/N684FrI2J1h6OE+xHtKxEKRtFa952I4iAIs9a4Wyq7FXcisgjeDMqXn8WLRwCodUtfnV7M0ogrZJIa0/bcBP2MahRM8nGxkxqeUDakfd62qGjEjZ9NEo/JsXV6JIy1fQrJxP09kdHImFEU2M48oZmv5eZ/tXaK4bmfCZWkyBWbfhSmkmBM8vVJT2jOUI4sUKaFzUrYgGrK0B6paI/gza+8CI3Timf51itXr2CqAhzCEZyAB2dQhRuoQR0YKHiCF3h1jPPsvDnv09YlZzZzAH/kfPwAOvKSxg==</latexit><latexit sha1_base64="wUzgBpZ6I/Z1XUhfpVg6r2dkQyc=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKGuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyyBlDo=</latexit><latexit sha1_base64="wUzgBpZ6I/Z1XUhfpVg6r2dkQyc=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKGuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyyBlDo=</latexit><latexit sha1_base64="rFnNEVvySqBvSiMnVOW/eBQRw04=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiRu7LLoxmUF+8A2lMn0ph06mYSZiVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSATXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6DhVDNssFrHqBVSj4BLbhhuBvUQhjQKB3WB6m/vdJ1Sax/LBzBL0IzqWPOSMGis9DiJqJkGY9ebDas2tuwuQdeIVpAYFWsPq12AUszRCaZigWvc9NzF+RpXhTOC8Mkg1JpRN6Rj7lkoaofazReI5ubDKiISxsk8aslB/b2Q00noWBXYyT6hXvVz8z+unJmz4GZdJalCy5UdhKoiJSX4+GXGFzIiZJZQpbrMSNqGKMmNLqtgSvNWT10nnqu5Zfu/VmjdFHWU4g3O4BA+uoQl30II2MJDwDK/w5mjnxXl3PpajJafYOYU/cD5/AMurkPk=</latexit> L
<latexit sha1_base64="3tg8NWwVITql15AEqocFyaE74gQ=">AAAB8XicbZC7SgNBFIbPxltcb1FLm8EgWIVdG23EoI2FRQRzwWQJs5PZZMjszDIzK4Ql4EPYWChi64PY2/k2ziYpNPGHgY//nMP854QJZ9p43rdTWFpeWV0rrrsbm1vbO6XdvYaWqSK0TiSXqhViTTkTtG6Y4bSVKIrjkNNmOLzK680HqjST4s6MEhrEuC9YxAg21rrvxNgMwii7GXdLZa/iTYQWwZ9B+eLTPX8EgFq39NXpSZLGVBjCsdZt30tMkGFlGOF07HZSTRNMhrhP2xYFjqkOskniMTqyTg9FUtknDJq4vycyHGs9ikPbmSfU87Xc/K/WTk10FmRMJKmhgkw/ilKOjET5+qjHFCWGjyxgopjNisgAK0yMPZJrj+DPr7wIjZOKb/nWL1cvYaoiHMAhHIMPp1CFa6hBHQgIeIIXeHW08+y8Oe/T1oIzm9mHP3I+fgAotpK6</latexit><latexit sha1_base64="ykYVAFzl3gWN0PFR3iQN9g//q7Y=">AAAB8XicbZC7SgNBFIbPeo3rLWppMxgEq7Bro40YtLGwiGAumCxhdjKbDJmdHWZmhbDkLWwsFNHSB7G3Ed/G2SSFJv4w8PGfc5j/nFBypo3nfTsLi0vLK6uFNXd9Y3Nru7izW9dJqgitkYQnqhliTTkTtGaY4bQpFcVxyGkjHFzm9cY9VZol4tYMJQ1i3BMsYgQba921Y2z6YZRdjzrFklf2xkLz4E+hdP7hnsm3L7faKX62uwlJYyoM4Vjrlu9JE2RYGUY4HbntVFOJyQD3aMuiwDHVQTZOPEKH1umiKFH2CYPG7u+JDMdaD+PQduYJ9WwtN/+rtVITnQYZEzI1VJDJR1HKkUlQvj7qMkWJ4UMLmChmsyLSxwoTY4/k2iP4syvPQ/247Fu+8UuVC5ioAPtwAEfgwwlU4AqqUAMCAh7gCZ4d7Tw6L87rpHXBmc7swR857z8aRZQu</latexit><latexit sha1_base64="ykYVAFzl3gWN0PFR3iQN9g//q7Y=">AAAB8XicbZC7SgNBFIbPeo3rLWppMxgEq7Bro40YtLGwiGAumCxhdjKbDJmdHWZmhbDkLWwsFNHSB7G3Ed/G2SSFJv4w8PGfc5j/nFBypo3nfTsLi0vLK6uFNXd9Y3Nru7izW9dJqgitkYQnqhliTTkTtGaY4bQpFcVxyGkjHFzm9cY9VZol4tYMJQ1i3BMsYgQba921Y2z6YZRdjzrFklf2xkLz4E+hdP7hnsm3L7faKX62uwlJYyoM4Vjrlu9JE2RYGUY4HbntVFOJyQD3aMuiwDHVQTZOPEKH1umiKFH2CYPG7u+JDMdaD+PQduYJ9WwtN/+rtVITnQYZEzI1VJDJR1HKkUlQvj7qMkWJ4UMLmChmsyLSxwoTY4/k2iP4syvPQ/247Fu+8UuVC5ioAPtwAEfgwwlU4AqqUAMCAh7gCZ4d7Tw6L87rpHXBmc7swR857z8aRZQu</latexit><latexit sha1_base64="3EvzcnuxzwYwhOZpGilHKE7C5nQ=">AAAB8XicbVC7TsMwFL0pr1JeBUYWiwqJqUpYYKxgYWAoEn2INqoc96a16jiR7SBVUf+ChQGEWPkbNv4Gp80ALUeydHTOvfK5J0gE18Z1v53S2vrG5lZ5u7Kzu7d/UD08aus4VQxbLBax6gZUo+ASW4Ybgd1EIY0CgZ1gcpP7nSdUmsfywUwT9CM6kjzkjBorPfYjasZBmN3NBtWaW3fnIKvEK0gNCjQH1a/+MGZphNIwQbXueW5i/Iwqw5nAWaWfakwom9AR9iyVNELtZ/PEM3JmlSEJY2WfNGSu/t7IaKT1NArsZJ5QL3u5+J/XS0145WdcJqlByRYfhakgJib5+WTIFTIjppZQprjNStiYKsqMLaliS/CWT14l7Yu6Z/m9V2tcF3WU4QRO4Rw8uIQG3EITWsBAwjO8wpujnRfn3flYjJacYucY/sD5/AG5b5Dt</latexit>
zi<latexit sha1_base64="33F+OFvkLuwNh1qOQ3L6jKLflD0=">AAAB83icbVC7SgNBFL3rM8ZXVLCxGQyCVdi10TJoY5mAeUCyxNnJbDJkdnaZuSvEJb9hY6GIrT9jZ+O3OJuk0MQDA4dz7uWeOUEihUHX/XJWVtfWNzYLW8Xtnd29/dLBYdPEqWa8wWIZ63ZADZdC8QYKlLydaE6jQPJWMLrJ/dYD10bE6g7HCfcjOlAiFIyilbrdiOIwCLPHSU/0SmW34k5Blok3J+Xqcf37HgBqvdJntx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs+mmSfkzCp9EsbaPoVkqv7eyGhkzDgK7GSe0Sx6ufif10kxvPIzoZIUuWKzQ2EqCcYkL4D0heYM5dgSyrSwWQkbUk0Z2pqKtgRv8cvLpHlR8Syve+XqNcxQgBM4hXPw4BKqcAs1aACDBJ7gBV6d1Hl23pz32eiKM985gj9wPn4AEAeT3A==</latexit><latexit sha1_base64="AqBNvmB5FvYxICTjcEz5LDS6KdE=">AAAB83icbVC7SgNBFL3rM8ZXVLCxGQyCVdi10TLExjIB84DsEmYns8mQ2dllHkJc8hs2ForY5i/8Ajsbv8XZJIUmHhg4nHMv98wJU86Udt0vZ219Y3Nru7BT3N3bPzgsHR23VGIkoU2S8ER2QqwoZ4I2NdOcdlJJcRxy2g5Ht7nffqBSsUTc63FKgxgPBIsYwdpKvh9jPQyj7HHSY71S2a24M6BV4i1IuXra+GbT2ke9V/r0+wkxMRWacKxU13NTHWRYakY4nRR9o2iKyQgPaNdSgWOqgmyWeYIurNJHUSLtExrN1N8bGY6VGsehncwzqmUvF//zukZHN0HGRGo0FWR+KDIc6QTlBaA+k5RoPrYEE8lsVkSGWGKibU1FW4K3/OVV0rqqeJY3vHK1BnMU4AzO4RI8uIYq3EEdmkAghSd4gVfHOM/Om/M+H11zFjsn8AfO9AdhSJWY</latexit><latexit sha1_base64="AqBNvmB5FvYxICTjcEz5LDS6KdE=">AAAB83icbVC7SgNBFL3rM8ZXVLCxGQyCVdi10TLExjIB84DsEmYns8mQ2dllHkJc8hs2ForY5i/8Ajsbv8XZJIUmHhg4nHMv98wJU86Udt0vZ219Y3Nru7BT3N3bPzgsHR23VGIkoU2S8ER2QqwoZ4I2NdOcdlJJcRxy2g5Ht7nffqBSsUTc63FKgxgPBIsYwdpKvh9jPQyj7HHSY71S2a24M6BV4i1IuXra+GbT2ke9V/r0+wkxMRWacKxU13NTHWRYakY4nRR9o2iKyQgPaNdSgWOqgmyWeYIurNJHUSLtExrN1N8bGY6VGsehncwzqmUvF//zukZHN0HGRGo0FWR+KDIc6QTlBaA+k5RoPrYEE8lsVkSGWGKibU1FW4K3/OVV0rqqeJY3vHK1BnMU4AzO4RI8uIYq3EEdmkAghSd4gVfHOM/Om/M+H11zFjsn8AfO9AdhSJWY</latexit><latexit sha1_base64="J1p9KYiuhco7OkhjieQVld22MFQ=">AAAB83icbVBNS8NAFHypX7V+VT16WSyCp5J40WPRi8cKthWaUDbbl3bpZhN2N0IN/RtePCji1T/jzX/jps1BWwcWhpn3eLMTpoJr47rfTmVtfWNzq7pd29nd2z+oHx51dZIphh2WiEQ9hFSj4BI7hhuBD6lCGocCe+HkpvB7j6g0T+S9maYYxHQkecQZNVby/ZiacRjlT7MBH9QbbtOdg6wSryQNKNEe1L/8YcKyGKVhgmrd99zUBDlVhjOBs5qfaUwpm9AR9i2VNEYd5PPMM3JmlSGJEmWfNGSu/t7Iaaz1NA7tZJFRL3uF+J/Xz0x0FeRcpplByRaHokwQk5CiADLkCpkRU0soU9xmJWxMFWXG1lSzJXjLX14l3YumZ/md12hdl3VU4QRO4Rw8uIQW3EIbOsAghWd4hTcnc16cd+djMVpxyp1j+APn8weBYJH3</latexit>
bi<latexit sha1_base64="scFVmzDNNvwJgiRtgXHvgnyThs0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZZBG8sEzAOSJc5OZpMhs7PLzF0hLPkNGwtFbP0ZOxu/xdkkhSYeGDiccy/3zAkSKQy67pdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDsBNVwKxZsoUPJOojmNAsnbwfg299uPXBsRq3ucJNyP6FCJUDCKVur1IoqjIMyCaV/0yxW36s5AVom3IJXaSeP7AQDq/fJnbxCzNOIKmaTGdD03QT+jGgWTfFrqpYYnlI3pkHctVTTixs9mmafk3CoDEsbaPoVkpv7eyGhkzCQK7GSe0Sx7ufif100xvPYzoZIUuWLzQ2EqCcYkL4AMhOYM5cQSyrSwWQkbUU0Z2ppKtgRv+curpHVZ9SxveJXaDcxRhFM4gwvw4ApqcAd1aAKDBJ7gBV6d1Hl23pz3+WjBWewcwx84Hz/rUJPE</latexit><latexit sha1_base64="Oaw5BFxOvqwrCdB3C4B8cbnBWhs=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMsnDaZ/1yxa26c6B14i1JpXbW/Gaz+kejX/70BwkxMRWacKxUz3NTHWRYakY4nZZ8o2iKyRgPac9SgWOqgmyeeYourTJAUSLtExrN1d8bGY6VmsShncwzqlUvF//zekZHt0HGRGo0FWRxKDIc6QTlBaABk5RoPrEEE8lsVkRGWGKibU0lW4K3+uV10r6uepY3vUqtDgsU4Rwu4Ao8uIEa3EMDWkAghSd4gVfHOM/Om/O+GC04y51T+ANn9gM8oJWA</latexit><latexit sha1_base64="Oaw5BFxOvqwrCdB3C4B8cbnBWhs=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMsnDaZ/1yxa26c6B14i1JpXbW/Gaz+kejX/70BwkxMRWacKxUz3NTHWRYakY4nZZ8o2iKyRgPac9SgWOqgmyeeYourTJAUSLtExrN1d8bGY6VmsShncwzqlUvF//zekZHt0HGRGo0FWRxKDIc6QTlBaABk5RoPrEEE8lsVkRGWGKibU0lW4K3+uV10r6uepY3vUqtDgsU4Rwu4Ao8uIEa3EMDWkAghSd4gVfHOM/Om/O+GC04y51T+ANn9gM8oJWA</latexit><latexit sha1_base64="O3/aSo7bnJYUt4uydn3X97F8X2k=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfUATymR60w6dTMLMRCihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTpoJr47rfTmVjc2t7p7pb29s/ODyqH590dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9K/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEkZ5OB/yYb3hNt0FyDrxStKAEu1h/csfJSyLURomqNYDz01NkFNlOBM4r/mZxpSyKR3jwFJJY9RBvsg8JxdWGZEoUfZJQxbq742cxlrP4tBOFhn1qleI/3mDzEQ3Qc5lmhmUbHkoygQxCSkKICOukBkxs4QyxW1WwiZUUWZsTTVbgrf65XXSvWp6lj94jdZtWUcVzuAcLsGDa2jBPbShAwxSeIZXeHMy58V5dz6WoxWn3DmFP3A+fwBcuJHf</latexit>
Boxes from
extreme points Annotations
Full Image
Segmentation
Image-level
logit predictions
Region-level
logit predictions
RoI
features
Backbone
features
Input
image
Figure 2: Our proposed region based model for interactive full image segmentation (see Sec. 3.1 for details). We start from Mask-
RCNN [22], but use user provided boxes (from extreme points) instead of a box proposal networks for RoI cropping, and concatenate
the RoI features with annotator provided corrective scribbles. Instead of predicting binary masks for each region, we project all region
prediction into the common image canvas, where they compete for space. The network is trained end-to-end for a novel pixel-wise loss for
the full image segmentation task (see Sec. 3.3).
to precisely match object boundaries.
Several older works on interactive segmentation handle
multiple labels in a single image [39, 40, 50, 53]. We
present the first interactive deep learning framework which
does this. Moreover, in contrast to those works, we explic-
itly demonstrate the benefits of interactive full image seg-
mentation over interactive single object segmentation.
Other works on interactive annotation. In [48] they com-
bine a segmentation network with a language module to al-
low a human to correct the segmentation by typing feedback
in natural language, such as “there are no clouds visible in
this image”. The work of [42] annotates bounding boxes
using only human verification, while [29] trained agents to
determine whether it is more efficient to verify or draw a
bounding box. The avant-garde work of [49] had a ma-
chine dispatching many labeling questions to annotators, in-
cluding whether an object class is present, box verification,
box drawing, and finding missing instances of a particular
class in the image. In [54] they estimate the informative-
ness of having an image label, a box, or a segmentation
for an image, which they use to guide an active learning
scheme. Finally, several works tackle fine-grained classi-
fication through attributes interactive provided by annota-
tors [9, 43, 7, 55].
3. Our interactive segmentation model
This section describes our model which we use to pre-
dict a segmentation from extreme points and scribble cor-
rections (Fig. 1). We first discuss the model architecture
(Sec. 3.1). We then describe how we feed annotations to the
model (extreme points and scribble corrections, Sec. 3.2).
Finally, we describe model training with our new loss func-
tion (Sec. 3.3).
3.1. Model architecture
Our model is based on Mask-RCNN [22]. In Mask-
RCNN inference is done as follows: (1) An input image X
is passed through a deep neural network backbone such as
ResNet [23], producing a feature map Z. (2) A specialized
network module (RPN [45]) predicts box proposals based
on Z. (3) These box proposals are used to crop out Region-
of-Interest (RoI) features z from Z with a RoI cropping
layer (RoI-align [22]). (4) Then each RoI feature z is fed
into three separate network modules which predict a class
label, refined box coordinates, and a segmentation mask.
Fig. 2 illustrates how we adapt Mask-RCNN [22] for in-
teractive full image segmentation. In particular, our net-
work takes three types of inputs: (1) an image X of
size W × H × 3; (2) N annotation maps S1, · · · ,SN of
size W × H (for extreme points and scribble corrections,
Sec. 3.2); and (3) N boxes b1, · · · ,bN determined by the
extreme points provided by annotators. Here N is the num-
ber of regions that we want to segment, which is determined
by the annotator, and which may vary per image.
As in Mask-RCNN, an image X is fed into our backbone
architecture (ResNet [23]) to produce feature map Z of size1rW × 1
rH ×C, where C is the number of feature channels
and r is a reduction factor. Both C and r are determined by
the choice of backbone architecture.
In contrast to Mask-RCNN, we already have boxes
11624
b1, · · · ,bN , so we do not need a box proposal module. In-
stead, we use each box bi directly to crop out an RoI feature
zi from feature map Z. All features zi have the same fixed
size w×h×C (i.e. w and h are only dependent on the RoI
cropping layer). We concatenate to this the corresponding
annotation map si which is described in Sec. 3.2, and obtain
a feature map vi, which is of size w × h× (C + 2).Using vi, our network predicts a logit map li of size
w′ × h′ which represents the prediction of a single mask.
While Mask-RCNN stops at such mask predictions and pro-
cesses them with a sigmoid to obtain binary masks, we want
to have predictions influence each other. Therefore we use
the boxes b1, · · · ,bN to re-project the logit predictions of
all masks li back into the original image resolution which
results in N prediction maps Li. We concatenate these pre-
diction maps into a single tensor L of size W ×H×N . For
each pixel, we then obtain region probabilities P of dimen-
sion W ×H ×N by applying a softmax to the logits,
(P(x,y)1 , · · · , P
(x,y)N ) = softmax(L
(x,y)1 , · · · , L
(x,y)N ), (1)
where P(x,y)i denotes the probability that pixel (x, y) is as-
signed to region i. This makes multiple nearby regions com-
pete for space in the common image canvas.
3.2. Incorporating annotations
Our model in Fig. 2 concatenates RoI features z with
annotation map s. We now describe how we create s. First,
for each region i we create a positive annotation map Si
which is of the same size W ×H as the image. We choose
the annotation map to be binary and we create it by pasting
all extreme points and corrective scribbles for region i onto
it. Extreme points are represented by a circle which is 6
pixels in diameter. Scribbles are 3 pixels wide.
For each region i, we collapse all annotations which
do not belong to it into a single negative annotation map∑
j 6=i Sj . Then, we concatenate the positive and negative
annotation maps into a two-channel annotation map Fi
Fi :=(
Si,∑
j 6=i
Sj
)
, (2)
which is illustrated in Fig. 3. Finally, we apply RoI-
align [22] to Fi using box bi to obtain the desired cropped
annotation map si.
The way we construct Fi enables the sharing of all an-
notation information across multiple object and stuff re-
gions in the image. The negative annotations for one re-
gion are formed by collecting the positive annotations of
all other regions. In contrast, in single object segmentation
works [3, 8, 18, 16, 21, 24, 30, 31, 32, 38, 36, 37, 47, 60]
both positive and negative annotations are made only on the
target object and they are never shared, so they only have an
effect on that one object.
3.3. Training
Training data. As training data, we have ground-truth
masks for all objects and stuff regions in all images. We
represent the (non-overlapping) N ground truth masks of
an image X with region indices. This results in a map Y
of dimension W ×H , which assigns each pixel X(x,y) to a
region Y (x,y) ∈ {1, ...N}.
Pixel-wise loss. Standard Mask-RCNN is trained with Bi-
nary Cross Entropy (BCE) losses for each mask prediction
separately. This means that there is no direct interaction
between adjacent masks, and they might even overlap. In-
stead, we propose a novel instance-aware loss which lets
predictions compete for space in the original image canvas.
In particular, as described in Sec. 3.1 we project all
region-specific logits into a single image-level logit tensor
L, which is softmaxed into a region assignment probabili-
ties P of size W ×H ×N .
As described above, the ground-truth segmentation is
represented by Y with values in {1, · · · , N}, which spec-
ifies for each pixel its region index. Since we simulate the
extreme points from the ground-truth masks, there is a di-
rect correspondence between the region assignment proba-
bilities P1, · · · ,PN and Y. Thus, we can train our network
end-to-end for the Categorical Cross Entropy (CCE) loss for
the region assignments:
Lpixelwise =∑
(x,y)
− logP(x,y)
Y (x,y) (3)
We note that while the CCE loss is commonly used in fully
convolutional networks for semantic segmentation [14, 35,
46], we instead use it in an architecture based on Mask-
RCNN [22]. Furthermore, usually the loss is defined over a
fixed number of classes [14, 35, 46], whereas we define it
over the number of regions N . This number of regions may
vary per image.
The loss in (3) is computed over the pixels in the full
resolution common image canvas. Consequently, larger re-
gions have a greater impact on the loss. However, in our
experiments we measure Intersection-over-Union (IoU) be-
tween ground-truth masks and predictions, which considers
all regions equally independent of their size. Therefore we
weigh the terms in (3) as follows. For each pixel we find the
smallest box bi which contains it, and reweigh the loss for
that pixel by the inverse of the size of bi. This causes each
region to contribute to the loss approximately equally.
Our loss shares similarities with [10]. They used Fast-
RCNN [20] with selective search regions [52] and generate
a class prediction vector for each region. Then they project
this vector back into the image canvas using its correspond-
ing region, while resolving conflicts using a max operator.
In our work instead, we project a full logit map back into
the image (Fig. 2). Furthermore, while in [10] the number
11625
Negative channelPositive channelAnnotations
Figure 3: We illustrate how we combine all annotations near a region (red) into two
annotation maps specific for that region. The colored regions denote the current pre-
dicted segmentation and the white boundaries depict true object boundaries. For the red
region, the extreme points and the single positive scribble are combined into a single
positive binary channel. All scribbles from other nearby regions are collected into a
single negative binary channel.
Scribble simulation
Figure 4: To simulate a corrective scribble,
we first sample an initial control point to in-
dicate which region we want to expand (yel-
low), followed by two control points (orange)
sampled uniformly from the error region.
of logit channels is equal to the number of classes C, in our
work it depends on the number of regions N , which may
vary per image.
3.4. Implementation details
The original implementation of Mask-RCNN [22] cre-
ates for each RoI feature mask predictions for all classes
that it is trained on. At inference time, it uses the predicted
class to select the corresponding predicted mask. Since we
build on Mask-RCNN, we also do this in our framework
for convenience of the implementation. During training we
use the class labels to train class-specific mask prediction
logits. During inference, for each region i we use the class
label predicted by Mask-RCNN to select which mask log-
its which we use as li. Hence during inference time, we
have implicit class labels. However, class labels are never
exposed to the annotator and are considered to be irrelevant
for this paper.
4. Annotations and their simulation
Our annotations consists of both extreme points and
scribble corrections. We chose scribble corrections [3, 8,
47] instead of click corrections [24, 30, 31, 32, 36, 37, 60]
as they are a more natural choice in our scenario. As we
consider multiple regions in an image, any annotation first
needs to indicate which region should be extended. With
scribbles one can start inside the region to be extended, fol-
lowed by a path which specifies how to extend the region.
In all our experiments we simulate annotations, follow-
ing previous interactive segmentation works [1, 12, 24, 30,
31, 32, 36, 37, 60].
Simulating extreme points. To simulate the extreme points
that the annotator provides at the beginning, we use the code
provided by [37].
Simulating scribble corrections. To simulate scribble cor-
rections during the interactive segmentation process, we
first need to select an error region. Error regions are de-
fined as a connected group of pixels of a ground-truth re-
gion which has been wrongly assigned to a different region
(Fig. 4). We assess the importance of an error region by
measuring how much segmentation quality (IoU) would im-
prove if it was completely corrected. We use this to create
annotator corrections on the most important error regions
(the exact way depends on the particular experiment, details
in Sec. 5).
To correct an error, we need a scribble that starts inside
the ground-truth region and extends into the error region
(Fig. 1). We simulate such scribbles with a three-step pro-
cess, illustrated in Fig. 4: (1) first we randomly sample the
first point on the border of the error region that touches
the ground-truth region (yellow point in Fig. 4; (2) then
we sample two more points uniformly inside the error re-
gion (yellow points in Fig. 4). (3) we construct a scribble
as a smooth trajectory through these three points (using a
bezier curve). We repeat this process ten times, and keep the
longest scribble that is exclusively inside the ground-truth
region (while all simulated points are within the ground-
truth, the curve could cover parts outside the ground-truth).
5. Results
We use Mask-RCNN as basic segmentation framework
instead of Fully Convolutional architectures [14, 35, 46]
commonly used in single object segmentation works [24,
30, 31, 32, 36, 37, 60]. We first demonstrate in Sec. 5.1
that this is a valid choice by comparing to DEXTR [37] in
the non-interactive setting where we generate masks start-
ing from extreme points [41]. In Sec 5.2 we move to the full
image segmentation task and demonstrate improvements re-
sulting from sharing extreme points across regions and from
our new pixel-wise loss. Finally, in Sec. 5.3 we show results
on interactive full image segmentation, where we also share
scribble corrections across regions, and allow the annotator
to freely allocate scribbles to regions while considering the
whole image.
11626
Method IoU
DEXTR [37] 82.1
DEXTR (released model) 81.9
Our single region model 81.6
Table 1: Performance on COCO (objects only). The accuracy of
our single region model is comparable to DEXTR [37].
X-points not shared X-points shared
Mask-wise loss 75.8 76.0
Pixel-wise loss 78.4 79.1
Table 2: Performance on the COCO Panoptic validation set when
predicting masks from extreme points (X-points). We vary the
loss and whether extreme points are shared across regions. The
top-left entry corresponds to our single region model, the
bottom-right entry corresponds to our full image model.
5.1. Single object segmentation
DEXTR. In DEXTR [37] they predict object masks from
four extreme points [41]. DEXTR is based on Deeplab-
v2 [14], using a ResNet-101 [23] backbone architecture and
a Pyramid Scene Parsing network [61] as prediction head.
As input they crop a bounding box out of the RGB im-
age based on the extreme points provided by the annotator.
The locations of the extreme points are Gaussian blurred
and fed as a heatmap to the network, concatenated to the
cropped RGB input. The DEXTR segmentation model ob-
tained state-of-the-art results on this task [37].
Details of our model. We compare DEXTR to a single ob-
ject segmentation variant of our model (single region
model). It uses the original Mask-RCNN loss, computed in-
dividually per mask, and does not share annotations across
regions. For fair comparison to DEXTR, here we also use
a ResNet-101 [23] backbone, which due to memory con-
straints limits the resolution of our RoI features to 14 × 14pixels and our predicted mask to 33 × 33. Moreover, we
use their released code to generate simulated extreme point
annotations. In contrast to subsequent experiments, here we
also use the same Gaussian blurred heatmaps to input anno-
tations to our model as used in [37].
Dataset. We follow the experimental setup of [37] on
the COCO dataset [34], which has 80 object classes.
Models are trained on the 2014 training set and evalu-
ated on the 2017 validation set (previously referred to as
2014 minival). We measure performance in terms of
Intersection-over-Union averaged over all instances.
Results. Tab. 1 reports the original results from
DEXTR [37], our reproduction using their publicly released
model, and results of our single region model. Their
publicly released model and our model deliver very similar
results (81.9 and 81.6 IoU). This demonstrate that Mask-
RCNN-style models are competitive to commonly used
FCN-style models for this task.
5.2. Full image segmentation
Experimental setup. Given extreme points for each object
and stuff region, we now predict a full image segmentation.
We demonstrate the benefits of using our pixel-wise loss
(Sec. 3.3) and sharing extreme points across regions (i.e.
extreme points for one region are used as negative informa-
tion for nearby regions, Sec. 3.2).
Details of our model. In preliminary experiments we found
that RoI features of 14× 14 pixels resolution were limiting
accuracy when feeding in annotations into the segmentation
head. Therefore we increased both the RoI features and the
predicted mask to 41 × 41 pixels and switched to ResNet-
50 [23] due to memory constraints. Importantly, in all ex-
periments from now on our model uses the two-channel an-
notation maps described in Sec. 3.2.
Dataset. We perform our experiments on the COCO panop-
tic challenge dataset [11, 27, 34], which has 80 object
classes and 53 stuff classes. Since the final goal is to ef-
ficiently annotate data, we train on only 12.5% of the 2017
training set (15k images). We evaluate on the 2017 valida-
tion set and measure IoU averaged over all object and stuff
regions in all images.
Results. As Tab. 2 shows, our single region model
yields 75.8 IoU. It uses a mask-wise loss and does not share
extreme points across regions. When only sharing extreme
points, we get a small gain of +0.2 IoU. In contrast, when
only switching to our pixel-wise loss, results improve by
+2.6 IoU. Sharing extreme points is more beneficial in com-
bination with our new loss, yielding an additional improve-
ment of +0.7 IoU. Overall this model with both improve-
ments achieves 79.1 IoU, +3.3 higher than the single
region model. We call it our full image model.
5.3. Interactive full image segmentation
We now move to our final system for interactive full im-
age segmentation. We start from the segmentations from
extreme points made by our single region and full
image models from Sec. 5.2. Then we iterate between:
(A) adding scribble corrections by the annotator, and (B)
updating the machine segmentations accordingly.
Dataset and training. As before, we experiment on the
COCO panoptic challenge dataset and report results on the
2017 validation set. Since during iterations our models
input scribble corrections in addition to extreme points,
we train two new interactive models: single region
scribble and full image scribble. These mod-
els have the same architecture as their counterparts in
Sec. 5.2 (which input only extreme points), but are trained
differently. To create training data for one of these inter-
11627
0 2 4 6 8
#scribbles/region
0.75
0.80
0.85
0.90
IoU
IoU vs #scribbles/region
full image scribble (one scribble on average)
full image scribble (one scribble per region)
single region scribble (one scribble per region)
Figure 5: Results on the COCO Panoptic validation set for the
interactive full image segmentation task. We measure average IoU
vs the number of scribbles per region. We compare our full
image scribble model under two scribble allocation strate-
gies to the single region scribble baseline.
active models, we apply its counterpart to another 12.5%
of the 2017 training set. We generate simulated correc-
tive scribbles as described in Sec. 4 and train each model
on the combined extreme points and scribbles annotations
(Sec. 3.2). We keep these models fixed throughout all it-
erations of interactive segmentation. Note how, in addition
to sharing extreme points as in Sec. 5.2, the full image
scribble model also shares scribble corrections across
regions.
Allocation of scribble corrections. When using our
single region scribble model, in every iteration
we allocate exactly one scribble to each region. Instead,
when using our full image scribble model we also
consider an alternative interesting strategy: one scribble
per region on average, but the annotator can freely allocate
these scribbles to the regions in the image. This enables the
annotator to focus efforts on the biggest errors across the
whole image, typically resulting in some regions receiving
multiple scribbles and some receiving none.
Results. Fig. 5 shows annotation quality (IoU) vs cost
(number of scribbles per region). The two starting points
at zero scribbles are the same as the top-left and bottom-
right entries of Tab. 2 since they are made using the same
non-interactive models (from extreme points only).
We first compare single region scribble to
full image scribble while using the same alloca-
tion strategy: exactly one scribble per region. Fig. 5
shows that for both models accuracy rapidly improves
with more scribble corrections. However, full image
scribble always offers a better trade-off between an-
notation effort and segmentation quality, e.g. to reach
85% IoU it takes 4 scribbles per region for the single
region scribble model but only 2 scribbles for our
full image scribble model. Similarly, to reach
88% IoU it takes 7 scribbles vs 4 scribbles. This confirms
that the benefits of sharing annotations across regions and
of our pixel-wise loss persist also in the interactive setting.
We now compare the two scribble allocation strate-
gies on the full image scribble model. As Fig. 5
shows, using the strategy of freely allocating scribbles to
regions (one scribble on average) brings further efficiency
gains. full image scribble reaches a very high
90% IoU at just 4 scribbles per region on average. Reaching
this IoU instead requires allocating exactly 8 scribbles per
region with the other strategy. This demonstrate the benefits
of focusing annotation effort on the largest errors across the
whole image.
Overall, at a budget of four extreme clicks and four scrib-
bles per region, we get a total 5% IoU gain over single
region scribble (90% vs 85%). This gain is brought
by the combined effects of our contributions: sharing anno-
tations across regions, the pixel-wise loss which lets regions
compete on the common image canvas, and the free scribble
allocation strategy.
Fig. 6 shows various examples for how annotation pro-
gresses over iterations. Notice how in the first example, the
corrective scribble on the left bear induces a negative scrib-
ble for the rock, which in turn improves the segmentation
of the right bear. This demonstrates the benefit of sharing
scribble annotations and competition between regions.
6. Discussion
Mask-RCNN vs FCNs. Our work builds on Mask-
RCNN [22] rather than FCN-based models [14, 35, 46] be-
cause it is faster and requires less memory. To see this, we
can reinterpret Fig. 2 as an FCN-based model: ignore the
backbone network, replace the backbone features Z by the
RGB image, and make the segmentation head a full FCN.
At inference time, we need to do a forward pass through
the segmentation head for every region for every correction.
When using Mask-RCNN, the heavy ResNet [23] backbone
network is applied only once for the whole image, and then
only a small 4-layer segmentation head is applied to each re-
gion. For the FCN-style alternative instead, nothing can be
precomputed and the segmentation head itself is the heavy
ResNet. Hence our framework is much faster during inter-
active annotation.
During training, typically all intermediate network ac-
tivations are stored in memory. Crucially, for each region
distinct activations are generated in the segmentation head.
For FCN-style models this is a heavy ResNet and requires
lots of memory. This is why DEXTR [37] reports a maxi-
mal batch size of 5 regions. Therefore, it would be difficult
to train with our pixel-wise loss in an FCN-style model, as
that requires processing all regions in each image simulta-
11628
Input image Machine predictions Corrective scribbles Machine predictions Final result Ground-truth
with extreme points from extreme points 1 scribble/region 1 scribble/region 9 scribbles/region
provided by annotator provided by annotator and extreme points and extreme points
Figure 6: We show example results obtained by our system using the full image scribble model with a free allocation strategy.
The first two columns show the input image with extreme points and predictions. Column 3 shows the first annotation step with one scribble
correction per region on average, and column 4 shows the updated predictions. The last two columns compare the final result after 9 steps
(using 9 scribbles per region on average) with the COCO ground-truth segmentation.
neously (15 regions per image on average).
In fact our Mask-RCNN based architecture (Fig. 2) and
its reinterpretation as an FCN-based model span a contin-
uum. Its design space can be explored by varying the size of
the backbone and the segmentation head, as well as their in-
put and output resolution. We leave such exploration of the
trade-off between memory requirements, inference speed,
and model accuracy for future work.
Scribble and point simulations. Like other interactive seg-
mentation works [1, 12, 24, 30, 31, 32, 36, 37, 60], we sim-
ulate annotations. It remains to be studied how to best se-
lect the simulation parameters so that the models generalize
well to real human annotators. The optimal parameters will
likely depend on various factors, such as the desired anno-
tation quality and the accuracy of the provided corrections.
7. Conclusions
We proposed an interactive annotation framework which
operates on the whole image to produce segmentations for
all object and stuff regions. Our key contributions derive
from considering the full image at once: sharing annota-
tions across regions, focusing annotator effort on the biggest
errors across the whole image, and a pixel-wise loss for
Mask-RCNN that lets regions compete on the common im-
age canvas. We have shown through experiments on the
COCO panoptic challenge dataset [11, 27, 34] that all the
elements we propose improve the trade-off between anno-
tation cost and quality, leading to a very high IoU of 90%
using just four extreme points and four corrective scribbles
per region (compared to 85% for the baseline).
11629
References
[1] D. Acuna, H. Ling, A. Kar, and S. Fidler. Efficient interactive
annotation of segmentation datasets with polygon-rnn++. In
CVPR, 2018. 2, 5, 8
[2] M. Andriluka, J. R. R. Uijlings, and V. Ferrari. Fluid an-
notation: A human-machine collaboration interface for full
image annotation. In ACM Multimedia, 2018. 2
[3] X. Bai and G. Sapiro. Geodesic matting: A framework for
fast interactive image and video segmentation and matting.
IJCV, 2009. 2, 4, 5
[4] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. Inter-
actively co-segmentating topically related images with intel-
ligent scribble guidance. IJCV, 2011. 2
[5] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei.
What’s the point: Semantic segmentation with point super-
vision. 2016. 2
[6] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material
recognition in the wild with the materials in context database.
In CVPR, 2015. 2
[7] A. Biswas and D. Parikh. Simultaneous active learning of
classifiers & attributes via relative feedback. In CVPR, 2013.
3
[8] Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal
boundary and region segmentation of objects in N-D images.
In ICCV, 2001. 2, 4, 5
[9] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder,
P. Perona, and S. Belongie. Visual recognition with humans
in the loop. In ECCV, 2010. 3
[10] H. Caesar, J. Uijlings, and V. Ferrari. Region-based semantic
segmentation with end-to-end training. In ECCV, 2016. 4
[11] H. Caesar, J. Uijlings, and V. Ferrari. COCO-stuff: Thing
and stuff classes in context. In CVPR, 2018. 1, 2, 6, 8
[12] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler. Annotat-
ing object instances with a polygon-rnn. In CVPR, 2017. 2,
5, 8
[13] D.-J. Chen, J.-T. Chien, H.-T. Chen, and L.-W. Chang. Tap
and shoot segmentation. In AAAI, 2018. 2
[14] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. Yuille. Deeplab: Semantic image segmentation with deep
convolutional nets, atrous convolution, and fully connected
CRFs. IEEE Trans. on PAMI, 2017. 2, 4, 5, 6, 7
[15] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz-
ingly fast video object segmentation with pixel-wise metric
learning. In CVPR, 2018. 2
[16] M.-M. Cheng, V. A. Prisacariu, S. Zheng, P. H. S. Torr, and
C. Rother. Densecut: Densely connected crfs for realtime
grabcut. Computer Graphics Forum, 2015. 2, 4
[17] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
R. Benenson, U. Franke, S. Roth, and B. Schiele. The
cityscapes dataset for semantic urban scene understanding.
CVPR, 2016. 1
[18] A. Criminisi, T. Sharp, C. Rother, and P. Perez. Geodesic
image and video editing. In ACM Transactions on Graphics,
2010. 2, 4
[19] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision
meets robotics: The KITTI dataset. International Journal
of Robotics Research, 2013. 1
[20] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. In CVPR, 2014. 4
[21] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zis-
serman. Geodesic star convexity for interactive image seg-
mentation. In CVPR, 2010. 2, 4
[22] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-
CNN. In ICCV, 2017. 1, 2, 3, 4, 5, 7
[23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In CVPR, 2016. 2, 3, 6, 7
[24] Y. Hu, A. Soltoggio, R. Lock, and S. Carter. A fully con-
volutional two-stream fusion network for interactive image
segmentation. Neural Networks, 2019. 1, 2, 4, 5, 8
[25] J. Johnson, A. Karpathy, and L. Fei-Fei. DenseCap: Fully
convolutional localization networks for dense captioning. In
CVPR, 2016. 1
[26] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and
B. Schiele. Simple does it: Weakly supervised instance and
semantic segmentation. In CVPR, 2017. 2
[27] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar.
Panoptic segmentation. In ArXiv, 2018. 1, 2, 6, 8
[28] A. Kolesnikov and C. Lampert. Seed, expand and constrain:
Three principles for weakly-supervised image segmentation.
In ECCV, 2016. 2
[29] K. Konyushkova, J. Uijlings, C. Lampert, and V. Ferrari.
Learning intelligent dialogs for bounding box annotation. In
CVPR, 2018. 3
[30] H. Le, L. Mai, B. Price, S. Cohen, H. Jin, and F. Liu. In-
teractive boundary prediction for object selection. In ECCV,
2018. 1, 2, 4, 5, 8
[31] Z. Li, Q. Chen, and V. Koltun. Interactive image segmenta-
tion with latent diversity. In CVPR, 2018. 1, 2, 4, 5, 8
[32] J. Liew, Y. Wei, W. Xiong, S.-H. Ong, and J. Feng. Regional
interactive image segmentation networks. In ICCV, 2017. 1,
2, 4, 5, 8
[33] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribble-
Sup: Scribble-supervised convolutional networks for seman-
tic segmentation. In CVPR, 2016. 2
[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollar, and C. Zitnick. Microsoft COCO: Com-
mon objects in context. In ECCV, 2014. 1, 2, 6, 8
[35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015. 2, 4,
5, 7
[36] S. Mahadevan, P. Voigtlaender, and B. Leibe. Iteratively
trained interactive segmentation. In BMVC, 2018. 1, 2, 4,
5, 8
[37] K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool.
Deep extreme cut: From extreme points to object segmenta-
tion. In CVPR, 2018. 1, 2, 4, 5, 6, 7, 8
[38] N. S. Nagaraja, F. R. Schmidt, and T. Brox. Video segmen-
tation with just a few strokes. In ICCV, 2015. 2, 4
[39] C. Nieuwenhuis and D. Cremers. Spatially varying color
distributions for interactive multilabel segmentation. IEEE
Trans. on PAMI, 2013. 3
[40] C. Nieuwenhuis, S. Hawe, M. Kleinsteuber, and D. Cremers.
Co-sparse textural similarity for interactive segmentation. In
ECCV, 2014. 3
11630
[41] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari.
Extreme clicking for efficient object annotation. In ICCV,
2017. 1, 2, 5, 6
[42] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Fer-
rari. We don’t need no bounding-boxes: Training object class
detectors using only human verification. In CVPR, 2016. 3
[43] A. Parkash and D. Parikh. Attributes for classifier feedback.
In ECCV, 2012. 3
[44] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained con-
volutional neural networks for weakly supervised segmenta-
tion. In ICCV, 2015. 2
[45] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
works. In NIPS, 2015. 3
[46] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In MIC-
CAI, 2015. 2, 4, 5, 7
[47] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Inter-
active foreground extraction using iterated graph cuts. In
SIGGRAPH, 2004. 2, 4, 5
[48] C. Rupprecht, I. Laina, N. Navab, G. D. Hager, and
F. Tombari. Guide me: Interacting with deep networks. In
CVPR, 2018. 3
[49] O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both
worlds: human-machine collaboration for object annotation.
In CVPR, 2015. 3
[50] J. Santner, T. Pock, and H. Bischof. Interactive multi-label
segmentation. In ACCV, 2010. 3
[51] M. Serrao, S. Shahrabadi, M. Moreno, J. Jose, J. I. Ro-
drigues, J. M. Rodrigues, and J. H. du Buf. Computer vi-
sion and GIS for the navigation of blind persons in buildings.
Universal Access in the Information Society, 2015. 1
[52] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
A. W. M. Smeulders. Selective search for object recognition.
IJCV, 2013. 4
[53] V. Vezhnevets and V. Konushin. GrowCut - interactive multi-
label N-D image segmentation by cellular automata. In
GraphiCon, 2005. 3
[54] S. Vijayanarasimhan and K. Grauman. What’s it going to
cost you?: Predicting effort vs. informativeness for multi-
label image annotations. In CVPR, 2009. 3
[55] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and
S. Belongie. Similarity comparisons for interactive fine-
grained categorization. In CVPR, 2014. 3
[56] G. Wang, P. Luo, L. Lin, and X. Wang. Learning object in-
teractions and descriptions for semantic image segmentation.
In CVPR, 2017. 1
[57] T. Wang, B. Han, and J. Collomosse. Touchcut: Fast im-
age and video segmentation using single-touch interaction.
CVIU, 2014. 2
[58] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang. Re-
visiting dilated convolution: A simple approach for weakly-
and semi-supervised semantic segmentation. In CVPR, 2018.
2
[59] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment
under various forms of weak supervision. In CVPR, 2015. 2
[60] N. Xu, B. Price, S. Cohen, J. Yang, and T. Huang. Deep
interactive object selection. In CVPR, 2016. 1, 2, 4, 5, 8
[61] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
parsing network. In CVPR, 2017. 6
11631