Visual vocabulary with a semantic twist: Supplementary materialvgg/software/fast_semantic... · 2014. 10. 30. · Visual vocabulary with a semantic twist: Supplementary material 3

Visual vocabulary with a semantic twist:

Supplementary material

Relja Arandjelovic and Andrew Zisserman

Department of Engineering Science, University of Oxford

Table of Contents

1 Fast semantic segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1 Inadequacy of current fast segmentation methods . . . . . . . . . . . . . . 21.2 Qualitative results of FSSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Soft segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Efficient inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Segmentation uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Implementation details: Features and unary potentials . . . . . . . . . 81.7 Implementation details: Fast inference . . . . . . . . . . . . . . . . . . . . . . . 9

2 Retrieval results: Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Relja Arandjelovic and Andrew Zisserman

1 Fast semantic segmentation

1.1 Inadequacy of current fast segmentation methods

For the proposed SemanticSIFT method we have two principal requirements forthe semantic segmentation: first, speed, as we envisage large scale deployment(millions to billions of images) and real time processing of uploaded query im-ages; and second, accuracy, as incorrect segmentations may reduce recall. Ofcourse, these two requirements often conflict: currently the methods of [1, 2, 3]produce state of the art semantic segmentations (e.g. on the Stanford back-ground dataset [4]) but take several minutes per image; conversely, existing fastapproaches either produce results of insufficient quality, such as Semantic Tex-ton Forest [5] (figure 2), or make assumptions which are often violated in realworld images, such as the “Tiered Scene” assumption [6] where sky is, if presentin the image, forced to be above any other label (this assumption is often invalidin real images, see figure 1).

Fig. 1. Limitations of the “Tiered Scene” model [6]. Real world images oftenviolate the “Tiered Scene” model where sky is assumed to be, if it exist in the image,above any other class, and not more than three classes (ground, only one middle class,sky) can appear in one vertical line

Visual vocabulary with a semantic twist: Supplementary material 3

Fig. 2. Semantic segmentations using Semantic Texton Forest [5]. Pairs ofrows show the original image (taken from the Oxford 5k dataset) on the top, andthe automatic semantic segmentation on the bottom. Text labels point out the mostobvious mistakes in each image. Semantic Texton Forest produces quite noisy and lowquality results. The results were obtained using the publicly available demo of theSemantic Texton Forest made by the second author of the paper [5], downloaded fromhttp://www.matthewajohnson.org/demos.html. Figure 3 shows results of our FSSSmethod on the same images. Best viewed in colour


1.2 Qualitative results of FSSS

Several examples of the produced semantic segmentations are shown in figure 3.Examples of semantic segmentations which take uncertainty of the segmentationinto account are shown in figure 4.

Fig. 3. Example semantic segmentations. Pairs of rows show the original imageon the top, and the automatic semantic segmentation on the bottom. Best viewed incolour


Fig. 4. Example semantic segmentations with uncertainty estimation.Triplets of rows show the original image on the top, the automatic semantic segmen-tation in the middle, and the semantic segmentation augmented with the “unknown”label on the bottom. Interfaces between different semantic classes are often labelled as“unknown” due to the uncertainty in the exact position of the border, but the widthof the “unknown” region is adaptive to the data – it is thin for clear boundaries (e.g.between building and sky), and wide for hard cases (e.g. boundary between grass anddirt road, and tree and sky). Introduction of the “unknown” label fixes several mistakes(e.g. top right image where some “sky” pixels were assigned the “other” label) by notforcing the segmentation algorithm to make a decision in cases where it is not sureabout the correct label. Best viewed in colour


1.3 Soft segments

Figure 5 shows several examples of soft-segments, as defined in the section 3.1of the main submission.

Fig. 5. Soft segments. Top left: the original image, all other images show soft-segments as defined in equation (1) of the main submission. Each image shows the“seed” for the soft-segment as a red cross, and the brightness depicts the weight wi,j

between the seed pixel i and all pixels j. Note, (i) that the soft-segments are localizedand do not cross semantic boundaries, (ii) the varying effective sizes, e.g. soft-segmentscontaining sky and buildings are large while the ones centred on faces and hands aresmall, and (iii) picking different seed points from the same object produces similarsoft-segments.


1.4 Efficient inference

This section provides further details of the inference procedure (section 3.2 ofthe main submission), namely the optimization of the energy function (equation(1) of the main submission) with respect to the pixel and soft-segment labels.For clarity, we repeat the energy definition here:

E (l, L) =∑

i

λφi (Li)−∑

j

wi,jδ (Li = lj)

(1)

From equation (1) and the corresponding graphical model in figure 3a of themain submission, it is clear that, by design, the pixel labels l are conditionallyindependent from each other given the soft segment labels L, and vice versa. Theenergy function is optimized by initializing the segment labels L according to theunary potentials, Li = argminv φi(v), followed by iterating between two steps:fixing L and optimizing for pixel labels l (step 1), and fixing l and optimizingfor L (step 2). The iterations proceed until convergence, i.e. until a step doesn’tchange any label.

Step 1. The first step assumes that Li’s are fixed, so the non-constant part ofthe energy function in equation (1) is:

E(1) (l) = −∑

i

∑

j

wi,jδ (Li = lj)

= −∑

j

∑

i

wi,jδ (Li = lj) (2)

It is clear that lj ’s are conditionally independent of each other given Li’s andthat the energy function can be decomposed into individual energy contributionsfor every pixel j:

E(1)j (lj) = −

∑

i


E(1) (l) =∑

j

E(1)j (lj) (4)

The optimum is found simply by minimizing each pixel’s energy contribution,

E(1)j (lj), independently, by exhaustive search over all labels:

lj = argminv

E(1)j (v) (5)

Step 2. The second step is similar to the first one: all of the Li’s are condi-tionally independent of each other given lj ’s, and therefore equation (1) can be

decomposed into individual soft-segment energy contributions, E(2)i (Li), which

are minimized independently for each i:

E(2)i (Li) = λφi (Li)−

∑

j



As before, the optimum is found by setting Li = argminv E(2)i (v).

The iterations finish when a step does not change any label; in practice thealgorithm converges quickly, in 1 to 7 iterations depending on the complexity ofthe scene.

The procedure can be interpreted as message passing, where the first set ofmessages are sent top-down from soft superpixels to the underlying pixels, andthe second set of messages are sent bottom-up from pixels to soft superpixels.

1.5 Segmentation uncertainty

It is sometimes beneficial for a semantic segmentation algorithm to be able toprovide an extra “unknown” label when it is uncertain about the true labelling,instead of making a hard and potentially wrong decision; such functionality isexploited by the SoftSemanticSIFT method (section 2.5 of the main submission).A pixel is assigned the “unknown” label if perturbing its optimal label does notchange its contribution to the energy much (equation (1)).

More formally, the contribution of pixel j’s label lj to the global labelling

energy is given in equation (3) and denoted as E(1)j (lj). Let the optimal label

for pixel j be lj and the second best be lj,2. Pixel j is deemed to be “unknown”

if E(1)j (lj) is not significantly larger than E

(1)j (lj,2):

E(1)j (lj,2)

E(1)j (lj)

> t (7)

where t is a threshold. For t > 1 no pixel is assigned the “unknown” label as

by definition E(1)j (lj,2) < E

(1)j (lj); for t ≤ 0 all pixels are assigned the “un-

known” label. We set t = 0.1 but have experimentally found that a wide rangeof values (between 0.05 and 0.5) results in similar SoftSemanticSIFT retrievalperformance.

The uncertainty estimation does not impact computational efficiency of the

inference algorithm as all required quantities, E(1)j (lj) and E

(1)j (lj,2), are avail-

able during the minimization of equation (3) in equation (5), step 1 (section 1.4).

1.6 Implementation details: Features and unary potentials

As described in section 3.2 of the main submission, The features used are ahistogram of colours in HSV space, a bag-of-words of dense RootSIFT [7], anda histogram of normalized y pixel locations.

For dense RootSIFT extraction we use VLFeat [8] to extract dense SIFTdescriptors at default 2 pixel spacing, followed by the RootSIFT transforma-tion [7]. Note that we have found it beneficial to remove descriptors which crossstrong boundaries as those can be confusing. For example, if a local descriptoris extracted from a sky region but its support region includes a building, thebag-of-words description of the sky superpixel is polluted by “building” words.


We therefore remove a local descriptor if there is a pair of pixels i and j in itssupport region such that Ds(i, j) > ǫ, where we set ǫ empirically to 0.1.

The vocabulary has 256 words for each of the bag-of-words and colour his-tograms. We found that segmentation accuracy doesn’t vary much with differentvocabulary sizes. Normalized y position histogram is quantized uniformly into10 bins. Therefore the entire feature vector is 256+256+10 = 522 dimensional.A one-vs-all linear SVM is used to classify the aforementioned features (i.e. tocompute φi(Li)), using liblinear [9].

Values of various parameters are found automatically using 5-fold cross vali-dation. The values for the ParisSculpt360 / Oxford Buildings [10] and Stanfordbackground [4] datasets are listed next.

ParisSculpt360 / Oxford Buildings

◦ α = 10◦ β = 20◦ Grid size: 20×20. Therefore the number of soft-segments is 400. For trainingunary potentials, we use a finer grid 35×35 in order to obtain more trainingdata

◦ λ = 500◦ VLFeat dense SIFT scales, default: 4, 6, 8, 10◦ Hard assignment into visual and colour words◦ SVM parameters: C = 0.005, B = 1

Stanford background

◦ α = 15◦ β = 30◦ Grid size: 25× 25. Therefore the number of soft-segments is 625◦ λ = 50000◦ VLFeat dense SIFT scales: 4, 6. We decrease the scales from default 4, 6, 8,10 as the images are small (320× 240)

◦ Hard assignment into colour words◦ Soft assignment into 5 nearest visual words◦ Instead of the 10-D normalized y pixel location histogram, here we use a 6×6histogram of normalized (x, y) pixel locations, like in [2]. This is because theStanford background dataset has a “foreground object” class which is oftencentral in the image, i.e. the x dimension is informative

◦ SVM parameters: C = 0.005, B = 1

1.7 Implementation details: Fast inference

The entire inference algorithm, including extraction of features, takes 7 secondson a 500× 500 image, with a pure MATLAB CPU implementation. For smaller320×240 images (from the Stanford background dataset [4]) the algorithm takes3.7s. The breakdown of the computational time in various stages of the algorithmfor images from the Oxford Buildings dataset (1024× 768) is:


(1) 0.07s: Loading and resizing the image such that the maximal dimension is500

(2) 1.71s: Soft segmentation embedding computation [11](3) 2.25s: Local feature extraction and assignment to visual/colour words(4) 2.60s: Aggregation of pixel-wise features into soft-segment features (section

3.2 of the main submission)(5) 0.01s: Computation of unaries (i.e. applying a linear SVM)(6) 0.38s: Inference (section 1.4) until convergence

The code is easily parallelizable, e.g. (2) and (3) can be performed indepen-dently at the same time, while each of the (4-6) stages is trivially parallelizable.Furthermore, a C implementation of the soft segmentation embedding will bereleased which only takes several milliseconds instead of 1.7s1, reducing the totalruntime by 25%.

Stage (6) has been sped up with a simple observation. As most of the pixeland soft-segment labels remain unchanged throughout the inference (e.g. unarypotentials for a large portion of sky can be very strong and therefore remainunchanged), most of the partial energies in equations (3) and (6) stay constantacross iterations. Therefore, a lot of computation can be saved when calculating

E(1)(l) and E(2)(L) by reusing unchanged values of E(1)j (lj) and E

(2)i (Li) from

the previous iteration.

1 Personal communication with the author of [11].


2 Retrieval results: Speedup

This section is supplementary to section 4.2 which states that the expectedspeedup for an average query [12], when using SemanticSIFT compared to thebaseline, is 40.7% and 50.4% on the Oxford 5k and 105k datasets, respectively.Here we give details of that computation.

Let p(i) be the fraction of features quantized to word i and N the totalnumber of database features, then the length of the posting list associated withword i is M(i) = p(i)N . The expected number of posting list entries, E(M),traversed for a single query feature, where the expectation is performed over thefeature’s word assignment, is therefore [12]:

E(M) =

K∑

i=1

p(i)M(i) = N

K∑

i=1

p(i)2 (8)

where K is the size of the vocabulary.The expected number of traversed posting list entries for a query image is

then nqE(M) where nq is the number of query features. Noting that nq andN are the same for the baseline and SemanticSIFT cases, the relative speed ofretrieval using the two indexes is then:

E(MSemanticSIFT )

E(Mbaseline)=

∑KSemanticSIFT

i=1 p(i)2SemanticSIFT∑Kbaseline

i=1 p(i)2baseline(9)

where the SemanticSIFT and baseline subscripts denote the quantities asso-ciated with the respective indexes, and p(i)SemanticSIFT ’s and p(i)baseline’s arecomputed as fractions of features quantized to word i for the two indexes.

The ratios for the Oxford 5k and Oxford 105k tests are 0.5933 and 0.4962,respectively. Therefore the expected speedup for an average query is 40.7% and50.4%, respectively.

SoftSemanticSIFT. Equation (8) is not valid for SoftSemanticSIFT becauseone word can match several other ones due to the introduction of the “unknown”label, for example {flora} matches {unknown}, {flora} and {flora,unknown}.Therefore, for a single query word, several posting lists in the visual/semanticproduct vocabulary have to be traversed such that the visual word is the same asthe query visual word, and the semantic words are the ones that can match thequery semantic word. More formally, the expected number of traversed postinglist entries for SoftSemanticSIFT is:

E(M) =

Kv∑

iv=1

Ks∑

is=1

p(iv, is)

Ks∑

js=1

m(is, js)M(iv, js) (10)

= N

Kv∑

iv=1

Ks∑

is=1

p(iv, is)

Ks∑

js=1

m(is, js)p(iv, js) (11)


where Kv and Ks are the sizes of the visual and semantic vocabularies, re-spectively, p(iv, is) is the fraction of features assigned to the (iv, is) word of thevisual/semantic product vocabulary, andm(is, js) is 1 if semantic words is and jsmatch and 0 otherwise (e.g.m({flora}, {sky}) = 0 andm({flora}, {unknown}) =1). Note that the equation reduces to the equation (8) for the “hard” Semantic-SIFT case because there m(is, js) = 1 iff is = js.

The expected speedup for an average query for SoftSemanticSIFT is 30.7%and 38.4% for the Oxford 5k and Oxford 105k datasets, respectively.

Bibliography

[1] Munoz, D., Bagnell, J.A., Hebert, M.: Stacked hierarchical labeling. In:Proc. ECCV. (2010)

[2] Lempitsky, V., Vedaldi, A., Zisserman, A.: A pylon model for semanticsegmentation. In: NIPS. (2011)

[3] Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Associative hierarchicalrandom fields. IEEE PAMI (2013)

[4] Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric andsemantically consistent regions. In: Proc. ICCV. (2009)

[5] Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for imagecategorization and segmentation. In: Proc. CVPR. (2008)

[6] Felzenszwalb, P.F., Veksler, O.: Tiered scene labelling with dynamic pro-gramming. In: Proc. CVPR. (2010)

[7] Arandjelovic, R., Zisserman, A.: Three things everyone should know toimprove object retrieval. In: Proc. CVPR. (2012)

[8] Vedaldi, A., B. Fulkerson, B.: VLFeat: An open and portable library ofcomputer vision algorithms (2008)

[9] Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR:A library for large linear classification. J. Machine Learning Research 9(2008) 1871–1874

[10] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrievalwith large vocabularies and fast spatial matching. In: Proc. CVPR. (2007)

[11] Leordeanu, M., Sukthankar, R., Sminchisescu, C.: Efficient closed-form so-lution to generalized boundary detection. In: Proc. ECCV. (2012)

[12] Stewenius, H., Gunderson, S.H., Pilet, J.: Size matters: exhaustive geomet-ric verification for image retrieval. In: Proc. ECCV. (2012)

Documents

Visual vocabulary with a semantic twist: Supplementary materialvgg/software/fast_semantic... · 2014. 10. 30. · Visual vocabulary with a semantic twist: Supplementary material 3